The harder part (rather than training models on all that sequence data) might be to connect all the relevant experimental data and metadata so that the resulting model does something useful. But in principle, I'm all for training on all of SRA for example
I can agree on pre-training “as we know it” but to say data is a fossil fuel is absurd. We have yet to even begin to look at physiological/ proprioceptive data for a single individual
We’re probably already at the point of diminishing returns for raw sequence data.