It is interesting that scaling to billions of parameters only really seems to help with structure prediction and that this does not translate to transfer learning for property/function prediction
Interesting, but there are some points to note. Embedding compression is not usually how pLMs are used for variant effect prediction- see e.g ESM1v. EMS1v model was optimised for DMS prediction, so no surprise it does well. Scaling should consider data and compute, as in some earlier work this year.
This last point is corroborated by work by Wu et al earlier this year, who found that embeddings derived from larger PLMs are more accurate when fine-tuned on structural-similarity searches. https://doi.org/10.1101/2024.05.14.594226 /end
Could these models be running against the technical error of what they are trying to predict? the biological ground truth in these phenotypic assays is very noisy, much noisier than structure afaik. And no model can accurately predict an assay beyond the technical accuracy of the assay...
I wonder this too, on proteingym you can see that different methods have highly correlated spearman values (e.g. all methods are terrible at MK01 mutants) and it could be assay related
Yes, most prediction methods are sensitive to the property/functions that evolution cares about (which may even vary across homologs) and thatβs not always what an assay is set to do. (Plus many other reasons why the correlations may be low)
Important distinction that may or may not affect results: the paper from the first post uses lasso regression, whereas a lot of other papers use ridge regression. No idea whether this plays a role in performance (in my hands it never has)
I would expect them to perform similarly. Main advantage of lasso is we get an estimate of the size of the relevant latent space (by counting non-zero params).
A paper earlier this year suggests it's just from a greater capacity to memorize domain-specific contacts. Very little emergence seems to be happening, except in some easy-to-fold hyperstable miniproteins https://www.pnas.org/doi/10.1073/pnas.2406285121
I would rephrase "sequence-only PLMs does not benefit from scale beyond 650M params **yet**" π
We have seen ample evidence of improvements in scaling in other domains, perhaps its about what we scale + what objective, and we haven't yet found it?
But this is changing, too, right? From experimentalists I feel there's growing excitement for ultra high throughput assays based on sequencing, so maybe the "likely not large enough to benefit from larger models **yet**"?
In my experience, which is 1000x less than Prof Wilke's, this high-throughput future has been "just around the corner" for decades, and so it makes sense to look for methods that suit today's data constraints. But TBH if you have enough data, why use PLMs at all?
Yes. At a minimum, there will be many small to medium sized datasets for years to come. Even if some labs for some can go bigger for some applications, this won't be across the board.
Iβve seen very nice papers on scaling laws but they are based on metrics like training loss and perplexity. I guess the problem is that we ultimately would need to go to the lab to properly compare sequence quality.
Also, the fact that mean pooling works so well (compared to other methods) indicates to me that we are definitely not getting the most out of these models (compared to what we know about how protein sequence relates to protein function/properties)
Completely agree. My theory is that mean-pooling performance is overstated bc these benchmark sets comprise of sequences w same length; signal from important residues gets equally diluted by mean-pooling. My guess is mean-pooling performs worse in variable-length cases such as the OpenCRISPR library
Please note that mean pooling performed worse in DMS data where all proteins are of the same length (left) and best in a diverse set of proteins with widely differing lengths (right). I think what matters more is the number of differences among sequences.
Yes. And mean pooling might also remove a substantial part of the signal that come from the actual sequence beyond composition (though of course not everything)
Comments
We have seen ample evidence of improvements in scaling in other domains, perhaps its about what we scale + what objective, and we haven't yet found it?
But this is changing, too, right? From experimentalists I feel there's growing excitement for ultra high throughput assays based on sequencing, so maybe the "likely not large enough to benefit from larger models **yet**"?