Yet more evidence that transfer learning of sequence-only PLMs does not benefit from scale beyond 650M params 🧵 - ThreadSky

ddelalamo.bsky.social • 94 days ago

Yet more evidence that transfer learning of sequence-only PLMs does not benefit from scale beyond 650M params 🧵

Comments

We also saw this when we looked at transfer learning with PoET embeddings compared with ESM - https://www.openprotein.ai/poet-foundation-model-for-high-accuracy-protein-property-prediction

tbepler.bsky.social•94 days ago

It also appears to be true for unsupervised variant effect prediction that increasing model size hurts performance

tbepler.bsky.social•94 days ago

It is interesting that scaling to billions of parameters only really seems to help with structure prediction and that this does not translate to transfer learning for property/function prediction

smyth7.bsky.social•94 days ago

Interesting, but there are some points to note. Embedding compression is not usually how pLMs are used for variant effect prediction- see e.g ESM1v. EMS1v model was optimised for DMS prediction, so no surprise it does well. Scaling should consider data and compute, as in some earlier work this year.

ddelalamo.bsky.social•94 days ago

Authors of the ProGen 2 paper similarly observed this (https://doi.org/10.1016/j.cels.2023.10.002), as did Li et al using ESM-2 as well as CARP (https://doi.org/10.1101/2024.02.05.578959). They also discuss the exception of tertiary structure, where larger models do in fact offer some benefit.

ddelalamo.bsky.social•94 days ago

This last point is corroborated by work by Wu et al earlier this year, who found that embeddings derived from larger PLMs are more accurate when fine-tuned on structural-similarity searches. https://doi.org/10.1101/2024.05.14.594226 /end

ddelalamo.bsky.social•94 days ago

Here's a blurb with more details https://publish.obsidian.md/ddelalamo/Sorted_notes/Public/Protein_design/Protein+property+prediction+using+PLMs+does+not+benefit+from+scale+except+when+predicting+structural+features

acritschristoph.bsky.social•92 days ago

Could these models be running against the technical error of what they are trying to predict? the biological ground truth in these phenotypic assays is very noisy, much noisier than structure afaik. And no model can accurately predict an assay beyond the technical accuracy of the assay...

ddelalamo.bsky.social•92 days ago

I wonder this too, on proteingym you can see that different methods have highly correlated spearman values (e.g. all methods are terrible at MK01 mutants) and it could be assay related

lindorfflarsen.bsky.social•91 days ago

Yes, most prediction methods are sensitive to the property/functions that evolution cares about (which may even vary across homologs) and that’s not always what an assay is set to do. (Plus many other reasons why the correlations may be low)

ddelalamo.bsky.social•94 days ago

Important distinction that may or may not affect results: the paper from the first post uses lasso regression, whereas a lot of other papers use ridge regression. No idea whether this plays a role in performance (in my hands it never has)

clauswilke.com•94 days ago

I would expect them to perform similarly. Main advantage of lasso is we get an estimate of the size of the relevant latent space (by counting non-zero params).

markburgessosl.bsky.social•94 days ago

Interesting. Can this be associated with a known emergent scale in protein structure / information?

ddelalamo.bsky.social•94 days ago

A paper earlier this year suggests it's just from a greater capacity to memorize domain-specific contacts. Very little emergence seems to be happening, except in some easy-to-fold hyperstable miniproteins https://www.pnas.org/doi/10.1073/pnas.2406285121

chaitjo.bsky.social•94 days ago

I would rephrase "sequence-only PLMs does not benefit from scale beyond 650M params **yet**" 😉

We have seen ample evidence of improvements in scaling in other domains, perhaps its about what we scale + what objective, and we haven't yet found it?

clauswilke.com•94 days ago

Our point is that most biological datasets people may generate in a lab are likely not large enough to benefit from larger models.

chaitjo.bsky.social•94 days ago

(Now only for debating/arguments sake)

But this is changing, too, right? From experimentalists I feel there's growing excitement for ultra high throughput assays based on sequencing, so maybe the "likely not large enough to benefit from larger models **yet**"?

ddelalamo.bsky.social•94 days ago

In my experience, which is 1000x less than Prof Wilke's, this high-throughput future has been "just around the corner" for decades, and so it makes sense to look for methods that suit today's data constraints. But TBH if you have enough data, why use PLMs at all?

clauswilke.com•94 days ago

Yes. At a minimum, there will be many small to medium sized datasets for years to come. Even if some labs for some can go bigger for some applications, this won't be across the board.

noeliaferruz.bsky.social•92 days ago

What are your thoughts on autoregressive models (where the task is design)? GPT-3 was better than GPT-2 and so on.

ddelalamo.bsky.social•91 days ago

Truthfully I almost never use them! Except a few. But if you have insights into their performance I’m all ears as it’s a blind spot for me

noeliaferruz.bsky.social•91 days ago

I’ve seen very nice papers on scaling laws but they are based on metrics like training loss and perplexity. I guess the problem is that we ultimately would need to go to the lab to properly compare sequence quality.

lindorfflarsen.bsky.social•94 days ago

Also, the fact that mean pooling works so well (compared to other methods) indicates to me that we are definitely not getting the most out of these models (compared to what we know about how protein sequence relates to protein function/properties)

ddelalamo.bsky.social•93 days ago

Completely agree. My theory is that mean-pooling performance is overstated bc these benchmark sets comprise of sequences w same length; signal from important residues gets equally diluted by mean-pooling. My guess is mean-pooling performs worse in variable-length cases such as the OpenCRISPR library

clauswilke.com•92 days ago

Please note that mean pooling performed worse in DMS data where all proteins are of the same length (left) and best in a diverse set of proteins with widely differing lengths (right). I think what matters more is the number of differences among sequences.

ddelalamo.bsky.social•92 days ago

Oh fascinating. There goes that theory

lindorfflarsen.bsky.social•93 days ago

Yes. And mean pooling might also remove a substantial part of the signal that come from the actual sequence beyond composition (though of course not everything)

hans2sachs.bsky.social•94 days ago

Similar findings in this one https://doi.org/10.1101/2024.09.23.614603. There are many more influences at play than mere scaling of parameters...

Comments

Posting Rules

Reply