adamauton.bsky.social
Geneticist @ 23andMe
24 posts
686 followers
231 following
Prolific Poster
Conversation Starter
comment in response to
post
So given that, can you spot the error right at the start of this Wikipedia article on HERC2?
"HERC2 is a giant E3 ubiquitin protein ligase, implicated in DNA repair regulation, pigmentation and neurological disorders."
comment in response to
post
🙋♂️
comment in response to
post
We demand examples!
comment in response to
post
Thank you! Super useful.
comment in response to
post
A super fun project. Congrats to Suyash Shringarpure, Wei Wang, Sotiris Karagounis, Xin Wang, Anna Reisetter, and Aly Khan on getting this out the door. Feedback very welcome!
www.medrxiv.org/content/10.1...
comment in response to
post
Definitely a fun result that I would not have expected! It's very early days, but it is exciting to think how LLM approaches could be combined with approaches that rely on functional data to incorporate prior knowledge of biology. Could we get the best of both worlds?
comment in response to
post
Interestingly, you can probe the internal embeddings of these models, and we found that the causal genes tend to be 'proximal' to the phenotypes that they influence in embedding space. So they do seem to be learning some relationship between these two concepts.
comment in response to
post
Nonetheless, the LLMs also have biases; they tend to favor genes with lots of existing literature, which perhaps isn't surprising given how they're trained. They also struggle to identify causal genes in loci containing large numbers of genes.
comment in response to
post
We found this really surprising; and worried that the LLMs had been trained using the truth data. However, we tried to use benchmark data that was curated after the LLM training period as well as a benchmark dataset not available on the internet.
comment in response to
post
The answer appears to be yes! In fact, using a *really simple* approach, LLMs appear to outperform state-of-the-art methods at identifying the causal gene in a variety of 'gold standard' truth datasets!
comment in response to
post
However, LLMs have been trained on virtually all biological literature on the internet. We wondered, could they help us address the causal gene problem?
comment in response to
post
As I once heard David Altshuler say "There is always a gene at the top of the list, and there is always a postdoc who can tell you why that gene must be important!" (It may be a mis-quote; it was a long time ago!)
comment in response to
post
However, humans bring their own biases to this problem; they tend to over-focus on genes they know about and are able to construct narratives as to why those genes *must be* the causal ones!
comment in response to
post
As good as these methods are, I have often seen research scientists relying on biological knowledge or literature when reviewing GWAS loci, and they can often identify very plausible causal genes, even in the absence of a link from functional data.
comment in response to
post
Multiple methods have been developed to identify the causal gene in GWAS studies. These methods often integrate additional functional information, or make use of colocalization with expression or proteomics QTL datasets.
comment in response to
post
GWAS studies have identified thousands of genetic variants associated with human disease. However, as the majority of the identified loci are found in non-coding regions, a long-standing problem in human genetics has been how to identify the genes driving these associations.