🧪🧬🖥️ Nucleotide Transformer is now published in Nature Methods! A foundational models for genomics, with up to 2.5 billion parameters, that are trained on genomes from 800+ species and 3000+ human individuals. 📄 www.nature.com/articles/s41... 💻 github.com/instadeepai/... #Genomics #BioML #MLSky - ThreadSky

jeroen.vangoey.be • 84 days ago

🧪🧬🖥️ Nucleotide Transformer is now published in Nature Methods!

A foundational models for genomics, with up to 2.5 billion parameters, that are trained on genomes from 800+ species and 3000+ human individuals.

📄 https://www.nature.com/articles/s41592-024-02523-z
💻 https://github.com/instadeepai/nucleotide-transformer

#Genomics #BioML #MLSky

Comments

jeroen.vangoey.be•84 days ago

🧪🧬🖥️ Developed in collaboration with NVIDIA and the Technical University of Munich, Nucleotide Transformer models outperform specialized methods and other foundation models on a wide range of tasks in human genomics.

jeroen.vangoey.be•84 days ago

🧪🧬🖥️ Nucleotide Transformer models offer broad applications in genomics, from enhancing the prediction of splicing and regulatory mechanisms to improving variant effect prediction.

🚀 Highlights of our journey so far:
✅ 700,000+ downloads
✅ 120+ citations

jeroen.vangoey.be•84 days ago

✅ Nucleotide Transformer is available open-source on HuggingFace 🤗 for researchers and through DeepChain for enterprise users. 🧪🧬🖥️

code: https://github.com/instadeepai/nucleotide-transformer
tutorial: https://huggingface.co/collections/InstaDeepAI/nucleotide-transformer-65099cdde13ff96230f2e592

mountaingeek76.bsky.social•83 days ago

Looks like 1000 token sequences and tokens are groups of up to 6 nucleotides?

"The models are trained on sequences of length up to 1000 tokens...The tokenizer starts tokenizing from left to right by grouping the letters "A", "C", "G" and "T" in 6-mers. "

jeroen.vangoey.be•83 days ago

Yes, that is correct. Here is a tutorial on how to use the model:
https://github.com/huggingface/notebooks/blob/main/examples/nucleotide_transformer_dna_sequence_modelling.ipynb

mountaingeek76.bsky.social•83 days ago

I'll have to play with it over the holidays. Curious, why tokenize across a series of nucleotides vs having one nucleotide be a token?

If it's sequence length, did you look at hybrid mamba+attention models?

jeroen.vangoey.be•83 days ago

> "We used six-mer tokens as a trade-off between sequence length (up to 6 kb) and embedding size, and because it achieved the highest performance when compared to other token lengths."

We didn't look at mamba architecture for version 1 of NT, but we are investigating longer context models now.

mountaingeek76.bsky.social•83 days ago

Cool! I'd be really curious how a Jamba like (mamba, attention, MoE) model would perform at ultra-long context length.

jeroen.vangoey.be•83 days ago

You'll have to wait for our next paper before I can share that, but to give you an idea: HyenaDNA has a context length context of up to 1 million tokens at the single nucleotide-level

https://arxiv.org/abs/2306.15794

mountaingeek76.bsky.social•83 days ago

What sequence length? And what granularity are the tokens at?

Comments

Posting Rules

Reply