๐งช๐งฌ๐ฅ๏ธ Nucleotide Transformer is now published in Nature Methods!
A foundational models for genomics, with up to 2.5 billion parameters, that are trained on genomes from 800+ species and 3000+ human individuals.
๐ https://www.nature.com/articles/s41592-024-02523-z
๐ป https://github.com/instadeepai/nucleotide-transformer
#Genomics #BioML #MLSky
A foundational models for genomics, with up to 2.5 billion parameters, that are trained on genomes from 800+ species and 3000+ human individuals.
๐ https://www.nature.com/articles/s41592-024-02523-z
๐ป https://github.com/instadeepai/nucleotide-transformer
#Genomics #BioML #MLSky
Comments
๐ Highlights of our journey so far:
โ 700,000+ downloads
โ 120+ citations
code: https://github.com/instadeepai/nucleotide-transformer
tutorial: https://huggingface.co/collections/InstaDeepAI/nucleotide-transformer-65099cdde13ff96230f2e592
"The models are trained on sequences of length up to 1000 tokens...The tokenizer starts tokenizing from left to right by grouping the letters "A", "C", "G" and "T" in 6-mers. "
https://github.com/huggingface/notebooks/blob/main/examples/nucleotide_transformer_dna_sequence_modelling.ipynb
If it's sequence length, did you look at hybrid mamba+attention models?
We didn't look at mamba architecture for version 1 of NT, but we are investigating longer context models now.
https://arxiv.org/abs/2306.15794