Profile avatar
mechanicaldirk.bsky.social
Training big models at @ai2.bsky.social.
49 posts 501 followers 241 following
Regular Contributor
Active Commenter

This project is a perfect model of an OLMo contribution. Well scoped, practical, sound theoretical underpinnings, and @lambdaviking.bsky.social submitted the paper 24h before the deadline 😍. It's integrated into the OLMo trainer here: github.com/allenai/OLMo...

Finally, OLMo 1B. This is the most commonly requested OLMo feature l, and it's finally here.

I'm in Singapore for @iclr-conf.bsky.social ! Come check out our spotlight paper on the environmental impact of training OLMo (link in next tweet) during the Saturday morning poster session from 10-12:30 -- happy to chat about this or anything else! DMs should be open, email works too

Came across arxiv.org/pdf/2504.05058 today. What a cool example of work you can do when LLM training data is open!

Ever wonder how LLM developers choose their pretraining data? It’s not guesswork— all AI labs create small-scale models as experiments, but the models and their data are rarely shared. DataDecide opens up the process: 1,050 models, 30k checkpoints, 25 datasets & 10 benchmarks 🧵

Today we're unveiling OLMoTrace, a tool that enables everyone to understand the outputs of LLMs by connecting to their training data. We do this on unprecedented scale and in real time: finding matching text between model outputs and 4 trillion training tokens within seconds. ✨

The fact that my Bsky feed is all tariffs and none Llama 4 means the platform is pretty much cooked for research purposes.

We created SuperBPE🚀, a *superword* tokenizer that includes tokens spanning multiple words. When pretraining at 8B scale, SuperBPE models consistently outperform the BPE baseline on 30 downstream tasks (+8% MMLU), while also being 27% more efficient at inference time.🧵

Error bars! @hails.computer will be so proud!

Introducing olmOCR, our open-source tool to extract clean plain text from PDFs! Built for scale, olmOCR handles many document types with high throughput. Run it on your own GPU for free—at over 3000 token/s, equivalent to $190 per million pages, or 1/32 the cost of GPT-4o!

We took our most efficient model and made an open-source iOS app📱but why? As phones get faster, more AI will happen on device. With OLMoE, researchers, developers, and users can get a feel for this future: fully private LLMs, available anytime. Learn more from @soldaini.net👇 youtu.be/rEK_FZE5rqQ

14.8T tokens in 2.8M hours is about 1500 tokens per second. That's a very good number for 37B active parameters, but by no means unbelievable.

Behind the scenes with what its like to build language models and pursue (hopefully) cutting edge AI research Interviewing OLMo 2 leads: Open secrets of training language models What we have learned and are going to do next. YouTube: https://buff.ly/40IlSFF Podcast / notes:

In November, every post here was about NLP. Now it's all about TikTok. We're doing the Twitter speed run.

A few days ago, we did finally release the OLMo 2 tech report: arxiv.org/pdf/2501.00656. There is a lot of good stuff in there, but the stability work we did over the summer makes me particularly proud.

Everyone wants open-source language models but no one wants to lift these heavy ass weights. We just released our paper "2 OLMo 2 Furious" Can't stop us in 2025. Links below.

Some people seem to believe that LLMs give inoffensive, milquetoast answers because of overblown safety concerns ("Because of the woke!"). But that's not it. LLMs give bland answers because they produce the average of what anyone would have said on the Internet.

It seems to me the second most common language spoken in the halls of NeurIPS is German.

Made a list of resources for open source language models with @soldaini.net ahead of the tutorial tomorrow at 930 AM. github.com/allenai/awes...

Want to predict the task performance of LMs before pretraining them? We develop task scaling laws and model ladders, which predict the accuracy on individual tasks by OLMo 2 7B & 13B models within 2 points of absolute error. The cost is 1% of the compute used to pretrain them.

I'll be at NeurIPS from Wednesday until Sunday! Do you think about pre-training? GPUs? What makes a foundation model good? If you have questions or answers, let's find a time to chat!

We just updated the OLMo repo at github.com/allenai/OLMo! There are now several training configs that together reproduce the training runs that lead to the final OLMo 2 models. In particular, all the training data is available, tokenized and shuffled exactly as we trained on it!