We released the OLMo 2 report! Ready for some more RL curves? 😏 This time, we applied RLVR iteratively! Our initial RLVR checkpoint on the RLVR dataset mix shows a low GSM8K score, so we did another RLVR on GSM8K only and another on MATH only 😆. And it works! A thread 🧵 1/N - ThreadSky

vwxyzjn.bsky.social • 115 days ago

We released the OLMo 2 report! Ready for some more RL curves? 😏

This time, we applied RLVR iteratively! Our initial RLVR checkpoint on the RLVR dataset mix shows a low GSM8K score, so we did another RLVR on GSM8K only and another on MATH only 😆.

And it works! A thread 🧵 1/N

Comments

vwxyzjn.bsky.social•115 days ago

There is actually a funny story about it, too. Our initial 13B Instruct models (https://huggingface.co/allenai/OLMo-2-1124-13B-Instruct-preview) use a borked tokenizer.

🤡 Basically, we didn't use HF's fast tokenizer, so the instruct models' tokenizer apply pre-tokenization logic differently from the base models.

vwxyzjn.bsky.social•115 days ago

There is not an easy way to go around it. We also tested it and found a ~ 0.5 point regression in the average performance. GSM8K and MATH is also lower

So, we decided to re-train the models using the correct tokenizer.

vwxyzjn.bsky.social•115 days ago

Well, well, well, we were in reproduction surprise. Modern-day LLM training feels quite "result-reproducible" but not "process-reproducible". Running the exact recipe yields worse models for some reason.

Our initial reproduction attempt shows regression on SFT / DPO / RLVR.

vwxyzjn.bsky.social•115 days ago

We kept on training by performing the mysterious technique: more hyperparameter tuning. Second attempt: the RLVR checkpoint is better, but still low on GSM8K and MATH. A lot better in IFeval. Eh. Ok.

vwxyzjn.bsky.social•115 days ago

I was quite puzzled by the regression. And thought, well, if the GSM8K is lower, I could just run it on GSM8K train and control the KL with a higher beta. That leads to 2 more RLVR checkpoints.

Our final RLVR checkpoint does look pretty good 😊

vwxyzjn.bsky.social•115 days ago

Learning curves time: The old borked tokenizer 13b RLVR has this beautiful training curve:

Comments

Posting Rules

Reply