Excited to release Tulu 3! We worked hard to try and make the best open post-training recipe we could, and the results are good! I was lucky enough to work on almost every stage of the pipeline in one way or another. Some comments + highlights ⬇️ - ThreadSky

About ThreadSky

hamishivi.bsky.social • 160 days ago

Excited to release Tulu 3! We worked hard to try and make the best open post-training recipe we could, and the results are good!
I was lucky enough to work on almost every stage of the pipeline in one way or another. Some comments + highlights ⬇️

Comments

hamishivi.bsky.social•160 days ago

but first quick links:
8B model: https://buff.ly/498x15q
70B model: https://buff.ly/3Ok4PTp
Demo: https://buff.ly/492H2Rw
Website: https://allenai.org/tulu

hamishivi.bsky.social•160 days ago

We generated and used ALOT of new data for this release, and used a lot of synthetically-generated data in general.

Working out the best ways to generate synthetic data was crucial to really boosting performance.

hamishivi.bsky.social•160 days ago

We swapped from using DPO to length-normalized DPO! Actually, I tried a bunch of DPO-like losses (and a little PPO), but we found length-norm DPO to be particularly strong in the settings we tested.

hamishivi.bsky.social•160 days ago

We also used on-policy data for DPO! This helps! Some further evidence that online training (for some definition of online) is useful.

hamishivi.bsky.social•160 days ago

We came up with a fun RL training strategy for the final stage: just do PPO against ground truth! We extract the answer, compare to the label, and reward if right.

hamishivi.bsky.social•160 days ago

When I first implemented this I found we immediately got > 10 point gains on evaluations like MATH and GSM8k (when applied to SFT models), without even trying to tune it.

The gains are smaller when your base models are already strong, but I am excited to take this further!

Posting Rules

Be respectful to others
No spam or self-promotion
Stay on topic
Follow Bluesky's terms of service

Comments

Posting Rules

Reply