Profile avatar
hamishivi.bsky.social
I (try to) do NLP research. Antipodean abroad. currently doing PhD @uwcse, prev @usyd @ai2 🇦🇺🇨🇦🇬🇧 ivison.id.au
61 posts 1,162 followers 373 following
Regular Contributor
Active Commenter

Excited to be back home in Australia (Syd/Melb) for most of April! Email or DM if you want to grab a coffee :)

@vwxyzjn.bsky.social and @hamishivi.bsky.social have uploaded intermediate checkpoints for our recent RL models at Ai2. Folks should do research into how RL finetuning is impacting the weights! Models with it: OLMo 2 7B, 13B, 32B Instruct; Tulu 3, 3.1 8B; Tulu 3 405b

How well do data-selection methods work for instruction-tuning at scale? Turns out, when you look at large, varied data pools, lots of recent methods lag behind simple baselines, and a simple embedding-based method (RDS) does best! More below ⬇️ (1/8)

(1/8) Excited to share some new work: TESS 2! TESS 2 is an instruction-tuned diffusion LM that can perform close to AR counterparts for general QA tasks, trained by adapting from an existing pretrained AR model. 📜 Paper: arxiv.org/abs/2502.13917 🤖 Demo: huggingface.co/spaces/hamis... More below ⬇️

GRPO makes everything better 😌

We took our most efficient model and made an open-source iOS app📱but why? As phones get faster, more AI will happen on device. With OLMoE, researchers, developers, and users can get a feel for this future: fully private LLMs, available anytime. Learn more from @soldaini.net👇 youtu.be/rEK_FZE5rqQ

li'l holiday project from the tulu team :) Scaling up the Tulu recipe to 405B works pretty well! We mainly see this as confirmation that open-instruct scales to large-scale training -- more exciting and ambitious things to come!

Seems like a good time to share this: a poster from a class project diving a little more into Tulu 3's RLVR. Deepseek R1 release today shows that scaling this sort of approach up can be very very effective!

Excited to see Tulu 3 sits in between Llama 3.1 and 3.3 instruct on the chatbot arena leaderboard right now! Particularly happy it is top 20 for Math and Multi-turn prompts :) All the details and data on how to train a model this good are right here: arxiv.org/abs/2411.15124

We released the OLMo 2 report! Ready for some more RL curves? 😏 This time, we applied RLVR iteratively! Our initial RLVR checkpoint on the RLVR dataset mix shows a low GSM8K score, so we did another RLVR on GSM8K only and another on MATH only 😆. And it works! A thread 🧵 1/N

More OLMo! More performance! More details! We applied Tulu post-training to OLMo 2 as well, so you can get strong model performance AND see what your model was actually trained on.

UW News put out a Q&A about our recent work on Variational Preference Learning, a technique for personalizing Reinforcement Learning from Human Feedback (RLHF) washington.edu/news/2024/12...

Want to predict the task performance of LMs before pretraining them? We develop task scaling laws and model ladders, which predict the accuracy on individual tasks by OLMo 2 7B & 13B models within 2 points of absolute error. The cost is 1% of the compute used to pretrain them.

New OpenAI RL finetuning API reminds me a lot of RLVR, which we used for Tülu 3 (arxiv.org/abs/2411.15124). Using RL to train against labels is a simple idea, but very effective (>10pt gains just using GSM8k train set). It's implemented for you to use in Open-Instruct 😉: github.com/allenai/open...

OpenAI announced a new RL finetuning API. You can do this on open models w the repo we used to train Tulu 3. Expanding reinforcement learning with verifiable rewards to more domains and with better answer extraction and to more domains in our near roadmap. https://buff.ly/3V4JEIJ

Curious about all this inference-time scaling hype? Attend our NeurIPS tutorial: Beyond Decoding: Meta-Generation Algorithms for LLMs (Tue. 1:30)! We have a top-notch panelist lineup. Our website: cmu-l3.github.io/neurips2024-...

I’m on the academic job market this year! I’m completing my @uwcse.bsky.social @uwnlp.bsky.social Ph.D. (2025), focusing on overcoming LLM limitations like hallucinations, by building new LMs. My Ph.D. work focuses on Retrieval-Augmented LMs to create more reliable AI systems 🧵

We're hiring another predoctoral researcher for my team at Ai2/OLMo next year. The goal of this position is to mentor and grow future academic stars of NLP/AI over 1-2 years before grad school. This ends up being people done with BS or MS who want to continue to a PhD soon. https://buff.ly/49nuggo

Excited to be at #NeurIPS next week in 🇨🇦! Please reach out if you want to chat about LM post-training (Tülu!), data curation, or anything else :) I'll be around all week, with two papers you should go check out (see image or next tweet):

I know it doesn't know much if anything about me but this was pretty surprisingly good!

Watching RL training curves is too addictive... begging my models to yap more and get more reward 🙏

🍲

What's that? A fully open LM competitive with Gemma and Qwen*? Happy to have helped a bit with this release (Tulu 3 recipe used here)! OLMo-2 13B actually beats Tulu 3 8B on these evals, making it a SOTA fully open LM!!! (*on the benchmarks we looked at, see tweet for more)

open source tulu 3 model recreation! rivals the original sft and other models in its size range huggingface.co/allura-org/T...

Meet Tülu 3, a set of state-of-the-art instruct models with fully open data, eval code, and training algorithms. We invented new methods for fine-tuning language models with RL and built upon best practices to scale synthetic instruction and preference data. Demo, GitHub, paper, and models 👇