AFAIK it's the same dataset, they just use the larger pretrained model as the teacher model. Screenshot is from the DinoV2 paper section 5: https://arxiv.org/abs/2304.07193
Depending on your definition of recent, the No Language Left Behind translation model also released primarily a distilled version of a larger MoE model.
If I recall correctly Llama-2 only did BoN with the 70b and used those samples for the smaller ones. So implicitly there was distillation in the alignment phase. Makes sense economically so they may have kept this for 3+?
I remembered something about llama, so looked at 3, and they pre-train on the same data from scratch. Then there's data recycling from bigger models to smaller in post-training, so yes, but only sort of. I will go back to look at 2, probably it was more clear cut there. Thanks!
This is great. Thanks! Do you know if there any detailed report on this? Do they use the same pre-training data and just change the targets from one-hot distributions (gold token) to logits from the bigger models?
Yeah, this would be only post-training and only in cases where you do not have SFT available and the exploration and validation process is expensive e.g. human feedback. (Maybe Strawberry to Orion may have included some distillation to smaller models as well).
So the main consideration is data economics, but they might be getting some benefits from the overall stronger data the bigger model gives them. That makes a lot of sense.
As well. The only argument against it is that it is a bit off-policy (maybe not ideal for things like DPO?). But the teacher being just a bigger version that sees the same data may make them very close.
I have a sense that most uses of DPO rely on preference data from another model (or off the shelf), so seems pretty standard. If this is just to get a reward model and then run RL, it remains on policy.
[can we even talk about DPO as on policy? beyond first update :) ]
Comments
https://github.com/openai/whisper/discussions/2363
https://huggingface.co/facebook/nllb-200-distilled-600M
https://huggingface.co/meta-llama/Llama-3.2-1B
[can we even talk about DPO as on policy? beyond first update :) ]