What are good (recent) examples of model families where smaller models are distilled from the bigger ones? - ThreadSky

yoavartzi.com • 94 days ago

What are good (recent) examples of model families where smaller models are distilled from the bigger ones?

Comments

notmahi.bsky.social•94 days ago

Dino-v2 is a good recent example in vision –

yoavartzi.com•94 days ago

Thanks! From a quick look, I can't find what is the exact data they use for the distillation. Any idea if this was reported somewhere?

notmahi.bsky.social•94 days ago

AFAIK it's the same dataset, they just use the larger pretrained model as the teacher model. Screenshot is from the DinoV2 paper section 5: https://arxiv.org/abs/2304.07193

yoavartzi.com•94 days ago

That was my guess too. But it's not explicitly specified. I think you are right though

notmahi.bsky.social•94 days ago

I agree, the paper could definitely be clearer. My assumption is “same training loop” ≈ “all else being equal”, but that can be totally incorrect.

hamzamerzic.info•94 days ago

Gemma 2 models use knowledge distillation.

yoavartzi.com•93 days ago

Thanks! A similar process to what the Lamma team does?

hamzamerzic.info•93 days ago

Yes, I believe so. The larger model is trained via next-token prediction and the smaller ones are distilled.

williamheld.com•94 days ago

Assuming you mean any deep learning model and not just LLMs, Whisper-Large-V3 v.s. Whisper-Large-V3 turbo!

https://github.com/openai/whisper/discussions/2363

williamheld.com•94 days ago

Depending on your definition of recent, the No Language Left Behind translation model also released primarily a distilled version of a larger MoE model.

https://huggingface.co/facebook/nllb-200-distilled-600M

yoavartzi.com•94 days ago

Thanks! I will check out both. I am thinking of LLMs, but the whisper example is interesting too.

pbontrager.bsky.social•94 days ago

The llama 3.2 1B and 3B models were distilled from the 8B and 70B models

yoavartzi.com•94 days ago

I am trying to find details about this process, but I think they didn't release much yet. Maybe coming soon :)

pbontrager.bsky.social•94 days ago

I think the only released information is in the “Lightweight Models” section here https://ai.meta.com/blog/llama-3-2-connect-2024-vision-edge-mobile-devices/

pbontrager.bsky.social•94 days ago

We also have a tutorial on it, though it doesn’t cover everything from 3.2 https://pytorch.org/torchtune/0.4/tutorials/llama_kd_tutorial.html#qwen2-1-5b-to-0-5b

yoavartzi.com•94 days ago

Thanks!

ramon-astudillo.bsky.social•94 days ago

If I recall correctly Llama-2 only did BoN with the 70b and used those samples for the smaller ones. So implicitly there was distillation in the alignment phase. Makes sense economically so they may have kept this for 3+?

yoavartzi.com•94 days ago

I remembered something about llama, so looked at 3, and they pre-train on the same data from scratch. Then there's data recycling from bigger models to smaller in post-training, so yes, but only sort of. I will go back to look at 2, probably it was more clear cut there. Thanks!

williamheld.com•94 days ago

I believe you are thinking of the Llama 3.2 models which I think are not covered in the paper, but are pruned then refined with distillation!

https://huggingface.co/meta-llama/Llama-3.2-1B

yoavartzi.com•94 days ago

This is great. Thanks! Do you know if there any detailed report on this? Do they use the same pre-training data and just change the targets from one-hot distributions (gold token) to logits from the bigger models?

williamheld.com•94 days ago

I don't think there is a detailed report. Only what's available in the press release for these models: https://ai.meta.com/blog/llama-3-2-connect-2024-vision-edge-mobile-devices/

ramon-astudillo.bsky.social•94 days ago

Yeah, this would be only post-training and only in cases where you do not have SFT available and the exploration and validation process is expensive e.g. human feedback. (Maybe Strawberry to Orion may have included some distillation to smaller models as well).

yoavartzi.com•94 days ago

So the main consideration is data economics, but they might be getting some benefits from the overall stronger data the bigger model gives them. That makes a lot of sense.

ramon-astudillo.bsky.social•94 days ago

As well. The only argument against it is that it is a bit off-policy (maybe not ideal for things like DPO?). But the teacher being just a bigger version that sees the same data may make them very close.

yoavartzi.com•94 days ago

I have a sense that most uses of DPO rely on preference data from another model (or off the shelf), so seems pretty standard. If this is just to get a reward model and then run RL, it remains on policy.

[can we even talk about DPO as on policy? beyond first update :) ]

Comments

Posting Rules

Reply