Is anyone well-read in the DS-R1 tea leaves and feels confidant they know what the distillation method used was? It's not clear to me if they mean "train on data from another model" or something that I'd consider "actually distilling"?
My current guess is synthetic CoT?
My current guess is synthetic CoT?
Comments
For ex: generating Jax code it will explicitly add shape annotations during "thinking". I don't think that's a hard prior to DPO in.