apparently RLCoT (chain of thought learned via RL) is in itself an emergent behavior that doesn’t happen until about 1.5B sized models
PPO, GPRO, PRIME — doesn’t matter what RL you use, the key is that it’s RL
experiment logs: https://wandb.ai/jiayipan/TinyZero?nw=nwuserjiayipan
x: https://x.com/jiayi_pirate/status/1882839504899420517?s=46&t=ftkDjGBpGPr2-yTN2CCUYg
PPO, GPRO, PRIME — doesn’t matter what RL you use, the key is that it’s RL
experiment logs: https://wandb.ai/jiayipan/TinyZero?nw=nwuserjiayipan
x: https://x.com/jiayi_pirate/status/1882839504899420517?s=46&t=ftkDjGBpGPr2-yTN2CCUYg
Comments
Seems to me that it’s the prompting is what’s driving this behavior rather than the RL?
in the R1 paper they talk about the “aha!” moment, that’s what they’re referring to
(sorry, can’t find it in the paper right now)
https://www.philschmid.de/deepseek-r1
if it’s easy, don’t do CoT. if it’s a fields’ medal problem, plz think for weeks kthxbye
The training data has examples of reasoning/cot, but when the model does it on its own, boost these pathways
Therefore the model learns its own reasoning methods
you always start with a fully pre-trained base model (V3) and do some training process
what we’re finding here is that there’s something special about RL in particular, which is controversial bc it’s so dumb/simple
https://bsky.app/profile/timkellogg.me/post/3lgb7jatrks24
Interesting that the scoring of the ensemble of answers is just algorithmic