apparently RLCoT (chain of thought learned via RL) is in itself an emergent behavior that doesn’t happen until about 1.5B sized models

PPO, GPRO, PRIME — doesn’t matter what RL you use, the key is that it’s RL

experiment logs: https://wandb.ai/jiayipan/TinyZero?nw=nwuserjiayipan

x: https://x.com/jiayi_pirate/status/1882839504899420517?s=46&t=ftkDjGBpGPr2-yTN2CCUYg

Comments