apparently RLCoT (chain of thought learned via RL) is in itself an emergent behavior that doesn’t happen until about 1.5B sized models PPO, GPRO, PRIME — doesn’t matter what RL you use, the key is that it’s RL experiment logs: wandb.ai/jiayipan/Tin... x: x.com/jiayi_pirate... - ThreadSky

timkellogg.me • 33 days ago

apparently RLCoT (chain of thought learned via RL) is in itself an emergent behavior that doesn’t happen until about 1.5B sized models

PPO, GPRO, PRIME — doesn’t matter what RL you use, the key is that it’s RL

experiment logs: https://wandb.ai/jiayipan/TinyZero?nw=nwuserjiayipan

x: https://x.com/jiayi_pirate/status/1882839504899420517?s=46&t=ftkDjGBpGPr2-yTN2CCUYg

Comments

braintelligence.bsky.social•33 days ago

I’m reading about this now and I’m skeptical

Seems to me that it’s the prompting is what’s driving this behavior rather than the RL?

timkellogg.me•33 days ago

no, they’re talking about when they do it spontaneously, without prompting

in the R1 paper they talk about the “aha!” moment, that’s what they’re referring to

(sorry, can’t find it in the paper right now)

braintelligence.bsky.social•33 days ago

This is what I’m reading and it mentions they tell the model to think and reason in the training data:

https://www.philschmid.de/deepseek-r1

timkellogg.me•33 days ago

right but the key is back propagation — they actually alter the model weights so that behavior continues to happen, even without prompting

timkellogg.me•33 days ago

to phrase it differently — we want models that will simply recognize when they’re dealing with a hard problem, and not have to prompt it differently

if it’s easy, don’t do CoT. if it’s a fields’ medal problem, plz think for weeks kthxbye

braintelligence.bsky.social•33 days ago

I see what you mean

The training data has examples of reasoning/cot, but when the model does it on its own, boost these pathways

Therefore the model learns its own reasoning methods

jslez.bsky.social•33 days ago

Could you possibly explain the central thing they’re doing when they iterate - are they fine tuning the models or what

timkellogg.me•33 days ago

yeah, here’s a more detailed thread

you always start with a fully pre-trained base model (V3) and do some training process

what we’re finding here is that there’s something special about RL in particular, which is controversial bc it’s so dumb/simple

https://bsky.app/profile/timkellogg.me/post/3lgb7jatrks24

jslez.bsky.social•33 days ago

ahh thanks ok i understand better now

Interesting that the scoring of the ensemble of answers is just algorithmic

timkellogg.me•33 days ago

*heuristic*! it’s not even a clever algorithm

segyges.bsky.social•33 days ago

it's not really heuristic it's pass/fail on outputs, it is very very simple

Comments

Posting Rules

Reply