We find that it rarely performs proper editing: R1 often completely discards its initial drafts, starting from scratch over and over. Similarly, the final output after "<thinkink>" might not be faithful to the drafts from the thinking process - ThreadSky

About ThreadSky

bennokrojer.bsky.social • 27 days ago

We find that it rarely performs proper editing:
R1 often completely discards its initial drafts, starting from scratch over and over. Similarly, the final output after "" might not be faithful to the drafts from the thinking process

Comments

bennokrojer.bsky.social•27 days ago

Another finding:
A more complex task does not necessarily lead to longer reasoning chains.
We ask the model to generate single objects (e.g. dog); and then also object compositions (e.g. half-dog half-shark) as a more complex
However this second task leads to slightly shorter chains!

bennokrojer.bsky.social•27 days ago

Finally, we study world modeling, asking the model to e.g. "generate 10 more ASCII frames of 2 balls on a pool table colliding, given an initial frame of the balls on the pool table"

We find that R1 often "gets lost" in symbolic/math reasoning instead of generating actual drafts of the "video"

bennokrojer.bsky.social•27 days ago

The final generation of ACII frames after the thinking process is still sometimes decent, but the very verbose reasoning did not actually provide the model with drafts to build upon

Overall:
1. R1 slightly better than V3
2. Yet still many failure modes, especially a lack of iterative refinement

bennokrojer.bsky.social•27 days ago

Looking back to the last weeks of writing, the most satisfying thing was to see everything evolve in parallel: I worked on my own section but saw others grow more coherent every day. Figures popped up, an introduction appeared. Every time I checked the edit history someone had done cool work!

bennokrojer.bsky.social•27 days ago

It reminds of the clean up after a big party: You do your own little part but it feels so much faster because every 10 minutes you turn around to the rest of the room and see that so much has gotten done

Posting Rules

Be respectful to others
No spam or self-promotion
Stay on topic
Follow Bluesky's terms of service

Comments

Posting Rules

Reply