The *current* reinforcement learning methods may not be improving reasoning capacity of the LLMs. Instead, they may be training the models to find the shortcuts more efficiently.

https://limit-of-rlvr.github.io
Post image

Comments