The *current* reinforcement learning methods may not be improving reasoning capacity of the LLMs. Instead, they may be training the models to find the shortcuts more efficiently.
https://limit-of-rlvr.github.io
https://limit-of-rlvr.github.io
Comments