I mean I totally understand they couldn't test it on the newest o1, but not being to test it on 4o really make the whole study "useless".
GPT4 has very much no reasoning, while 4o has beginning of reasoning, and o1 has good reasoning.
GPT4 has very much no reasoning, while 4o has beginning of reasoning, and o1 has good reasoning.
Comments