Clever test of AI reasoning ability adds the option "none of these" to the common MMLU benchmark, forcing the AI to consider options rather than just picking the best
The result is a big drop in accuracy for most models, though Reasoners (o3 & DeepSeek) hold up much better https://arxiv.org/pdf/2502.12896
The result is a big drop in accuracy for most models, though Reasoners (o3 & DeepSeek) hold up much better https://arxiv.org/pdf/2502.12896
1 / 2
Comments