Clever test of AI reasoning ability adds the option "none of these" to the common MMLU benchmark, forcing the AI to consider options rather than just picking the best

The result is a big drop in accuracy for most models, though Reasoners (o3 & DeepSeek) hold up much better https://arxiv.org/pdf/2502.12896
1 / 2
Post image
Post image

Comments