Clever test of AI reasoning ability adds the option "none of these" to the common MMLU benchmark, forcing the AI to consider options rather than just picking the best The result is a big drop in accuracy for most models, though Reasoners (o3 & DeepSeek) hold up much better arxiv.org/pdf/2502.12896 - ThreadSky

emollick.bsky.social • 8 days ago

Clever test of AI reasoning ability adds the option "none of these" to the common MMLU benchmark, forcing the AI to consider options rather than just picking the best

The result is a big drop in accuracy for most models, though Reasoners (o3 & DeepSeek) hold up much better https://arxiv.org/pdf/2502.12896

1 / 2

Comments

Posting Rules

Comments

Posting Rules

Reply