Whenever I see 3 significant digits in a benchmark, especially of a qualitative task, I immediately get suspicious. "Sonnet 4 improved our SWE-bench agent single pass score from 60.6% to 70.6%" So, 60 to 70? What kind of errors does it still make? What new errors does it make? - ThreadSky

kentbeck.com • 7 hours ago

Whenever I see 3 significant digits in a benchmark, especially of a qualitative task, I immediately get suspicious. "Sonnet 4 improved our SWE-bench agent single pass score from 60.6% to 70.6%" So, 60 to 70?

What kind of errors does it still make? What new errors does it make?

Comments

Posting Rules

Comments

Posting Rules

Reply