Whenever I see 3 significant digits in a benchmark, especially of a qualitative task, I immediately get suspicious. "Sonnet 4 improved our SWE-bench agent single pass score from 60.6% to 70.6%" So, 60 to 70?

What kind of errors does it still make? What new errors does it make?

Comments