Claude 3.7 is so temperamental at times. I gave it a Python codebase and asked it to review the test suite (of ~150 pytest tests) that had some OK tests plus some pretty bad ones by Gemini 2.5 Pro;
Comments
Log in with your Bluesky account to leave a comment
Claude's first answer: "Overall, this is a high-quality test suite that provides confidence in the library's functionality and reliability. The tests are well-implemented, provide good coverage, and verify meaningful behaviors."
Me: how about this [obviously totally shitty piece of test code]?
Claude's second answer: "[..] The tests provide a false sense of security by testing modified implementations, verifying calls rather than behavior, and over-using mocks that hide integration issues. [..] The fundamental approach needs rethinking to provide actual confidence in this codebase."
This was probably the biggest and most weird fail of Claude 3.7 for me so far and I've been using it a lot. But it's sometimes clearly "lazy" and after you point it out it overreacts.
Gemini 2.5 Pro seems to be much more consistent and I'm generally starting to like it more for coding tasks.
I found this interesting also because LLM:s seem to like each others' output more than output they haven't seen before -- even if it's out of place given the context. It seems to be pretty hard to prompt around this.
Comments
Me: how about this [obviously totally shitty piece of test code]?
😂😅😅
Gemini 2.5 Pro seems to be much more consistent and I'm generally starting to like it more for coding tasks.