Claude 3.7 is so temperamental at times. I gave it a Python codebase and asked it to review the test suite (of ~150 pytest tests) that had some OK tests plus some pretty bad ones by Gemini 2.5 Pro; - ThreadSky

About ThreadSky

uninen.net • 79 days ago

Claude 3.7 is so temperamental at times. I gave it a Python codebase and asked it to review the test suite (of ~150 pytest tests) that had some OK tests plus some pretty bad ones by Gemini 2.5 Pro;

Comments

uninen.net•79 days ago

Claude's first answer: "Overall, this is a high-quality test suite that provides confidence in the library's functionality and reliability. The tests are well-implemented, provide good coverage, and verify meaningful behaviors."

Me: how about this [obviously totally shitty piece of test code]?

uninen.net•79 days ago

Claude's second answer: "[..] The tests provide a false sense of security by testing modified implementations, verifying calls rather than behavior, and over-using mocks that hide integration issues. [..] The fundamental approach needs rethinking to provide actual confidence in this codebase."

😂😅😅

uninen.net•79 days ago

This was probably the biggest and most weird fail of Claude 3.7 for me so far and I've been using it a lot. But it's sometimes clearly "lazy" and after you point it out it overreacts.

Gemini 2.5 Pro seems to be much more consistent and I'm generally starting to like it more for coding tasks.

uninen.net•79 days ago

I found this interesting also because LLM:s seem to like each others' output more than output they haven't seen before -- even if it's out of place given the context. It seems to be pretty hard to prompt around this.

Posting Rules

Be respectful to others
No spam or self-promotion
Stay on topic
Follow Bluesky's terms of service

Comments

Posting Rules

Reply