Your human and LLM judges should follow the same criteria.
Then, you can transition from manual to automated evaluation once you have inter-annotator agreement between LLM & human. You now have a faster iteration speed and the annotator can focus on finding edge cases!
Then, you can transition from manual to automated evaluation once you have inter-annotator agreement between LLM & human. You now have a faster iteration speed and the annotator can focus on finding edge cases!
Comments
We're building LLM / Human "scorers" in @weightsbiases.bsky.social to have the same data model for this reason