New preprint! ✨
Interested in LLM-as-a-Judge?
Want to get the best judge for ranking your system?
our new work is just for you:
"JuStRank: Benchmarking LLM Judges for System Ranking"
🕺💃
https://arxiv.org/abs/2412.09569
Interested in LLM-as-a-Judge?
Want to get the best judge for ranking your system?
our new work is just for you:
"JuStRank: Benchmarking LLM Judges for System Ranking"
🕺💃
https://arxiv.org/abs/2412.09569
Comments
But most focus on evaluating the judge's ability to choose a better response
We focus on the judge's ability to choose a better system
For LLMs, we took 4 unique realizations
➕ Reward models
they judge the responses of 64 systems
and got each judge's system ranking
Then we compare the ranking to Arena's gold rank
1⃣Smaller dedicated judges are on par with big ones
2⃣LLM judge's realization matters a lot
3⃣Comparative judgment is not the best for most judges
🕺💃
For that, we turn to the system preference task
Given a pair of systems, which one is better!
We plot gold and judge predicted win-rates
We call it decisiveness!
decisive judges prefer stronger systems, more than humans do!
We measure it based on the empirical fit
System-specific bias
Where a judge prefers or dislikes a specific system
Our results demonstrate large biases that affect systems-ranking