New preprint! ✨ Interested in LLM-as-a-Judge? Want to get the best judge for ranking your system? our new work is just for you: "JuStRank: Benchmarking LLM Judges for System Ranking" 🕺💃 arxiv.org/abs/2412.09569 - ThreadSky

asaf-yehudai.bsky.social • 192 days ago

New preprint! ✨
Interested in LLM-as-a-Judge?
Want to get the best judge for ranking your system?
our new work is just for you:
"JuStRank: Benchmarking LLM Judges for System Ranking"
🕺💃
https://arxiv.org/abs/2412.09569

Comments

asaf-yehudai.bsky.social•192 days ago

There are many new judge benchmarks
But most focus on evaluating the judge's ability to choose a better response

We focus on the judge's ability to choose a better system

asaf-yehudai.bsky.social•192 days ago

So how did we do it?

For LLMs, we took 4 unique realizations
➕ Reward models
they judge the responses of 64 systems
and got each judge's system ranking

Then we compare the ranking to Arena's gold rank

asaf-yehudai.bsky.social•192 days ago

With JuStRank we found:
1⃣Smaller dedicated judges are on par with big ones
2⃣LLM judge's realization matters a lot
3⃣Comparative judgment is not the best for most judges

🕺💃

asaf-yehudai.bsky.social•192 days ago

What does JuStRank tell us about general judge behavior?

For that, we turn to the system preference task
Given a pair of systems, which one is better!

We plot gold and judge predicted win-rates

asaf-yehudai.bsky.social•192 days ago

Analyzing these figures, we found an emergent judge behavior:

We call it decisiveness!
decisive judges prefer stronger systems, more than humans do!

We measure it based on the empirical fit

asaf-yehudai.bsky.social•192 days ago

Secondly, we define a new type of Bias:

System-specific bias

Where a judge prefers or dislikes a specific system

Our results demonstrate large biases that affect systems-ranking

Comments

Posting Rules

Reply