JudgeBench converts difficult datasets into objective-correctness response pairs to evaluate LLM judges, showing even strong models like GPT-4o perform near random guessing.
Title resolution pending
4 Pith papers cite this work. Polarity classification is still indexing.
verdicts
UNVERDICTED 4representative citing papers
BenchBuilder automates extraction of 500 challenging prompts from crowd-sourced data into Arena-Hard-Auto, delivering 3x model separation versus MT-Bench and 98.6% human-preference correlation at $20 cost.
Chatbot Arena is an open platform that ranks LLMs using crowdsourced pairwise human votes, with statistical methods and validation showing diverse questions and agreement with expert raters.
DIR applies an information bottleneck to reward model training to mitigate complex inductive biases such as length, sycophancy, and format, with claimed improvements in RLHF generalization.
citing papers explorer
-
JudgeBench: A Benchmark for Evaluating LLM-based Judges
JudgeBench converts difficult datasets into objective-correctness response pairs to evaluate LLM judges, showing even strong models like GPT-4o perform near random guessing.
-
From Crowdsourced Data to High-Quality Benchmarks: Arena-Hard and BenchBuilder Pipeline
BenchBuilder automates extraction of 500 challenging prompts from crowd-sourced data into Arena-Hard-Auto, delivering 3x model separation versus MT-Bench and 98.6% human-preference correlation at $20 cost.
-
Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference
Chatbot Arena is an open platform that ranks LLMs using crowdsourced pairwise human votes, with statistical methods and validation showing diverse questions and agreement with expert raters.
-
Eliminating Inductive Bias in Reward Models with Information-Theoretic Guidance
DIR applies an information bottleneck to reward model training to mitigate complex inductive biases such as length, sycophancy, and format, with claimed improvements in RLHF generalization.