JudgeBench: A Benchmark for Evaluating LLM-based Judges
Pith reviewed 2026-05-18 01:19 UTC · model grok-4.3
The pith
JudgeBench shows that even top LLM judges like GPT-4o perform only slightly better than random guessing on response pairs labeled for objective factual and logical correctness.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
JudgeBench is constructed via a novel pipeline that turns existing difficult datasets into response pairs whose preference labels track objective factual and logical correctness rather than human votes, and evaluations on this benchmark demonstrate that many strong LLM judges perform only slightly above random guessing.
What carries the argument
The novel pipeline that converts existing difficult datasets into challenging response pairs carrying preference labels based on objective correctness.
If this is right
- Current LLM judges lack sufficient robustness for reliable evaluation on complex knowledge and reasoning tasks.
- Training or alignment pipelines that depend on these judges risk propagating errors when correctness matters more than preference.
- Multi-agent and fine-tuned judge approaches still fall short on the objective-hard tasks represented in JudgeBench.
- Reward models exhibit the same performance ceiling, indicating the limitation is not solved by scale or preference tuning alone.
Where Pith is reading between the lines
- Strong performance on JudgeBench could serve as a useful filter when selecting judges for downstream model improvement loops.
- The benchmark could be extended to additional domains such as science or policy where objective correctness can be defined.
- Developers may need to combine JudgeBench-style objective checks with human oversight for edge cases that remain ambiguous.
- Repeated use of weak judges on hard tasks could compound factual drift in successive model generations.
Load-bearing premise
The pipeline successfully turns existing datasets into response pairs whose labels accurately reflect objective factual and logical correctness without introducing systematic labeling errors or biases.
What would settle it
Independent expert verification of a random sample of JudgeBench pairs that finds frequent mismatches between the pipeline labels and clear factual or logical truth would falsify the benchmark's core validity.
read the original abstract
LLM-based judges have emerged as a scalable alternative to human evaluation and are increasingly used to assess, compare, and improve models. However, the reliability of LLM-based judges themselves is rarely scrutinized. As LLMs become more advanced, their responses grow more sophisticated, requiring stronger judges to evaluate them. Existing benchmarks primarily focus on a judge's alignment with human preferences, but often fail to account for more challenging tasks where crowdsourced human preference is a poor indicator of factual and logical correctness. To address this, we propose a novel evaluation framework to objectively evaluate LLM-based judges. Based on this framework, we propose JudgeBench, a benchmark for evaluating LLM-based judges on challenging response pairs spanning knowledge, reasoning, math, and coding. JudgeBench leverages a novel pipeline for converting existing difficult datasets into challenging response pairs with preference labels reflecting objective correctness. Our comprehensive evaluation on a collection of prompted judges, fine-tuned judges, multi-agent judges, and reward models shows that JudgeBench poses a significantly greater challenge than previous benchmarks, with many strong models (e.g., GPT-4o) performing just slightly better than random guessing. Overall, JudgeBench offers a reliable platform for assessing increasingly advanced LLM-based judges. Data and code are available at https://github.com/ScalerLab/JudgeBench.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces JudgeBench, a benchmark for evaluating LLM-based judges on challenging response pairs drawn from knowledge, reasoning, math, and coding domains. It proposes a novel pipeline that converts existing difficult datasets into response pairs equipped with preference labels reflecting objective factual and logical correctness (rather than human preferences). Comprehensive experiments on prompted judges, fine-tuned judges, multi-agent systems, and reward models show that even strong models such as GPT-4o perform only marginally above random guessing, indicating that JudgeBench is substantially harder than prior benchmarks. Data and code are released.
Significance. If the conversion pipeline produces labels that faithfully capture objective correctness, the benchmark would fill an important gap: current judge evaluations rely heavily on human-preference alignment, which becomes unreliable for advanced models. The public release of data and code is a clear strength that supports reproducibility and follow-up work. The result that frontier models hover near chance on objective tasks would be a useful signal for the community if the labels are shown to be free of systematic artifacts.
major comments (2)
- [§3] §3 (Pipeline Construction): The central claim that JudgeBench is markedly harder than prior benchmarks rests on the pipeline producing preference labels that accurately encode objective correctness. The manuscript provides only a high-level description of the conversion process and does not report validation against ground-truth solutions from the source datasets, checks for multiple valid answers in math/coding items, or quantitative measures of label noise. Without these, the observed near-chance performance of GPT-4o and similar models could partly reflect label artifacts rather than judge limitations.
- [§5] §5 (Experimental Results): The headline numbers (e.g., GPT-4o only slightly above random) are presented without statistical significance tests, confidence intervals, or ablation on label quality. This makes it difficult to assess whether the reported difficulty is robust or sensitive to small changes in the labeling procedure.
minor comments (2)
- [Abstract / §1] The abstract and introduction could more explicitly contrast JudgeBench with existing judge benchmarks (e.g., those based on human preference datasets) to clarify the precise novelty of the objective-correctness framing.
- [Figure 1] Figure 1 (pipeline diagram) would benefit from a more detailed caption that enumerates each transformation step and the exact criteria used to assign the final preference label.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback. We address each major comment point by point below, indicating where we agree and the revisions we will make to strengthen the manuscript.
read point-by-point responses
-
Referee: [§3] §3 (Pipeline Construction): The central claim that JudgeBench is markedly harder than prior benchmarks rests on the pipeline producing preference labels that accurately encode objective correctness. The manuscript provides only a high-level description of the conversion process and does not report validation against ground-truth solutions from the source datasets, checks for multiple valid answers in math/coding items, or quantitative measures of label noise. Without these, the observed near-chance performance of GPT-4o and similar models could partly reflect label artifacts rather than judge limitations.
Authors: We agree that a more detailed validation of the pipeline would strengthen the central claim. In the revised manuscript, we will expand Section 3 with a dedicated subsection on label validation. This will include: (1) agreement rates between the generated preference labels and ground-truth solutions from the source datasets, computed over a randomly sampled subset of 200 examples per domain; (2) explicit discussion of multiple valid answers in math and coding, noting that our pipeline selects responses based on objective factual/logical correctness relative to the problem statement and discards ambiguous cases during construction; and (3) quantitative measures of label noise, such as the fraction of pairs where both responses are incorrect or where the preference label conflicts with source ground truth. These additions will provide direct evidence that the labels faithfully encode objective correctness. revision: yes
-
Referee: [§5] §5 (Experimental Results): The headline numbers (e.g., GPT-4o only slightly above random) are presented without statistical significance tests, confidence intervals, or ablation on label quality. This makes it difficult to assess whether the reported difficulty is robust or sensitive to small changes in the labeling procedure.
Authors: We concur that statistical analysis and robustness checks are important for interpreting the headline results. In the revised Section 5, we will add bootstrap confidence intervals (1,000 resamples) around all accuracy figures and perform paired statistical tests (McNemar’s test) between models to establish significance of differences. We will also include an ablation on label quality: we will report judge performance on a high-confidence subset (where source ground-truth validation agrees with our labels) versus the full set, and on a version with 5% injected label flips. These analyses will demonstrate that the near-chance performance is robust to reasonable variations in labeling. revision: yes
Circularity Check
No significant circularity: benchmark labels derived independently of evaluated judges
full rationale
The paper constructs JudgeBench by applying a novel pipeline to existing difficult datasets, producing response pairs whose preference labels are asserted to reflect objective factual and logical correctness. This construction step is not defined in terms of the LLM judges under evaluation, nor does it fit any parameters to the judges' outputs or performance metrics. The headline result—that strong models perform near random guessing—is a direct empirical measurement on the resulting benchmark rather than a quantity forced by self-definition, fitted inputs, or a self-citation chain. No equations, uniqueness theorems, or ansatzes are invoked that reduce the central claim to the paper's own inputs by construction. The derivation therefore remains self-contained against external datasets and does not exhibit the enumerated circularity patterns.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Existing difficult datasets can be converted into response pairs whose preference labels accurately reflect objective factual and logical correctness.
Lean theorems connected to this paper
-
IndisputableMonolith.Foundation.DAlembert.Inevitabilitybilinear_family_forced unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
JudgeBench poses a significantly greater challenge than previous benchmarks, with many strong models (e.g., GPT-4o) performing just slightly better than random guessing.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 17 Pith papers
-
DRIP-R: A Benchmark for Decision-Making and Reasoning Under Real-World Policy Ambiguity in the Retail Domain
DRIP-R is a new benchmark showing that frontier LLMs systematically disagree on how to resolve identical ambiguous retail policy scenarios, highlighting ambiguity as a core challenge for agent decision-making.
-
Think-with-Rubrics: From External Evaluator to Internal Reasoning Guidance
Think-with-Rubrics has LLMs generate rubrics internally before responding, outperforming external rubric-as-reward baselines by 3.87 points on average across benchmarks.
-
Training Computer Use Agents to Assess the Usability of Graphical User Interfaces
uxCUA is a trained computer use agent that assesses GUI usability more accurately than larger models by learning to prioritize and execute important user interactions on labeled interface datasets.
-
Green Shielding: A User-Centric Approach Towards Trustworthy AI
Green Shielding introduces CUE criteria and the HCM-Dx benchmark to demonstrate that routine prompt variations systematically alter LLM diagnostic behavior along clinically relevant dimensions, producing Pareto-like t...
-
CompliBench: Benchmarking LLM Judges for Compliance Violation Detection in Dialogue Systems
CompliBench uses simulation and adversarial flaw injection to create labeled dialogue data showing that top proprietary LLMs perform poorly at spotting guideline violations while fine-tuned smaller models outperform t...
-
RTLC -- Research, Teach-to-Learn, Critique: A three-stage prompting paradigm inspired by the Feynman Learning Technique that lifts LLM-as-judge accuracy on JudgeBench with no fine-tuning
RTLC prompting lifts Claude 3.7 Sonnet pairwise accuracy on 350 hard JudgeBench items from 64.6% to 78.6% via a Research-Teach-Critique scaffold that beats self-consistency.
-
Reasoning Is Not Free: Robust Adaptive Cost-Efficient Routing for LLM-as-a-Judge
RACER routes between reasoning and non-reasoning LLM judges via constrained distributionally robust optimization to achieve better accuracy-cost trade-offs under distribution shift.
-
LATTICE: Evaluating Decision Support Utility of Crypto Agents
LATTICE is a scalable LLM-judge benchmark for crypto agent decision support that reveals performance trade-offs among real-world copilots across dimensions and tasks.
-
Structured Multi-Criteria Evaluation of Large Language Models with Fuzzy Analytic Hierarchy Process and DualJudge
Fuzzy AHP and DualJudge deliver more stable and calibrated LLM evaluations than direct scoring by breaking assessments into explicit criteria and adaptively fusing intuitive and deliberative judgments.
-
MM-tau-p$^2$: Persona-Adaptive Prompting for Robust Multi-Modal Agent Evaluation in Dual-Control Settings
MM-tau-p² is a new benchmark with 12 metrics that measures how well multi-modal agents adapt to user personas and maintain robustness in dual-control interactions.
-
LLM-as-a-Judge for Human-AI Co-Creation: A Reliability-Aware Evaluation Framework for Coding
LLM judges for human-AI coding co-creation show moderate performance (ROC-AUC 0.59) and low agreement, with co-creation success concentrating early in interactions.
-
Judging the Judges: A Systematic Evaluation of Bias Mitigation Strategies in LLM-as-a-Judge Pipelines
Style bias dominates LLM-as-a-Judge systems far more than position bias, with debiasing strategies providing model-dependent gains and public tools released for replication.
-
Rethinking Math Reasoning Evaluation: A Robust LLM-as-a-Judge Framework Beyond Symbolic Rigidity
An LLM-as-a-judge evaluation framework for math reasoning outperforms symbolic methods by accurately assessing diverse answer representations and formats.
-
Bias in the Loop: Auditing LLM-as-a-Judge for Software Engineering
LLM judges for code tasks show high sensitivity to prompt biases that systematically favor certain options, changing accuracy and model rankings even when code is unchanged.
-
NVIDIA Nemotron 3: Efficient and Open Intelligence
NVIDIA releases the Nemotron 3 model family with hybrid Mamba-Transformer architecture, LatentMoE, NVFP4 training, MTP layers, and multi-environment RL post-training for reasoning and agentic tasks.
-
From LLM Reasoning to Autonomous AI Agents: A Comprehensive Review
A survey consolidating benchmarks, agent frameworks, real-world applications, and protocols for LLM-based autonomous agents into a proposed taxonomy with recommendations for future research.
-
LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods
A survey that organizes LLMs-as-judges research into functionality, methodology, applications, meta-evaluation, and limitations.
Reference graph
Works this paper leans on
-
[1]
judgments. Auto-J supports ties. Following Auto-J’s official implementation, we generated judgments up to 1024 tokens in length at a temperature of 0. Skywork Critics (Shiwen et al., 2024): Skywork released a series of 8B and 70B generative judges built on Llama-3.1-8B-Instruct and Llama-3.1-70B-Instruct (Dubey et al., 2024) respectively. These models are...
work page 2024
-
[2]
contains 419. RewardBench (Lambert et al., 2024) is larger, but it’s an aggregation of ex- isting benchmarks, including MT-Bench and LLMBar. In order to test if this size is sufficient, we augmented our “knowledge” subset, increasing the number of response pairs from 154 to 770. We evaluated several LLMs using the Arena-Hard prompt, and observed that the ...
work page 2024
-
[3]
Each side’s lateral pterygoid has a different function during this movement
**Muscles Involved in Mandibular Movements**: - The **lateral pterygoid muscles** are the main muscles responsible for the lateral movements of the mandible. Each side’s lateral pterygoid has a different function during this movement. - The **right lateral pterygoid muscle** would pull the condyle of the mandible forward on the right side, resulting in le...
work page 2025
-
[4]
**Supporting Muscles**: - The **masseter, temporalis, medial pterygoid**, and **buccinator** play roles primarily in elevation, retraction, or mastication rather than direct lateral movements
-
[5]
**Logical Deduction**: - When the mandible moves to the right, the **left lateral pterygoid muscle** is more engaged. This is because it’s the left lateral pterygoid that pulls the left side of the mandible forward and to the right. Given this, the most active muscle during a right lateral excursion of the mandible is: I. Left lateral pterygoid muscle. Th...
work page 2025
-
[6]
Assistant A is significantly better: [[A>>B]]
-
[7]
Assistant A is slightly better: [[A >B]]
-
[8]
Tie, relatively the same: [[A=B]]
-
[9]
Assistant B is slightly better: [[B >A]]
-
[10]
My final verdict is tie: [[A=B]]
Assistant B is significantly better: [[B>>A]] Example output: “My final verdict is tie: [[A=B]]”. User Prompt: <|User Prompt|> {question} <|The Start of Assistant A’s Answer|> {response a} <|The End of Assistant A’s Answer|> <|The Start of Assistant B’s Answer|> {response b} <|The End of Assistant B’s Answer|> Google Vertex Prompt User Prompt: # Instructi...
work page 2025
-
[11]
Write a detailed feedback that assess the quality of two responses strictly based on the given score rubric, not evaluating in general
-
[12]
You should refer to the score rubric
After writing a feedback, choose a better response between Response A and Response B. You should refer to the score rubric
-
[13]
(write a feedback for criteria) [RESULT] (A or B)
The output format should look as follows: “(write a feedback for criteria) [RESULT] (A or B)” 4. Please do not generate any other opening, closing, and explanations. ###Instruction: {question} ###Response A: {response a} ###Response B: {response b} ###Score Rubric: [Are the model’s responses factually correct and well-supported by evidence?] ###Feedback: ...
work page 2025
-
[14]
Pinpoint the key factors to distinguish these two responses
-
[15]
So, the final decision is Response 1 / Response 2 / Tie
Conclude your comparison by providing a final decision on which response is better, or they are tied. Begin your final decision statement with “So, the final decision is Response 1 / Response 2 / Tie”. Ensure that your decision aligns coherently with the comprehensive evaluation and comparison you’ve provided. Skywork Prompt User Prompt: Please act as an ...
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.