pith. machine review for the scientific record. sign in

arxiv: 2410.12784 · v2 · pith:WBRSQ7UHnew · submitted 2024-10-16 · 💻 cs.AI · cs.CL· cs.LG

JudgeBench: A Benchmark for Evaluating LLM-based Judges

Pith reviewed 2026-05-18 01:19 UTC · model grok-4.3

classification 💻 cs.AI cs.CLcs.LG
keywords LLM judgesevaluation benchmarksobjective correctnessresponse evaluationreasoning tasksAI reliability
0
0 comments X

The pith

JudgeBench shows that even top LLM judges like GPT-4o perform only slightly better than random guessing on response pairs labeled for objective factual and logical correctness.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper contends that existing benchmarks for LLM-based judges over-rely on human preference data, which becomes unreliable for hard tasks where crowdsourced opinions diverge from actual correctness. It introduces JudgeBench, a new evaluation set spanning knowledge, reasoning, math, and coding that is built by converting challenging existing datasets into pairs of responses with objective preference labels. Comprehensive tests of prompted, fine-tuned, multi-agent, and reward-model judges on this benchmark reveal markedly weaker performance than on prior suites, with strong models hovering near random levels. This matters because LLM judges are now widely used to rank, refine, and align other models; if they fail on objective correctness, the resulting training signals can be systematically flawed. The work therefore supplies a concrete, scalable platform for measuring and improving judge reliability on precisely the tasks that matter most for advanced systems.

Core claim

JudgeBench is constructed via a novel pipeline that turns existing difficult datasets into response pairs whose preference labels track objective factual and logical correctness rather than human votes, and evaluations on this benchmark demonstrate that many strong LLM judges perform only slightly above random guessing.

What carries the argument

The novel pipeline that converts existing difficult datasets into challenging response pairs carrying preference labels based on objective correctness.

If this is right

  • Current LLM judges lack sufficient robustness for reliable evaluation on complex knowledge and reasoning tasks.
  • Training or alignment pipelines that depend on these judges risk propagating errors when correctness matters more than preference.
  • Multi-agent and fine-tuned judge approaches still fall short on the objective-hard tasks represented in JudgeBench.
  • Reward models exhibit the same performance ceiling, indicating the limitation is not solved by scale or preference tuning alone.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Strong performance on JudgeBench could serve as a useful filter when selecting judges for downstream model improvement loops.
  • The benchmark could be extended to additional domains such as science or policy where objective correctness can be defined.
  • Developers may need to combine JudgeBench-style objective checks with human oversight for edge cases that remain ambiguous.
  • Repeated use of weak judges on hard tasks could compound factual drift in successive model generations.

Load-bearing premise

The pipeline successfully turns existing datasets into response pairs whose labels accurately reflect objective factual and logical correctness without introducing systematic labeling errors or biases.

What would settle it

Independent expert verification of a random sample of JudgeBench pairs that finds frequent mismatches between the pipeline labels and clear factual or logical truth would falsify the benchmark's core validity.

read the original abstract

LLM-based judges have emerged as a scalable alternative to human evaluation and are increasingly used to assess, compare, and improve models. However, the reliability of LLM-based judges themselves is rarely scrutinized. As LLMs become more advanced, their responses grow more sophisticated, requiring stronger judges to evaluate them. Existing benchmarks primarily focus on a judge's alignment with human preferences, but often fail to account for more challenging tasks where crowdsourced human preference is a poor indicator of factual and logical correctness. To address this, we propose a novel evaluation framework to objectively evaluate LLM-based judges. Based on this framework, we propose JudgeBench, a benchmark for evaluating LLM-based judges on challenging response pairs spanning knowledge, reasoning, math, and coding. JudgeBench leverages a novel pipeline for converting existing difficult datasets into challenging response pairs with preference labels reflecting objective correctness. Our comprehensive evaluation on a collection of prompted judges, fine-tuned judges, multi-agent judges, and reward models shows that JudgeBench poses a significantly greater challenge than previous benchmarks, with many strong models (e.g., GPT-4o) performing just slightly better than random guessing. Overall, JudgeBench offers a reliable platform for assessing increasingly advanced LLM-based judges. Data and code are available at https://github.com/ScalerLab/JudgeBench.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces JudgeBench, a benchmark for evaluating LLM-based judges on challenging response pairs drawn from knowledge, reasoning, math, and coding domains. It proposes a novel pipeline that converts existing difficult datasets into response pairs equipped with preference labels reflecting objective factual and logical correctness (rather than human preferences). Comprehensive experiments on prompted judges, fine-tuned judges, multi-agent systems, and reward models show that even strong models such as GPT-4o perform only marginally above random guessing, indicating that JudgeBench is substantially harder than prior benchmarks. Data and code are released.

Significance. If the conversion pipeline produces labels that faithfully capture objective correctness, the benchmark would fill an important gap: current judge evaluations rely heavily on human-preference alignment, which becomes unreliable for advanced models. The public release of data and code is a clear strength that supports reproducibility and follow-up work. The result that frontier models hover near chance on objective tasks would be a useful signal for the community if the labels are shown to be free of systematic artifacts.

major comments (2)
  1. [§3] §3 (Pipeline Construction): The central claim that JudgeBench is markedly harder than prior benchmarks rests on the pipeline producing preference labels that accurately encode objective correctness. The manuscript provides only a high-level description of the conversion process and does not report validation against ground-truth solutions from the source datasets, checks for multiple valid answers in math/coding items, or quantitative measures of label noise. Without these, the observed near-chance performance of GPT-4o and similar models could partly reflect label artifacts rather than judge limitations.
  2. [§5] §5 (Experimental Results): The headline numbers (e.g., GPT-4o only slightly above random) are presented without statistical significance tests, confidence intervals, or ablation on label quality. This makes it difficult to assess whether the reported difficulty is robust or sensitive to small changes in the labeling procedure.
minor comments (2)
  1. [Abstract / §1] The abstract and introduction could more explicitly contrast JudgeBench with existing judge benchmarks (e.g., those based on human preference datasets) to clarify the precise novelty of the objective-correctness framing.
  2. [Figure 1] Figure 1 (pipeline diagram) would benefit from a more detailed caption that enumerates each transformation step and the exact criteria used to assign the final preference label.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We address each major comment point by point below, indicating where we agree and the revisions we will make to strengthen the manuscript.

read point-by-point responses
  1. Referee: [§3] §3 (Pipeline Construction): The central claim that JudgeBench is markedly harder than prior benchmarks rests on the pipeline producing preference labels that accurately encode objective correctness. The manuscript provides only a high-level description of the conversion process and does not report validation against ground-truth solutions from the source datasets, checks for multiple valid answers in math/coding items, or quantitative measures of label noise. Without these, the observed near-chance performance of GPT-4o and similar models could partly reflect label artifacts rather than judge limitations.

    Authors: We agree that a more detailed validation of the pipeline would strengthen the central claim. In the revised manuscript, we will expand Section 3 with a dedicated subsection on label validation. This will include: (1) agreement rates between the generated preference labels and ground-truth solutions from the source datasets, computed over a randomly sampled subset of 200 examples per domain; (2) explicit discussion of multiple valid answers in math and coding, noting that our pipeline selects responses based on objective factual/logical correctness relative to the problem statement and discards ambiguous cases during construction; and (3) quantitative measures of label noise, such as the fraction of pairs where both responses are incorrect or where the preference label conflicts with source ground truth. These additions will provide direct evidence that the labels faithfully encode objective correctness. revision: yes

  2. Referee: [§5] §5 (Experimental Results): The headline numbers (e.g., GPT-4o only slightly above random) are presented without statistical significance tests, confidence intervals, or ablation on label quality. This makes it difficult to assess whether the reported difficulty is robust or sensitive to small changes in the labeling procedure.

    Authors: We concur that statistical analysis and robustness checks are important for interpreting the headline results. In the revised Section 5, we will add bootstrap confidence intervals (1,000 resamples) around all accuracy figures and perform paired statistical tests (McNemar’s test) between models to establish significance of differences. We will also include an ablation on label quality: we will report judge performance on a high-confidence subset (where source ground-truth validation agrees with our labels) versus the full set, and on a version with 5% injected label flips. These analyses will demonstrate that the near-chance performance is robust to reasonable variations in labeling. revision: yes

Circularity Check

0 steps flagged

No significant circularity: benchmark labels derived independently of evaluated judges

full rationale

The paper constructs JudgeBench by applying a novel pipeline to existing difficult datasets, producing response pairs whose preference labels are asserted to reflect objective factual and logical correctness. This construction step is not defined in terms of the LLM judges under evaluation, nor does it fit any parameters to the judges' outputs or performance metrics. The headline result—that strong models perform near random guessing—is a direct empirical measurement on the resulting benchmark rather than a quantity forced by self-definition, fitted inputs, or a self-citation chain. No equations, uniqueness theorems, or ansatzes are invoked that reduce the central claim to the paper's own inputs by construction. The derivation therefore remains self-contained against external datasets and does not exhibit the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that the conversion pipeline yields preference labels that genuinely track objective correctness; this is a domain assumption rather than a fitted parameter or new entity.

axioms (1)
  • domain assumption Existing difficult datasets can be converted into response pairs whose preference labels accurately reflect objective factual and logical correctness.
    The pipeline described in the abstract presupposes that such conversion is feasible and reliable for knowledge, reasoning, math, and coding tasks.

pith-pipeline@v0.9.0 · 5782 in / 1255 out tokens · 44324 ms · 2026-05-18T01:19:07.682200+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 17 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. DRIP-R: A Benchmark for Decision-Making and Reasoning Under Real-World Policy Ambiguity in the Retail Domain

    cs.CL 2026-05 unverdicted novelty 7.0

    DRIP-R is a new benchmark showing that frontier LLMs systematically disagree on how to resolve identical ambiguous retail policy scenarios, highlighting ambiguity as a core challenge for agent decision-making.

  2. Think-with-Rubrics: From External Evaluator to Internal Reasoning Guidance

    cs.CL 2026-05 unverdicted novelty 7.0

    Think-with-Rubrics has LLMs generate rubrics internally before responding, outperforming external rubric-as-reward baselines by 3.87 points on average across benchmarks.

  3. Training Computer Use Agents to Assess the Usability of Graphical User Interfaces

    cs.CL 2026-04 unverdicted novelty 7.0

    uxCUA is a trained computer use agent that assesses GUI usability more accurately than larger models by learning to prioritize and execute important user interactions on labeled interface datasets.

  4. Green Shielding: A User-Centric Approach Towards Trustworthy AI

    cs.CL 2026-04 unverdicted novelty 7.0

    Green Shielding introduces CUE criteria and the HCM-Dx benchmark to demonstrate that routine prompt variations systematically alter LLM diagnostic behavior along clinically relevant dimensions, producing Pareto-like t...

  5. CompliBench: Benchmarking LLM Judges for Compliance Violation Detection in Dialogue Systems

    cs.CL 2026-04 unverdicted novelty 7.0

    CompliBench uses simulation and adversarial flaw injection to create labeled dialogue data showing that top proprietary LLMs perform poorly at spotting guideline violations while fine-tuned smaller models outperform t...

  6. RTLC -- Research, Teach-to-Learn, Critique: A three-stage prompting paradigm inspired by the Feynman Learning Technique that lifts LLM-as-judge accuracy on JudgeBench with no fine-tuning

    cs.CL 2026-05 unverdicted novelty 6.0

    RTLC prompting lifts Claude 3.7 Sonnet pairwise accuracy on 350 hard JudgeBench items from 64.6% to 78.6% via a Research-Teach-Critique scaffold that beats self-consistency.

  7. Reasoning Is Not Free: Robust Adaptive Cost-Efficient Routing for LLM-as-a-Judge

    cs.AI 2026-05 unverdicted novelty 6.0

    RACER routes between reasoning and non-reasoning LLM judges via constrained distributionally robust optimization to achieve better accuracy-cost trade-offs under distribution shift.

  8. LATTICE: Evaluating Decision Support Utility of Crypto Agents

    cs.CR 2026-04 unverdicted novelty 6.0

    LATTICE is a scalable LLM-judge benchmark for crypto agent decision support that reveals performance trade-offs among real-world copilots across dimensions and tasks.

  9. Structured Multi-Criteria Evaluation of Large Language Models with Fuzzy Analytic Hierarchy Process and DualJudge

    cs.AI 2026-04 unverdicted novelty 6.0

    Fuzzy AHP and DualJudge deliver more stable and calibrated LLM evaluations than direct scoring by breaking assessments into explicit criteria and adaptively fusing intuitive and deliberative judgments.

  10. MM-tau-p$^2$: Persona-Adaptive Prompting for Robust Multi-Modal Agent Evaluation in Dual-Control Settings

    cs.ET 2026-03 unverdicted novelty 6.0

    MM-tau-p² is a new benchmark with 12 metrics that measures how well multi-modal agents adapt to user personas and maintain robustness in dual-control interactions.

  11. LLM-as-a-Judge for Human-AI Co-Creation: A Reliability-Aware Evaluation Framework for Coding

    cs.SE 2026-04 unverdicted novelty 5.0

    LLM judges for human-AI coding co-creation show moderate performance (ROC-AUC 0.59) and low agreement, with co-creation success concentrating early in interactions.

  12. Judging the Judges: A Systematic Evaluation of Bias Mitigation Strategies in LLM-as-a-Judge Pipelines

    cs.AI 2026-04 unverdicted novelty 5.0

    Style bias dominates LLM-as-a-Judge systems far more than position bias, with debiasing strategies providing model-dependent gains and public tools released for replication.

  13. Rethinking Math Reasoning Evaluation: A Robust LLM-as-a-Judge Framework Beyond Symbolic Rigidity

    cs.AI 2026-04 unverdicted novelty 5.0

    An LLM-as-a-judge evaluation framework for math reasoning outperforms symbolic methods by accurately assessing diverse answer representations and formats.

  14. Bias in the Loop: Auditing LLM-as-a-Judge for Software Engineering

    cs.SE 2026-04 unverdicted novelty 5.0

    LLM judges for code tasks show high sensitivity to prompt biases that systematically favor certain options, changing accuracy and model rankings even when code is unchanged.

  15. NVIDIA Nemotron 3: Efficient and Open Intelligence

    cs.CL 2025-12 unverdicted novelty 5.0

    NVIDIA releases the Nemotron 3 model family with hybrid Mamba-Transformer architecture, LatentMoE, NVFP4 training, MTP layers, and multi-environment RL post-training for reasoning and agentic tasks.

  16. From LLM Reasoning to Autonomous AI Agents: A Comprehensive Review

    cs.AI 2025-04 accept novelty 4.0

    A survey consolidating benchmarks, agent frameworks, real-world applications, and protocols for LLM-based autonomous agents into a proposed taxonomy with recommendations for future research.

  17. LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods

    cs.CL 2024-12 accept novelty 3.0

    A survey that organizes LLMs-as-judges research into functionality, methodology, applications, meta-evaluation, and limitations.

Reference graph

Works this paper leans on

15 extracted references · 15 canonical work pages · cited by 17 Pith papers

  1. [1]

    10/10”, “Neither A nor B

    judgments. Auto-J supports ties. Following Auto-J’s official implementation, we generated judgments up to 1024 tokens in length at a temperature of 0. Skywork Critics (Shiwen et al., 2024): Skywork released a series of 8B and 70B generative judges built on Llama-3.1-8B-Instruct and Llama-3.1-70B-Instruct (Dubey et al., 2024) respectively. These models are...

  2. [2]

    knowledge

    contains 419. RewardBench (Lambert et al., 2024) is larger, but it’s an aggregation of ex- isting benchmarks, including MT-Bench and LLMBar. In order to test if this size is sufficient, we augmented our “knowledge” subset, increasing the number of response pairs from 154 to 770. We evaluated several LLMs using the Arena-Hard prompt, and observed that the ...

  3. [3]

    Each side’s lateral pterygoid has a different function during this movement

    **Muscles Involved in Mandibular Movements**: - The **lateral pterygoid muscles** are the main muscles responsible for the lateral movements of the mandible. Each side’s lateral pterygoid has a different function during this movement. - The **right lateral pterygoid muscle** would pull the condyle of the mandible forward on the right side, resulting in le...

  4. [4]

    **Supporting Muscles**: - The **masseter, temporalis, medial pterygoid**, and **buccinator** play roles primarily in elevation, retraction, or mastication rather than direct lateral movements

  5. [5]

    Output (a)

    **Logical Deduction**: - When the mandible moves to the right, the **left lateral pterygoid muscle** is more engaged. This is because it’s the left lateral pterygoid that pulls the left side of the mandible forward and to the right. Given this, the most active muscle during a right lateral excursion of the mandible is: I. Left lateral pterygoid muscle. Th...

  6. [6]

    Assistant A is significantly better: [[A>>B]]

  7. [7]

    Assistant A is slightly better: [[A >B]]

  8. [8]

    Tie, relatively the same: [[A=B]]

  9. [9]

    Assistant B is slightly better: [[B >A]]

  10. [10]

    My final verdict is tie: [[A=B]]

    Assistant B is significantly better: [[B>>A]] Example output: “My final verdict is tie: [[A=B]]”. User Prompt: <|User Prompt|> {question} <|The Start of Assistant A’s Answer|> {response a} <|The End of Assistant A’s Answer|> <|The Start of Assistant B’s Answer|> {response b} <|The End of Assistant B’s Answer|> Google Vertex Prompt User Prompt: # Instructi...

  11. [11]

    Write a detailed feedback that assess the quality of two responses strictly based on the given score rubric, not evaluating in general

  12. [12]

    You should refer to the score rubric

    After writing a feedback, choose a better response between Response A and Response B. You should refer to the score rubric

  13. [13]

    (write a feedback for criteria) [RESULT] (A or B)

    The output format should look as follows: “(write a feedback for criteria) [RESULT] (A or B)” 4. Please do not generate any other opening, closing, and explanations. ###Instruction: {question} ###Response A: {response a} ###Response B: {response b} ###Score Rubric: [Are the model’s responses factually correct and well-supported by evidence?] ###Feedback: ...

  14. [14]

    Pinpoint the key factors to distinguish these two responses

  15. [15]

    So, the final decision is Response 1 / Response 2 / Tie

    Conclude your comparison by providing a final decision on which response is better, or they are tied. Begin your final decision statement with “So, the final decision is Response 1 / Response 2 / Tie”. Ensure that your decision aligns coherently with the comprehensive evaluation and comparison you’ve provided. Skywork Prompt User Prompt: Please act as an ...