hub Canonical reference

JudgeBench: A Benchmark for Evaluating LLM-based Judges

Sijun Tan, Siyuan Zhuang, Kyle Montgomery, William Y. Tang, Alejandro Cuadron, Chenguang Wang · 2024 · cs.AI · arXiv 2410.12784

Canonical reference. 75% of citing Pith papers cite this work as background.

25 Pith papers citing it

Background 75% of classified citations

open full Pith review browse 25 citing papers arXiv PDF

abstract

LLM-based judges have emerged as a scalable alternative to human evaluation and are increasingly used to assess, compare, and improve models. However, the reliability of LLM-based judges themselves is rarely scrutinized. As LLMs become more advanced, their responses grow more sophisticated, requiring stronger judges to evaluate them. Existing benchmarks primarily focus on a judge's alignment with human preferences, but often fail to account for more challenging tasks where crowdsourced human preference is a poor indicator of factual and logical correctness. To address this, we propose a novel evaluation framework to objectively evaluate LLM-based judges. Based on this framework, we propose JudgeBench, a benchmark for evaluating LLM-based judges on challenging response pairs spanning knowledge, reasoning, math, and coding. JudgeBench leverages a novel pipeline for converting existing difficult datasets into challenging response pairs with preference labels reflecting objective correctness. Our comprehensive evaluation on a collection of prompted judges, fine-tuned judges, multi-agent judges, and reward models shows that JudgeBench poses a significantly greater challenge than previous benchmarks, with many strong models (e.g., GPT-4o) performing just slightly better than random guessing. Overall, JudgeBench offers a reliable platform for assessing increasingly advanced LLM-based judges. Data and code are available at https://github.com/ScalerLab/JudgeBench.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 5 dataset 3

citation-polarity summary

background 6 use dataset 2

representative citing papers

RealMath-Eval: Why SOTA Judges Struggle with Real Human Reasoning

cs.AI · 2026-06-08 · unverdicted · novelty 7.0

RealMath-Eval benchmark shows LLM judges have an evaluation gap, performing worse on diverse real human math reasoning than on synthetic solutions due to greater error diversity and higher surprisal.

CoEval: Ranking Language Models for Custom Tasks Without Labeled Data or Trustworthy Benchmarks

cs.CL · 2026-06-02 · conditional · novelty 7.0

CoEval generates task-specific benchmarks by rotating models through teacher, student, and judge roles, then weights questions by discriminative power and judges by panel consensus to recover accurate model rankings without labels.

DRIP-R: A Benchmark for Decision-Making and Reasoning Under Real-World Policy Ambiguity in the Retail Domain

cs.CL · 2026-05-08 · unverdicted · novelty 7.0

DRIP-R is a new benchmark showing that frontier LLMs systematically disagree on how to resolve identical ambiguous retail policy scenarios, highlighting ambiguity as a core challenge for agent decision-making.

Think-with-Rubrics: From External Evaluator to Internal Reasoning Guidance

cs.CL · 2026-05-08 · unverdicted · novelty 7.0

Think-with-Rubrics has LLMs generate rubrics internally before responding, outperforming external rubric-as-reward baselines by 3.87 points on average across benchmarks.

Training Computer Use Agents to Assess the Usability of Graphical User Interfaces

cs.CL · 2026-04-28 · unverdicted · novelty 7.0

uxCUA is a trained computer use agent that assesses GUI usability more accurately than larger models by learning to prioritize and execute important user interactions on labeled interface datasets.

Green Shielding: A User-Centric Approach Towards Trustworthy AI

cs.CL · 2026-04-27 · unverdicted · novelty 7.0

Green Shielding introduces CUE criteria and the HCM-Dx benchmark to demonstrate that routine prompt variations systematically alter LLM diagnostic behavior along clinically relevant dimensions, producing Pareto-like tradeoffs in plausibility versus coverage.

CompliBench: Benchmarking LLM Judges for Compliance Violation Detection in Dialogue Systems

cs.CL · 2026-04-14 · unverdicted · novelty 7.0

CompliBench uses simulation and adversarial flaw injection to create labeled dialogue data showing that top proprietary LLMs perform poorly at spotting guideline violations while fine-tuned smaller models outperform them and generalize to new domains.

Training Therapeutic Judges and Multi-Agent Systems for Human-Aligned Mental Health Support

cs.CL · 2026-06-29 · unverdicted · novelty 6.0

TheraJudge, trained via preference optimization on human annotations, reaches high clinician agreement (ICC 0.87-0.95) and, when used by TheraAgent, raises human-rated therapeutic quality by 0.43 points on a 5-point scale with 94% recovery of low-quality responses.

Towards Spec Learning: Inference-Time Alignment from Preference Pairs

cs.CL · 2026-06-22 · unverdicted · novelty 6.0

Proposes compiling preference pairs into readable natural-language specifications for inference-time LLM alignment, claiming outperformance over DPO on dense-preference domains.

An Enigma of Artificial Reason: Investigating the Production-Evaluation Gap in Large Reasoning Models

cs.AI · 2026-05-31 · conditional · novelty 6.0

LRMs show a large production-evaluation gap on the VAIR dataset with valid answers but invalid reasoning, driven by answer confirmation bias as evidenced by CoT analysis, linear probes, and causal patching.

RTLC -- Research, Teach-to-Learn, Critique: A three-stage prompting paradigm inspired by the Feynman Learning Technique that lifts LLM-as-judge accuracy on JudgeBench with no fine-tuning

cs.CL · 2026-05-13 · unverdicted · novelty 6.0

RTLC prompting lifts Claude 3.7 Sonnet pairwise accuracy on 350 hard JudgeBench items from 64.6% to 78.6% via a Research-Teach-Critique scaffold that beats self-consistency.

Reasoning Is Not Free: Robust Adaptive Cost-Efficient Routing for LLM-as-a-Judge

cs.AI · 2026-05-11 · unverdicted · novelty 6.0

RACER routes between reasoning and non-reasoning LLM judges via constrained distributionally robust optimization to achieve better accuracy-cost trade-offs under distribution shift.

LATTICE: Evaluating Decision Support Utility of Crypto Agents

cs.CR · 2026-04-29 · unverdicted · novelty 6.0

LATTICE is a scalable LLM-judge benchmark for crypto agent decision support that reveals performance trade-offs among real-world copilots across dimensions and tasks.

Structured Multi-Criteria Evaluation of Large Language Models with Fuzzy Analytic Hierarchy Process and DualJudge

cs.AI · 2026-04-04 · unverdicted · novelty 6.0

Fuzzy AHP and DualJudge deliver more stable and calibrated LLM evaluations than direct scoring by breaking assessments into explicit criteria and adaptively fusing intuitive and deliberative judgments.

MM-tau-p$^2$: Persona-Adaptive Prompting for Robust Multi-Modal Agent Evaluation in Dual-Control Settings

cs.ET · 2026-03-10 · unverdicted · novelty 6.0

MM-tau-p² is a new benchmark with 12 metrics that measures how well multi-modal agents adapt to user personas and maintain robustness in dual-control interactions.

CARE-RL: Capability-Aware Reinforcement Learning for Mitigating Cross-Domain Conflicts

cs.LG · 2026-05-30 · unverdicted · novelty 5.0

CARE-RL combines PA-GRM for task-adaptive rewards on open-ended tasks and DACSP for modulating RL updates using historical capability directions, reporting higher total average scores than baselines on Qwen models.

Position: Uncertainty Quantification in LLMs is Just Unsupervised Clustering

cs.CL · 2026-05-19 · unverdicted · novelty 5.0

Mainstream UQ for LLMs reduces to unsupervised clustering of internal generation consistency and therefore cannot detect confident hallucinations or provide reliable safety signals.

LLM-as-a-Judge for Human-AI Co-Creation: A Reliability-Aware Evaluation Framework for Coding

cs.SE · 2026-04-30 · unverdicted · novelty 5.0

LLM judges for human-AI coding co-creation show moderate performance (ROC-AUC 0.59) and low agreement, with co-creation success concentrating early in interactions.

Rethinking Math Reasoning Evaluation: A Robust LLM-as-a-Judge Framework Beyond Symbolic Rigidity

cs.AI · 2026-04-24 · unverdicted · novelty 5.0

An LLM-as-a-judge evaluation framework for math reasoning outperforms symbolic methods by accurately assessing diverse answer representations and formats.

Bias in the Loop: Auditing LLM-as-a-Judge for Software Engineering

cs.SE · 2026-04-18 · unverdicted · novelty 5.0

LLM judges for code tasks show high sensitivity to prompt biases that systematically favor certain options, changing accuracy and model rankings even when code is unchanged.

NVIDIA Nemotron 3: Efficient and Open Intelligence

cs.CL · 2025-12-24 · unverdicted · novelty 5.0

NVIDIA releases the Nemotron 3 model family with hybrid Mamba-Transformer architecture, LatentMoE, NVFP4 training, MTP layers, and multi-environment RL post-training for reasoning and agentic tasks.

From LLM Reasoning to Autonomous AI Agents: A Comprehensive Review

cs.AI · 2025-04-28 · accept · novelty 4.0

A survey consolidating benchmarks, agent frameworks, real-world applications, and protocols for LLM-based autonomous agents into a proposed taxonomy with recommendations for future research.

A Survey on LLM-as-a-Judge

cs.CL · 2024-11-23 · unverdicted · novelty 4.0

A survey on LLM-as-a-Judge that reviews reliability strategies, proposes evaluation methods, and introduces a novel benchmark for assessing such systems.

LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods

cs.CL · 2024-12-07 · accept · novelty 3.0

A survey that organizes LLMs-as-judges research into functionality, methodology, applications, meta-evaluation, and limitations.

citing papers explorer

Showing 2 of 2 citing papers after filters.

NVIDIA Nemotron 3: Efficient and Open Intelligence cs.CL · 2025-12-24 · unverdicted · none · ref 184 · internal anchor
NVIDIA releases the Nemotron 3 model family with hybrid Mamba-Transformer architecture, LatentMoE, NVFP4 training, MTP layers, and multi-environment RL post-training for reasoning and agentic tasks.
From LLM Reasoning to Autonomous AI Agents: A Comprehensive Review cs.AI · 2025-04-28 · accept · none · ref 73 · internal anchor
A survey consolidating benchmarks, agent frameworks, real-world applications, and protocols for LLM-based autonomous agents into a proposed taxonomy with recommendations for future research.

JudgeBench: A Benchmark for Evaluating LLM-based Judges

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer