hub Canonical reference

Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models

Pat Verga, Sebastian Hofstatter, Sophia Althammer, Yixuan Su, Aleksandra Piktus, Arkady Arkhangorodsky · 2024 · cs.CL · arXiv 2404.18796

Canonical reference. 88% of citing Pith papers cite this work as background.

38 Pith papers citing it

Background 88% of classified citations

open full Pith review browse 38 citing papers arXiv PDF

abstract

As Large Language Models (LLMs) have become more advanced, they have outpaced our abilities to accurately evaluate their quality. Not only is finding data to adequately probe particular model properties difficult, but evaluating the correctness of a model's freeform generation alone is a challenge. To address this, many evaluations now rely on using LLMs themselves as judges to score the quality of outputs from other LLMs. Evaluations most commonly use a single large model like GPT4. While this method has grown in popularity, it is costly, has been shown to introduce intramodel bias, and in this work, we find that very large models are often unnecessary. We propose instead to evaluate models using a Panel of LLm evaluators (PoLL). Across three distinct judge settings and spanning six different datasets, we find that using a PoLL composed of a larger number of smaller models outperforms a single large judge, exhibits less intra-model bias due to its composition of disjoint model families, and does so while being over seven times less expensive.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 7 method 1

citation-polarity summary

background 7 use method 1

representative citing papers

Cherry-pick Override: Unsafe Directional Commitment in LLM Judges under Mixed Evidence

cs.SE · 2026-06-05 · unverdicted · novelty 7.0

The paper defines Cherry-pick Override (CCO) as unauthorized directional commitment by LLM judges under mixed evidence and quantifies its prevalence (>84% on AVeriTeC conflicting subset) while testing intervention ladders and a two-channel reference probe.

CoEval: Ranking Language Models for Custom Tasks Without Labeled Data or Trustworthy Benchmarks

cs.CL · 2026-06-02 · conditional · novelty 7.0

CoEval generates task-specific benchmarks by rotating models through teacher, student, and judge roles, then weights questions by discriminative power and judges by panel consensus to recover accurate model rankings without labels.

Synthetic Hallucinations, Real Gains: Hard Negatives from Frontier Models for FIM Hallucination Mitigation

cs.LG · 2026-06-02 · unverdicted · novelty 7.0

Using frontier models to synthesize plausible-but-wrong FIM completions as hard negatives for SFT improves Delulu exact match by +18.8 and edit similarity by +0.22 on Qwen2.5-Coder-7B while also lifting HumanEval-Infilling and SAFIM.

Beyond Recall: Behavioral Specification as an Interpretive Layer for AI Personalization

cs.CL · 2026-05-27 · unverdicted · novelty 7.0

A Behavioral Specification interpretive layer improves representational accuracy for AI personalization by compressing user data into patterns, outperforming raw corpora and commercial memory systems on held-out behavioral predictions across 14 autobiographical corpora while reducing context cost.

Does Capability Transfer to Subjective Behavior -- and Would Our Instruments Tell Us? A Self-Evolving, Trust-by-Construction Evaluation Paradigm

cs.CL · 2026-05-27 · unverdicted · novelty 7.0

Self-evolving rubric with anti-gaming fitness reveals that objective capability scaling fails to transfer to subjective LLM behaviors, with advice-restraint as the universal lowest dimension that can regress.

Refusal Evaluation in Coding LLMs and Code Agents: A Systematic Review of Thirteen Malicious-Code Prompt Corpora (2023-2025)

cs.CR · 2026-05-19 · accept · novelty 7.0

Systematic review of thirteen malicious-code prompt corpora for coding LLM refusal evaluation that catalogs construction methods, surfaces gaps in human baselines, cross-corpus comparability, and malware taxonomies, and proposes methodological improvements.

BiAxisAudit: A Novel Framework to Evaluate LLM Bias Across Prompt Sensitivity and Response-Layer Divergence

cs.CL · 2026-05-09 · unverdicted · novelty 7.0

BiAxisAudit measures LLM bias on two axes—across-prompt sensitivity via factorial grids and within-response divergence via split coding—revealing that task format explains as much variance as model choice and that 63.6% of bias signals appear in only one layer.

The Partial Testimony of Logs: Evaluation of Language Model Generation under Confounded Model Choice

cs.LG · 2026-05-02 · unverdicted · novelty 7.0

An identification theorem shows that a randomized experiment and simulator together recover causal model values from confounded logs, with logs used only afterward to reduce estimation error.

Prism-Reranker: Beyond Relevance Scoring -- Jointly Producing Contributions and Evidence for Agentic Retrieval

cs.IR · 2026-04-26 · accept · novelty 7.0

Prism-Reranker models output relevance, contribution statements, and evidence passages to support agentic retrieval beyond scalar scoring.

TimeSeriesExamAgent: Creating Time Series Reasoning Benchmarks at Scale

cs.AI · 2026-04-11 · conditional · novelty 7.0

TimeSeriesExamAgent combines templates and LLM agents to generate scalable time series reasoning benchmarks, demonstrating that current LLMs have limited performance on both abstract and domain-specific tasks.

Do AI Coding Agents Log Like Humans? An Empirical Study

cs.SE · 2026-04-10 · unverdicted · novelty 7.0

AI agents modify logging less often than humans in 58.4% of repositories but produce higher log density when they change it; explicit logging instructions are rare (4.7%) and ignored 67% of the time, with humans performing 72.5% of post-generation log repairs.

Self-Preference Bias in Rubric-Based Evaluation of Large Language Models

cs.CL · 2026-04-08 · unverdicted · novelty 7.0

Rubric-based LLM judges show self-preference bias, incorrectly marking their own failed outputs as satisfied up to 50% more often on verifiable benchmarks and skewing scores by 10 points on subjective ones.

Using LLM-as-a-Judge/Jury to Advance Scalable, Clinically-Validated Safety Evaluations of Model Responses to Users Demonstrating Psychosis

cs.CL · 2026-03-20 · conditional · novelty 7.0

Seven clinician-informed safety criteria enable LLM-as-a-Judge to reach substantial agreement with human consensus (Cohen's κ up to 0.75) on evaluating LLM responses to users demonstrating psychosis.

A Finite-Calibration Regime Map for LLM Judge Panels

cs.CL · 2026-05-31 · unverdicted · novelty 6.0

The paper introduces a finite-calibration regime map and Finite-Calibration Panel Selection selector, finding scalar aggregation wins on most real benchmark-budget combinations while joint tables help when interactions are present.

Nine Judges, Two Effective Votes: Correlated Errors Undermine LLM Evaluation Panels

cs.CL · 2026-05-28 · conditional · novelty 6.0

Nine LLM judges on three NLI datasets with human labels provide only ~2 effective independent votes due to correlated errors, underperforming independent voting by 8-22 points and matched or beaten by the best single judge.

Agreement Metrics for LLM-as-Judge Evaluation: What to Report and Why

cs.CL · 2026-05-25 · conditional · novelty 6.0

For binary LLM judge validation, Pearson's r, Spearman's ρ, Kendall's τ_b, phi, and Matthews correlation all equal a single number on non-degenerate data, Cohen's κ supplies the extra signal on label-rate drift, and a reporting checklist is provided.

Who judges the judges? Governance from metrics: a runtime framework for continuous LLM compliance monitoring

cs.CL · 2026-05-23 · unverdicted · novelty 6.0

govllm turns compliance into a runtime metric by routing via accumulated scores from a jury of regulatory LLM judges, with disagreement treated as a human-arbitration signal, validated on 49 annotated pairs showing 51.5-69.1% agreement across four SLMs.

Reinforcing Human Behavior Simulation via Verbal Feedback

cs.LG · 2026-05-19 · unverdicted · novelty 6.0

DITTO uses RL with verbal feedback to train LLMs for human behavior simulation, reporting 36% average gains over base models and outperforming GPT-5.4 on 6 of 10 SOUL benchmark tasks.

Towards Apples to Apples for AI Evaluations: From Real-World Use Cases to Evaluation Scenarios

cs.HC · 2026-05-08 · unverdicted · novelty 6.0

A repeatable worksheet and human-reviewed expansion process turns expert-elicited AI use cases into 107 grounded scenarios to support consistent human-centered evaluations.

Towards Annotation-Free Validation of MLLMs: A Vision-Language Logical Consistency Metric

cs.AI · 2026-05-07 · unverdicted · novelty 6.0

VL-LCM measures vision-language logical consistency without annotations and shows that recent MLLMs have high accuracy but low logical consistency on benchmarks like MMMU and NaturalBench.

Beyond Accuracy: Policy Invariance as a Reliability Test for LLM Safety Judges

cs.AI · 2026-05-07 · unverdicted · novelty 6.0

LLM safety judges flip verdicts on equivalent policy rewrites up to 9.1% of the time and cannot distinguish meaningful from meaningless changes, requiring new invariance-based reliability metrics.

FUSE: Ensembling Verifiers with Zero Labeled Data

stat.ML · 2026-04-20 · unverdicted · novelty 6.0

FUSE ensembles verifiers unsupervisedly by controlling their conditional dependencies to improve spectral ensembling algorithms, matching or exceeding semi-supervised baselines on benchmarks including GPQA Diamond and Humanity's Last Exam.

CURE-Med: Curriculum-Informed Reinforcement Learning for Multilingual Medical Reasoning

cs.AI · 2026-01-19 · unverdicted · novelty 6.0

CURE-MED pairs a new 13-language medical reasoning benchmark with curriculum RL to raise logical correctness to 70% and language consistency to 95% at 32B scale while outperforming baselines.

LLM-FACETS: A Privacy-Preserving Framework for Evaluating LLM Transparency and Accountability

cs.AI · 2026-05-29 · unverdicted · novelty 5.0

Introduces LLM-FACETS, a privacy-preserving open-source framework for LLM evaluation using deterministic metrics locally, LLM-judge metrics with user-controlled APIs, and mechanisms for uncertainty visualization and hallucination detection.

citing papers explorer

Showing 26 of 26 citing papers after filters.

Cherry-pick Override: Unsafe Directional Commitment in LLM Judges under Mixed Evidence cs.SE · 2026-06-05 · unverdicted · none · ref 12 · internal anchor
The paper defines Cherry-pick Override (CCO) as unauthorized directional commitment by LLM judges under mixed evidence and quantifies its prevalence (>84% on AVeriTeC conflicting subset) while testing intervention ladders and a two-channel reference probe.
Synthetic Hallucinations, Real Gains: Hard Negatives from Frontier Models for FIM Hallucination Mitigation cs.LG · 2026-06-02 · unverdicted · none · ref 15 · internal anchor
Using frontier models to synthesize plausible-but-wrong FIM completions as hard negatives for SFT improves Delulu exact match by +18.8 and edit similarity by +0.22 on Qwen2.5-Coder-7B while also lifting HumanEval-Infilling and SAFIM.
Beyond Recall: Behavioral Specification as an Interpretive Layer for AI Personalization cs.CL · 2026-05-27 · unverdicted · none · ref 14 · internal anchor
A Behavioral Specification interpretive layer improves representational accuracy for AI personalization by compressing user data into patterns, outperforming raw corpora and commercial memory systems on held-out behavioral predictions across 14 autobiographical corpora while reducing context cost.
Does Capability Transfer to Subjective Behavior -- and Would Our Instruments Tell Us? A Self-Evolving, Trust-by-Construction Evaluation Paradigm cs.CL · 2026-05-27 · unverdicted · none · ref 118 · internal anchor
Self-evolving rubric with anti-gaming fitness reveals that objective capability scaling fails to transfer to subjective LLM behaviors, with advice-restraint as the universal lowest dimension that can regress.
BiAxisAudit: A Novel Framework to Evaluate LLM Bias Across Prompt Sensitivity and Response-Layer Divergence cs.CL · 2026-05-09 · unverdicted · none · ref 56 · internal anchor
BiAxisAudit measures LLM bias on two axes—across-prompt sensitivity via factorial grids and within-response divergence via split coding—revealing that task format explains as much variance as model choice and that 63.6% of bias signals appear in only one layer.
The Partial Testimony of Logs: Evaluation of Language Model Generation under Confounded Model Choice cs.LG · 2026-05-02 · unverdicted · none · ref 33 · internal anchor
An identification theorem shows that a randomized experiment and simulator together recover causal model values from confounded logs, with logs used only afterward to reduce estimation error.
Do AI Coding Agents Log Like Humans? An Empirical Study cs.SE · 2026-04-10 · unverdicted · none · ref 30 · internal anchor
AI agents modify logging less often than humans in 58.4% of repositories but produce higher log density when they change it; explicit logging instructions are rare (4.7%) and ignored 67% of the time, with humans performing 72.5% of post-generation log repairs.
Self-Preference Bias in Rubric-Based Evaluation of Large Language Models cs.CL · 2026-04-08 · unverdicted · none · ref 17 · internal anchor
Rubric-based LLM judges show self-preference bias, incorrectly marking their own failed outputs as satisfied up to 50% more often on verifiable benchmarks and skewing scores by 10 points on subjective ones.
A Finite-Calibration Regime Map for LLM Judge Panels cs.CL · 2026-05-31 · unverdicted · none · ref 20 · internal anchor
The paper introduces a finite-calibration regime map and Finite-Calibration Panel Selection selector, finding scalar aggregation wins on most real benchmark-budget combinations while joint tables help when interactions are present.
Who judges the judges? Governance from metrics: a runtime framework for continuous LLM compliance monitoring cs.CL · 2026-05-23 · unverdicted · none · ref 14 · internal anchor
govllm turns compliance into a runtime metric by routing via accumulated scores from a jury of regulatory LLM judges, with disagreement treated as a human-arbitration signal, validated on 49 annotated pairs showing 51.5-69.1% agreement across four SLMs.
Reinforcing Human Behavior Simulation via Verbal Feedback cs.LG · 2026-05-19 · unverdicted · none · ref 151 · internal anchor
DITTO uses RL with verbal feedback to train LLMs for human behavior simulation, reporting 36% average gains over base models and outperforming GPT-5.4 on 6 of 10 SOUL benchmark tasks.
Towards Apples to Apples for AI Evaluations: From Real-World Use Cases to Evaluation Scenarios cs.HC · 2026-05-08 · unverdicted · none · ref 56 · internal anchor
A repeatable worksheet and human-reviewed expansion process turns expert-elicited AI use cases into 107 grounded scenarios to support consistent human-centered evaluations.
Towards Annotation-Free Validation of MLLMs: A Vision-Language Logical Consistency Metric cs.AI · 2026-05-07 · unverdicted · none · ref 15 · internal anchor
VL-LCM measures vision-language logical consistency without annotations and shows that recent MLLMs have high accuracy but low logical consistency on benchmarks like MMMU and NaturalBench.
Beyond Accuracy: Policy Invariance as a Reliability Test for LLM Safety Judges cs.AI · 2026-05-07 · unverdicted · none · ref 43 · internal anchor
LLM safety judges flip verdicts on equivalent policy rewrites up to 9.1% of the time and cannot distinguish meaningful from meaningless changes, requiring new invariance-based reliability metrics.
FUSE: Ensembling Verifiers with Zero Labeled Data stat.ML · 2026-04-20 · unverdicted · none · ref 16 · internal anchor
FUSE ensembles verifiers unsupervisedly by controlling their conditional dependencies to improve spectral ensembling algorithms, matching or exceeding semi-supervised baselines on benchmarks including GPQA Diamond and Humanity's Last Exam.
CURE-Med: Curriculum-Informed Reinforcement Learning for Multilingual Medical Reasoning cs.AI · 2026-01-19 · unverdicted · none · ref 46 · internal anchor
CURE-MED pairs a new 13-language medical reasoning benchmark with curriculum RL to raise logical correctness to 70% and language consistency to 95% at 32B scale while outperforming baselines.
LLM-FACETS: A Privacy-Preserving Framework for Evaluating LLM Transparency and Accountability cs.AI · 2026-05-29 · unverdicted · none · ref 33 · internal anchor
Introduces LLM-FACETS, a privacy-preserving open-source framework for LLM evaluation using deterministic metrics locally, LLM-judge metrics with user-controlled APIs, and mechanisms for uncertainty visualization and hallucination detection.
Code as a Weapon: A Consensus-Labeled Prompt Bank for Measuring Coding-Model Compliance with Malicious-Code Requests cs.CR · 2026-05-27 · unverdicted · none · ref 12 · internal anchor
Consolidates eight corpora into a 6,671-prompt bank with five-judge consensus labels separating executable malicious code requests (4,748) from harmful security knowledge requests (1,923), achieving Fleiss' kappa 0.767.
Quantifying the Utility of User Simulators for Building Collaborative LLM Assistants cs.CL · 2026-05-10 · unverdicted · none · ref 95 · internal anchor
Fine-tuned simulators grounded in real human data produce LLM assistants that win more often against real users than those trained against role-playing simulators.
Exploiting LLM-as-a-Judge Disposition on Free Text Legal QA via Prompt Optimization cs.CL · 2026-04-22 · unverdicted · none · ref 18 · internal anchor
Automatic prompt optimization using lenient LLM judges improves performance and transferability in legal QA evaluations compared to human design or strict judges.
Scoring, Reasoning, and Selecting the Best! Ensembling Large Language Models via a Peer-Review Process cs.CL · 2025-12-29 · unverdicted · none · ref 47 · internal anchor
LLM-PeerReview ensembles LLMs by scoring responses with LLM-as-Judge and selecting the best via averaging or truth inference, beating Smoothie-Global by 6.9-7.3 points on four datasets.
A Comprehensive Survey of Self-Evolving AI Agents: A New Paradigm Bridging Foundation Models and Lifelong Agentic Systems cs.AI · 2025-08-10 · unverdicted · none · ref 99 · internal anchor
A comprehensive review of self-evolving AI agents that improve themselves over time, organized via a framework of inputs, agent system, environment, and optimizers, with domain-specific and safety discussions.
Building Customer Support AI Agents at 100M-User Scale: An Evaluation-Driven Framework cs.CL · 2026-06-07 · unverdicted · none · ref 43 · internal anchor
An evaluation-driven framework for customer support AI agents at Nubank integrates context engineering, LLM judges, and A/B testing to deliver up to 37pp NPS gains and strong offline-online correlation across five production domains.
Opir: Efficient Multi-Task Safety Classification for Toxicity, Jailbreaks, Hate Speech, and Harmful Content cs.LG · 2026-05-28 · unverdicted · none · ref 36 · internal anchor
Opir introduces efficient multi-task encoder models trained on a 996-category safety taxonomy that match or exceed larger baselines on most safety benchmarks while using under 100M parameters for edge variants.
Rethinking LLMOps for Fraud and AML: Building a Compliance-Grade LLM Serving Stack cs.AI · 2026-05-11 · unverdicted · none · ref 27 · internal anchor
Workload-aware optimizations for LLM serving in AML and fraud detection yield substantial gains in throughput, latency, and GPU utilization on synthetic compliance prompts.
OpenCoderRank: Personalized Technical Assessments with Generative AI cs.SE · 2025-09-08 · unverdicted · none · ref 13 · internal anchor
OpenCoderRank provides a self-hosted, customizable system for time-bound technical assessments with automatic grading via BERTScore and LLM evaluation.

Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer