The paper defines Cherry-pick Override (CCO) as unauthorized directional commitment by LLM judges under mixed evidence and quantifies its prevalence (>84% on AVeriTeC conflicting subset) while testing intervention ladders and a two-channel reference probe.
hub Canonical reference
Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models
Canonical reference. 88% of citing Pith papers cite this work as background.
abstract
As Large Language Models (LLMs) have become more advanced, they have outpaced our abilities to accurately evaluate their quality. Not only is finding data to adequately probe particular model properties difficult, but evaluating the correctness of a model's freeform generation alone is a challenge. To address this, many evaluations now rely on using LLMs themselves as judges to score the quality of outputs from other LLMs. Evaluations most commonly use a single large model like GPT4. While this method has grown in popularity, it is costly, has been shown to introduce intramodel bias, and in this work, we find that very large models are often unnecessary. We propose instead to evaluate models using a Panel of LLm evaluators (PoLL). Across three distinct judge settings and spanning six different datasets, we find that using a PoLL composed of a larger number of smaller models outperforms a single large judge, exhibits less intra-model bias due to its composition of disjoint model families, and does so while being over seven times less expensive.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
CoEval generates task-specific benchmarks by rotating models through teacher, student, and judge roles, then weights questions by discriminative power and judges by panel consensus to recover accurate model rankings without labels.
Using frontier models to synthesize plausible-but-wrong FIM completions as hard negatives for SFT improves Delulu exact match by +18.8 and edit similarity by +0.22 on Qwen2.5-Coder-7B while also lifting HumanEval-Infilling and SAFIM.
A Behavioral Specification interpretive layer improves representational accuracy for AI personalization by compressing user data into patterns, outperforming raw corpora and commercial memory systems on held-out behavioral predictions across 14 autobiographical corpora while reducing context cost.
Self-evolving rubric with anti-gaming fitness reveals that objective capability scaling fails to transfer to subjective LLM behaviors, with advice-restraint as the universal lowest dimension that can regress.
Systematic review of thirteen malicious-code prompt corpora for coding LLM refusal evaluation that catalogs construction methods, surfaces gaps in human baselines, cross-corpus comparability, and malware taxonomies, and proposes methodological improvements.
BiAxisAudit measures LLM bias on two axes—across-prompt sensitivity via factorial grids and within-response divergence via split coding—revealing that task format explains as much variance as model choice and that 63.6% of bias signals appear in only one layer.
An identification theorem shows that a randomized experiment and simulator together recover causal model values from confounded logs, with logs used only afterward to reduce estimation error.
Prism-Reranker models output relevance, contribution statements, and evidence passages to support agentic retrieval beyond scalar scoring.
TimeSeriesExamAgent combines templates and LLM agents to generate scalable time series reasoning benchmarks, demonstrating that current LLMs have limited performance on both abstract and domain-specific tasks.
AI agents modify logging less often than humans in 58.4% of repositories but produce higher log density when they change it; explicit logging instructions are rare (4.7%) and ignored 67% of the time, with humans performing 72.5% of post-generation log repairs.
Rubric-based LLM judges show self-preference bias, incorrectly marking their own failed outputs as satisfied up to 50% more often on verifiable benchmarks and skewing scores by 10 points on subjective ones.
Seven clinician-informed safety criteria enable LLM-as-a-Judge to reach substantial agreement with human consensus (Cohen's κ up to 0.75) on evaluating LLM responses to users demonstrating psychosis.
The paper introduces a finite-calibration regime map and Finite-Calibration Panel Selection selector, finding scalar aggregation wins on most real benchmark-budget combinations while joint tables help when interactions are present.
Nine LLM judges on three NLI datasets with human labels provide only ~2 effective independent votes due to correlated errors, underperforming independent voting by 8-22 points and matched or beaten by the best single judge.
For binary LLM judge validation, Pearson's r, Spearman's ρ, Kendall's τ_b, phi, and Matthews correlation all equal a single number on non-degenerate data, Cohen's κ supplies the extra signal on label-rate drift, and a reporting checklist is provided.
govllm turns compliance into a runtime metric by routing via accumulated scores from a jury of regulatory LLM judges, with disagreement treated as a human-arbitration signal, validated on 49 annotated pairs showing 51.5-69.1% agreement across four SLMs.
DITTO uses RL with verbal feedback to train LLMs for human behavior simulation, reporting 36% average gains over base models and outperforming GPT-5.4 on 6 of 10 SOUL benchmark tasks.
A repeatable worksheet and human-reviewed expansion process turns expert-elicited AI use cases into 107 grounded scenarios to support consistent human-centered evaluations.
VL-LCM measures vision-language logical consistency without annotations and shows that recent MLLMs have high accuracy but low logical consistency on benchmarks like MMMU and NaturalBench.
LLM safety judges flip verdicts on equivalent policy rewrites up to 9.1% of the time and cannot distinguish meaningful from meaningless changes, requiring new invariance-based reliability metrics.
FUSE ensembles verifiers unsupervisedly by controlling their conditional dependencies to improve spectral ensembling algorithms, matching or exceeding semi-supervised baselines on benchmarks including GPQA Diamond and Humanity's Last Exam.
CURE-MED pairs a new 13-language medical reasoning benchmark with curriculum RL to raise logical correctness to 70% and language consistency to 95% at 32B scale while outperforming baselines.
Introduces LLM-FACETS, a privacy-preserving open-source framework for LLM evaluation using deterministic metrics locally, LLM-judge metrics with user-controlled APIs, and mechanisms for uncertainty visualization and hallucination detection.
citing papers explorer
-
Cherry-pick Override: Unsafe Directional Commitment in LLM Judges under Mixed Evidence
The paper defines Cherry-pick Override (CCO) as unauthorized directional commitment by LLM judges under mixed evidence and quantifies its prevalence (>84% on AVeriTeC conflicting subset) while testing intervention ladders and a two-channel reference probe.
-
Synthetic Hallucinations, Real Gains: Hard Negatives from Frontier Models for FIM Hallucination Mitigation
Using frontier models to synthesize plausible-but-wrong FIM completions as hard negatives for SFT improves Delulu exact match by +18.8 and edit similarity by +0.22 on Qwen2.5-Coder-7B while also lifting HumanEval-Infilling and SAFIM.
-
Beyond Recall: Behavioral Specification as an Interpretive Layer for AI Personalization
A Behavioral Specification interpretive layer improves representational accuracy for AI personalization by compressing user data into patterns, outperforming raw corpora and commercial memory systems on held-out behavioral predictions across 14 autobiographical corpora while reducing context cost.
-
Does Capability Transfer to Subjective Behavior -- and Would Our Instruments Tell Us? A Self-Evolving, Trust-by-Construction Evaluation Paradigm
Self-evolving rubric with anti-gaming fitness reveals that objective capability scaling fails to transfer to subjective LLM behaviors, with advice-restraint as the universal lowest dimension that can regress.
-
BiAxisAudit: A Novel Framework to Evaluate LLM Bias Across Prompt Sensitivity and Response-Layer Divergence
BiAxisAudit measures LLM bias on two axes—across-prompt sensitivity via factorial grids and within-response divergence via split coding—revealing that task format explains as much variance as model choice and that 63.6% of bias signals appear in only one layer.
-
The Partial Testimony of Logs: Evaluation of Language Model Generation under Confounded Model Choice
An identification theorem shows that a randomized experiment and simulator together recover causal model values from confounded logs, with logs used only afterward to reduce estimation error.
-
Do AI Coding Agents Log Like Humans? An Empirical Study
AI agents modify logging less often than humans in 58.4% of repositories but produce higher log density when they change it; explicit logging instructions are rare (4.7%) and ignored 67% of the time, with humans performing 72.5% of post-generation log repairs.
-
Self-Preference Bias in Rubric-Based Evaluation of Large Language Models
Rubric-based LLM judges show self-preference bias, incorrectly marking their own failed outputs as satisfied up to 50% more often on verifiable benchmarks and skewing scores by 10 points on subjective ones.
-
A Finite-Calibration Regime Map for LLM Judge Panels
The paper introduces a finite-calibration regime map and Finite-Calibration Panel Selection selector, finding scalar aggregation wins on most real benchmark-budget combinations while joint tables help when interactions are present.
-
Who judges the judges? Governance from metrics: a runtime framework for continuous LLM compliance monitoring
govllm turns compliance into a runtime metric by routing via accumulated scores from a jury of regulatory LLM judges, with disagreement treated as a human-arbitration signal, validated on 49 annotated pairs showing 51.5-69.1% agreement across four SLMs.
-
Reinforcing Human Behavior Simulation via Verbal Feedback
DITTO uses RL with verbal feedback to train LLMs for human behavior simulation, reporting 36% average gains over base models and outperforming GPT-5.4 on 6 of 10 SOUL benchmark tasks.
-
Towards Apples to Apples for AI Evaluations: From Real-World Use Cases to Evaluation Scenarios
A repeatable worksheet and human-reviewed expansion process turns expert-elicited AI use cases into 107 grounded scenarios to support consistent human-centered evaluations.
-
Towards Annotation-Free Validation of MLLMs: A Vision-Language Logical Consistency Metric
VL-LCM measures vision-language logical consistency without annotations and shows that recent MLLMs have high accuracy but low logical consistency on benchmarks like MMMU and NaturalBench.
-
Beyond Accuracy: Policy Invariance as a Reliability Test for LLM Safety Judges
LLM safety judges flip verdicts on equivalent policy rewrites up to 9.1% of the time and cannot distinguish meaningful from meaningless changes, requiring new invariance-based reliability metrics.
-
FUSE: Ensembling Verifiers with Zero Labeled Data
FUSE ensembles verifiers unsupervisedly by controlling their conditional dependencies to improve spectral ensembling algorithms, matching or exceeding semi-supervised baselines on benchmarks including GPQA Diamond and Humanity's Last Exam.
-
CURE-Med: Curriculum-Informed Reinforcement Learning for Multilingual Medical Reasoning
CURE-MED pairs a new 13-language medical reasoning benchmark with curriculum RL to raise logical correctness to 70% and language consistency to 95% at 32B scale while outperforming baselines.
-
LLM-FACETS: A Privacy-Preserving Framework for Evaluating LLM Transparency and Accountability
Introduces LLM-FACETS, a privacy-preserving open-source framework for LLM evaluation using deterministic metrics locally, LLM-judge metrics with user-controlled APIs, and mechanisms for uncertainty visualization and hallucination detection.
-
Code as a Weapon: A Consensus-Labeled Prompt Bank for Measuring Coding-Model Compliance with Malicious-Code Requests
Consolidates eight corpora into a 6,671-prompt bank with five-judge consensus labels separating executable malicious code requests (4,748) from harmful security knowledge requests (1,923), achieving Fleiss' kappa 0.767.
-
Quantifying the Utility of User Simulators for Building Collaborative LLM Assistants
Fine-tuned simulators grounded in real human data produce LLM assistants that win more often against real users than those trained against role-playing simulators.
-
Exploiting LLM-as-a-Judge Disposition on Free Text Legal QA via Prompt Optimization
Automatic prompt optimization using lenient LLM judges improves performance and transferability in legal QA evaluations compared to human design or strict judges.
-
Scoring, Reasoning, and Selecting the Best! Ensembling Large Language Models via a Peer-Review Process
LLM-PeerReview ensembles LLMs by scoring responses with LLM-as-Judge and selecting the best via averaging or truth inference, beating Smoothie-Global by 6.9-7.3 points on four datasets.
-
A Comprehensive Survey of Self-Evolving AI Agents: A New Paradigm Bridging Foundation Models and Lifelong Agentic Systems
A comprehensive review of self-evolving AI agents that improve themselves over time, organized via a framework of inputs, agent system, environment, and optimizers, with domain-specific and safety discussions.
-
Building Customer Support AI Agents at 100M-User Scale: An Evaluation-Driven Framework
An evaluation-driven framework for customer support AI agents at Nubank integrates context engineering, LLM judges, and A/B testing to deliver up to 37pp NPS gains and strong offline-online correlation across five production domains.
-
Opir: Efficient Multi-Task Safety Classification for Toxicity, Jailbreaks, Hate Speech, and Harmful Content
Opir introduces efficient multi-task encoder models trained on a 996-category safety taxonomy that match or exceed larger baselines on most safety benchmarks while using under 100M parameters for edge variants.
-
Rethinking LLMOps for Fraud and AML: Building a Compliance-Grade LLM Serving Stack
Workload-aware optimizations for LLM serving in AML and fraud detection yield substantial gains in throughput, latency, and GPU utilization on synthetic compliance prompts.
-
OpenCoderRank: Personalized Technical Assessments with Generative AI
OpenCoderRank provides a self-hosted, customizable system for time-bound technical assessments with automatic grading via BERTScore and LLM evaluation.