LLM judges display per-document transitivity violations in 33-67% of cases despite low aggregate rates, while conformal prediction set widths serve as reliable indicators of document-level difficulty with cross-judge agreement.
Benchmarking cognitive biases in large language models as evaluators.arXiv preprint arXiv:2309.17012
6 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
representative citing papers
Rubric-based LLM judges show self-preference bias, incorrectly marking their own failed outputs as satisfied up to 50% more often on verifiable benchmarks and skewing scores by 10 points on subjective ones.
Multi-token prediction training yields higher sample efficiency, better benchmark scores on code generation, and up to 3x faster inference than standard next-token prediction for LLMs.
U-Define improves user control in LLM planning by letting people define hard rules and soft preferences in natural language with matching verification methods, raising usefulness and satisfaction scores.
Style bias dominates LLM-as-a-Judge systems far more than position bias, with debiasing strategies providing model-dependent gains and public tools released for replication.
A survey that organizes LLMs-as-judges research into functionality, methodology, applications, meta-evaluation, and limitations.
citing papers explorer
-
Diagnosing LLM Judge Reliability: Conformal Prediction Sets and Transitivity Violations
LLM judges display per-document transitivity violations in 33-67% of cases despite low aggregate rates, while conformal prediction set widths serve as reliable indicators of document-level difficulty with cross-judge agreement.
-
Self-Preference Bias in Rubric-Based Evaluation of Large Language Models
Rubric-based LLM judges show self-preference bias, incorrectly marking their own failed outputs as satisfied up to 50% more often on verifiable benchmarks and skewing scores by 10 points on subjective ones.
-
Better & Faster Large Language Models via Multi-token Prediction
Multi-token prediction training yields higher sample efficiency, better benchmark scores on code generation, and up to 3x faster inference than standard next-token prediction for LLMs.
-
U-Define: Designing User Workflows for Hard and Soft Constraints in LLM-Based Planning
U-Define improves user control in LLM planning by letting people define hard rules and soft preferences in natural language with matching verification methods, raising usefulness and satisfaction scores.
-
Judging the Judges: A Systematic Evaluation of Bias Mitigation Strategies in LLM-as-a-Judge Pipelines
Style bias dominates LLM-as-a-Judge systems far more than position bias, with debiasing strategies providing model-dependent gains and public tools released for replication.
-
LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods
A survey that organizes LLMs-as-judges research into functionality, methodology, applications, meta-evaluation, and limitations.