Benchmarking Cognitive Biases in Large Language Models as Evaluators

Dongyeop Kang; Jong Inn Park; Minhwa Lee; Ryan Koo; Vipul Raheja; Zae Myung Kim

arxiv: 2309.17012 · v3 · pith:3SMX3ST6new · submitted 2023-09-29 · 💻 cs.CL · cs.AI· cs.LG

Benchmarking Cognitive Biases in Large Language Models as Evaluators

Ryan Koo , Minhwa Lee , Vipul Raheja , Jong Inn Park , Zae Myung Kim , Dongyeop Kang This is my paper

classification 💻 cs.CL cs.AIcs.LG

keywords evaluatorsllmsmodelsbenchmarkbiascognitivelanguagelarge

0 comments

read the original abstract

Large Language Models are cognitively biased judges. Large Language Models (LLMs) have recently been shown to be effective as automatic evaluators with simple prompting and in-context learning. In this work, we assemble 15 LLMs of four different size ranges and evaluate their output responses by preference ranking from the other LLMs as evaluators, such as System Star is better than System Square. We then evaluate the quality of ranking outputs introducing the Cognitive Bias Benchmark for LLMs as Evaluators (CoBBLEr), a benchmark to measure six different cognitive biases in LLM evaluation outputs, such as the Egocentric bias where a model prefers to rank its own outputs highly in evaluation. We find that LLMs are biased text quality evaluators, exhibiting strong indications on our bias benchmark (average of 40% of comparisons across all models) within each of their evaluations that question their robustness as evaluators. Furthermore, we examine the correlation between human and machine preferences and calculate the average Rank-Biased Overlap (RBO) score to be 49.6%, indicating that machine preferences are misaligned with humans. According to our findings, LLMs may still be unable to be utilized for automatic annotation aligned with human preferences. Our project page is at: https://minnesotanlp.github.io/cobbler.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 15 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Evaluating Deep Research Agents on Expert Consulting Work: A Benchmark with Verifiers, Rubrics, and Cognitive Traps
cs.AI 2026-05 unverdicted novelty 8.0

A new benchmark with cognitive traps shows frontier deep research agents achieve only 13-16% acceptance on expert consulting tasks under combined verifier and rubric criteria.
Does Capability Transfer to Subjective Behavior -- and Would Our Instruments Tell Us? A Self-Evolving, Trust-by-Construction Evaluation Paradigm
cs.CL 2026-05 unverdicted novelty 7.0

Self-evolving rubric with anti-gaming fitness reveals that objective capability scaling fails to transfer to subjective LLM behaviors, with advice-restraint as the universal lowest dimension that can regress.
Evaluating Deep Research Agents on Expert Consulting Work: A Benchmark with Verifiers, Rubrics, and Cognitive Traps
cs.AI 2026-05 unverdicted novelty 7.0

New benchmark evaluates three frontier deep research agents on 42 SME prompts with verifiers and rubrics, reporting low acceptance rates of 9.5-21.4% and agent-specific failure modes.
Diagnosing LLM Judge Reliability: Conformal Prediction Sets and Transitivity Violations
cs.AI 2026-04 unverdicted novelty 7.0

LLM judges display per-document transitivity violations in 33-67% of cases despite low aggregate rates, while conformal prediction set widths serve as reliable indicators of document-level difficulty with cross-judge ...
Self-Preference Bias in Rubric-Based Evaluation of Large Language Models
cs.CL 2026-04 unverdicted novelty 7.0

Rubric-based LLM judges show self-preference bias, incorrectly marking their own failed outputs as satisfied up to 50% more often on verifiable benchmarks and skewing scores by 10 points on subjective ones.
Poller: Are LLMs Suitable for Evaluating the Poetry Understanding Task?
cs.CL 2026-06 unverdicted novelty 6.0

Poller reduces LLM-human disagreement in evaluating Chinese poetry understanding by having LLMs role-play as authors, with reported error reductions of 94.55% and 89.53% on rhetorical techniques and defamiliarization.
Show, Don't TELL: Explainable AI-Generated Text Detection
cs.AI 2026-05 unverdicted novelty 6.0

TELL is a new architecture for AI text detection that natively supplies explanatory annotations, reaching AUROC 0.927 and a 72.3% human win-rate on explanation quality metrics.
Better & Faster Large Language Models via Multi-token Prediction
cs.CL 2024-04 conditional novelty 6.0

Multi-token prediction training yields higher sample efficiency, better benchmark scores on code generation, and up to 3x faster inference than standard next-token prediction for LLMs.
LLM Evaluators Recognize and Favor Their Own Generations
cs.CL 2024-04 unverdicted novelty 6.0

LLMs show measurable self-recognition that linearly correlates with self-preference bias in evaluations, supported by fine-tuning experiments and controls for confounders.
U-Define: Designing User Workflows for Hard and Soft Constraints in LLM-Based Planning
cs.AI 2026-05 unverdicted novelty 5.0

U-Define improves user control in LLM planning by letting people define hard rules and soft preferences in natural language with matching verification methods, raising usefulness and satisfaction scores.
Judging the Judges: A Systematic Evaluation of Bias Mitigation Strategies in LLM-as-a-Judge Pipelines
cs.AI 2026-04 conditional novelty 5.0

Gemini 2.5 Flash with a Combined Budget debiasing strategy achieves 71.0% judge agreement at ~$0.001/evaluation, outperforming frontier models at 15x lower cost.
Judging the Judges: A Systematic Evaluation of Bias Mitigation Strategies in LLM-as-a-Judge Pipelines
cs.AI 2026-04 unverdicted novelty 5.0

Style bias dominates LLM-as-a-Judge systems far more than position bias, with debiasing strategies providing model-dependent gains and public tools released for replication.
A Survey on LLM-as-a-Judge
cs.CL 2024-11 unverdicted novelty 4.0

A survey on LLM-as-a-Judge that reviews reliability strategies, proposes evaluation methods, and introduces a novel benchmark for assessing such systems.
From Cool Demos to Production-Ready FMware: Core Challenges and a Technology Roadmap
cs.SE 2024-10 unverdicted novelty 4.0

A semi-structured thematic synthesis identifies core challenges in FM selection, alignment, prompting, orchestration, testing, deployment, and cross-cutting concerns like observability for production-ready FMware.
LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods
cs.CL 2024-12 accept novelty 3.0

A survey that organizes LLMs-as-judges research into functionality, methodology, applications, meta-evaluation, and limitations.