Recognition: 2 theorem links
· Lean TheoremSelf-Preference Bias in LLM-as-a-Judge
Pith reviewed 2026-05-15 14:41 UTC · model grok-4.3
The pith
LLMs as judges give higher scores to low-perplexity outputs than humans, even for non-self-generated text.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
LLMs exhibit self-preference bias by assigning higher evaluations to outputs with lower perplexity than human evaluators, and this pattern holds regardless of whether the outputs were self-generated. The bias therefore arises because LLMs prefer texts that are more familiar to them, as measured by perplexity.
What carries the argument
Quantitative metric for self-preference bias and analysis of its correlation with output perplexity.
If this is right
- LLM evaluators promote styles and policies intrinsic to the models.
- Automated evaluation of dialogue systems risks systematic skew toward familiar text.
- The bias is driven by perplexity preference rather than explicit self-recognition.
- New metric enables quantitative tracking of this effect across models and tasks.
Where Pith is reading between the lines
- Evaluations could be adjusted by normalizing for perplexity to better match human judgments.
- The same mechanism may appear in other applications where LLMs assess text quality.
- Training LLMs on more diverse perplexity levels might lessen the bias in judging.
Load-bearing premise
The introduced metric isolates self-preference bias from confounding factors in LLM judgments.
What would settle it
If LLMs and humans assign evaluations to low-perplexity outputs at the same rate, the claim that LLMs exhibit a distinct bias tied to perplexity would be falsified.
read the original abstract
Automated evaluation leveraging large language models (LLMs), commonly referred to as LLM evaluators or LLM-as-a-judge, has been widely used in measuring the performance of dialogue systems. However, the self-preference bias in LLMs has posed significant risks, including promoting specific styles or policies intrinsic to the LLMs. Despite the importance of this issue, there is a lack of established methods to measure the self-preference bias quantitatively, and its underlying causes are poorly understood. In this paper, we introduce a novel quantitative metric to measure the self-preference bias. Our experimental results demonstrate that GPT-4 exhibits a significant degree of self-preference bias. To explore the causes, we hypothesize that LLMs may favor outputs that are more familiar to them, as indicated by lower perplexity. We analyze the relationship between LLM evaluations and the perplexities of outputs. Our findings reveal that LLMs assign significantly higher evaluations to outputs with lower perplexity than human evaluators, regardless of whether the outputs were self-generated. This suggests that the essence of the bias lies in perplexity and that the self-preference bias exists because LLMs prefer texts more familiar to them.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces a novel quantitative metric for measuring self-preference bias in LLM-as-a-judge setups for dialogue evaluation. It reports experimental results showing that GPT-4 exhibits significant self-preference bias and hypothesizes that the bias stems from LLMs favoring lower-perplexity (more familiar) outputs. Analysis indicates LLMs assign higher scores to low-perplexity texts than human evaluators do, even for non-self-generated outputs, concluding that perplexity is the essence of the bias.
Significance. If the metric validly isolates self-preference bias and the perplexity correlation proves causal rather than confounded by quality detection differences, the work would supply a useful quantitative tool for diagnosing and addressing biases in automated evaluation, with direct implications for reliable LLM judges. The GPT-4 experiments provide a concrete empirical anchor, but stronger isolation of the familiarity mechanism would be needed to elevate the contribution beyond correlational observation.
major comments (2)
- [Experimental Analysis] Experimental Analysis section: the correlation between LLM scores and lower perplexity does not include a controlled experiment holding semantic content, human-rated quality, and output length fixed while varying only perplexity; without this, the claim that perplexity is the 'essence' of self-preference bias cannot be distinguished from the alternative that LLMs and humans simply differ in how they detect fluency or coherence.
- [Metric Definition] Metric Definition: the novel quantitative metric for self-preference bias is introduced without an explicit formula or pseudocode; it is therefore impossible to verify whether the metric definition itself incorporates perplexity or model-familiarity terms, which would render the reported relationship partly definitional rather than independently discovered.
minor comments (2)
- [Abstract] Abstract: the statistical significance levels and exact sample sizes for the GPT-4 self-preference results should be stated explicitly rather than described only qualitatively as 'significant'.
- [Methods] Notation: the paper should clarify whether perplexity is computed with the same model family used as judge or with a separate reference model, as this choice affects the interpretation of 'familiarity'.
Simulated Author's Rebuttal
We thank the referee for their thoughtful and constructive feedback. We address each major comment below, providing our responses and indicating planned revisions to improve the manuscript.
read point-by-point responses
-
Referee: [Experimental Analysis] Experimental Analysis section: the correlation between LLM scores and lower perplexity does not include a controlled experiment holding semantic content, human-rated quality, and output length fixed while varying only perplexity; without this, the claim that perplexity is the 'essence' of self-preference bias cannot be distinguished from the alternative that LLMs and humans simply differ in how they detect fluency or coherence.
Authors: We appreciate this point and agree that a fully controlled experiment isolating perplexity (while holding semantic content, human-rated quality, and length fixed) would provide stronger causal evidence. Our current results show that the preference for lower-perplexity text persists for non-self-generated outputs and diverges from human judgments, which supports familiarity as a contributing factor rather than pure self-preference. However, we acknowledge the limitation in distinguishing this from differences in fluency detection. In the revision, we will add a dedicated limitations subsection, tone down the phrasing from 'essence' to 'a primary contributing factor,' and include supplementary analyses using length-matched and semantically similar output pairs to better control for confounds. revision: partial
-
Referee: [Metric Definition] Metric Definition: the novel quantitative metric for self-preference bias is introduced without an explicit formula or pseudocode; it is therefore impossible to verify whether the metric definition itself incorporates perplexity or model-familiarity terms, which would render the reported relationship partly definitional rather than independently discovered.
Authors: We thank the referee for highlighting this omission. The metric is defined independently as the average score difference between self-generated and cross-generated outputs under matched conditions, without any perplexity or familiarity terms. We will include the full mathematical definition and pseudocode in the revised Metric Definition section to enable verification and ensure the reported perplexity correlation is an independent empirical finding. revision: yes
Circularity Check
No significant circularity; metric and perplexity correlation are independently measured
full rationale
The paper introduces a novel quantitative metric for self-preference bias as a distinct contribution, then separately hypothesizes that lower perplexity drives the bias and reports an empirical correlation between LLM evaluations and output perplexity (observed even for non-self-generated text). No equation or definition in the provided text shows the bias metric being constructed from perplexity terms, nor is any 'prediction' obtained by fitting a parameter to the same data used for the target claim. The central finding is an observed divergence between LLM and human scoring that correlates with perplexity; this is presented as an empirical result rather than a definitional identity. No self-citation chain, uniqueness theorem, or ansatz smuggling is invoked to force the conclusion. The derivation chain therefore remains self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Perplexity computed by the LLM is a valid measure of how familiar or preferred a text is to that model.
Lean theorems connected to this paper
-
IndisputableMonolith.Cost.FunctionalEquationwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Our findings reveal that LLMs assign significantly higher evaluations to outputs with lower perplexity than human evaluators, regardless of whether the outputs were self-generated. This suggests that the essence of the bias lies in perplexity.
-
IndisputableMonolith.Foundation.LawOfExistencelaw_of_existence unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We propose a new metric to quantify self-preference bias in LLMs
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 19 Pith papers
-
Not All Proofs Are Equal: Evaluating LLM Proof Quality Beyond Correctness
LLM proofs for hard math problems show large differences in quality metrics like conciseness and cognitive simplicity that correctness-only tests miss, along with trade-offs between quality and correctness.
-
MCJudgeBench: A Benchmark for Constraint-Level Judge Evaluation in Multi-Constraint Instruction Following
MCJudgeBench evaluates LLM judges at the constraint level with gold labels and inconsistency metrics, showing that overall performance does not ensure reliable detection of partial or no cases or stability under pertu...
-
Detecting Stealth Sycophancy in Mental-Health Dialogue with Dynamic Emotional Signature Graphs
DESG uses dynamic graphs of decoupled clinical states and asymmetric geometry to evaluate therapeutic dialogue quality, reaching 0.9353 macro-F1 on a 600-window held-out test set and outperforming LLM judges and text ...
-
Green Shielding: A User-Centric Approach Towards Trustworthy AI
Green Shielding introduces CUE criteria and the HCM-Dx benchmark to demonstrate that routine prompt variations systematically alter LLM diagnostic behavior along clinically relevant dimensions, producing Pareto-like t...
-
Aggregate vs. Personalized Judges in Business Idea Evaluation: Evidence from Expert Disagreement
Personalized LLM judges conditioned on an individual evaluator's scoring history align more closely with that evaluator than aggregate judges trained on mixed histories.
-
LogicEval: A Systematic Framework for Evaluating Automated Repair Techniques for Logical Vulnerabilities in Real-World Software
Creates LogicDS with 122 logical vulnerabilities and LogicEval framework to evaluate repair techniques, finding failures mainly from prompt sensitivity, lost code context, and poor patch localization.
-
How Independent are Large Language Models? A Statistical Framework for Auditing Behavioral Entanglement and Reweighting Verifier Ensembles
A new auditing framework reveals widespread behavioral entanglement among LLMs and shows that reweighting ensembles based on measured independence improves verification accuracy by up to 4.5%.
-
Self-Preference Bias in Rubric-Based Evaluation of Large Language Models
Rubric-based LLM judges show self-preference bias, incorrectly marking their own failed outputs as satisfied up to 50% more often on verifiable benchmarks and skewing scores by 10 points on subjective ones.
-
CyclicJudge: Mitigating Judge Bias Efficiently in LLM-based Evaluation
CyclicJudge uses round-robin judge-to-scenario assignment to recover the panel-mean score exactly while using the same number of judge calls as single-judge evaluation.
-
Verification Mirage: Mapping the Reliability Boundary of Self-Verification in Medical VQA
Self-verification in medical VQA creates a verification mirage where verifiers exhibit high error and agreement bias on wrong answers, with reliability strongly conditioned on task type.
-
When Routine Chats Turn Toxic: Unintended Long-Term State Poisoning in Personalized Agents
Routine user chats can unintentionally poison the long-term state of personalized LLM agents, causing authorization drift, tool escalation, and unchecked autonomy, as measured by a new benchmark and reduced by the Sta...
-
Effective Performance Measurement: Challenges and Opportunities in KPI Extraction from Earnings Calls
Encoder models trained on SEC filings struggle with earnings calls due to domain shift, while LLMs enable open-ended KPI extraction with 79.7% human-verified precision on newly introduced benchmarks.
-
StratMem-Bench: Evaluating Strategic Memory Use in Virtual Character Conversation Beyond Factual Recall
StratMem-Bench reveals that state-of-the-art LLMs distinguish required from irrelevant memories effectively but struggle to integrate supportive memories in character conversations.
-
Learning to Control Summaries with Score Ranking
A score-ranking loss enables controllable summarization by aligning outputs to evaluation scores, matching SOTA performance with dimension-specific control on LLaMA, Qwen, and Mistral.
-
Label Effects: Shared Heuristic Reliance in Trust Assessment by Humans and LLM-as-a-Judge
Both humans and LLMs trust content more when labeled human-authored than AI-generated, with LLMs showing denser attention to labels and higher uncertainty under AI labels, mirroring human heuristic patterns.
-
LLM-as-a-Judge for Human-AI Co-Creation: A Reliability-Aware Evaluation Framework for Coding
LLM judges for human-AI coding co-creation show moderate performance (ROC-AUC 0.59) and low agreement, with co-creation success concentrating early in interactions.
-
Cross-Model Consistency of AI-Generated Exercise Prescriptions: A Repeated Generation Study Across Three Large Language Models
Three LLMs exhibit distinct consistency profiles in repeated exercise prescription generation, with GPT-4.1 producing unique but semantically stable outputs while Gemini 2.5 Flash achieves high similarity through text...
-
Towards Self-Improving Error Diagnosis in Multi-Agent Systems
ErrorProbe introduces a self-improving pipeline for attributing semantic failures in LLM multi-agent systems to specific agents and steps via anomaly detection, backward tracing, and tool-grounded validation with veri...
-
Consistency of AI-Generated Exercise Prescriptions: A Repeated Generation Study Using a Large Language Model
Repeated generations of exercise prescriptions by an LLM showed high semantic consistency but notable variability in quantitative details such as exercise intensity.
Reference graph
Works this paper leans on
-
[1]
Daniel Deutsch, Rotem Dror, and Dan Roth
Curran Associates Inc. Daniel Deutsch, Rotem Dror, and Dan Roth. On the limitations of reference-free evaluations of generated text. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang, editors,Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 10960–10977, Abu Dhabi, United Arab Emirates, December
work page 2022
-
[2]
doi:10.18653/v1/2022.emnlp-main.753
Association for Computational Linguistics. doi:10.18653/v1/2022.emnlp-main.753. URL https://aclanthology.org/2022. emnlp-main.753. Arjun Panickssery, Samuel R. Bowman, and Shi Feng. Llm evaluators recognize and favor their own generations,
-
[3]
Wenda Xu, Guanglei Zhu, Xuandong Zhao, Liangming Pan, Lei Li, and William Yang Wang
URL https://arxiv.org/abs/2404.13076. Wenda Xu, Guanglei Zhu, Xuandong Zhao, Liangming Pan, Lei Li, and William Yang Wang. Pride and prejudice: Llm amplifies self-bias in self-refinement,
-
[4]
Rickard Stureborg, Dimitris Alikaniotis, and Yoshi Suhara
URL https://arxiv.org/abs/2402.11436. Rickard Stureborg, Dimitris Alikaniotis, and Yoshi Suhara. Large language models are inconsistent and biased evaluators,
-
[5]
Thibault Sellam, Dipanjan Das, and Ankur Parikh
URL https://arxiv.org/abs/2405.01724. Thibault Sellam, Dipanjan Das, and Ankur Parikh. BLEURT: Learning robust metrics for text generation. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault, editors,Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7881–7892, Online, July
-
[6]
doi:10.18653/v1/2020.acl-main.704
Association for Computational Linguistics. doi:10.18653/v1/2020.acl-main.704. URL https://aclanthology.org/2020.acl-main.704. Andrea Sottana, Bin Liang, Kai Zou, and Zheng Yuan. Evaluation metrics in the era of GPT-4: Reliably evaluating large language models on sequence to sequence tasks. In The 2023 Conference on Empirical Methods in Natural Language Pr...
-
[7]
Yuxia Wang, Haonan Li, Xudong Han, Preslav Nakov, and Timothy Baldwin
URL https://openreview.net/forum?id=SyEwsV52Dk. Yuxia Wang, Haonan Li, Xudong Han, Preslav Nakov, and Timothy Baldwin. Do-not-answer: Evaluating safeguards in LLMs. In Yvette Graham and Matthew Purver, editors, Findings of the Association for Computational Linguistics: EACL 2024, pages 896–911, St. Julian’s, Malta, March
work page 2024
-
[8]
URL https://aclanthology.org/2024.findings-eacl.61
Association for Computational Linguistics. URL https://aclanthology.org/2024.findings-eacl.61. Timo Schick, Sahana Udupa, and Hinrich Schütze. Self-diagnosis and self-debiasing: A proposal for reducing corpus-based bias in NLP. Transactions of the Association for Computational Linguistics , 9:1408–1424,
work page 2024
-
[9]
URL https://aclanthology.org/2021.tacl-1.84
doi:10.1162/tacl_a_00434. URL https://aclanthology.org/2021.tacl-1.84. 9 Self-Preference Bias in LLM-as-a-Judge Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, Shashank Gupta, Bodhisattwa Prasad Majumder, Katherine Hermann, Sean Welleck, Amir Yazdanbakhsh, and Pe...
-
[10]
URL https://proceedings.neurips.cc/paper_files/paper/2022/file/ b1efde53be364a73914f58805a001731-Paper-Conference.pdf. Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. In Thirty-seventh Conference on Neural Information Proc...
work page 2022
-
[11]
Keita Saito, Akifumi Wachi, Koki Wataoka, and Youhei Akimoto
URL https: //openreview.net/forum?id=1hLFLNu4uy. Keita Saito, Akifumi Wachi, Koki Wataoka, and Youhei Akimoto. Verbosity bias in preference labeling by large language models. In NeurIPS 2023 Workshop on Instruction Tuning and Instruction Following ,
work page 2023
-
[12]
Toon Calders, Faisal Kamiran, and Mykola Pechenizkiy
URL https://openreview.net/forum?id=magEgFpK1y. Toon Calders, Faisal Kamiran, and Mykola Pechenizkiy. Building classifiers with independency constraints. In 2009 IEEE International Conference on Data Mining Workshops, pages 13–18,
work page 2009
-
[13]
Moritz Hardt, Eric Price, and Nathan Srebro
doi:10.1109/ICDMW.2009.83. Moritz Hardt, Eric Price, and Nathan Srebro. Equality of opportunity in supervised learning. In Proceedings of the 30th International Conference on Neural Information Processing Systems, NIPS’16, page 3323–3331, Red Hook, NY , USA,
-
[14]
URL https://arxiv.org/abs/2303.08774. Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March
work page internal anchor Pith review Pith/arXiv arXiv
-
[15]
URL https://lmsys.org/blog/2023-03-30-vicuna/ . Stella Biderman, Hailey Schoelkopf, Quentin Gregory Anthony, Herbie Bradley, Kyle O’Brien, Eric Hallahan, Moham- mad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff, et al. Pythia: A suite for analyzing large language models across training and scaling. In International Conference on Machine ...
work page 2023
-
[16]
Xinyang Geng, Arnav Gudibande, Hao Liu, Eric Wallace, Pieter Abbeel, Sergey Levine, and Dawn Song
URL https://www.databricks.com/blog/2023/04/12/ dolly-first-open-commercially-viable-instruction-tuned-llm . Xinyang Geng, Arnav Gudibande, Hao Liu, Eric Wallace, Pieter Abbeel, Sergey Levine, and Dawn Song. Koala: A dialogue model for academic research. Blog post, April
work page 2023
-
[17]
10 Self-Preference Bias in LLM-as-a-Judge et al
URL https://bair.berkeley.edu/blog/2023/ 04/03/koala/. 10 Self-Preference Bias in LLM-as-a-Judge et al. Jonathan Tow. Stablelm alpha v2 models,
work page 2023
-
[18]
Llama 2: Open Foundation and Fine-Tuned Chat Models
URL https://arxiv.org/abs/2307.09288. Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, and Amy Yang et al. The llama 3 herd of models,
work page internal anchor Pith review Pith/arXiv arXiv
-
[19]
URL https://arxiv.org/abs/ 2407.21783. 11
work page internal anchor Pith review Pith/arXiv arXiv
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.