Quantifying and Mitigating Self-Preference Bias of LLM Judges

Chuxian Qiu; Jinming Yang; Tao Zhou; Xinshan Jiao; Zheng Hu; Zhenyu Deng

arxiv: 2604.22891 · v3 · submitted 2026-04-24 · 💻 cs.LG · cs.AI· cs.CL

Quantifying and Mitigating Self-Preference Bias of LLM Judges

Jinming Yang , Zheng Hu , Chuxian Qiu , Zhenyu Deng , Xinshan Jiao , Tao Zhou This is my paper

Pith reviewed 2026-05-15 06:54 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL

keywords self-preference biasLLM-as-a-Judgeautomated evaluationbias mitigationcognitive loadmodel alignmentLLM judges

0 comments

The pith

LLM judges show self-preference bias uncorrelated with capability, but a multi-dimensional strategy reduces it by 31.5 percent.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces an automated framework to measure self-preference bias in LLMs acting as judges. It creates pairs of responses with nearly identical quality to separate genuine judgment skill from the tendency to favor the model's own outputs. Testing across 20 models finds that stronger capabilities do not reliably reduce this bias and can even increase it. The authors then present a mitigation method that decomposes evaluations into multiple dimensions based on cognitive load, lowering the measured bias by an average of 31.5 percent.

Core claim

Self-preference bias is a directional deviation where LLMs favor their own generated responses during evaluation. The paper's fully automated framework constructs equal-quality response pairs to quantify bias propensity separately from discriminability without human gold standards. Empirical results across 20 mainstream LLMs show that advanced capabilities are often uncorrelated or negatively correlated with low bias. A structured multi-dimensional evaluation strategy grounded in cognitive load decomposition mitigates this bias by 31.5 percent on average.

What carries the argument

Equal-quality response pairs that statistically disentangle bias propensity from genuine discriminability, combined with the cognitive load decomposition that structures evaluations into multiple independent dimensions.

If this is right

Bias in LLM-as-a-Judge systems can be quantified at scale without human annotations.
Mitigation via multi-dimensional evaluation can improve trustworthiness in model alignment and leaderboard construction.
Advanced model capabilities alone do not solve evaluative fairness.
Real-world quality control systems using LLM judges can apply the strategy to reduce systematic favoritism.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Training methods that improve capability may need separate adjustments to avoid reinforcing self-referential judgments.
The framework could extend to measuring other directional biases in automated evaluation pipelines.
Deployments in content moderation or feedback loops might gain reliability by adopting the cognitive load approach.

Load-bearing premise

The constructed pairs of responses have truly negligible quality differences so that any preference can be attributed to bias rather than actual differences.

What would settle it

Human raters detect consistent quality differences in the constructed pairs, or the proposed mitigation strategy shows no reduction in measured bias when true preferences are independently verified.

Figures

Figures reproduced from arXiv: 2604.22891 by Chuxian Qiu, Jinming Yang, Tao Zhou, Xinshan Jiao, Zheng Hu, Zhenyu Deng.

**Figure 1.** Figure 1: Overview of the SPB quantification and mitigation framework. The workflow view at source ↗

**Figure 2.** Figure 2: Correlation analysis between quality and SPB. The x-axis represents the model’s view at source ↗

**Figure 3.** Figure 3: Discrimination capability (πi) vs. SPB (βi). Dashed lines indicate thresholds: πthresh = 0.8 and |βi |thresh = 0.08. Objective Judges. Three models are good evaluators: DeepSeek-V3- 0324 (DeepSeek-AI, 2024) (π = 0.82, β = 0.024), Grok-4-Fast (π = 0.85, β = 0.035), and Kimi-Linear-48B-A3B-Instruct (Moonshot AI, 2025c) (π = 0.85, β = −0.043). Despite Kimi-Linear-48B-A3B-Instruct’s mild negative bias, all fal… view at source ↗

**Figure 4.** Figure 4: The mitigation effect of the structured multi-dimensional evaluation strategy. view at source ↗

read the original abstract

LLM-as-a-Judge has become a dominant approach in automated evaluation systems, playing critical roles in model alignment, leaderboard construction, quality control, and so on. However, the scalability and trustworthiness of this approach can be substantially distorted by Self-Preference Bias (SPB), which is a directional evaluative deviation in which LLMs systematically favor or disfavor their own generated outputs during evaluation. Existing measurements rely on costly human annotations and conflate generative capability with evaluative stance, and thus are impractical for large-scale deployment in real-world systems. To address this issue, we introduce a fully automated framework to quantifying and mitigating SPB, which constructs equal-quality pairs of responses with negligible quality differences, enabling statistical disentanglement of discriminability from bias propensity without human gold standards. Empirical analysis across 20 mainstream LLMs reveals that advanced capabilities are often uncorrelated, or even negatively correlated, with low SPB. To mitigate this bias, we propose a structured multi-dimensional evaluation strategy grounded in cognitive load decomposition, which reduces SPB by 31.5\% on average.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The automated equal-quality pair method is the real novelty here, but the results rest on an unverified assumption that needs external checks.

read the letter

This paper gives a workable automated way to measure self-preference bias in LLM judges by building response pairs that are meant to have negligible quality differences. That setup lets them separate a model's actual discriminability from its tendency to favor its own outputs. They test it across 20 models and find that stronger capabilities often show little or no link to lower bias, sometimes even the reverse. For mitigation they break evaluation into cognitive-load dimensions and report a 31.5% average drop in the bias measure. The approach is practical because it avoids human annotations at scale, which matters for leaderboards and alignment work. The correlation result is worth noting since it pushes back on the hope that bigger models will just judge more fairly. The main soft spot is the pair construction itself. Everything after that step depends on those pairs truly having tiny quality gaps, yet the description supplies no human ratings, cross-checks, or other confirmation that the residual differences are small enough. If the construction method shares factors with the capabilities being measured, the bias numbers and the mitigation gain could be partly confounded. Statistical details and controls are also thin in the available text. This is relevant for anyone running automated evaluations. It deserves peer review so the pair method and the reported reduction can be examined directly.

Referee Report

2 major / 2 minor

Summary. The paper introduces a fully automated framework for quantifying and mitigating Self-Preference Bias (SPB) in LLM-as-a-Judge evaluations. It constructs equal-quality response pairs with negligible quality differences to statistically disentangle bias propensity from discriminability without human gold standards. Through analysis of 20 mainstream LLMs, it finds that advanced capabilities are often uncorrelated or negatively correlated with low SPB. It proposes a structured multi-dimensional evaluation strategy based on cognitive load decomposition that reduces SPB by 31.5% on average.

Significance. If the automated pair construction is valid, this provides a scalable annotation-free method to measure and mitigate bias in LLM judges, which has broad implications for model alignment and leaderboard construction. The empirical results across 20 models are a strength, as is the attempt at a parameter-free statistical separation.

major comments (2)

[§3] §3: The automated procedure for constructing equal-quality pairs lacks any independent verification (e.g., human ratings or cross-model consistency checks) that residual quality differences are negligible. This assumption is load-bearing for the statistical disentanglement of SPB from discriminability and for the reported correlations across models.
[Abstract and results section] Abstract and results section: The 31.5% average reduction is presented without details on the exact statistical tests, confidence intervals, or controls for pair construction variability, making it impossible to assess whether the mitigation gain is robust.

minor comments (2)

[Early sections] The formal definition of SPB would benefit from an explicit equation in the early sections to clarify how bias propensity is isolated from discriminability.
[Figures and tables] Figure captions and table legends should explicitly state the number of pairs per model and any filtering criteria applied.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help strengthen the presentation of our automated framework. We address each major comment point by point below.

read point-by-point responses

Referee: [§3] §3: The automated procedure for constructing equal-quality pairs lacks any independent verification (e.g., human ratings or cross-model consistency checks) that residual quality differences are negligible. This assumption is load-bearing for the statistical disentanglement of SPB from discriminability and for the reported correlations across models.

Authors: We acknowledge the centrality of this assumption but maintain that the framework's design enables statistical disentanglement without external verification. Equal-quality pairs are generated from identical prompts under tightly controlled sampling parameters, ensuring negligible quality variance by construction; the statistical separation then isolates SPB as the directional preference observed in these pairs while discriminability is measured on deliberately unequal pairs. Human ratings would undermine the automated, annotation-free goal. In revision we will add cross-model consistency checks (e.g., verifying that quality orderings remain stable when pairs are re-evaluated by held-out models) and report the resulting agreement statistics in §3. revision: partial
Referee: [Abstract and results section] Abstract and results section: The 31.5% average reduction is presented without details on the exact statistical tests, confidence intervals, or controls for pair construction variability, making it impossible to assess whether the mitigation gain is robust.

Authors: We agree that additional statistical rigor is required. In the revised manuscript we will specify the exact tests (paired t-tests with Bonferroni correction across the 20 models), report 95% confidence intervals around the 31.5% mean reduction, and include controls for pair-construction variability via bootstrap resampling of the pair-generation process and sensitivity analysis over prompt templates. These details will appear in the results section and be summarized in the abstract. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper presents an empirical framework for constructing equal-quality response pairs and a cognitive-load-based mitigation strategy, with the 31.5% SPB reduction reported as an observed experimental outcome across 20 LLMs rather than a quantity derived by construction from fitted parameters or self-referential definitions. No equations, ansatzes, or uniqueness theorems are invoked that reduce the central claims to inputs by tautology; the pair-construction procedure is a methodological choice whose validity is external to the reported statistics, and no self-citation load-bearing steps appear in the provided text. The derivation remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The framework rests on the domain assumption that equal-quality pairs can be built reliably and on the definition of SPB as a measurable directional deviation.

axioms (1)

domain assumption Equal-quality response pairs can be constructed with negligible quality differences
This assumption is required to statistically disentangle bias propensity from discriminability without human labels.

invented entities (1)

Self-Preference Bias (SPB) no independent evidence
purpose: To name and quantify the directional favoritism of an LLM judge toward its own outputs
SPB is introduced as the central phenomenon the framework measures and mitigates.

pith-pipeline@v0.9.0 · 5493 in / 1404 out tokens · 59167 ms · 2026-05-15T06:54:19.298185+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

16 extracted references · 16 canonical work pages · 1 internal anchor

[1]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

doi: 10.1162/coli_a_00524. Gemini Team, & Google DeepMind (2025). Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities. arXiv preprint , 2507.06261 . URL: https://arxiv. org/abs/2507.06261. Gemma Team, & Google DeepMind (2025). Gemma 3: Open models based on gemini research and tech...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1162/coli_a_00524 2025
[2]

Relevance: Whether the response directly and completely answers the question in the instruction

work page
[3]

Accuracy: Whether the information in the response is accurate, reliable, and error- free

work page
[4]

Depth: Whether the response provides sufficient details, context, and depth

work page
[5]

Logic: Whether the logic of the response is clear, coherent, and well-structured

work page
[6]

Clarity: Whether the expression of the response is clear, understandable, and well- structured Table A.9: Quality Evaluation Prompt – Evaluation Dimension Deﬁnitions. Evaluation Criteria: Relevance & Accuracy Detailed Scoring Criteria (for each dimension): Relevance - 9.0-10.0 points: Fully understands the instruction, response directly and completely add...

work page
[7]

Carefully read the instruction and response

work page
[8]

Strictly follow the scoring criteria, independently score each dimension (0.0-10.0)

work page
[9]

IMPORTANT: Use the full scoring range

work page
[10]

Calculate the average of the five dimensions as the overall score

work page
[11]

36 Notes and Considerations Notes: - Each dimension score must use a 0.25 resolution (e.g., 8.0, 8.25, 8.5, 8.75)

Provide detailed evaluation reasoning Instruction: {instruction} AI Response: {response} Please output the evaluation results in the following format: Dimension Scores: - Relevance: X.X points - Accuracy: X.X points - Depth: X.X points - Logic: X.X points - Clarity: X.X points Overall Score: X.X points Detailed Reasoning: [Specific analysis of each dimens...

work page
[12]

Relevance: Which response is more relevant to the instruction? [A/B]

work page
[13]

Accuracy: Which response is more accurate and factually correct? [A/B]

work page
[14]

Depth: Which response provides more depth and comprehensive coverage? [A/B]

work page
[15]

Logic: Which response has better logical structure and reasoning? [A/B]

work page
[16]

Use the following keywords in the article Questions Others Asked What is a backlink example? What are SEO backlinks? Do backlinks help SEO? How do I get backlinks?

Clarity: Which response is clearer and more well-organized? [A/B] After evaluating all five dimensions, make your final decision based on overall quality . You MUST output a single letter only: ’A’ or ’B’. Do NOT output any explanation or any other text. Answer (A or B only): Table A.17: structured multi-dimensional evaluation Prompt Template. 38 Appendix...

work page 2011

[1] [1]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

doi: 10.1162/coli_a_00524. Gemini Team, & Google DeepMind (2025). Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities. arXiv preprint , 2507.06261 . URL: https://arxiv. org/abs/2507.06261. Gemma Team, & Google DeepMind (2025). Gemma 3: Open models based on gemini research and tech...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1162/coli_a_00524 2025

[2] [2]

Relevance: Whether the response directly and completely answers the question in the instruction

work page

[3] [3]

Accuracy: Whether the information in the response is accurate, reliable, and error- free

work page

[4] [4]

Depth: Whether the response provides sufficient details, context, and depth

work page

[5] [5]

Logic: Whether the logic of the response is clear, coherent, and well-structured

work page

[6] [6]

Clarity: Whether the expression of the response is clear, understandable, and well- structured Table A.9: Quality Evaluation Prompt – Evaluation Dimension Deﬁnitions. Evaluation Criteria: Relevance & Accuracy Detailed Scoring Criteria (for each dimension): Relevance - 9.0-10.0 points: Fully understands the instruction, response directly and completely add...

work page

[7] [7]

Carefully read the instruction and response

work page

[8] [8]

Strictly follow the scoring criteria, independently score each dimension (0.0-10.0)

work page

[9] [9]

IMPORTANT: Use the full scoring range

work page

[10] [10]

Calculate the average of the five dimensions as the overall score

work page

[11] [11]

36 Notes and Considerations Notes: - Each dimension score must use a 0.25 resolution (e.g., 8.0, 8.25, 8.5, 8.75)

Provide detailed evaluation reasoning Instruction: {instruction} AI Response: {response} Please output the evaluation results in the following format: Dimension Scores: - Relevance: X.X points - Accuracy: X.X points - Depth: X.X points - Logic: X.X points - Clarity: X.X points Overall Score: X.X points Detailed Reasoning: [Specific analysis of each dimens...

work page

[12] [12]

Relevance: Which response is more relevant to the instruction? [A/B]

work page

[13] [13]

Accuracy: Which response is more accurate and factually correct? [A/B]

work page

[14] [14]

Depth: Which response provides more depth and comprehensive coverage? [A/B]

work page

[15] [15]

Logic: Which response has better logical structure and reasoning? [A/B]

work page

[16] [16]

Use the following keywords in the article Questions Others Asked What is a backlink example? What are SEO backlinks? Do backlinks help SEO? How do I get backlinks?

Clarity: Which response is clearer and more well-organized? [A/B] After evaluating all five dimensions, make your final decision based on overall quality . You MUST output a single letter only: ’A’ or ’B’. Do NOT output any explanation or any other text. Answer (A or B only): Table A.17: structured multi-dimensional evaluation Prompt Template. 38 Appendix...

work page 2011