Evaluating Scoring Bias in LLM-as-a-Judge

Chao Chen; Haixiang Hu; Kailai Shao; Qingquan Li; Shaoyu Dou

arxiv: 2506.22316 · v4 · pith:3B54YSARnew · submitted 2025-06-27 · 💻 cs.CL

Evaluating Scoring Bias in LLM-as-a-Judge

Qingquan Li , Shaoyu Dou , Kailai Shao , Chao Chen , Haixiang Hu This is my paper

Pith reviewed 2026-05-22 12:56 UTC · model grok-4.3

classification 💻 cs.CL

keywords LLM-as-a-Judgescoring biasrubric order biasscore ID biasreference answer score biasautomated evaluationprompt biasLLM reliability

0 comments

The pith

Even advanced LLMs show substantial scoring biases in absolute evaluations due to prompt features like rubric order and score labels.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shifts attention from comparison-based judgments to scoring-based ones, where LLMs assign absolute scores. It defines scoring bias as effects originating in the prompt and introduces three specific types: changes in rubric ordering, use of particular score identifiers, and inclusion of reference answers. A framework with new metrics and an automatic pipeline for generating test cases measures these effects across models. Experiments indicate the biases remain large in current leading systems. Readers care because absolute scoring is common in practical LLM development and flawed scores can distort progress tracking.

Core claim

We formally define scoring bias and identify three novel, previously unstudied types: rubric order bias, score ID bias, and reference answer score bias. We propose a comprehensive framework to quantify these biases, featuring a suite of multi-faceted metrics and an automatic data synthesis pipeline to create a tailored evaluation corpus. Our experiments empirically demonstrate that even the most advanced LLMs suffer from these substantial scoring biases.

What carries the argument

A framework of multi-faceted metrics paired with an automatic data synthesis pipeline that isolates and measures prompt-originating scoring biases in absolute LLM evaluations.

If this is right

Scoring prompts require deliberate design to limit order effects from rubric sequences.
Score label choices can systematically alter the numeric values LLMs assign.
Reference answers in prompts can anchor or shift the distribution of given scores.
Industrial absolute-scoring pipelines should incorporate bias checks to avoid distorted model assessments.
Comparative evaluation studies may miss bias patterns that appear only in standalone scoring.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Standardized prompt templates could reduce these biases if the effects prove consistent across tasks.
The measurement approach might extend to testing similar prompt sensitivities in non-evaluation LLM uses such as summarization.
Fine-tuning or post-processing steps could be added to LLM judges specifically to counteract the identified prompt effects.
Cross-domain tests on non-English or specialized content could reveal whether the biases vary by language or subject.

Load-bearing premise

The automatically synthesized evaluation corpus and chosen metrics isolate prompt-based scoring biases without interference from other LLM behaviors or data artifacts.

What would settle it

Repeating the quantification experiments on a set of human-written evaluation examples and finding no measurable score shifts traceable to rubric order, score ID labels, or reference answers would indicate the claimed biases are not isolated or substantial.

read the original abstract

The "LLM-as-a-Judge" paradigm, using Large Language Models (LLMs) as automated evaluators, is pivotal to LLM development, offering scalable feedback for complex tasks. However, the reliability of these judges is compromised by various biases. Existing research has heavily concentrated on biases in comparative evaluations. In contrast, scoring-based evaluations-which assign an absolute score and are often more practical in industrial applications-remain under-investigated. To address this gap, we undertake the first dedicated examination of scoring bias in LLM judges. We shift the focus from biases tied to the evaluation targets to those originating from the scoring prompt itself. We formally define scoring bias and identify three novel, previously unstudied types: rubric order bias, score ID bias, and reference answer score bias. We propose a comprehensive framework to quantify these biases, featuring a suite of multi-faceted metrics and an automatic data synthesis pipeline to create a tailored evaluation corpus. Our experiments empirically demonstrate that even the most advanced LLMs suffer from these substantial scoring biases. Our analysis yields actionable insights for designing more robust scoring prompts and mitigating these newly identified biases.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper defines three prompt-originating scoring biases and gives a measurement framework, but the synthetic corpus needs tighter controls to rule out other LLM artifacts.

read the letter

The main thing here is the shift to absolute scoring biases instead of the usual pairwise comparison work. The authors define three new types tied to the prompt—rubric order bias, score ID bias, and reference answer score bias—and back them with a framework that includes multi-faceted metrics plus an automatic pipeline for building the evaluation corpus. They also note that absolute scoring is more common in practice than comparisons, which is a fair practical point. The experiments are presented as showing these biases in advanced models, and the formal definitions plus the named categories are presented as new relative to prior comparative bias literature. That framing and the synthesis approach are the clearest contributions so far. The softer spot is exactly the one the stress-test note flags: the automatic corpus generation could introduce correlations between the generated items and model preferences on topic, length, or style. Without explicit ablations or controls that separate the three targeted prompt elements from those other factors, the score shifts could reflect general prompt sensitivity rather than the claimed scoring biases alone. The abstract does not include the statistical details or ablation results that would let a reader judge how well the isolation works. This is useful reading for anyone building or relying on LLM-based evaluators in production settings. Teams that need concrete ways to test scoring prompts will find the metrics and pipeline worth trying. It should go to peer review because the topic is timely, the claims are falsifiable, and the experimental setup can be tightened with the right revisions.

Referee Report

2 major / 2 minor

Summary. The paper claims to provide the first dedicated study of scoring bias in LLM-as-a-Judge systems for absolute scoring tasks. It formally defines scoring bias and identifies three novel prompt-originating types (rubric order bias, score ID bias, and reference answer score bias). The authors introduce a framework consisting of multi-faceted metrics and an automatic data synthesis pipeline to generate a tailored evaluation corpus, then report experiments showing that even advanced LLMs exhibit substantial biases from these prompt elements, along with insights for mitigation.

Significance. If the results hold after addressing isolation concerns, the work would be significant for addressing an understudied aspect of LLM judge reliability in practical scoring applications. The shift from target-related biases to prompt-originating ones, combined with the automatic synthesis pipeline and multi-faceted metrics, provides a reproducible approach to quantifying such issues and yields actionable prompt-design guidance. These elements strengthen the contribution beyond prior comparative-bias studies.

major comments (2)

[Framework and Experiments description] The central claim that experiments demonstrate substantial scoring biases attributable to the three defined prompt elements requires stronger support for isolation from confounds. In the paragraph describing the framework and experiments, the automatic synthesis pipeline and multi-faceted metrics are presented without explicit ablations or controls showing orthogonality to known LLM artifacts such as verbosity, length, or lexical patterns that could correlate with generated questions or references.
[Abstract and Experiments] The empirical demonstration that 'even the most advanced LLMs suffer from these substantial scoring biases' is load-bearing for the paper's conclusions, yet the abstract and framework description provide no details on statistical reporting, effect sizes, or variance across models and runs. This makes it difficult to assess whether observed score shifts exceed baseline prompt sensitivity.

minor comments (2)

[Definition of scoring bias] The formal definition of scoring bias would benefit from an explicit mathematical formulation or pseudocode to clarify how the three bias types are quantified independently.
Figure or table captions describing the multi-faceted metrics should include more detail on how each metric is computed to improve clarity for readers implementing the framework.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the constructive feedback and positive assessment of the paper's significance. We address each major comment below and agree to strengthen the manuscript accordingly.

read point-by-point responses

Referee: [Framework and Experiments description] The central claim that experiments demonstrate substantial scoring biases attributable to the three defined prompt elements requires stronger support for isolation from confounds. In the paragraph describing the framework and experiments, the automatic synthesis pipeline and multi-faceted metrics are presented without explicit ablations or controls showing orthogonality to known LLM artifacts such as verbosity, length, or lexical patterns that could correlate with generated questions or references.

Authors: We agree that explicit isolation from confounds is necessary to support the central claims. While the synthesis pipeline generates controlled synthetic data to reduce unintended correlations, we acknowledge the absence of dedicated ablations in the current version. In the revision, we will add experiments that explicitly control for response length, lexical diversity, and verbosity by generating matched variants and demonstrating that score shifts attributable to rubric order, score ID, and reference answer biases remain after these controls. revision: yes
Referee: [Abstract and Experiments] The empirical demonstration that 'even the most advanced LLMs suffer from these substantial scoring biases' is load-bearing for the paper's conclusions, yet the abstract and framework description provide no details on statistical reporting, effect sizes, or variance across models and runs. This makes it difficult to assess whether observed score shifts exceed baseline prompt sensitivity.

Authors: We concur that statistical rigor is essential for evaluating the magnitude and reliability of the reported biases. The experiments section currently reports mean score differences across models and runs, but we will revise both the abstract and main text to include effect sizes (e.g., standardized differences), variance measures (standard deviations across runs), and comparisons against neutral prompt baselines. This will clarify that the observed shifts exceed typical prompt sensitivity. revision: yes

Circularity Check

0 steps flagged

No circularity in empirical bias measurement study

full rationale

This is an empirical measurement study that defines three new bias types (rubric order bias, score ID bias, reference answer score bias) and introduces a quantification framework plus automatic synthesis pipeline. The central claims rest on direct experimental measurements of LLM scoring behavior rather than any mathematical derivations, fitted parameters renamed as predictions, or self-referential definitions. No load-bearing step reduces to the paper's own inputs by construction; results are falsifiable against external LLM outputs and do not rely on self-citation chains for uniqueness or ansatz smuggling. The analysis is self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on the empirical results of the proposed quantification framework applied to synthetic data; no free parameters, invented entities, or non-standard axioms are described in the abstract.

pith-pipeline@v0.9.0 · 5726 in / 1096 out tokens · 50406 ms · 2026-05-22T12:56:25.422108+00:00 · methodology

discussion (0)

Forward citations

Cited by 8 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

ODRPO: Ordinal Decompositions of Discrete Rewards for Robust Policy Optimization
cs.LG 2026-05 unverdicted novelty 6.0

ODRPO decomposes discrete rewards into ordinal binary indicators to compute independent advantages and reduce noise corruption in RLAIF policy optimization.
ODRPO: Ordinal Decompositions of Discrete Rewards for Robust Policy Optimization
cs.LG 2026-05 unverdicted novelty 6.0

ODRPO decomposes discrete rewards into ordinal binary indicators to create robust, variance-aware advantage estimators for noisy RLAIF in LLM alignment.
Auditing Multimodal LLM Raters: Central Tendency Bias in Clinical Ordinal Scoring
cs.CV 2026-05 conditional novelty 6.0

Multimodal LLMs exhibit central tendency bias when scoring ordinal clinical images, over-predicting low scores and under-predicting high scores even after prompt ablations.
Coordinates of Capability: A Unified MTMM-Geometric Framework for LLM Evaluation
cs.CL 2026-05 unverdicted novelty 6.0

A new MTMM-geometric framework unifies LLM evaluation metrics into three latent dimensions to separate method variance from true capabilities.
Beyond Accuracy: Policy Invariance as a Reliability Test for LLM Safety Judges
cs.AI 2026-05 unverdicted novelty 6.0

LLM safety judges flip verdicts on equivalent policy rewrites up to 9.1% of the time and cannot distinguish meaningful from meaningless changes, requiring new invariance-based reliability metrics.
Label Effects: Shared Heuristic Reliance in Trust Assessment by Humans and LLM-as-a-Judge
cs.AI 2026-04 unverdicted novelty 6.0

Both humans and LLMs trust content more when labeled human-authored than AI-generated, with LLMs showing denser attention to labels and higher uncertainty under AI labels, mirroring human heuristic patterns.
Coordinates of Capability: A Unified MTMM-Geometric Framework for LLM Evaluation
cs.CL 2026-05 unverdicted novelty 5.0

A systematization of knowledge unifies nine LLM metrics into three orthogonal latent dimensions via an MTMM-geometric framework to improve construct validity in evaluation.
Token Statistics Reveal Conversational Drift in Multi-turn LLM Interaction
cs.CL 2026-03 unverdicted novelty 5.0

Bipredictability from token statistics monitors structural consistency in multi-turn LLM interactions, showing 85% alignment with structure but only 44% with semantics and 100% sensitivity to tested drifts across 4574 turns.