pith. sign in

arxiv: 2602.17170 · v3 · submitted 2026-02-19 · 💻 cs.IR

When LLM Judges Inflate Scores: Exploring Overrating in Relevance Assessment

Pith reviewed 2026-05-15 21:07 UTC · model grok-4.3

classification 💻 cs.IR
keywords LLM judgesrelevance assessmentoverrating behaviorinformation retrievalevaluation biaspointwise pairwise judgments
0
0 comments X

The pith

LLM-based relevance judges assign inflated scores to passages that do not satisfy the information need.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models are increasingly used to judge how relevant a passage is to a query in information retrieval systems. This paper examines whether these models overrate passages that actually fail to meet the user's information need. Through experiments with various models, judgment types, and modified passages, it finds that LLMs consistently give higher scores than warranted, often with high confidence. The overrating links to superficial factors like longer text or certain word choices rather than genuine relevance. This suggests LLMs introduce systematic bias when used as human substitutes for evaluation.

Core claim

Models consistently assign inflated relevance scores to passages that do not genuinely satisfy the underlying information need, often with high confidence. This overrating behavior holds across different model backbones, pointwise and pairwise evaluation paradigms, and various passage modification strategies, indicating a system-wide bias. The judgments prove sensitive to passage length and surface-level lexical cues.

What carries the argument

Overrating behavior in LLM relevance judgments, demonstrated via controlled passage modifications that preserve core content but alter length and lexical features.

Load-bearing premise

The observed score inflation arises from inherent LLM limitations rather than from the chosen prompts, datasets, or other experimental factors.

What would settle it

A controlled test showing that LLMs assign appropriately low scores to lengthened or lexically altered passages that still do not meet the information need, matching human assessments.

Figures

Figures reproduced from arXiv: 2602.17170 by Chuting Yu, Guido Zuccon, Hang Li, Joel Mackenzie, Teerapong Leelanupab.

Figure 2
Figure 2. Figure 2: SEM, LEX, and QRY variants for the query “do goldfish grow”. SEM preserves relevance without surface query terms; LEX provides the query terms such that they are not relevant to the information need; and QRY simply adds the query itself. to generate stylistic and length-based content variations. Unlike Ba￾log et al. [4], who assume rewriting preserves relevance (validated by human assessors [9]), we perfor… view at source ↗
Figure 1
Figure 1. Figure 1: Label transition matrices for Qwen under different passage rewrites. Each heatmap shows normalized transi￾tions from the LLM’s original assigned relevance labels. results demonstrate that using confidence as a signal for judgment reliability is not useful, and is highly sensitive to positional changes. 3 Exploring Relevance Cues The overrating behaviors observed earlier raise a key question: do LLMs genuin… view at source ↗
Figure 3
Figure 3. Figure 3: Predicted relevance-label distributions on non [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
read the original abstract

Human relevance assessment is time-consuming and cognitively intensive, limiting the scalability of Information Retrieval evaluation. This has led to growing interest in using large language models (LLMs) as proxies for human judges. However, it remains an open question whether LLM-based relevance judgments are reliable, stable, and rigorous enough to match humans for relevance assessment. In this work, we conduct a study of \textit{overrating behavior} in LLM-based relevance judgments across model backbones, evaluation paradigms (pointwise and pairwise), and passage modification strategies. We show that models consistently assign inflated relevance scores -- often with high confidence -- to passages that do not genuinely satisfy the underlying information need, revealing a system-wide bias rather than random fluctuations in judgment. Furthermore, controlled experiments show that LLM-based relevance judgments can be highly sensitive to passage length and surface-level lexical cues. These results raise concerns about the usage of LLMs as drop-in replacements for human relevance assessors, and highlight the urgent need for careful diagnostic evaluation frameworks when applying LLMs for relevance assessments. Our code and results are publicly available.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that LLMs exhibit systematic overrating in relevance assessment for IR, consistently assigning inflated scores (often with high confidence) to passages that do not satisfy the information need. This is demonstrated through experiments across multiple model backbones, pointwise and pairwise paradigms, and controlled passage modifications, with additional findings that judgments are sensitive to passage length and surface lexical cues. The authors conclude this indicates a system-wide bias rather than random error, cautioning against using LLMs as drop-in replacements for human judges, and release code and results publicly.

Significance. If the central observations hold after addressing setup variations, the work is significant for IR evaluation practices by providing concrete evidence of bias in LLM judges that could undermine automated assessments at scale. The public code supports reproducibility, which is a strength for empirical claims in this area.

major comments (2)
  1. [Methods] Methods: The study holds prompt templates fixed (standard zero-shot relevance templates) across all backbones and paradigms without reporting ablations on prompt phrasing, chain-of-thought, or few-shot human examples. This is load-bearing for the system-wide bias claim, as the observed inflation could be an artifact of the specific controls rather than inherent to LLMs; if scores change under modest prompt variations, the conclusion would need narrowing.
  2. [Results] Results: The abstract and findings report consistent inflation but provide no details on statistical controls, sample sizes, exclusion criteria, or effect sizes. This makes it difficult to verify that the overrating reflects a non-random, system-wide property rather than fluctuations tied to the tested datasets and modifications.
minor comments (2)
  1. [Abstract] Abstract: The term 'overrating behavior' would benefit from an explicit early definition to distinguish it from general score variance.
  2. [Figures] Figures: Ensure all passage modification examples are clearly labeled with their impact on lexical cues and length to aid interpretation of sensitivity results.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their insightful comments on our work. We address each major comment below and outline the revisions we plan to make to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Methods] Methods: The study holds prompt templates fixed (standard zero-shot relevance templates) across all backbones and paradigms without reporting ablations on prompt phrasing, chain-of-thought, or few-shot human examples. This is load-bearing for the system-wide bias claim, as the observed inflation could be an artifact of the specific controls rather than inherent to LLMs; if scores change under modest prompt variations, the conclusion would need narrowing.

    Authors: We recognize the importance of prompt variations for validating the robustness of our findings. While our primary experiments utilized standard zero-shot templates to reflect common usage in the field, we agree that demonstrating the persistence of overrating under alternative prompting strategies would bolster the claim of a system-wide bias. In the revised manuscript, we will add a new subsection with ablations on prompt phrasing and a chain-of-thought approach, showing that the inflation effect remains consistent. This will narrow the scope if necessary but we expect it to support the original conclusions. revision: yes

  2. Referee: [Results] Results: The abstract and findings report consistent inflation but provide no details on statistical controls, sample sizes, exclusion criteria, or effect sizes. This makes it difficult to verify that the overrating reflects a non-random, system-wide property rather than fluctuations tied to the tested datasets and modifications.

    Authors: We appreciate this observation and agree that additional statistical details are necessary for rigorous interpretation. In the revised manuscript, we will expand the experimental setup and results sections to explicitly state the number of queries and passages evaluated, the statistical tests applied (including p-values for differences), any exclusion criteria, and quantitative effect sizes. These additions will clarify that the overrating is a consistent, non-random phenomenon across the tested conditions. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical observational study with direct experimental results

full rationale

The paper conducts controlled experiments on LLM relevance judgments across backbones, pointwise/pairwise paradigms, and passage modifications. It reports observed score inflation and sensitivity to length/lexical cues without any derivations, equations, fitted parameters, or first-principles predictions. No self-definitional steps, fitted-input predictions, or load-bearing self-citations appear; results rest on direct measurements rather than reduction to inputs by construction. This is a standard non-circular empirical analysis.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Empirical study with no mathematical model, free parameters, or new entities; relies on standard IR relevance assessment definitions and LLM prompting practices.

pith-pipeline@v0.9.0 · 5496 in / 983 out tokens · 97561 ms · 2026-05-15T21:07:13.464187+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

31 extracted references · 31 canonical work pages · 4 internal anchors

  1. [1]

    Marwah Alaofi, Paul Thomas, Falk Scholer, and Mark Sanderson. 2024. LLMs Can Be Fooled into Labelling a Document as Relevant: Best Café near Me; This Paper Is Perfectly Relevant. InProceedings of the 2024 Annual International ACM SIGIR Conference on Research and Development in Information Retrieval in the Asia Pacific Region (SIGIR-AP ’24). Tokyo, Japan, ...

  2. [2]

    Negar Arabzadeh and Charles L. A. Clarke. 2025. Benchmarking LLM-based Relevance Judgment Methods. InProceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’25). Padua, Italy, 3194–3204. https://doi.org/10.1145/3726302.3730305

  3. [3]

    Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, Mir Rosenberg, Xia Song, Alina Stoica, Saurabh Tiwary, and Tong Wang. 2018. MS MARCO: A Human Generated MAchine Reading COmprehension Dataset. arXiv:1611.09268v3(2018)

  4. [4]

    Krisztian Balog, Donald Metzler, and Zhen Qin. 2025. Rankers, Judges, and Assistants: Towards Understanding the Interplay of LLMs in Information Re- trieval Evaluation. InProceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’25). Padua, Italy, 3865–3875. https://doi.org/10.1145/3726302.3730348

  5. [5]

    Charles LA Clarke and Laura Dietz. 2024. LLM-based relevance assessment still can’t replace human relevance assessment.arXiv preprint arXiv:2412.17156(2024). https://doi.org/10.48550/arXiv.2412.17156

  6. [6]

    Cyril Cleverdon. 1967. The Cranfield Tests on Index Language Devices.Aslib Proceedings19, 6 (06 1967), 173–194. https://doi.org/10.1108/eb050097

  7. [7]

    Nick Craswell, Bhaskar Mitra, Emine Yilmaz, Daniel Campos, and Ellen M Voorhees. 2020. Overview of the TREC 2019 Deep Learning Track. InProceedings of the 28th Text REtrieval Conference, TREC

  8. [8]

    Nick Craswell, Bhaskar Mitra, Emine Yilmaz, Daniel Campos, and Ellen M Voorhees. 2021. Overview of the TREC 2020 Deep Learning Track. InProceedings of the 29th Text REtrieval Conference, TREC

  9. [9]

    Sunhao Dai, Yuqi Zhou, Liang Pang, Weihao Liu, Xiaolin Hu, Yong Liu, Xiao Zhang, Gang Wang, and Jun Xu. 2024. Neural Retrievers Are Biased Towards LLM-Generated Content. InProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD ’24). Barcelona, Spain, 526–537. https://doi.org/10.1145/3637528.3671882

  10. [10]

    Damessie, Thao P

    Tadele T. Damessie, Thao P. Nghiem, Falk Scholer, and J. Shane Culpepper. 2017. Gauging the Quality of Relevance Assessments using Inter-Rater Agreement. InProceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’17). Tokyo, Japan, 1089–1092. https: //doi.org/10.1145/3077136.3080729

  11. [11]

    Guglielmo Faggioli, Laura Dietz, Charles Clarke, Gianluca Demartini, Matthias Hagen, Claudia Hauff, Noriko Kando, Evangelos Kanoulas, Martin Potthast, Benno Stein, and Henning Wachsmuth. 2023. Perspectives on Large Language Models for Relevance Judgment. InProceedings of the 2023 ACM SIGIR Interna- tional Conference on Theory of Information Retrieval (ICT...

  12. [12]

    2015.Fowler’s dictionary of modern English usage

    Henry Watson Fowler. 2015.Fowler’s dictionary of modern English usage. Oxford University Press

  13. [13]

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. 2024. The Llama 3 Herd of Models.arXiv preprint arXiv:2407.21783 (2024). https://arxiv.org/abs/2407.21783

  14. [14]

    Zishan Guo, Renren Jin, Chuang Liu, Yufei Huang, Dan Shi, Supryadi, Linhao Yu, Yan Liu, Jiaxuan Li, Bojian Xiong, and Deyi Xiong. 2023. Evaluating Large Language Models: A Comprehensive Survey.arXiv preprint arXiv:2310.19736 (2023). https://arxiv.org/abs/2310.19736

  15. [15]

    David Hawking, Ellen Voorhees, Nick Craswell, and Peter Bailey. 2000. Overview of the TREC-8 Web Track.Text Retrieval Conference (TREC)(2000). https: //tsapps.nist.gov/publication/get_pdf.cfm?pub_id=151494

  16. [16]

    Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, De- vendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. 2023. Mistral 7B.arXiv preprint ...

  17. [17]

    M Yusri Ali Lubis, Reysha Miranti, and Yani Lubis. 2024. Passive voice and active voice in sentence structure.Journal of Psychology, Counseling and Education2, 1 (2024), 59–64

  18. [18]

    Chuan Meng, Negar Arabzadeh, Arian Askari, Mohammad Aliannejadi, and Maarten de Rijke. 2025. Query Performance Prediction Using Relevance Judg- ments Generated by Large Language Models.ACM Transactions on Information Systems(2025), 1–35. https://doi.org/10.1145/3736402

  19. [19]

    Samaneh Mohtadi, Kevin Roitero, Stefano Mizzaro, and Gianluca Demartini. 2026. The Effect of Document Summarization on LLM-Based Relevance Judgments. In Proceedings of the 48th European Conference on Information Retrieval (ECIR ’26). Delft, The Netherlands, 70–87. https://doi.org/10.1007/978-3-032-21300-6_5

  20. [20]

    Rahmani, Clemencia Siro, Mohammad Aliannejadi, Nick Craswell, Charles L

    Hossein A. Rahmani, Clemencia Siro, Mohammad Aliannejadi, Nick Craswell, Charles L. A. Clarke, Guglielmo Faggioli, Bhaskar Mitra, Paul Thomas, and Emine Yilmaz. 2025. Judging the Judges: A Collection of LLM-Generated Relevance Judgements.arXiv preprint arXiv:2502.13908(2025). https://arxiv.org/abs/2502. 13908

  21. [21]

    Keita Saito, Akifumi Wachi, Koki Wataoka, and Youhei Akimoto. 2023. Ver- bosity bias in preference labeling by large language models.arXiv preprint arXiv:2310.10076(2023). https://arxiv.org/abs/2310.10076

  22. [22]

    Ian Soboroff. 2025. Don’t use LLMs to make relevance judgments.Information retrieval research journal(2025), 10–54195. https://doi.org/10.54195/irrj.19625

  23. [23]

    Manveer Singh Tamber and Jimmy Lin. 2025. Illusions of Relevance: Using Content Injection Attacks to Deceive Retrievers, Rerankers, and LLM Judges. arXiv preprint arXiv:2501.18536(2025). https://doi.org/10.48550/arXiv.2501.18536

  24. [24]

    Gemma Team. 2025. Gemma 3. https://goo.gle/Gemma3Report

  25. [25]

    Paul Thomas, Seth Spielman, Nick Craswell, and Bhaskar Mitra. 2024. Large Language Models can Accurately Predict Searcher Preferences. InProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’24). Washington DC, USA, 1930–1940. https: //doi.org/10.1145/3626772.3657707

  26. [26]

    Shivani Upadhyay, Ronak Pradeep, Nandan Thakur, Daniel Campos, Nick Craswell, Ian Soboroff, Hoa Trang Dang, and Jimmy Lin. 2024. A Large-Scale Study of Relevance Assessments with Large Language Models: An Initial Look. arXiv preprint arXiv:2411.08275(2024). https://arxiv.org/abs/2411.08275

  27. [27]

    Shivani Upadhyay, Ronak Pradeep, Nandan Thakur, Daniel Campos, Nick Craswell, Ian Soboroff, and Jimmy Lin. 2025. A Large-Scale Study of Rel- evance Assessments with Large Language Models Using UMBRELA. InPro- ceedings of the 2025 International ACM SIGIR Conference on Innovative Con- cepts and Theories in Information Retrieval (ICTIR ’25). Padua, Italy, 35...

  28. [28]

    Shivani Upadhyay, Ronak Pradeep, Nandan Thakur, Nick Craswell, and Jimmy Lin. 2024. UMBRELA: UMbrela is the (Open-Source Reproduction of the) Bing RELevance Assessor.arXiv preprint arXiv:2406.06519(2024). https://doi.org/10. 48550/arXiv.2406.06519

  29. [29]

    Ellen M Voorhees. 2006. Overview of the TREC 2005 Robust Retrieval Track. Text Retrieval Conference (TREC)(2006). https://tsapps.nist.gov/publication/get_ pdf.cfm?pub_id=150643

  30. [30]

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, ..., and Zihan Qiu. 2025. Qwen3 Technical Report.arXiv preprint arXiv:2505.09388(2025). https://doi.org/10.48550/arXiv.2505.09388

  31. [31]

    Xing, Hao Zhang, Joseph E

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. 2023. Judging LLM-as-a-judge with MT- bench and Chatbot Arena. InProceedings of the 37th International Conference on Neural Information Processing Systems (NIPS ’23). New Orlean...