When LLM Judges Inflate Scores: Exploring Overrating in Relevance Assessment
Pith reviewed 2026-05-15 21:07 UTC · model grok-4.3
The pith
LLM-based relevance judges assign inflated scores to passages that do not satisfy the information need.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Models consistently assign inflated relevance scores to passages that do not genuinely satisfy the underlying information need, often with high confidence. This overrating behavior holds across different model backbones, pointwise and pairwise evaluation paradigms, and various passage modification strategies, indicating a system-wide bias. The judgments prove sensitive to passage length and surface-level lexical cues.
What carries the argument
Overrating behavior in LLM relevance judgments, demonstrated via controlled passage modifications that preserve core content but alter length and lexical features.
Load-bearing premise
The observed score inflation arises from inherent LLM limitations rather than from the chosen prompts, datasets, or other experimental factors.
What would settle it
A controlled test showing that LLMs assign appropriately low scores to lengthened or lexically altered passages that still do not meet the information need, matching human assessments.
Figures
read the original abstract
Human relevance assessment is time-consuming and cognitively intensive, limiting the scalability of Information Retrieval evaluation. This has led to growing interest in using large language models (LLMs) as proxies for human judges. However, it remains an open question whether LLM-based relevance judgments are reliable, stable, and rigorous enough to match humans for relevance assessment. In this work, we conduct a study of \textit{overrating behavior} in LLM-based relevance judgments across model backbones, evaluation paradigms (pointwise and pairwise), and passage modification strategies. We show that models consistently assign inflated relevance scores -- often with high confidence -- to passages that do not genuinely satisfy the underlying information need, revealing a system-wide bias rather than random fluctuations in judgment. Furthermore, controlled experiments show that LLM-based relevance judgments can be highly sensitive to passage length and surface-level lexical cues. These results raise concerns about the usage of LLMs as drop-in replacements for human relevance assessors, and highlight the urgent need for careful diagnostic evaluation frameworks when applying LLMs for relevance assessments. Our code and results are publicly available.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that LLMs exhibit systematic overrating in relevance assessment for IR, consistently assigning inflated scores (often with high confidence) to passages that do not satisfy the information need. This is demonstrated through experiments across multiple model backbones, pointwise and pairwise paradigms, and controlled passage modifications, with additional findings that judgments are sensitive to passage length and surface lexical cues. The authors conclude this indicates a system-wide bias rather than random error, cautioning against using LLMs as drop-in replacements for human judges, and release code and results publicly.
Significance. If the central observations hold after addressing setup variations, the work is significant for IR evaluation practices by providing concrete evidence of bias in LLM judges that could undermine automated assessments at scale. The public code supports reproducibility, which is a strength for empirical claims in this area.
major comments (2)
- [Methods] Methods: The study holds prompt templates fixed (standard zero-shot relevance templates) across all backbones and paradigms without reporting ablations on prompt phrasing, chain-of-thought, or few-shot human examples. This is load-bearing for the system-wide bias claim, as the observed inflation could be an artifact of the specific controls rather than inherent to LLMs; if scores change under modest prompt variations, the conclusion would need narrowing.
- [Results] Results: The abstract and findings report consistent inflation but provide no details on statistical controls, sample sizes, exclusion criteria, or effect sizes. This makes it difficult to verify that the overrating reflects a non-random, system-wide property rather than fluctuations tied to the tested datasets and modifications.
minor comments (2)
- [Abstract] Abstract: The term 'overrating behavior' would benefit from an explicit early definition to distinguish it from general score variance.
- [Figures] Figures: Ensure all passage modification examples are clearly labeled with their impact on lexical cues and length to aid interpretation of sensitivity results.
Simulated Author's Rebuttal
We thank the referee for their insightful comments on our work. We address each major comment below and outline the revisions we plan to make to strengthen the manuscript.
read point-by-point responses
-
Referee: [Methods] Methods: The study holds prompt templates fixed (standard zero-shot relevance templates) across all backbones and paradigms without reporting ablations on prompt phrasing, chain-of-thought, or few-shot human examples. This is load-bearing for the system-wide bias claim, as the observed inflation could be an artifact of the specific controls rather than inherent to LLMs; if scores change under modest prompt variations, the conclusion would need narrowing.
Authors: We recognize the importance of prompt variations for validating the robustness of our findings. While our primary experiments utilized standard zero-shot templates to reflect common usage in the field, we agree that demonstrating the persistence of overrating under alternative prompting strategies would bolster the claim of a system-wide bias. In the revised manuscript, we will add a new subsection with ablations on prompt phrasing and a chain-of-thought approach, showing that the inflation effect remains consistent. This will narrow the scope if necessary but we expect it to support the original conclusions. revision: yes
-
Referee: [Results] Results: The abstract and findings report consistent inflation but provide no details on statistical controls, sample sizes, exclusion criteria, or effect sizes. This makes it difficult to verify that the overrating reflects a non-random, system-wide property rather than fluctuations tied to the tested datasets and modifications.
Authors: We appreciate this observation and agree that additional statistical details are necessary for rigorous interpretation. In the revised manuscript, we will expand the experimental setup and results sections to explicitly state the number of queries and passages evaluated, the statistical tests applied (including p-values for differences), any exclusion criteria, and quantitative effect sizes. These additions will clarify that the overrating is a consistent, non-random phenomenon across the tested conditions. revision: yes
Circularity Check
No circularity: empirical observational study with direct experimental results
full rationale
The paper conducts controlled experiments on LLM relevance judgments across backbones, pointwise/pairwise paradigms, and passage modifications. It reports observed score inflation and sensitivity to length/lexical cues without any derivations, equations, fitted parameters, or first-principles predictions. No self-definitional steps, fitted-input predictions, or load-bearing self-citations appear; results rest on direct measurements rather than reduction to inputs by construction. This is a standard non-circular empirical analysis.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Marwah Alaofi, Paul Thomas, Falk Scholer, and Mark Sanderson. 2024. LLMs Can Be Fooled into Labelling a Document as Relevant: Best Café near Me; This Paper Is Perfectly Relevant. InProceedings of the 2024 Annual International ACM SIGIR Conference on Research and Development in Information Retrieval in the Asia Pacific Region (SIGIR-AP ’24). Tokyo, Japan, ...
-
[2]
Negar Arabzadeh and Charles L. A. Clarke. 2025. Benchmarking LLM-based Relevance Judgment Methods. InProceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’25). Padua, Italy, 3194–3204. https://doi.org/10.1145/3726302.3730305
-
[3]
Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, Mir Rosenberg, Xia Song, Alina Stoica, Saurabh Tiwary, and Tong Wang. 2018. MS MARCO: A Human Generated MAchine Reading COmprehension Dataset. arXiv:1611.09268v3(2018)
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[4]
Krisztian Balog, Donald Metzler, and Zhen Qin. 2025. Rankers, Judges, and Assistants: Towards Understanding the Interplay of LLMs in Information Re- trieval Evaluation. InProceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’25). Padua, Italy, 3865–3875. https://doi.org/10.1145/3726302.3730348
-
[5]
Charles LA Clarke and Laura Dietz. 2024. LLM-based relevance assessment still can’t replace human relevance assessment.arXiv preprint arXiv:2412.17156(2024). https://doi.org/10.48550/arXiv.2412.17156
-
[6]
Cyril Cleverdon. 1967. The Cranfield Tests on Index Language Devices.Aslib Proceedings19, 6 (06 1967), 173–194. https://doi.org/10.1108/eb050097
-
[7]
Nick Craswell, Bhaskar Mitra, Emine Yilmaz, Daniel Campos, and Ellen M Voorhees. 2020. Overview of the TREC 2019 Deep Learning Track. InProceedings of the 28th Text REtrieval Conference, TREC
work page 2020
-
[8]
Nick Craswell, Bhaskar Mitra, Emine Yilmaz, Daniel Campos, and Ellen M Voorhees. 2021. Overview of the TREC 2020 Deep Learning Track. InProceedings of the 29th Text REtrieval Conference, TREC
work page 2021
-
[9]
Sunhao Dai, Yuqi Zhou, Liang Pang, Weihao Liu, Xiaolin Hu, Yong Liu, Xiao Zhang, Gang Wang, and Jun Xu. 2024. Neural Retrievers Are Biased Towards LLM-Generated Content. InProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD ’24). Barcelona, Spain, 526–537. https://doi.org/10.1145/3637528.3671882
-
[10]
Tadele T. Damessie, Thao P. Nghiem, Falk Scholer, and J. Shane Culpepper. 2017. Gauging the Quality of Relevance Assessments using Inter-Rater Agreement. InProceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’17). Tokyo, Japan, 1089–1092. https: //doi.org/10.1145/3077136.3080729
-
[11]
Guglielmo Faggioli, Laura Dietz, Charles Clarke, Gianluca Demartini, Matthias Hagen, Claudia Hauff, Noriko Kando, Evangelos Kanoulas, Martin Potthast, Benno Stein, and Henning Wachsmuth. 2023. Perspectives on Large Language Models for Relevance Judgment. InProceedings of the 2023 ACM SIGIR Interna- tional Conference on Theory of Information Retrieval (ICT...
-
[12]
2015.Fowler’s dictionary of modern English usage
Henry Watson Fowler. 2015.Fowler’s dictionary of modern English usage. Oxford University Press
work page 2015
-
[13]
Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. 2024. The Llama 3 Herd of Models.arXiv preprint arXiv:2407.21783 (2024). https://arxiv.org/abs/2407.21783
work page internal anchor Pith review Pith/arXiv arXiv 2024
- [14]
-
[15]
David Hawking, Ellen Voorhees, Nick Craswell, and Peter Bailey. 2000. Overview of the TREC-8 Web Track.Text Retrieval Conference (TREC)(2000). https: //tsapps.nist.gov/publication/get_pdf.cfm?pub_id=151494
work page 2000
-
[16]
Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, De- vendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. 2023. Mistral 7B.arXiv preprint ...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2310.06825 2023
-
[17]
M Yusri Ali Lubis, Reysha Miranti, and Yani Lubis. 2024. Passive voice and active voice in sentence structure.Journal of Psychology, Counseling and Education2, 1 (2024), 59–64
work page 2024
-
[18]
Chuan Meng, Negar Arabzadeh, Arian Askari, Mohammad Aliannejadi, and Maarten de Rijke. 2025. Query Performance Prediction Using Relevance Judg- ments Generated by Large Language Models.ACM Transactions on Information Systems(2025), 1–35. https://doi.org/10.1145/3736402
-
[19]
Samaneh Mohtadi, Kevin Roitero, Stefano Mizzaro, and Gianluca Demartini. 2026. The Effect of Document Summarization on LLM-Based Relevance Judgments. In Proceedings of the 48th European Conference on Information Retrieval (ECIR ’26). Delft, The Netherlands, 70–87. https://doi.org/10.1007/978-3-032-21300-6_5
-
[20]
Rahmani, Clemencia Siro, Mohammad Aliannejadi, Nick Craswell, Charles L
Hossein A. Rahmani, Clemencia Siro, Mohammad Aliannejadi, Nick Craswell, Charles L. A. Clarke, Guglielmo Faggioli, Bhaskar Mitra, Paul Thomas, and Emine Yilmaz. 2025. Judging the Judges: A Collection of LLM-Generated Relevance Judgements.arXiv preprint arXiv:2502.13908(2025). https://arxiv.org/abs/2502. 13908
- [21]
-
[22]
Ian Soboroff. 2025. Don’t use LLMs to make relevance judgments.Information retrieval research journal(2025), 10–54195. https://doi.org/10.54195/irrj.19625
-
[23]
Manveer Singh Tamber and Jimmy Lin. 2025. Illusions of Relevance: Using Content Injection Attacks to Deceive Retrievers, Rerankers, and LLM Judges. arXiv preprint arXiv:2501.18536(2025). https://doi.org/10.48550/arXiv.2501.18536
-
[24]
Gemma Team. 2025. Gemma 3. https://goo.gle/Gemma3Report
work page 2025
-
[25]
Paul Thomas, Seth Spielman, Nick Craswell, and Bhaskar Mitra. 2024. Large Language Models can Accurately Predict Searcher Preferences. InProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’24). Washington DC, USA, 1930–1940. https: //doi.org/10.1145/3626772.3657707
-
[26]
Shivani Upadhyay, Ronak Pradeep, Nandan Thakur, Daniel Campos, Nick Craswell, Ian Soboroff, Hoa Trang Dang, and Jimmy Lin. 2024. A Large-Scale Study of Relevance Assessments with Large Language Models: An Initial Look. arXiv preprint arXiv:2411.08275(2024). https://arxiv.org/abs/2411.08275
-
[27]
Shivani Upadhyay, Ronak Pradeep, Nandan Thakur, Daniel Campos, Nick Craswell, Ian Soboroff, and Jimmy Lin. 2025. A Large-Scale Study of Rel- evance Assessments with Large Language Models Using UMBRELA. InPro- ceedings of the 2025 International ACM SIGIR Conference on Innovative Con- cepts and Theories in Information Retrieval (ICTIR ’25). Padua, Italy, 35...
- [28]
-
[29]
Ellen M Voorhees. 2006. Overview of the TREC 2005 Robust Retrieval Track. Text Retrieval Conference (TREC)(2006). https://tsapps.nist.gov/publication/get_ pdf.cfm?pub_id=150643
work page 2006
-
[30]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, ..., and Zihan Qiu. 2025. Qwen3 Technical Report.arXiv preprint arXiv:2505.09388(2025). https://doi.org/10.48550/arXiv.2505.09388
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2505.09388 2025
-
[31]
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. 2023. Judging LLM-as-a-judge with MT- bench and Chatbot Arena. InProceedings of the 37th International Conference on Neural Information Processing Systems (NIPS ’23). New Orlean...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.