When LLM Judges Inflate Scores: Exploring Overrating in Relevance Assessment

Chuting Yu; Guido Zuccon; Hang Li; Joel Mackenzie; Teerapong Leelanupab

arxiv: 2602.17170 · v3 · submitted 2026-02-19 · 💻 cs.IR

When LLM Judges Inflate Scores: Exploring Overrating in Relevance Assessment

Chuting Yu , Hang Li , Guido Zuccon , Joel Mackenzie , Teerapong Leelanupab This is my paper

Pith reviewed 2026-05-15 21:07 UTC · model grok-4.3

classification 💻 cs.IR

keywords LLM judgesrelevance assessmentoverrating behaviorinformation retrievalevaluation biaspointwise pairwise judgments

0 comments

The pith

LLM-based relevance judges assign inflated scores to passages that do not satisfy the information need.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models are increasingly used to judge how relevant a passage is to a query in information retrieval systems. This paper examines whether these models overrate passages that actually fail to meet the user's information need. Through experiments with various models, judgment types, and modified passages, it finds that LLMs consistently give higher scores than warranted, often with high confidence. The overrating links to superficial factors like longer text or certain word choices rather than genuine relevance. This suggests LLMs introduce systematic bias when used as human substitutes for evaluation.

Core claim

Models consistently assign inflated relevance scores to passages that do not genuinely satisfy the underlying information need, often with high confidence. This overrating behavior holds across different model backbones, pointwise and pairwise evaluation paradigms, and various passage modification strategies, indicating a system-wide bias. The judgments prove sensitive to passage length and surface-level lexical cues.

What carries the argument

Overrating behavior in LLM relevance judgments, demonstrated via controlled passage modifications that preserve core content but alter length and lexical features.

Load-bearing premise

The observed score inflation arises from inherent LLM limitations rather than from the chosen prompts, datasets, or other experimental factors.

What would settle it

A controlled test showing that LLMs assign appropriately low scores to lengthened or lexically altered passages that still do not meet the information need, matching human assessments.

Figures

Figures reproduced from arXiv: 2602.17170 by Chuting Yu, Guido Zuccon, Hang Li, Joel Mackenzie, Teerapong Leelanupab.

**Figure 2.** Figure 2: SEM, LEX, and QRY variants for the query “do goldfish grow”. SEM preserves relevance without surface query terms; LEX provides the query terms such that they are not relevant to the information need; and QRY simply adds the query itself. to generate stylistic and length-based content variations. Unlike Balog et al. [4], who assume rewriting preserves relevance (validated by human assessors [9]), we perfor… view at source ↗

**Figure 1.** Figure 1: Label transition matrices for Qwen under different passage rewrites. Each heatmap shows normalized transitions from the LLM’s original assigned relevance labels. results demonstrate that using confidence as a signal for judgment reliability is not useful, and is highly sensitive to positional changes. 3 Exploring Relevance Cues The overrating behaviors observed earlier raise a key question: do LLMs genuin… view at source ↗

**Figure 3.** Figure 3: Predicted relevance-label distributions on non [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

read the original abstract

Human relevance assessment is time-consuming and cognitively intensive, limiting the scalability of Information Retrieval evaluation. This has led to growing interest in using large language models (LLMs) as proxies for human judges. However, it remains an open question whether LLM-based relevance judgments are reliable, stable, and rigorous enough to match humans for relevance assessment. In this work, we conduct a study of \textit{overrating behavior} in LLM-based relevance judgments across model backbones, evaluation paradigms (pointwise and pairwise), and passage modification strategies. We show that models consistently assign inflated relevance scores -- often with high confidence -- to passages that do not genuinely satisfy the underlying information need, revealing a system-wide bias rather than random fluctuations in judgment. Furthermore, controlled experiments show that LLM-based relevance judgments can be highly sensitive to passage length and surface-level lexical cues. These results raise concerns about the usage of LLMs as drop-in replacements for human relevance assessors, and highlight the urgent need for careful diagnostic evaluation frameworks when applying LLMs for relevance assessments. Our code and results are publicly available.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LLMs overrate passages in relevance judgments and react to length and lexical cues, but the results look tied to fixed prompts with no ablation shown.

read the letter

The paper's core finding is that LLMs assign inflated relevance scores to passages that do not meet the information need, and these scores stay high even when the passages are altered in controlled ways. The experiments also show clear sensitivity to passage length and surface word choices, and this pattern appears across pointwise and pairwise setups plus several model backbones. That controlled modification approach is the main new piece; it moves past general reliability complaints and tries to pin down specific triggers. Releasing code and results is a plus for anyone who wants to rerun or tweak the setup themselves. The work is straightforward and directly relevant to current IR evaluation practices that lean on LLMs for scale. The main limitation is the lack of prompt variation. Prompts stay fixed at standard zero-shot templates, with no tests of rephrasing, chain-of-thought, or few-shot examples. If modest prompt changes cut the overrating, the bias looks more like an artifact of the tested instructions than an inherent model property. The abstract also gives no numbers on sample sizes, statistical tests, or how they handled model variability, which makes it hard to judge how stable the differences really are. This is worth a serious referee for groups already running LLM-based evaluations. The experiments are replicable enough that reviewers can check the claims directly, and the topic matters for anyone scaling up IR test collections. I would send it to review rather than desk reject, with the expectation that revisions would need to address the prompt question and add clearer stats.

Referee Report

2 major / 2 minor

Summary. The paper claims that LLMs exhibit systematic overrating in relevance assessment for IR, consistently assigning inflated scores (often with high confidence) to passages that do not satisfy the information need. This is demonstrated through experiments across multiple model backbones, pointwise and pairwise paradigms, and controlled passage modifications, with additional findings that judgments are sensitive to passage length and surface lexical cues. The authors conclude this indicates a system-wide bias rather than random error, cautioning against using LLMs as drop-in replacements for human judges, and release code and results publicly.

Significance. If the central observations hold after addressing setup variations, the work is significant for IR evaluation practices by providing concrete evidence of bias in LLM judges that could undermine automated assessments at scale. The public code supports reproducibility, which is a strength for empirical claims in this area.

major comments (2)

[Methods] Methods: The study holds prompt templates fixed (standard zero-shot relevance templates) across all backbones and paradigms without reporting ablations on prompt phrasing, chain-of-thought, or few-shot human examples. This is load-bearing for the system-wide bias claim, as the observed inflation could be an artifact of the specific controls rather than inherent to LLMs; if scores change under modest prompt variations, the conclusion would need narrowing.
[Results] Results: The abstract and findings report consistent inflation but provide no details on statistical controls, sample sizes, exclusion criteria, or effect sizes. This makes it difficult to verify that the overrating reflects a non-random, system-wide property rather than fluctuations tied to the tested datasets and modifications.

minor comments (2)

[Abstract] Abstract: The term 'overrating behavior' would benefit from an explicit early definition to distinguish it from general score variance.
[Figures] Figures: Ensure all passage modification examples are clearly labeled with their impact on lexical cues and length to aid interpretation of sensitivity results.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their insightful comments on our work. We address each major comment below and outline the revisions we plan to make to strengthen the manuscript.

read point-by-point responses

Referee: [Methods] Methods: The study holds prompt templates fixed (standard zero-shot relevance templates) across all backbones and paradigms without reporting ablations on prompt phrasing, chain-of-thought, or few-shot human examples. This is load-bearing for the system-wide bias claim, as the observed inflation could be an artifact of the specific controls rather than inherent to LLMs; if scores change under modest prompt variations, the conclusion would need narrowing.

Authors: We recognize the importance of prompt variations for validating the robustness of our findings. While our primary experiments utilized standard zero-shot templates to reflect common usage in the field, we agree that demonstrating the persistence of overrating under alternative prompting strategies would bolster the claim of a system-wide bias. In the revised manuscript, we will add a new subsection with ablations on prompt phrasing and a chain-of-thought approach, showing that the inflation effect remains consistent. This will narrow the scope if necessary but we expect it to support the original conclusions. revision: yes
Referee: [Results] Results: The abstract and findings report consistent inflation but provide no details on statistical controls, sample sizes, exclusion criteria, or effect sizes. This makes it difficult to verify that the overrating reflects a non-random, system-wide property rather than fluctuations tied to the tested datasets and modifications.

Authors: We appreciate this observation and agree that additional statistical details are necessary for rigorous interpretation. In the revised manuscript, we will expand the experimental setup and results sections to explicitly state the number of queries and passages evaluated, the statistical tests applied (including p-values for differences), any exclusion criteria, and quantitative effect sizes. These additions will clarify that the overrating is a consistent, non-random phenomenon across the tested conditions. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical observational study with direct experimental results

full rationale

The paper conducts controlled experiments on LLM relevance judgments across backbones, pointwise/pairwise paradigms, and passage modifications. It reports observed score inflation and sensitivity to length/lexical cues without any derivations, equations, fitted parameters, or first-principles predictions. No self-definitional steps, fitted-input predictions, or load-bearing self-citations appear; results rest on direct measurements rather than reduction to inputs by construction. This is a standard non-circular empirical analysis.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Empirical study with no mathematical model, free parameters, or new entities; relies on standard IR relevance assessment definitions and LLM prompting practices.

pith-pipeline@v0.9.0 · 5496 in / 983 out tokens · 97561 ms · 2026-05-15T21:07:13.464187+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

31 extracted references · 31 canonical work pages · 4 internal anchors

[1]

Marwah Alaofi, Paul Thomas, Falk Scholer, and Mark Sanderson. 2024. LLMs Can Be Fooled into Labelling a Document as Relevant: Best Café near Me; This Paper Is Perfectly Relevant. InProceedings of the 2024 Annual International ACM SIGIR Conference on Research and Development in Information Retrieval in the Asia Pacific Region (SIGIR-AP ’24). Tokyo, Japan, ...

work page arXiv 2024
[2]

Negar Arabzadeh and Charles L. A. Clarke. 2025. Benchmarking LLM-based Relevance Judgment Methods. InProceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’25). Padua, Italy, 3194–3204. https://doi.org/10.1145/3726302.3730305

work page doi:10.1145/3726302.3730305 2025
[3]

Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, Mir Rosenberg, Xia Song, Alina Stoica, Saurabh Tiwary, and Tong Wang. 2018. MS MARCO: A Human Generated MAchine Reading COmprehension Dataset. arXiv:1611.09268v3(2018)

work page internal anchor Pith review Pith/arXiv arXiv 2018
[4]

Krisztian Balog, Donald Metzler, and Zhen Qin. 2025. Rankers, Judges, and Assistants: Towards Understanding the Interplay of LLMs in Information Re- trieval Evaluation. InProceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’25). Padua, Italy, 3865–3875. https://doi.org/10.1145/3726302.3730348

work page doi:10.1145/3726302.3730348 2025
[5]

Charles LA Clarke and Laura Dietz. 2024. LLM-based relevance assessment still can’t replace human relevance assessment.arXiv preprint arXiv:2412.17156(2024). https://doi.org/10.48550/arXiv.2412.17156

work page doi:10.48550/arxiv.2412.17156 2024
[6]

Cyril Cleverdon. 1967. The Cranfield Tests on Index Language Devices.Aslib Proceedings19, 6 (06 1967), 173–194. https://doi.org/10.1108/eb050097

work page doi:10.1108/eb050097 1967
[7]

Nick Craswell, Bhaskar Mitra, Emine Yilmaz, Daniel Campos, and Ellen M Voorhees. 2020. Overview of the TREC 2019 Deep Learning Track. InProceedings of the 28th Text REtrieval Conference, TREC

work page 2020
[8]

Nick Craswell, Bhaskar Mitra, Emine Yilmaz, Daniel Campos, and Ellen M Voorhees. 2021. Overview of the TREC 2020 Deep Learning Track. InProceedings of the 29th Text REtrieval Conference, TREC

work page 2021
[9]

Sunhao Dai, Yuqi Zhou, Liang Pang, Weihao Liu, Xiaolin Hu, Yong Liu, Xiao Zhang, Gang Wang, and Jun Xu. 2024. Neural Retrievers Are Biased Towards LLM-Generated Content. InProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD ’24). Barcelona, Spain, 526–537. https://doi.org/10.1145/3637528.3671882

work page doi:10.1145/3637528.3671882 2024
[10]

Damessie, Thao P

Tadele T. Damessie, Thao P. Nghiem, Falk Scholer, and J. Shane Culpepper. 2017. Gauging the Quality of Relevance Assessments using Inter-Rater Agreement. InProceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’17). Tokyo, Japan, 1089–1092. https: //doi.org/10.1145/3077136.3080729

work page doi:10.1145/3077136.3080729 2017
[11]

Guglielmo Faggioli, Laura Dietz, Charles Clarke, Gianluca Demartini, Matthias Hagen, Claudia Hauff, Noriko Kando, Evangelos Kanoulas, Martin Potthast, Benno Stein, and Henning Wachsmuth. 2023. Perspectives on Large Language Models for Relevance Judgment. InProceedings of the 2023 ACM SIGIR Interna- tional Conference on Theory of Information Retrieval (ICT...

work page doi:10.1145/3578337.3605136 2023
[12]

2015.Fowler’s dictionary of modern English usage

Henry Watson Fowler. 2015.Fowler’s dictionary of modern English usage. Oxford University Press

work page 2015
[13]

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. 2024. The Llama 3 Herd of Models.arXiv preprint arXiv:2407.21783 (2024). https://arxiv.org/abs/2407.21783

work page internal anchor Pith review Pith/arXiv arXiv 2024
[14]

Zishan Guo, Renren Jin, Chuang Liu, Yufei Huang, Dan Shi, Supryadi, Linhao Yu, Yan Liu, Jiaxuan Li, Bojian Xiong, and Deyi Xiong. 2023. Evaluating Large Language Models: A Comprehensive Survey.arXiv preprint arXiv:2310.19736 (2023). https://arxiv.org/abs/2310.19736

work page arXiv 2023
[15]

David Hawking, Ellen Voorhees, Nick Craswell, and Peter Bailey. 2000. Overview of the TREC-8 Web Track.Text Retrieval Conference (TREC)(2000). https: //tsapps.nist.gov/publication/get_pdf.cfm?pub_id=151494

work page 2000
[16]

Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, De- vendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. 2023. Mistral 7B.arXiv preprint ...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2310.06825 2023
[17]

M Yusri Ali Lubis, Reysha Miranti, and Yani Lubis. 2024. Passive voice and active voice in sentence structure.Journal of Psychology, Counseling and Education2, 1 (2024), 59–64

work page 2024
[18]

Chuan Meng, Negar Arabzadeh, Arian Askari, Mohammad Aliannejadi, and Maarten de Rijke. 2025. Query Performance Prediction Using Relevance Judg- ments Generated by Large Language Models.ACM Transactions on Information Systems(2025), 1–35. https://doi.org/10.1145/3736402

work page doi:10.1145/3736402 2025
[19]

Samaneh Mohtadi, Kevin Roitero, Stefano Mizzaro, and Gianluca Demartini. 2026. The Effect of Document Summarization on LLM-Based Relevance Judgments. In Proceedings of the 48th European Conference on Information Retrieval (ECIR ’26). Delft, The Netherlands, 70–87. https://doi.org/10.1007/978-3-032-21300-6_5

work page doi:10.1007/978-3-032-21300-6_5 2026
[20]

Rahmani, Clemencia Siro, Mohammad Aliannejadi, Nick Craswell, Charles L

Hossein A. Rahmani, Clemencia Siro, Mohammad Aliannejadi, Nick Craswell, Charles L. A. Clarke, Guglielmo Faggioli, Bhaskar Mitra, Paul Thomas, and Emine Yilmaz. 2025. Judging the Judges: A Collection of LLM-Generated Relevance Judgements.arXiv preprint arXiv:2502.13908(2025). https://arxiv.org/abs/2502. 13908

work page arXiv 2025
[21]

Keita Saito, Akifumi Wachi, Koki Wataoka, and Youhei Akimoto. 2023. Ver- bosity bias in preference labeling by large language models.arXiv preprint arXiv:2310.10076(2023). https://arxiv.org/abs/2310.10076

work page arXiv 2023
[22]

Ian Soboroff. 2025. Don’t use LLMs to make relevance judgments.Information retrieval research journal(2025), 10–54195. https://doi.org/10.54195/irrj.19625

work page doi:10.54195/irrj.19625 2025
[23]

Manveer Singh Tamber and Jimmy Lin. 2025. Illusions of Relevance: Using Content Injection Attacks to Deceive Retrievers, Rerankers, and LLM Judges. arXiv preprint arXiv:2501.18536(2025). https://doi.org/10.48550/arXiv.2501.18536

work page doi:10.48550/arxiv.2501.18536 2025
[24]

Gemma Team. 2025. Gemma 3. https://goo.gle/Gemma3Report

work page 2025
[25]

Paul Thomas, Seth Spielman, Nick Craswell, and Bhaskar Mitra. 2024. Large Language Models can Accurately Predict Searcher Preferences. InProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’24). Washington DC, USA, 1930–1940. https: //doi.org/10.1145/3626772.3657707

work page doi:10.1145/3626772.3657707 2024
[26]

Shivani Upadhyay, Ronak Pradeep, Nandan Thakur, Daniel Campos, Nick Craswell, Ian Soboroff, Hoa Trang Dang, and Jimmy Lin. 2024. A Large-Scale Study of Relevance Assessments with Large Language Models: An Initial Look. arXiv preprint arXiv:2411.08275(2024). https://arxiv.org/abs/2411.08275

work page arXiv 2024
[27]

Shivani Upadhyay, Ronak Pradeep, Nandan Thakur, Daniel Campos, Nick Craswell, Ian Soboroff, and Jimmy Lin. 2025. A Large-Scale Study of Rel- evance Assessments with Large Language Models Using UMBRELA. InPro- ceedings of the 2025 International ACM SIGIR Conference on Innovative Con- cepts and Theories in Information Retrieval (ICTIR ’25). Padua, Italy, 35...

work page doi:10.1145/3731120.3744605 2025
[28]

Shivani Upadhyay, Ronak Pradeep, Nandan Thakur, Nick Craswell, and Jimmy Lin. 2024. UMBRELA: UMbrela is the (Open-Source Reproduction of the) Bing RELevance Assessor.arXiv preprint arXiv:2406.06519(2024). https://doi.org/10. 48550/arXiv.2406.06519

work page arXiv 2024
[29]

Ellen M Voorhees. 2006. Overview of the TREC 2005 Robust Retrieval Track. Text Retrieval Conference (TREC)(2006). https://tsapps.nist.gov/publication/get_ pdf.cfm?pub_id=150643

work page 2006
[30]

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, ..., and Zihan Qiu. 2025. Qwen3 Technical Report.arXiv preprint arXiv:2505.09388(2025). https://doi.org/10.48550/arXiv.2505.09388

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2505.09388 2025
[31]

Xing, Hao Zhang, Joseph E

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. 2023. Judging LLM-as-a-judge with MT- bench and Chatbot Arena. InProceedings of the 37th International Conference on Neural Information Processing Systems (NIPS ’23). New Orlean...

work page doi:10.5555/3666122.3668142 2023

[1] [1]

Marwah Alaofi, Paul Thomas, Falk Scholer, and Mark Sanderson. 2024. LLMs Can Be Fooled into Labelling a Document as Relevant: Best Café near Me; This Paper Is Perfectly Relevant. InProceedings of the 2024 Annual International ACM SIGIR Conference on Research and Development in Information Retrieval in the Asia Pacific Region (SIGIR-AP ’24). Tokyo, Japan, ...

work page arXiv 2024

[2] [2]

Negar Arabzadeh and Charles L. A. Clarke. 2025. Benchmarking LLM-based Relevance Judgment Methods. InProceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’25). Padua, Italy, 3194–3204. https://doi.org/10.1145/3726302.3730305

work page doi:10.1145/3726302.3730305 2025

[3] [3]

Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, Mir Rosenberg, Xia Song, Alina Stoica, Saurabh Tiwary, and Tong Wang. 2018. MS MARCO: A Human Generated MAchine Reading COmprehension Dataset. arXiv:1611.09268v3(2018)

work page internal anchor Pith review Pith/arXiv arXiv 2018

[4] [4]

Krisztian Balog, Donald Metzler, and Zhen Qin. 2025. Rankers, Judges, and Assistants: Towards Understanding the Interplay of LLMs in Information Re- trieval Evaluation. InProceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’25). Padua, Italy, 3865–3875. https://doi.org/10.1145/3726302.3730348

work page doi:10.1145/3726302.3730348 2025

[5] [5]

Charles LA Clarke and Laura Dietz. 2024. LLM-based relevance assessment still can’t replace human relevance assessment.arXiv preprint arXiv:2412.17156(2024). https://doi.org/10.48550/arXiv.2412.17156

work page doi:10.48550/arxiv.2412.17156 2024

[6] [6]

Cyril Cleverdon. 1967. The Cranfield Tests on Index Language Devices.Aslib Proceedings19, 6 (06 1967), 173–194. https://doi.org/10.1108/eb050097

work page doi:10.1108/eb050097 1967

[7] [7]

Nick Craswell, Bhaskar Mitra, Emine Yilmaz, Daniel Campos, and Ellen M Voorhees. 2020. Overview of the TREC 2019 Deep Learning Track. InProceedings of the 28th Text REtrieval Conference, TREC

work page 2020

[8] [8]

Nick Craswell, Bhaskar Mitra, Emine Yilmaz, Daniel Campos, and Ellen M Voorhees. 2021. Overview of the TREC 2020 Deep Learning Track. InProceedings of the 29th Text REtrieval Conference, TREC

work page 2021

[9] [9]

Sunhao Dai, Yuqi Zhou, Liang Pang, Weihao Liu, Xiaolin Hu, Yong Liu, Xiao Zhang, Gang Wang, and Jun Xu. 2024. Neural Retrievers Are Biased Towards LLM-Generated Content. InProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD ’24). Barcelona, Spain, 526–537. https://doi.org/10.1145/3637528.3671882

work page doi:10.1145/3637528.3671882 2024

[10] [10]

Damessie, Thao P

Tadele T. Damessie, Thao P. Nghiem, Falk Scholer, and J. Shane Culpepper. 2017. Gauging the Quality of Relevance Assessments using Inter-Rater Agreement. InProceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’17). Tokyo, Japan, 1089–1092. https: //doi.org/10.1145/3077136.3080729

work page doi:10.1145/3077136.3080729 2017

[11] [11]

Guglielmo Faggioli, Laura Dietz, Charles Clarke, Gianluca Demartini, Matthias Hagen, Claudia Hauff, Noriko Kando, Evangelos Kanoulas, Martin Potthast, Benno Stein, and Henning Wachsmuth. 2023. Perspectives on Large Language Models for Relevance Judgment. InProceedings of the 2023 ACM SIGIR Interna- tional Conference on Theory of Information Retrieval (ICT...

work page doi:10.1145/3578337.3605136 2023

[12] [12]

2015.Fowler’s dictionary of modern English usage

Henry Watson Fowler. 2015.Fowler’s dictionary of modern English usage. Oxford University Press

work page 2015

[13] [13]

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. 2024. The Llama 3 Herd of Models.arXiv preprint arXiv:2407.21783 (2024). https://arxiv.org/abs/2407.21783

work page internal anchor Pith review Pith/arXiv arXiv 2024

[14] [14]

Zishan Guo, Renren Jin, Chuang Liu, Yufei Huang, Dan Shi, Supryadi, Linhao Yu, Yan Liu, Jiaxuan Li, Bojian Xiong, and Deyi Xiong. 2023. Evaluating Large Language Models: A Comprehensive Survey.arXiv preprint arXiv:2310.19736 (2023). https://arxiv.org/abs/2310.19736

work page arXiv 2023

[15] [15]

David Hawking, Ellen Voorhees, Nick Craswell, and Peter Bailey. 2000. Overview of the TREC-8 Web Track.Text Retrieval Conference (TREC)(2000). https: //tsapps.nist.gov/publication/get_pdf.cfm?pub_id=151494

work page 2000

[16] [16]

Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, De- vendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. 2023. Mistral 7B.arXiv preprint ...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2310.06825 2023

[17] [17]

M Yusri Ali Lubis, Reysha Miranti, and Yani Lubis. 2024. Passive voice and active voice in sentence structure.Journal of Psychology, Counseling and Education2, 1 (2024), 59–64

work page 2024

[18] [18]

Chuan Meng, Negar Arabzadeh, Arian Askari, Mohammad Aliannejadi, and Maarten de Rijke. 2025. Query Performance Prediction Using Relevance Judg- ments Generated by Large Language Models.ACM Transactions on Information Systems(2025), 1–35. https://doi.org/10.1145/3736402

work page doi:10.1145/3736402 2025

[19] [19]

Samaneh Mohtadi, Kevin Roitero, Stefano Mizzaro, and Gianluca Demartini. 2026. The Effect of Document Summarization on LLM-Based Relevance Judgments. In Proceedings of the 48th European Conference on Information Retrieval (ECIR ’26). Delft, The Netherlands, 70–87. https://doi.org/10.1007/978-3-032-21300-6_5

work page doi:10.1007/978-3-032-21300-6_5 2026

[20] [20]

Rahmani, Clemencia Siro, Mohammad Aliannejadi, Nick Craswell, Charles L

Hossein A. Rahmani, Clemencia Siro, Mohammad Aliannejadi, Nick Craswell, Charles L. A. Clarke, Guglielmo Faggioli, Bhaskar Mitra, Paul Thomas, and Emine Yilmaz. 2025. Judging the Judges: A Collection of LLM-Generated Relevance Judgements.arXiv preprint arXiv:2502.13908(2025). https://arxiv.org/abs/2502. 13908

work page arXiv 2025

[21] [21]

Keita Saito, Akifumi Wachi, Koki Wataoka, and Youhei Akimoto. 2023. Ver- bosity bias in preference labeling by large language models.arXiv preprint arXiv:2310.10076(2023). https://arxiv.org/abs/2310.10076

work page arXiv 2023

[22] [22]

Ian Soboroff. 2025. Don’t use LLMs to make relevance judgments.Information retrieval research journal(2025), 10–54195. https://doi.org/10.54195/irrj.19625

work page doi:10.54195/irrj.19625 2025

[23] [23]

Manveer Singh Tamber and Jimmy Lin. 2025. Illusions of Relevance: Using Content Injection Attacks to Deceive Retrievers, Rerankers, and LLM Judges. arXiv preprint arXiv:2501.18536(2025). https://doi.org/10.48550/arXiv.2501.18536

work page doi:10.48550/arxiv.2501.18536 2025

[24] [24]

Gemma Team. 2025. Gemma 3. https://goo.gle/Gemma3Report

work page 2025

[25] [25]

Paul Thomas, Seth Spielman, Nick Craswell, and Bhaskar Mitra. 2024. Large Language Models can Accurately Predict Searcher Preferences. InProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’24). Washington DC, USA, 1930–1940. https: //doi.org/10.1145/3626772.3657707

work page doi:10.1145/3626772.3657707 2024

[26] [26]

Shivani Upadhyay, Ronak Pradeep, Nandan Thakur, Daniel Campos, Nick Craswell, Ian Soboroff, Hoa Trang Dang, and Jimmy Lin. 2024. A Large-Scale Study of Relevance Assessments with Large Language Models: An Initial Look. arXiv preprint arXiv:2411.08275(2024). https://arxiv.org/abs/2411.08275

work page arXiv 2024

[27] [27]

Shivani Upadhyay, Ronak Pradeep, Nandan Thakur, Daniel Campos, Nick Craswell, Ian Soboroff, and Jimmy Lin. 2025. A Large-Scale Study of Rel- evance Assessments with Large Language Models Using UMBRELA. InPro- ceedings of the 2025 International ACM SIGIR Conference on Innovative Con- cepts and Theories in Information Retrieval (ICTIR ’25). Padua, Italy, 35...

work page doi:10.1145/3731120.3744605 2025

[28] [28]

Shivani Upadhyay, Ronak Pradeep, Nandan Thakur, Nick Craswell, and Jimmy Lin. 2024. UMBRELA: UMbrela is the (Open-Source Reproduction of the) Bing RELevance Assessor.arXiv preprint arXiv:2406.06519(2024). https://doi.org/10. 48550/arXiv.2406.06519

work page arXiv 2024

[29] [29]

Ellen M Voorhees. 2006. Overview of the TREC 2005 Robust Retrieval Track. Text Retrieval Conference (TREC)(2006). https://tsapps.nist.gov/publication/get_ pdf.cfm?pub_id=150643

work page 2006

[30] [30]

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, ..., and Zihan Qiu. 2025. Qwen3 Technical Report.arXiv preprint arXiv:2505.09388(2025). https://doi.org/10.48550/arXiv.2505.09388

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2505.09388 2025

[31] [31]

Xing, Hao Zhang, Joseph E

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. 2023. Judging LLM-as-a-judge with MT- bench and Chatbot Arena. InProceedings of the 37th International Conference on Neural Information Processing Systems (NIPS ’23). New Orlean...

work page doi:10.5555/3666122.3668142 2023