LLMs as Assessors: Right for the Right Reason?
Pith reviewed 2026-05-16 14:19 UTC · model grok-4.3
The pith
LLMs match human relevance judgments on documents but cite different passages as evidence.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that LLMs prompted to judge document relevance and highlight useful passages on the INEX Wikipedia collection agree with human assessors at the document level but disagree substantially on the specific passages selected as evidence. This shows LLMs are often right about relevance without being right for the same reasons as humans. Consequently, while LLMs can be used judiciously to reduce the amount of human labor required for benchmark datasets, they cannot replace human assessors.
What carries the argument
The passage-highlighting task performed by both LLMs and humans on the same INEX Wikipedia queries and documents, allowing comparison of relevance labels together with the cited evidence for those labels.
Load-bearing premise
Human passage highlights provide a reliable ground-truth measure of the correct reasons for relevance and the INEX collection is representative for broader claims about LLM assessors.
What would settle it
Build parallel test collections for the same queries using only LLM assessments versus only human assessments, then run the same set of retrieval systems on both and check whether the system rankings and performance scores differ substantially.
Figures
read the original abstract
A good deal of recent research has focused on how Large Language Models (LLMs) may be used as judges in place of humans to evaluate the quality of the output produced by various text / image processing systems. Within this broader context, a number of studies have investigated the specific question of how effectively LLMs can be used as relevance assessors for the standard ad hoc task in Information Retrieval (IR). We extend these studies by looking at additional questions. Most importantly, we use a Wikipedia based test collection created by the INEX initiative, and prompt LLMs to not only judge whether documents are relevant / non-relevant, but to highlight relevant passages in documents that it regards as useful. The human relevance assessors involved in creating this collection were given analogous instructions, i.e., they were asked to highlight all passages within a document that respond to the information need expressed in a query. This enables us to evaluate the quality of LLMs as judges not only at the document level, but to also quantify how often these judges are right for the right reasons. Our observations lead us to reiterate the cautionary note sounded in some earlier studies when it comes to using LLMs as assessors for creating IR datasets: while LLMs are unquestionably promising, and may be used judiciously to subtantially reduce the amount of human involvement required to generate high-quality benchmark datasets, they cannot replace humans as assessors.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper investigates LLMs as relevance assessors for ad hoc IR tasks on the INEX Wikipedia collection. Human assessors and LLMs are both asked to judge document relevance and highlight relevant passages within documents. The work compares document-level agreement and passage overlap between LLMs and humans, concluding that LLMs are promising for reducing human effort in benchmark creation but cannot replace humans because they frequently fail to highlight the same passages (i.e., are not 'right for the right reason').
Significance. If the central comparison holds after addressing ground-truth reliability, the paper offers a useful cautionary result for the growing literature on LLM judges in IR. It strengthens the case for hybrid human-LLM pipelines by quantifying passage-level mismatches on a standard test collection, while crediting LLMs' potential to scale assessment with reduced human involvement.
major comments (2)
- [Methods (human assessment protocol)] The evaluation treats human passage highlights as stable ground truth for measuring whether LLMs are 'right for the right reason,' yet no inter-annotator agreement statistics or overlap metrics among the original INEX human assessors are reported. Passage selection is subjective; without IAA, observed LLM-human mismatches cannot be confidently attributed to LLM reasoning rather than annotator variability. This directly undermines the load-bearing claim in the abstract and conclusion that LLMs cannot replace humans.
- [Results and Evaluation] The abstract describes a comparison protocol using passage-overlap metrics, but the manuscript does not indicate whether this metric was pre-registered or include document-level agreement results with appropriate baselines (e.g., random or majority-vote human highlights). This weakens verification of the quantitative claims about LLM performance.
minor comments (2)
- [Abstract] Typo in abstract: 'subtantially' should read 'substantially'.
- [Experimental Setup] Provide more detail on the exact prompting templates, temperature settings, and specific LLM versions used to support reproducibility.
Simulated Author's Rebuttal
Thank you for the constructive feedback on our manuscript. We address each major comment below, indicating where revisions will be made to strengthen the work while remaining faithful to the available data and study design.
read point-by-point responses
-
Referee: [Methods (human assessment protocol)] The evaluation treats human passage highlights as stable ground truth for measuring whether LLMs are 'right for the right reason,' yet no inter-annotator agreement statistics or overlap metrics among the original INEX human assessors are reported. Passage selection is subjective; without IAA, observed LLM-human mismatches cannot be confidently attributed to LLM reasoning rather than annotator variability. This directly undermines the load-bearing claim in the abstract and conclusion that LLMs cannot replace humans.
Authors: We agree that inter-annotator agreement (IAA) for passage highlights would provide important context for interpreting mismatches. The INEX Wikipedia collection (2006-2009) does not report IAA statistics for these annotations, and the archived data do not allow recomputation of overlap among assessors. In the revised manuscript we will add an explicit discussion of this limitation in the methods and limitations sections, qualify our claims by noting that human variability may contribute to observed differences, and reference related literature on IAA in ad hoc relevance assessment to contextualize the results. This is a partial revision because the missing IAA data cannot be supplied. revision: partial
-
Referee: [Results and Evaluation] The abstract describes a comparison protocol using passage-overlap metrics, but the manuscript does not indicate whether this metric was pre-registered or include document-level agreement results with appropriate baselines (e.g., random or majority-vote human highlights). This weakens verification of the quantitative claims about LLM performance.
Authors: The passage-overlap metrics were not pre-registered; the study was exploratory. In revision we will state this explicitly in the methods section, add random-passage and majority-vote human baselines, and report document-level agreement results against these baselines to allow better verification of the quantitative claims. revision: yes
- Absence of inter-annotator agreement statistics in the original INEX annotations, which cannot be retroactively computed from the archived data.
Circularity Check
No significant circularity; empirical comparison to external INEX annotations
full rationale
The paper's central evaluation compares LLM-generated relevance judgments and passage highlights directly against the pre-existing human annotations from the INEX Wikipedia collection. No parameters are fitted to the target data, no self-definitional loops are present, and no load-bearing claims reduce to self-citations or ansatzes. The derivation chain consists of straightforward empirical measurement against an independent external benchmark, satisfying the criteria for a self-contained, non-circular analysis.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Human passage highlights in the INEX collection constitute valid ground truth for evidence selection
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanabsolute_floor_iff_bare_distinguishability unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
prompt LLMs to not only judge whether documents are relevant / non-relevant, but to highlight relevant passages... compute recall and precision... overlap between the model’s outputs and the ground truth annotations
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
LLMs are promising... but they cannot replace humans as assessors
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Zahra Abbasiantaeb, Chuan Meng, Leif Azzopardi, and Mohammad Aliannejadi
- [2]
-
[3]
Marwah Alaofi, Paul Thomas, Falk Scholer, and Mark Sanderson. 2024. LLMs can be Fooled into Labelling a Document as Relevant: best café near me; this paper is perfectly relevant. In Proceedings of the 2024 Annual International ACM SIGIR Conference on Research and Development in Information Retrieval in the Asia Pacific Region (Tokyo, Japan) (SIGIR-AP 2024...
- [4]
-
[5]
Avishek Anand, Procheta Sen, Sourav Saha, Manisha Verma, and Mandar Mi- tra. 2023. Explainable Information Retrieval. In Proceedings of the 46th Interna- tional ACM SIGIR Conference on Research and Development in Information Re- trieval (Taipei, Taiwan)(SIGIR ’23). Association for Computing Machinery, New York, NY, USA, 3448–3451. doi:10.1145/3539618.3594249
-
[6]
Paavo Arvola, Shlomo Geva, Jaap Kamps, Ralf Schenkel, Andrew Trotman, and Johanna Vainio. 2011. Overview of the INEX 2010 Ad Hoc Track. InComparative Evaluation of Focused Retrieval , Shlomo Geva, Jaap Kamps, Ralf Schenkel, and Andrew Trotman (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 1–32
work page 2011
-
[7]
Christine Bauer, Ben Carterette, Nicola Ferro, Norbert Fuhr, Joeran Beel, Timo Breuer, Charles L. A. Clarke, Anita Crescenzi, Gianluca Demartini, Gior- gio Maria Di Nunzio, Laura Dietz, Guglielmo Faggioli, Bruce Ferwerda, Maik Fröbe, Matthias Hagen, Allan Hanbury, Claudia Hauff, Dietmar Jannach, Noriko Kando, Evangelos Kanoulas, Bart P. Knijnenburg, Udo K...
-
[8]
Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Ka- plan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litw...
work page 2020
-
[9]
Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Se- bastian Gehrmann, Parker Schuh, Kensen Shi, Sasha Tsvyashchenko, Joshua Maynez, Abhishek Rao, Parker Barnes, Yi Tay, Noam Shazeer, Vinodkumar Prab- hakaran, Emily Reif, Nan Du, Ben Hutchinson, Reiner Pope, James B...
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[10]
Charles L.A. Clarke and Laura Dietz. 2025. LLM-based relevance assessment still can’t replace human relevance assessment . Technical Report 2412.17156. arXiv
-
[11]
Rice University David M. Lane. [n. d.]. Online Statistics Education: A Multimedia Course of Study. http://onlinestatbook.com/. Chapter 2
-
[13]
Guglielmo Faggioli, Laura Dietz, Charles L. A. Clarke, Gianluca Demartini, Matthias Hagen, Claudia Hauff, Noriko Kando, Evangelos Kanoulas, Martin Pot- thast, Benno Stein, and Henning Wachsmuth. 2023. Perspectives on Large Lan- guage Models for Relevance Judgment. In Proceedings of the 2023 ACM SIGIR International Conference on Theory of Information Retri...
- [14]
-
[15]
Shlomo Geva, Jaap Kamps, Miro Lethonen, Ralf Schenkel, James A. Thom, and Andrew Trotman. 2010. Overview of the INEX 2009 Ad Hoc Track. In Focused Retrieval and Evaluation, Shlomo Geva, Jaap Kamps, and Andrew Trotman (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 4–25
work page 2010
-
[16]
Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. 2023. Large Language Models are Zero-Shot Reasoners. arXiv:2205.11916 [cs.CL] https://arxiv.org/abs/2205.11916
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[17]
Kenton Lee, Ming-Wei Chang, and Kristina Toutanova. 2019. Latent Retrieval for Weakly Supervised Open Domain Question Answering. InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics , Anna Ko- rhonen, David Traum, and Lluís Màrquez (Eds.). Association for Computational Linguistics, Florence, Italy, 6086–6096. doi:10.18...
-
[18]
Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rock- täschel, Sebastian Riedel, and Douwe Kiela. 2020. Retrieval-augmented genera- tion for knowledge-intensive NLP tasks. InProceedings of the 34th International Conference on Neural Information Processing Systems...
work page 2020
-
[19]
Yao Lu, Max Bartolo, Alastair Moore, Sebastian Riedel, and Pontus Stenetorp
-
[20]
In: Muresan, S., Nakov, P., Villavicencio, A
Fantastically Ordered Prompts and Where to Find Them: Overcoming Few-Shot Prompt Order Sensitivity. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , Smaranda Muresan, Preslav Nakov, and Aline Villavicencio (Eds.). Association for Com- putational Linguistics, Dublin, Ireland, 8086–8098. doi...
-
[21]
Sean MacAvaney and Luca Soldaini. 2023. One-Shot Labeling for Automatic Relevance Estimation. In Proceedings of the 46th International ACM SIGIR Con- ference on Research and Development in Information Retrieval (Taipei, Taiwan) (SIGIR ’23) . Association for Computing Machinery, New York, NY, USA, 2230–
work page 2023
-
[22]
doi:10.1145/3539618.3592032
-
[23]
Kiran Purohit, Venktesh V, Sourangshu Bhattacharya, and Avishek Anand. 2025. Sample Efficient Demonstration Selection for In-Context Learning. In Forty- second International Conference on Machine Learning . https://openreview.net/ forum?id=cuqvlLBQK6
work page 2025
-
[24]
Rahmani, Nick Craswell, Emine Yilmaz, Bhaskar Mitra, and Daniel Campos
Hossein A. Rahmani, Nick Craswell, Emine Yilmaz, Bhaskar Mitra, and Daniel Campos. 2024. Synthetic Test Collections for Retrieval Evaluation. In Proceed- ings of the 47th International ACM SIGIR Conference on Research and Develop- ment in Information Retrieval (Washington DC, USA) (SIGIR ’24) . Association for Computing Machinery, New York, NY, USA, 2647–...
-
[25]
Rahmani, Emine Yilmaz, Nick Craswell, and Bhaskar Mitra
Hossein A. Rahmani, Emine Yilmaz, Nick Craswell, and Bhaskar Mitra. 2025. JudgeBlender: Ensembling Automatic Relevance Judgments. InCompanion Pro- ceedings of the ACM on Web Conference 2025 (Sydney NSW, Australia) (WWW ’25). Association for Computing Machinery, New York, NY, USA, 1268–1272. doi:10.1145/3701716.3715536
-
[26]
John W. Ratcliff and David E. Metzener. 1988. Pattern Matching: The Gestalt Approach. Dr. Dobb’s Journal (July 1988)
work page 1988
-
[27]
J. Rubia. 2024. Rice University Rule to Determine the Number of Bins. Open Journal of Statistics 14 (2024), 119–149. doi: 10.4236/ojs.2024.141006
-
[28]
Ian Soboroff. 2025. Don’t Use LLMs to Make Relevance Judgments.Information Retrieval Research 1, 1 (Mar. 2025), 29–46. doi:10.54195/irrj.19625
-
[29]
Voorhees, Tetsuya Sakai, and Ian Soboroff
Rikiya Takehi, Ellen M. Voorhees, Tetsuya Sakai, and Ian Soboroff. 2025. LLM- Assisted Relevance Assessments: When Should We Ask LLMs for Help?. InPro- ceedings of the 48th International ACM SIGIR Conference on Research and Devel- opment in Information Retrieval (Padua, Italy) (SIGIR ’25). Association for Com- puting Machinery, New York, NY, USA, 95–105. ...
-
[30]
Paul Thomas, Seth Spielman, Nick Craswell, and Bhaskar Mitra. 2024. Large Lan- guage Models can Accurately Predict Searcher Preferences. InProceedings of the 47th International ACM SIGIR Conference on Research and Development in Infor- mation Retrieval (Washington DC, USA)(SIGIR ’24). Association for Computing Machinery, New York, NY, USA, 1930–1940. doi:...
-
[31]
Shivani Upadhyay, Ronak Pradeep, Nandan Thakur, Daniel Campos, Nick Craswell, Ian Soboroff, and Jimmy Lin. 2025. A Large-Scale Study of Relevance Assessments with Large Language Models Using UMBRELA. In Proceedings of the 2025 International ACM SIGIR Conference on Innovative Concepts and Theories in Information Retrieval (ICTIR) (Padua, Italy) (ICTIR ’25)...
-
[32]
Ellen M. Voorhees and Donna K. Harman. 2005. TREC: Experiment and Evalu- ation in Information Retrieval (Digital Libraries and Electronic Publishing) . The MIT Press
work page 2005
-
[33]
Xi Ye, Srinivasan Iyer, Asli Celikyilmaz, Veselin Stoyanov, Greg Durrett, and Ra- makanth Pasunuru. 2023. Complementary Explanations for Effective In-Context Fine Grained Evaluation of LLMs-as-Judges Conference acronym ’XX, June 03–05, 2018, Woodstock, NY Learning. InFindings of the Association for Computational Linguistics: ACL 2023 , Anna Rogers, Jordan...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.