pith. sign in

arxiv: 2601.08919 · v2 · submitted 2026-01-13 · 💻 cs.IR · cs.CL· cs.LG

LLMs as Assessors: Right for the Right Reason?

Pith reviewed 2026-05-16 14:19 UTC · model grok-4.3

classification 💻 cs.IR cs.CLcs.LG
keywords LLM assessorsrelevance judgmentpassage highlightinginformation retrievalbenchmark datasetsINEX collectionhuman vs LLM comparisontest collection creation
0
0 comments X

The pith

LLMs match human relevance judgments on documents but cite different passages as evidence.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper investigates whether large language models can serve as substitutes for human assessors when building relevance judgments for information retrieval benchmarks. The authors extend prior work by using the INEX Wikipedia collection and asking both LLMs and humans not only to label documents as relevant or non-relevant but also to highlight the specific passages that justify the label. Comparisons show reasonable agreement at the document level yet clear divergence at the passage level, indicating that LLMs reach similar verdicts for different underlying reasons. This distinction matters because benchmark datasets built solely from LLM outputs could produce inconsistent or biased evaluations of retrieval systems. The authors conclude that LLMs can reduce but not eliminate the need for human involvement in creating high-quality test collections.

Core claim

The central claim is that LLMs prompted to judge document relevance and highlight useful passages on the INEX Wikipedia collection agree with human assessors at the document level but disagree substantially on the specific passages selected as evidence. This shows LLMs are often right about relevance without being right for the same reasons as humans. Consequently, while LLMs can be used judiciously to reduce the amount of human labor required for benchmark datasets, they cannot replace human assessors.

What carries the argument

The passage-highlighting task performed by both LLMs and humans on the same INEX Wikipedia queries and documents, allowing comparison of relevance labels together with the cited evidence for those labels.

Load-bearing premise

Human passage highlights provide a reliable ground-truth measure of the correct reasons for relevance and the INEX collection is representative for broader claims about LLM assessors.

What would settle it

Build parallel test collections for the same queries using only LLM assessments versus only human assessments, then run the same set of retrieval systems on both and check whether the system rankings and performance scores differ substantially.

Figures

Figures reproduced from arXiv: 2601.08919 by Aditya Dutta, Mandar Mitra, Sourav Saha.

Figure 1
Figure 1. Figure 1: Structure of our prompt. For Example 7, which cor [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: A line from an INEX 2009 qrel file general purpose LLM to be able to learn this format from a few in￾context examples; it generates only plain text as ̂𝑦. We then need to map each ̂𝑦 to one or more substrings of the corresponding doc￾ument. For this, we employ the pattern-matching algorithm pro￾posed in [23]. This algorithm identifies the longest common subse￾quences (at the word level) between the documen… view at source ↗
Figure 3
Figure 3. Figure 3: Distribution of generated content lengths relative to document lengths across INEX 2009 and 2010 using Exemplar [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Distribution of document length vs precision, document length vs fraction of gold chunks, document length vs [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Distribution of document length vs precision, document length vs fraction of gold chunks, document length vs [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Distribution of document length vs precision, document length vs fraction of gold chunks, document length vs [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Distribution of document length vs precision, document length vs fraction of gold chunks, document length vs [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗
read the original abstract

A good deal of recent research has focused on how Large Language Models (LLMs) may be used as judges in place of humans to evaluate the quality of the output produced by various text / image processing systems. Within this broader context, a number of studies have investigated the specific question of how effectively LLMs can be used as relevance assessors for the standard ad hoc task in Information Retrieval (IR). We extend these studies by looking at additional questions. Most importantly, we use a Wikipedia based test collection created by the INEX initiative, and prompt LLMs to not only judge whether documents are relevant / non-relevant, but to highlight relevant passages in documents that it regards as useful. The human relevance assessors involved in creating this collection were given analogous instructions, i.e., they were asked to highlight all passages within a document that respond to the information need expressed in a query. This enables us to evaluate the quality of LLMs as judges not only at the document level, but to also quantify how often these judges are right for the right reasons. Our observations lead us to reiterate the cautionary note sounded in some earlier studies when it comes to using LLMs as assessors for creating IR datasets: while LLMs are unquestionably promising, and may be used judiciously to subtantially reduce the amount of human involvement required to generate high-quality benchmark datasets, they cannot replace humans as assessors.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper investigates LLMs as relevance assessors for ad hoc IR tasks on the INEX Wikipedia collection. Human assessors and LLMs are both asked to judge document relevance and highlight relevant passages within documents. The work compares document-level agreement and passage overlap between LLMs and humans, concluding that LLMs are promising for reducing human effort in benchmark creation but cannot replace humans because they frequently fail to highlight the same passages (i.e., are not 'right for the right reason').

Significance. If the central comparison holds after addressing ground-truth reliability, the paper offers a useful cautionary result for the growing literature on LLM judges in IR. It strengthens the case for hybrid human-LLM pipelines by quantifying passage-level mismatches on a standard test collection, while crediting LLMs' potential to scale assessment with reduced human involvement.

major comments (2)
  1. [Methods (human assessment protocol)] The evaluation treats human passage highlights as stable ground truth for measuring whether LLMs are 'right for the right reason,' yet no inter-annotator agreement statistics or overlap metrics among the original INEX human assessors are reported. Passage selection is subjective; without IAA, observed LLM-human mismatches cannot be confidently attributed to LLM reasoning rather than annotator variability. This directly undermines the load-bearing claim in the abstract and conclusion that LLMs cannot replace humans.
  2. [Results and Evaluation] The abstract describes a comparison protocol using passage-overlap metrics, but the manuscript does not indicate whether this metric was pre-registered or include document-level agreement results with appropriate baselines (e.g., random or majority-vote human highlights). This weakens verification of the quantitative claims about LLM performance.
minor comments (2)
  1. [Abstract] Typo in abstract: 'subtantially' should read 'substantially'.
  2. [Experimental Setup] Provide more detail on the exact prompting templates, temperature settings, and specific LLM versions used to support reproducibility.

Simulated Author's Rebuttal

2 responses · 1 unresolved

Thank you for the constructive feedback on our manuscript. We address each major comment below, indicating where revisions will be made to strengthen the work while remaining faithful to the available data and study design.

read point-by-point responses
  1. Referee: [Methods (human assessment protocol)] The evaluation treats human passage highlights as stable ground truth for measuring whether LLMs are 'right for the right reason,' yet no inter-annotator agreement statistics or overlap metrics among the original INEX human assessors are reported. Passage selection is subjective; without IAA, observed LLM-human mismatches cannot be confidently attributed to LLM reasoning rather than annotator variability. This directly undermines the load-bearing claim in the abstract and conclusion that LLMs cannot replace humans.

    Authors: We agree that inter-annotator agreement (IAA) for passage highlights would provide important context for interpreting mismatches. The INEX Wikipedia collection (2006-2009) does not report IAA statistics for these annotations, and the archived data do not allow recomputation of overlap among assessors. In the revised manuscript we will add an explicit discussion of this limitation in the methods and limitations sections, qualify our claims by noting that human variability may contribute to observed differences, and reference related literature on IAA in ad hoc relevance assessment to contextualize the results. This is a partial revision because the missing IAA data cannot be supplied. revision: partial

  2. Referee: [Results and Evaluation] The abstract describes a comparison protocol using passage-overlap metrics, but the manuscript does not indicate whether this metric was pre-registered or include document-level agreement results with appropriate baselines (e.g., random or majority-vote human highlights). This weakens verification of the quantitative claims about LLM performance.

    Authors: The passage-overlap metrics were not pre-registered; the study was exploratory. In revision we will state this explicitly in the methods section, add random-passage and majority-vote human baselines, and report document-level agreement results against these baselines to allow better verification of the quantitative claims. revision: yes

standing simulated objections not resolved
  • Absence of inter-annotator agreement statistics in the original INEX annotations, which cannot be retroactively computed from the archived data.

Circularity Check

0 steps flagged

No significant circularity; empirical comparison to external INEX annotations

full rationale

The paper's central evaluation compares LLM-generated relevance judgments and passage highlights directly against the pre-existing human annotations from the INEX Wikipedia collection. No parameters are fitted to the target data, no self-definitional loops are present, and no load-bearing claims reduce to self-citations or ansatzes. The derivation chain consists of straightforward empirical measurement against an independent external benchmark, satisfying the criteria for a self-contained, non-circular analysis.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The study is purely empirical and relies on standard comparison of LLM outputs against an existing human-labeled collection; no new free parameters, axioms beyond basic statistical agreement, or invented entities are introduced.

axioms (1)
  • domain assumption Human passage highlights in the INEX collection constitute valid ground truth for evidence selection
    Invoked when treating human highlights as the reference for measuring whether LLMs are 'right for the right reason'

pith-pipeline@v0.9.0 · 5553 in / 1216 out tokens · 54589 ms · 2026-05-16T14:19:13.693730+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

32 extracted references · 32 canonical work pages · 2 internal anchors

  1. [1]

    Zahra Abbasiantaeb, Chuan Meng, Leif Azzopardi, and Mohammad Aliannejadi

  2. [2]

    Can We Use Large Language Models to Fill Relevance Judgment Holes? arXiv:2405.05600 [cs.IR] https://arxiv.org/abs/2405.05600

  3. [3]

    Marwah Alaofi, Paul Thomas, Falk Scholer, and Mark Sanderson. 2024. LLMs can be Fooled into Labelling a Document as Relevant: best café near me; this paper is perfectly relevant. In Proceedings of the 2024 Annual International ACM SIGIR Conference on Research and Development in Information Retrieval in the Asia Pacific Region (Tokyo, Japan) (SIGIR-AP 2024...

  4. [4]

    Avishek Anand, Lijun Lyu, Maximilian Idahl, Yumeng Wang, Jonas Wallat, and Zijian Zhang. 2022. Explainable Information Retrieval: A Survey. ArXiv (2022). https://arxiv.org/abs/2211.02405

  5. [5]

    Avishek Anand, Procheta Sen, Sourav Saha, Manisha Verma, and Mandar Mi- tra. 2023. Explainable Information Retrieval. In Proceedings of the 46th Interna- tional ACM SIGIR Conference on Research and Development in Information Re- trieval (Taipei, Taiwan)(SIGIR ’23). Association for Computing Machinery, New York, NY, USA, 3448–3451. doi:10.1145/3539618.3594249

  6. [6]

    Paavo Arvola, Shlomo Geva, Jaap Kamps, Ralf Schenkel, Andrew Trotman, and Johanna Vainio. 2011. Overview of the INEX 2010 Ad Hoc Track. InComparative Evaluation of Focused Retrieval , Shlomo Geva, Jaap Kamps, Ralf Schenkel, and Andrew Trotman (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 1–32

  7. [7]

    Christine Bauer, Ben Carterette, Nicola Ferro, Norbert Fuhr, Joeran Beel, Timo Breuer, Charles L. A. Clarke, Anita Crescenzi, Gianluca Demartini, Gior- gio Maria Di Nunzio, Laura Dietz, Guglielmo Faggioli, Bruce Ferwerda, Maik Fröbe, Matthias Hagen, Allan Hanbury, Claudia Hauff, Dietmar Jannach, Noriko Kando, Evangelos Kanoulas, Bart P. Knijnenburg, Udo K...

  8. [8]

    Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Ka- plan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litw...

  9. [9]

    Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Se- bastian Gehrmann, Parker Schuh, Kensen Shi, Sasha Tsvyashchenko, Joshua Maynez, Abhishek Rao, Parker Barnes, Yi Tay, Noam Shazeer, Vinodkumar Prab- hakaran, Emily Reif, Nan Du, Ben Hutchinson, Reiner Pope, James B...

  10. [10]

    Clarke and Laura Dietz

    Charles L.A. Clarke and Laura Dietz. 2025. LLM-based relevance assessment still can’t replace human relevance assessment . Technical Report 2412.17156. arXiv

  11. [11]

    Rice University David M. Lane. [n. d.]. Online Statistics Education: A Multimedia Course of Study. http://onlinestatbook.com/. Chapter 2

  12. [13]

    Guglielmo Faggioli, Laura Dietz, Charles L. A. Clarke, Gianluca Demartini, Matthias Hagen, Claudia Hauff, Noriko Kando, Evangelos Kanoulas, Martin Pot- thast, Benno Stein, and Henning Wachsmuth. 2023. Perspectives on Large Lan- guage Models for Relevance Judgment. In Proceedings of the 2023 ACM SIGIR International Conference on Theory of Information Retri...

  13. [14]

    Robert Friel, Masha Belyi, and Atindriyo Sanyal. 2025. RAGBench: Explainable Benchmark for Retrieval-Augmented Generation Systems. arXiv:2407.11005 [cs.CL] https://arxiv.org/abs/2407.11005

  14. [15]

    Thom, and Andrew Trotman

    Shlomo Geva, Jaap Kamps, Miro Lethonen, Ralf Schenkel, James A. Thom, and Andrew Trotman. 2010. Overview of the INEX 2009 Ad Hoc Track. In Focused Retrieval and Evaluation, Shlomo Geva, Jaap Kamps, and Andrew Trotman (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 4–25

  15. [16]

    Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. 2023. Large Language Models are Zero-Shot Reasoners. arXiv:2205.11916 [cs.CL] https://arxiv.org/abs/2205.11916

  16. [17]

    Kenton Lee, Ming-Wei Chang, and Kristina Toutanova. 2019. Latent Retrieval for Weakly Supervised Open Domain Question Answering. InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics , Anna Ko- rhonen, David Traum, and Lluís Màrquez (Eds.). Association for Computational Linguistics, Florence, Italy, 6086–6096. doi:10.18...

  17. [18]

    Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rock- täschel, Sebastian Riedel, and Douwe Kiela. 2020. Retrieval-augmented genera- tion for knowledge-intensive NLP tasks. InProceedings of the 34th International Conference on Neural Information Processing Systems...

  18. [19]

    Yao Lu, Max Bartolo, Alastair Moore, Sebastian Riedel, and Pontus Stenetorp

  19. [20]

    In: Muresan, S., Nakov, P., Villavicencio, A

    Fantastically Ordered Prompts and Where to Find Them: Overcoming Few-Shot Prompt Order Sensitivity. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , Smaranda Muresan, Preslav Nakov, and Aline Villavicencio (Eds.). Association for Com- putational Linguistics, Dublin, Ireland, 8086–8098. doi...

  20. [21]

    Sean MacAvaney and Luca Soldaini. 2023. One-Shot Labeling for Automatic Relevance Estimation. In Proceedings of the 46th International ACM SIGIR Con- ference on Research and Development in Information Retrieval (Taipei, Taiwan) (SIGIR ’23) . Association for Computing Machinery, New York, NY, USA, 2230–

  21. [22]

    doi:10.1145/3539618.3592032

  22. [23]

    Kiran Purohit, Venktesh V, Sourangshu Bhattacharya, and Avishek Anand. 2025. Sample Efficient Demonstration Selection for In-Context Learning. In Forty- second International Conference on Machine Learning . https://openreview.net/ forum?id=cuqvlLBQK6

  23. [24]

    Rahmani, Nick Craswell, Emine Yilmaz, Bhaskar Mitra, and Daniel Campos

    Hossein A. Rahmani, Nick Craswell, Emine Yilmaz, Bhaskar Mitra, and Daniel Campos. 2024. Synthetic Test Collections for Retrieval Evaluation. In Proceed- ings of the 47th International ACM SIGIR Conference on Research and Develop- ment in Information Retrieval (Washington DC, USA) (SIGIR ’24) . Association for Computing Machinery, New York, NY, USA, 2647–...

  24. [25]

    Rahmani, Emine Yilmaz, Nick Craswell, and Bhaskar Mitra

    Hossein A. Rahmani, Emine Yilmaz, Nick Craswell, and Bhaskar Mitra. 2025. JudgeBlender: Ensembling Automatic Relevance Judgments. InCompanion Pro- ceedings of the ACM on Web Conference 2025 (Sydney NSW, Australia) (WWW ’25). Association for Computing Machinery, New York, NY, USA, 1268–1272. doi:10.1145/3701716.3715536

  25. [26]

    Ratcliff and David E

    John W. Ratcliff and David E. Metzener. 1988. Pattern Matching: The Gestalt Approach. Dr. Dobb’s Journal (July 1988)

  26. [27]

    J. Rubia. 2024. Rice University Rule to Determine the Number of Bins. Open Journal of Statistics 14 (2024), 119–149. doi: 10.4236/ojs.2024.141006

  27. [28]

    Ian Soboroff. 2025. Don’t Use LLMs to Make Relevance Judgments.Information Retrieval Research 1, 1 (Mar. 2025), 29–46. doi:10.54195/irrj.19625

  28. [29]

    Voorhees, Tetsuya Sakai, and Ian Soboroff

    Rikiya Takehi, Ellen M. Voorhees, Tetsuya Sakai, and Ian Soboroff. 2025. LLM- Assisted Relevance Assessments: When Should We Ask LLMs for Help?. InPro- ceedings of the 48th International ACM SIGIR Conference on Research and Devel- opment in Information Retrieval (Padua, Italy) (SIGIR ’25). Association for Com- puting Machinery, New York, NY, USA, 95–105. ...

  29. [30]

    Paul Thomas, Seth Spielman, Nick Craswell, and Bhaskar Mitra. 2024. Large Lan- guage Models can Accurately Predict Searcher Preferences. InProceedings of the 47th International ACM SIGIR Conference on Research and Development in Infor- mation Retrieval (Washington DC, USA)(SIGIR ’24). Association for Computing Machinery, New York, NY, USA, 1930–1940. doi:...

  30. [31]

    Shivani Upadhyay, Ronak Pradeep, Nandan Thakur, Daniel Campos, Nick Craswell, Ian Soboroff, and Jimmy Lin. 2025. A Large-Scale Study of Relevance Assessments with Large Language Models Using UMBRELA. In Proceedings of the 2025 International ACM SIGIR Conference on Innovative Concepts and Theories in Information Retrieval (ICTIR) (Padua, Italy) (ICTIR ’25)...

  31. [32]

    Voorhees and Donna K

    Ellen M. Voorhees and Donna K. Harman. 2005. TREC: Experiment and Evalu- ation in Information Retrieval (Digital Libraries and Electronic Publishing) . The MIT Press

  32. [33]

    Xi Ye, Srinivasan Iyer, Asli Celikyilmaz, Veselin Stoyanov, Greg Durrett, and Ra- makanth Pasunuru. 2023. Complementary Explanations for Effective In-Context Fine Grained Evaluation of LLMs-as-Judges Conference acronym ’XX, June 03–05, 2018, Woodstock, NY Learning. InFindings of the Association for Computational Linguistics: ACL 2023 , Anna Rogers, Jordan...