LLMs as Assessors: Right for the Right Reason?

Aditya Dutta; Mandar Mitra; Sourav Saha

arxiv: 2601.08919 · v2 · submitted 2026-01-13 · 💻 cs.IR · cs.CL· cs.LG

LLMs as Assessors: Right for the Right Reason?

Sourav Saha , Mandar Mitra , Aditya Dutta This is my paper

Pith reviewed 2026-05-16 14:19 UTC · model grok-4.3

classification 💻 cs.IR cs.CLcs.LG

keywords LLM assessorsrelevance judgmentpassage highlightinginformation retrievalbenchmark datasetsINEX collectionhuman vs LLM comparisontest collection creation

0 comments

The pith

LLMs match human relevance judgments on documents but cite different passages as evidence.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper investigates whether large language models can serve as substitutes for human assessors when building relevance judgments for information retrieval benchmarks. The authors extend prior work by using the INEX Wikipedia collection and asking both LLMs and humans not only to label documents as relevant or non-relevant but also to highlight the specific passages that justify the label. Comparisons show reasonable agreement at the document level yet clear divergence at the passage level, indicating that LLMs reach similar verdicts for different underlying reasons. This distinction matters because benchmark datasets built solely from LLM outputs could produce inconsistent or biased evaluations of retrieval systems. The authors conclude that LLMs can reduce but not eliminate the need for human involvement in creating high-quality test collections.

Core claim

The central claim is that LLMs prompted to judge document relevance and highlight useful passages on the INEX Wikipedia collection agree with human assessors at the document level but disagree substantially on the specific passages selected as evidence. This shows LLMs are often right about relevance without being right for the same reasons as humans. Consequently, while LLMs can be used judiciously to reduce the amount of human labor required for benchmark datasets, they cannot replace human assessors.

What carries the argument

The passage-highlighting task performed by both LLMs and humans on the same INEX Wikipedia queries and documents, allowing comparison of relevance labels together with the cited evidence for those labels.

Load-bearing premise

Human passage highlights provide a reliable ground-truth measure of the correct reasons for relevance and the INEX collection is representative for broader claims about LLM assessors.

What would settle it

Build parallel test collections for the same queries using only LLM assessments versus only human assessments, then run the same set of retrieval systems on both and check whether the system rankings and performance scores differ substantially.

Figures

Figures reproduced from arXiv: 2601.08919 by Aditya Dutta, Mandar Mitra, Sourav Saha.

**Figure 2.** Figure 2: A line from an INEX 2009 qrel file general purpose LLM to be able to learn this format from a few incontext examples; it generates only plain text as ̂𝑦. We then need to map each ̂𝑦 to one or more substrings of the corresponding document. For this, we employ the pattern-matching algorithm proposed in [23]. This algorithm identifies the longest common subsequences (at the word level) between the documen… view at source ↗

**Figure 3.** Figure 3: Distribution of generated content lengths relative to document lengths across INEX 2009 and 2010 using Exemplar [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Distribution of document length vs precision, document length vs fraction of gold chunks, document length vs [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Distribution of document length vs precision, document length vs fraction of gold chunks, document length vs [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗

**Figure 6.** Figure 6: Distribution of document length vs precision, document length vs fraction of gold chunks, document length vs [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗

**Figure 7.** Figure 7: Distribution of document length vs precision, document length vs fraction of gold chunks, document length vs [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗

read the original abstract

A good deal of recent research has focused on how Large Language Models (LLMs) may be used as judges in place of humans to evaluate the quality of the output produced by various text / image processing systems. Within this broader context, a number of studies have investigated the specific question of how effectively LLMs can be used as relevance assessors for the standard ad hoc task in Information Retrieval (IR). We extend these studies by looking at additional questions. Most importantly, we use a Wikipedia based test collection created by the INEX initiative, and prompt LLMs to not only judge whether documents are relevant / non-relevant, but to highlight relevant passages in documents that it regards as useful. The human relevance assessors involved in creating this collection were given analogous instructions, i.e., they were asked to highlight all passages within a document that respond to the information need expressed in a query. This enables us to evaluate the quality of LLMs as judges not only at the document level, but to also quantify how often these judges are right for the right reasons. Our observations lead us to reiterate the cautionary note sounded in some earlier studies when it comes to using LLMs as assessors for creating IR datasets: while LLMs are unquestionably promising, and may be used judiciously to subtantially reduce the amount of human involvement required to generate high-quality benchmark datasets, they cannot replace humans as assessors.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LLMs match humans on document relevance judgments in this INEX setup but diverge on passage highlights, with the human ground truth left unvalidated for consistency.

read the letter

The main point is that LLMs can reach decent agreement with humans on whether a document is relevant or not using the INEX Wikipedia collection, but they pick out different passages when asked to highlight the supporting evidence. This leads the authors to the familiar caution that LLMs can reduce labeling effort but cannot stand in for humans entirely on dataset creation. The work extends earlier LLM-as-judge papers by adding the passage-level comparison on this specific collection, where humans were already instructed to mark relevant spans. That dual evaluation is the concrete addition, and the abstract keeps the protocol straightforward without extra claims. The results are presented as measured rather than dramatic, which fits the data. The soft spot is the missing check on how consistent the original human assessors were with each other on passage selection. Highlighting is subjective, so without inter-annotator agreement numbers or overlap stats from the INEX creation process, the mismatches with LLMs could partly reflect normal human variation instead of clear LLM shortcomings. The abstract does not report any such figures, which leaves the 'right for the right reason' measure less anchored than it needs to be. This paper is for IR groups that build or maintain test collections and are weighing automated options. Researchers already running LLM judges would get a useful data point on where the limits show up. I would send it for peer review. The core comparison uses an established collection and stays within its scope, so referees can usefully push on the agreement details and any statistical presentation.

Referee Report

2 major / 2 minor

Summary. The paper investigates LLMs as relevance assessors for ad hoc IR tasks on the INEX Wikipedia collection. Human assessors and LLMs are both asked to judge document relevance and highlight relevant passages within documents. The work compares document-level agreement and passage overlap between LLMs and humans, concluding that LLMs are promising for reducing human effort in benchmark creation but cannot replace humans because they frequently fail to highlight the same passages (i.e., are not 'right for the right reason').

Significance. If the central comparison holds after addressing ground-truth reliability, the paper offers a useful cautionary result for the growing literature on LLM judges in IR. It strengthens the case for hybrid human-LLM pipelines by quantifying passage-level mismatches on a standard test collection, while crediting LLMs' potential to scale assessment with reduced human involvement.

major comments (2)

[Methods (human assessment protocol)] The evaluation treats human passage highlights as stable ground truth for measuring whether LLMs are 'right for the right reason,' yet no inter-annotator agreement statistics or overlap metrics among the original INEX human assessors are reported. Passage selection is subjective; without IAA, observed LLM-human mismatches cannot be confidently attributed to LLM reasoning rather than annotator variability. This directly undermines the load-bearing claim in the abstract and conclusion that LLMs cannot replace humans.
[Results and Evaluation] The abstract describes a comparison protocol using passage-overlap metrics, but the manuscript does not indicate whether this metric was pre-registered or include document-level agreement results with appropriate baselines (e.g., random or majority-vote human highlights). This weakens verification of the quantitative claims about LLM performance.

minor comments (2)

[Abstract] Typo in abstract: 'subtantially' should read 'substantially'.
[Experimental Setup] Provide more detail on the exact prompting templates, temperature settings, and specific LLM versions used to support reproducibility.

Simulated Author's Rebuttal

2 responses · 1 unresolved

Thank you for the constructive feedback on our manuscript. We address each major comment below, indicating where revisions will be made to strengthen the work while remaining faithful to the available data and study design.

read point-by-point responses

Referee: [Methods (human assessment protocol)] The evaluation treats human passage highlights as stable ground truth for measuring whether LLMs are 'right for the right reason,' yet no inter-annotator agreement statistics or overlap metrics among the original INEX human assessors are reported. Passage selection is subjective; without IAA, observed LLM-human mismatches cannot be confidently attributed to LLM reasoning rather than annotator variability. This directly undermines the load-bearing claim in the abstract and conclusion that LLMs cannot replace humans.

Authors: We agree that inter-annotator agreement (IAA) for passage highlights would provide important context for interpreting mismatches. The INEX Wikipedia collection (2006-2009) does not report IAA statistics for these annotations, and the archived data do not allow recomputation of overlap among assessors. In the revised manuscript we will add an explicit discussion of this limitation in the methods and limitations sections, qualify our claims by noting that human variability may contribute to observed differences, and reference related literature on IAA in ad hoc relevance assessment to contextualize the results. This is a partial revision because the missing IAA data cannot be supplied. revision: partial
Referee: [Results and Evaluation] The abstract describes a comparison protocol using passage-overlap metrics, but the manuscript does not indicate whether this metric was pre-registered or include document-level agreement results with appropriate baselines (e.g., random or majority-vote human highlights). This weakens verification of the quantitative claims about LLM performance.

Authors: The passage-overlap metrics were not pre-registered; the study was exploratory. In revision we will state this explicitly in the methods section, add random-passage and majority-vote human baselines, and report document-level agreement results against these baselines to allow better verification of the quantitative claims. revision: yes

standing simulated objections not resolved

Absence of inter-annotator agreement statistics in the original INEX annotations, which cannot be retroactively computed from the archived data.

Circularity Check

0 steps flagged

No significant circularity; empirical comparison to external INEX annotations

full rationale

The paper's central evaluation compares LLM-generated relevance judgments and passage highlights directly against the pre-existing human annotations from the INEX Wikipedia collection. No parameters are fitted to the target data, no self-definitional loops are present, and no load-bearing claims reduce to self-citations or ansatzes. The derivation chain consists of straightforward empirical measurement against an independent external benchmark, satisfying the criteria for a self-contained, non-circular analysis.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The study is purely empirical and relies on standard comparison of LLM outputs against an existing human-labeled collection; no new free parameters, axioms beyond basic statistical agreement, or invented entities are introduced.

axioms (1)

domain assumption Human passage highlights in the INEX collection constitute valid ground truth for evidence selection
Invoked when treating human highlights as the reference for measuring whether LLMs are 'right for the right reason'

pith-pipeline@v0.9.0 · 5553 in / 1216 out tokens · 54589 ms · 2026-05-16T14:19:13.693730+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean absolute_floor_iff_bare_distinguishability unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

prompt LLMs to not only judge whether documents are relevant / non-relevant, but to highlight relevant passages... compute recall and precision... overlap between the model’s outputs and the ground truth annotations
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

LLMs are promising... but they cannot replace humans as assessors

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

32 extracted references · 32 canonical work pages · 2 internal anchors

[1]

Zahra Abbasiantaeb, Chuan Meng, Leif Azzopardi, and Mohammad Aliannejadi

work page
[2]

Can We Use Large Language Models to Fill Relevance Judgment Holes? arXiv:2405.05600 [cs.IR] https://arxiv.org/abs/2405.05600

work page arXiv
[3]

Marwah Alaofi, Paul Thomas, Falk Scholer, and Mark Sanderson. 2024. LLMs can be Fooled into Labelling a Document as Relevant: best café near me; this paper is perfectly relevant. In Proceedings of the 2024 Annual International ACM SIGIR Conference on Research and Development in Information Retrieval in the Asia Pacific Region (Tokyo, Japan) (SIGIR-AP 2024...

work page doi:10.1145/3673791.3698431 2024
[4]

Avishek Anand, Lijun Lyu, Maximilian Idahl, Yumeng Wang, Jonas Wallat, and Zijian Zhang. 2022. Explainable Information Retrieval: A Survey. ArXiv (2022). https://arxiv.org/abs/2211.02405

work page arXiv 2022
[5]

Avishek Anand, Procheta Sen, Sourav Saha, Manisha Verma, and Mandar Mi- tra. 2023. Explainable Information Retrieval. In Proceedings of the 46th Interna- tional ACM SIGIR Conference on Research and Development in Information Re- trieval (Taipei, Taiwan)(SIGIR ’23). Association for Computing Machinery, New York, NY, USA, 3448–3451. doi:10.1145/3539618.3594249

work page doi:10.1145/3539618.3594249 2023
[6]

Paavo Arvola, Shlomo Geva, Jaap Kamps, Ralf Schenkel, Andrew Trotman, and Johanna Vainio. 2011. Overview of the INEX 2010 Ad Hoc Track. InComparative Evaluation of Focused Retrieval , Shlomo Geva, Jaap Kamps, Ralf Schenkel, and Andrew Trotman (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 1–32

work page 2011
[7]

Christine Bauer, Ben Carterette, Nicola Ferro, Norbert Fuhr, Joeran Beel, Timo Breuer, Charles L. A. Clarke, Anita Crescenzi, Gianluca Demartini, Gior- gio Maria Di Nunzio, Laura Dietz, Guglielmo Faggioli, Bruce Ferwerda, Maik Fröbe, Matthias Hagen, Allan Hanbury, Claudia Hauff, Dietmar Jannach, Noriko Kando, Evangelos Kanoulas, Bart P. Knijnenburg, Udo K...

work page doi:10.1145/3636341.3636351 2023
[8]

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Ka- plan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litw...

work page 2020
[9]

Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Se- bastian Gehrmann, Parker Schuh, Kensen Shi, Sasha Tsvyashchenko, Joshua Maynez, Abhishek Rao, Parker Barnes, Yi Tay, Noam Shazeer, Vinodkumar Prab- hakaran, Emily Reif, Nan Du, Ben Hutchinson, Reiner Pope, James B...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[10]

Clarke and Laura Dietz

Charles L.A. Clarke and Laura Dietz. 2025. LLM-based relevance assessment still can’t replace human relevance assessment . Technical Report 2412.17156. arXiv

work page arXiv 2025
[11]

Rice University David M. Lane. [n. d.]. Online Statistics Education: A Multimedia Course of Study. http://onlinestatbook.com/. Chapter 2

work page
[13]

Guglielmo Faggioli, Laura Dietz, Charles L. A. Clarke, Gianluca Demartini, Matthias Hagen, Claudia Hauff, Noriko Kando, Evangelos Kanoulas, Martin Pot- thast, Benno Stein, and Henning Wachsmuth. 2023. Perspectives on Large Lan- guage Models for Relevance Judgment. In Proceedings of the 2023 ACM SIGIR International Conference on Theory of Information Retri...

work page doi:10.1145/3578337.3605136 2023
[14]

Robert Friel, Masha Belyi, and Atindriyo Sanyal. 2025. RAGBench: Explainable Benchmark for Retrieval-Augmented Generation Systems. arXiv:2407.11005 [cs.CL] https://arxiv.org/abs/2407.11005

work page arXiv 2025
[15]

Thom, and Andrew Trotman

Shlomo Geva, Jaap Kamps, Miro Lethonen, Ralf Schenkel, James A. Thom, and Andrew Trotman. 2010. Overview of the INEX 2009 Ad Hoc Track. In Focused Retrieval and Evaluation, Shlomo Geva, Jaap Kamps, and Andrew Trotman (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 4–25

work page 2010
[16]

Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. 2023. Large Language Models are Zero-Shot Reasoners. arXiv:2205.11916 [cs.CL] https://arxiv.org/abs/2205.11916

work page internal anchor Pith review Pith/arXiv arXiv 2023
[17]

Kenton Lee, Ming-Wei Chang, and Kristina Toutanova. 2019. Latent Retrieval for Weakly Supervised Open Domain Question Answering. InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics , Anna Ko- rhonen, David Traum, and Lluís Màrquez (Eds.). Association for Computational Linguistics, Florence, Italy, 6086–6096. doi:10.18...

work page doi:10.18653/v1/p19-1612 2019
[18]

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rock- täschel, Sebastian Riedel, and Douwe Kiela. 2020. Retrieval-augmented genera- tion for knowledge-intensive NLP tasks. InProceedings of the 34th International Conference on Neural Information Processing Systems...

work page 2020
[19]

Yao Lu, Max Bartolo, Alastair Moore, Sebastian Riedel, and Pontus Stenetorp

work page
[20]

In: Muresan, S., Nakov, P., Villavicencio, A

Fantastically Ordered Prompts and Where to Find Them: Overcoming Few-Shot Prompt Order Sensitivity. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , Smaranda Muresan, Preslav Nakov, and Aline Villavicencio (Eds.). Association for Com- putational Linguistics, Dublin, Ireland, 8086–8098. doi...

work page doi:10.18653/v1/2022.acl- 2022
[21]

Sean MacAvaney and Luca Soldaini. 2023. One-Shot Labeling for Automatic Relevance Estimation. In Proceedings of the 46th International ACM SIGIR Con- ference on Research and Development in Information Retrieval (Taipei, Taiwan) (SIGIR ’23) . Association for Computing Machinery, New York, NY, USA, 2230–

work page 2023
[22]

doi:10.1145/3539618.3592032

work page doi:10.1145/3539618.3592032
[23]

Kiran Purohit, Venktesh V, Sourangshu Bhattacharya, and Avishek Anand. 2025. Sample Efficient Demonstration Selection for In-Context Learning. In Forty- second International Conference on Machine Learning . https://openreview.net/ forum?id=cuqvlLBQK6

work page 2025
[24]

Rahmani, Nick Craswell, Emine Yilmaz, Bhaskar Mitra, and Daniel Campos

Hossein A. Rahmani, Nick Craswell, Emine Yilmaz, Bhaskar Mitra, and Daniel Campos. 2024. Synthetic Test Collections for Retrieval Evaluation. In Proceed- ings of the 47th International ACM SIGIR Conference on Research and Develop- ment in Information Retrieval (Washington DC, USA) (SIGIR ’24) . Association for Computing Machinery, New York, NY, USA, 2647–...

work page doi:10.1145/3626772 2024
[25]

Rahmani, Emine Yilmaz, Nick Craswell, and Bhaskar Mitra

Hossein A. Rahmani, Emine Yilmaz, Nick Craswell, and Bhaskar Mitra. 2025. JudgeBlender: Ensembling Automatic Relevance Judgments. InCompanion Pro- ceedings of the ACM on Web Conference 2025 (Sydney NSW, Australia) (WWW ’25). Association for Computing Machinery, New York, NY, USA, 1268–1272. doi:10.1145/3701716.3715536

work page doi:10.1145/3701716.3715536 2025
[26]

Ratcliff and David E

John W. Ratcliff and David E. Metzener. 1988. Pattern Matching: The Gestalt Approach. Dr. Dobb’s Journal (July 1988)

work page 1988
[27]

J. Rubia. 2024. Rice University Rule to Determine the Number of Bins. Open Journal of Statistics 14 (2024), 119–149. doi: 10.4236/ojs.2024.141006

work page doi:10.4236/ojs.2024.141006 2024
[28]

Ian Soboroff. 2025. Don’t Use LLMs to Make Relevance Judgments.Information Retrieval Research 1, 1 (Mar. 2025), 29–46. doi:10.54195/irrj.19625

work page doi:10.54195/irrj.19625 2025
[29]

Voorhees, Tetsuya Sakai, and Ian Soboroff

Rikiya Takehi, Ellen M. Voorhees, Tetsuya Sakai, and Ian Soboroff. 2025. LLM- Assisted Relevance Assessments: When Should We Ask LLMs for Help?. InPro- ceedings of the 48th International ACM SIGIR Conference on Research and Devel- opment in Information Retrieval (Padua, Italy) (SIGIR ’25). Association for Com- puting Machinery, New York, NY, USA, 95–105. ...

work page doi:10.1145/3726302.3729916 2025
[30]

Paul Thomas, Seth Spielman, Nick Craswell, and Bhaskar Mitra. 2024. Large Lan- guage Models can Accurately Predict Searcher Preferences. InProceedings of the 47th International ACM SIGIR Conference on Research and Development in Infor- mation Retrieval (Washington DC, USA)(SIGIR ’24). Association for Computing Machinery, New York, NY, USA, 1930–1940. doi:...

work page doi:10.1145/3626772.3657707 2024
[31]

Shivani Upadhyay, Ronak Pradeep, Nandan Thakur, Daniel Campos, Nick Craswell, Ian Soboroff, and Jimmy Lin. 2025. A Large-Scale Study of Relevance Assessments with Large Language Models Using UMBRELA. In Proceedings of the 2025 International ACM SIGIR Conference on Innovative Concepts and Theories in Information Retrieval (ICTIR) (Padua, Italy) (ICTIR ’25)...

work page doi:10.1145/3731120.3744605 2025
[32]

Voorhees and Donna K

Ellen M. Voorhees and Donna K. Harman. 2005. TREC: Experiment and Evalu- ation in Information Retrieval (Digital Libraries and Electronic Publishing) . The MIT Press

work page 2005
[33]

Xi Ye, Srinivasan Iyer, Asli Celikyilmaz, Veselin Stoyanov, Greg Durrett, and Ra- makanth Pasunuru. 2023. Complementary Explanations for Effective In-Context Fine Grained Evaluation of LLMs-as-Judges Conference acronym ’XX, June 03–05, 2018, Woodstock, NY Learning. InFindings of the Association for Computational Linguistics: ACL 2023 , Anna Rogers, Jordan...

work page doi:10.18653/v1/2023 2023

[1] [1]

Zahra Abbasiantaeb, Chuan Meng, Leif Azzopardi, and Mohammad Aliannejadi

work page

[2] [2]

Can We Use Large Language Models to Fill Relevance Judgment Holes? arXiv:2405.05600 [cs.IR] https://arxiv.org/abs/2405.05600

work page arXiv

[3] [3]

Marwah Alaofi, Paul Thomas, Falk Scholer, and Mark Sanderson. 2024. LLMs can be Fooled into Labelling a Document as Relevant: best café near me; this paper is perfectly relevant. In Proceedings of the 2024 Annual International ACM SIGIR Conference on Research and Development in Information Retrieval in the Asia Pacific Region (Tokyo, Japan) (SIGIR-AP 2024...

work page doi:10.1145/3673791.3698431 2024

[4] [4]

Avishek Anand, Lijun Lyu, Maximilian Idahl, Yumeng Wang, Jonas Wallat, and Zijian Zhang. 2022. Explainable Information Retrieval: A Survey. ArXiv (2022). https://arxiv.org/abs/2211.02405

work page arXiv 2022

[5] [5]

Avishek Anand, Procheta Sen, Sourav Saha, Manisha Verma, and Mandar Mi- tra. 2023. Explainable Information Retrieval. In Proceedings of the 46th Interna- tional ACM SIGIR Conference on Research and Development in Information Re- trieval (Taipei, Taiwan)(SIGIR ’23). Association for Computing Machinery, New York, NY, USA, 3448–3451. doi:10.1145/3539618.3594249

work page doi:10.1145/3539618.3594249 2023

[6] [6]

Paavo Arvola, Shlomo Geva, Jaap Kamps, Ralf Schenkel, Andrew Trotman, and Johanna Vainio. 2011. Overview of the INEX 2010 Ad Hoc Track. InComparative Evaluation of Focused Retrieval , Shlomo Geva, Jaap Kamps, Ralf Schenkel, and Andrew Trotman (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 1–32

work page 2011

[7] [7]

Christine Bauer, Ben Carterette, Nicola Ferro, Norbert Fuhr, Joeran Beel, Timo Breuer, Charles L. A. Clarke, Anita Crescenzi, Gianluca Demartini, Gior- gio Maria Di Nunzio, Laura Dietz, Guglielmo Faggioli, Bruce Ferwerda, Maik Fröbe, Matthias Hagen, Allan Hanbury, Claudia Hauff, Dietmar Jannach, Noriko Kando, Evangelos Kanoulas, Bart P. Knijnenburg, Udo K...

work page doi:10.1145/3636341.3636351 2023

[8] [8]

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Ka- plan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litw...

work page 2020

[9] [9]

Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Se- bastian Gehrmann, Parker Schuh, Kensen Shi, Sasha Tsvyashchenko, Joshua Maynez, Abhishek Rao, Parker Barnes, Yi Tay, Noam Shazeer, Vinodkumar Prab- hakaran, Emily Reif, Nan Du, Ben Hutchinson, Reiner Pope, James B...

work page internal anchor Pith review Pith/arXiv arXiv 2022

[10] [10]

Clarke and Laura Dietz

Charles L.A. Clarke and Laura Dietz. 2025. LLM-based relevance assessment still can’t replace human relevance assessment . Technical Report 2412.17156. arXiv

work page arXiv 2025

[11] [11]

Rice University David M. Lane. [n. d.]. Online Statistics Education: A Multimedia Course of Study. http://onlinestatbook.com/. Chapter 2

work page

[12] [13]

Guglielmo Faggioli, Laura Dietz, Charles L. A. Clarke, Gianluca Demartini, Matthias Hagen, Claudia Hauff, Noriko Kando, Evangelos Kanoulas, Martin Pot- thast, Benno Stein, and Henning Wachsmuth. 2023. Perspectives on Large Lan- guage Models for Relevance Judgment. In Proceedings of the 2023 ACM SIGIR International Conference on Theory of Information Retri...

work page doi:10.1145/3578337.3605136 2023

[13] [14]

Robert Friel, Masha Belyi, and Atindriyo Sanyal. 2025. RAGBench: Explainable Benchmark for Retrieval-Augmented Generation Systems. arXiv:2407.11005 [cs.CL] https://arxiv.org/abs/2407.11005

work page arXiv 2025

[14] [15]

Thom, and Andrew Trotman

Shlomo Geva, Jaap Kamps, Miro Lethonen, Ralf Schenkel, James A. Thom, and Andrew Trotman. 2010. Overview of the INEX 2009 Ad Hoc Track. In Focused Retrieval and Evaluation, Shlomo Geva, Jaap Kamps, and Andrew Trotman (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 4–25

work page 2010

[15] [16]

Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. 2023. Large Language Models are Zero-Shot Reasoners. arXiv:2205.11916 [cs.CL] https://arxiv.org/abs/2205.11916

work page internal anchor Pith review Pith/arXiv arXiv 2023

[16] [17]

Kenton Lee, Ming-Wei Chang, and Kristina Toutanova. 2019. Latent Retrieval for Weakly Supervised Open Domain Question Answering. InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics , Anna Ko- rhonen, David Traum, and Lluís Màrquez (Eds.). Association for Computational Linguistics, Florence, Italy, 6086–6096. doi:10.18...

work page doi:10.18653/v1/p19-1612 2019

[17] [18]

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rock- täschel, Sebastian Riedel, and Douwe Kiela. 2020. Retrieval-augmented genera- tion for knowledge-intensive NLP tasks. InProceedings of the 34th International Conference on Neural Information Processing Systems...

work page 2020

[18] [19]

Yao Lu, Max Bartolo, Alastair Moore, Sebastian Riedel, and Pontus Stenetorp

work page

[19] [20]

In: Muresan, S., Nakov, P., Villavicencio, A

Fantastically Ordered Prompts and Where to Find Them: Overcoming Few-Shot Prompt Order Sensitivity. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , Smaranda Muresan, Preslav Nakov, and Aline Villavicencio (Eds.). Association for Com- putational Linguistics, Dublin, Ireland, 8086–8098. doi...

work page doi:10.18653/v1/2022.acl- 2022

[20] [21]

Sean MacAvaney and Luca Soldaini. 2023. One-Shot Labeling for Automatic Relevance Estimation. In Proceedings of the 46th International ACM SIGIR Con- ference on Research and Development in Information Retrieval (Taipei, Taiwan) (SIGIR ’23) . Association for Computing Machinery, New York, NY, USA, 2230–

work page 2023

[21] [22]

doi:10.1145/3539618.3592032

work page doi:10.1145/3539618.3592032

[22] [23]

Kiran Purohit, Venktesh V, Sourangshu Bhattacharya, and Avishek Anand. 2025. Sample Efficient Demonstration Selection for In-Context Learning. In Forty- second International Conference on Machine Learning . https://openreview.net/ forum?id=cuqvlLBQK6

work page 2025

[23] [24]

Rahmani, Nick Craswell, Emine Yilmaz, Bhaskar Mitra, and Daniel Campos

Hossein A. Rahmani, Nick Craswell, Emine Yilmaz, Bhaskar Mitra, and Daniel Campos. 2024. Synthetic Test Collections for Retrieval Evaluation. In Proceed- ings of the 47th International ACM SIGIR Conference on Research and Develop- ment in Information Retrieval (Washington DC, USA) (SIGIR ’24) . Association for Computing Machinery, New York, NY, USA, 2647–...

work page doi:10.1145/3626772 2024

[24] [25]

Rahmani, Emine Yilmaz, Nick Craswell, and Bhaskar Mitra

Hossein A. Rahmani, Emine Yilmaz, Nick Craswell, and Bhaskar Mitra. 2025. JudgeBlender: Ensembling Automatic Relevance Judgments. InCompanion Pro- ceedings of the ACM on Web Conference 2025 (Sydney NSW, Australia) (WWW ’25). Association for Computing Machinery, New York, NY, USA, 1268–1272. doi:10.1145/3701716.3715536

work page doi:10.1145/3701716.3715536 2025

[25] [26]

Ratcliff and David E

John W. Ratcliff and David E. Metzener. 1988. Pattern Matching: The Gestalt Approach. Dr. Dobb’s Journal (July 1988)

work page 1988

[26] [27]

J. Rubia. 2024. Rice University Rule to Determine the Number of Bins. Open Journal of Statistics 14 (2024), 119–149. doi: 10.4236/ojs.2024.141006

work page doi:10.4236/ojs.2024.141006 2024

[27] [28]

Ian Soboroff. 2025. Don’t Use LLMs to Make Relevance Judgments.Information Retrieval Research 1, 1 (Mar. 2025), 29–46. doi:10.54195/irrj.19625

work page doi:10.54195/irrj.19625 2025

[28] [29]

Voorhees, Tetsuya Sakai, and Ian Soboroff

Rikiya Takehi, Ellen M. Voorhees, Tetsuya Sakai, and Ian Soboroff. 2025. LLM- Assisted Relevance Assessments: When Should We Ask LLMs for Help?. InPro- ceedings of the 48th International ACM SIGIR Conference on Research and Devel- opment in Information Retrieval (Padua, Italy) (SIGIR ’25). Association for Com- puting Machinery, New York, NY, USA, 95–105. ...

work page doi:10.1145/3726302.3729916 2025

[29] [30]

Paul Thomas, Seth Spielman, Nick Craswell, and Bhaskar Mitra. 2024. Large Lan- guage Models can Accurately Predict Searcher Preferences. InProceedings of the 47th International ACM SIGIR Conference on Research and Development in Infor- mation Retrieval (Washington DC, USA)(SIGIR ’24). Association for Computing Machinery, New York, NY, USA, 1930–1940. doi:...

work page doi:10.1145/3626772.3657707 2024

[30] [31]

Shivani Upadhyay, Ronak Pradeep, Nandan Thakur, Daniel Campos, Nick Craswell, Ian Soboroff, and Jimmy Lin. 2025. A Large-Scale Study of Relevance Assessments with Large Language Models Using UMBRELA. In Proceedings of the 2025 International ACM SIGIR Conference on Innovative Concepts and Theories in Information Retrieval (ICTIR) (Padua, Italy) (ICTIR ’25)...

work page doi:10.1145/3731120.3744605 2025

[31] [32]

Voorhees and Donna K

Ellen M. Voorhees and Donna K. Harman. 2005. TREC: Experiment and Evalu- ation in Information Retrieval (Digital Libraries and Electronic Publishing) . The MIT Press

work page 2005

[32] [33]

Xi Ye, Srinivasan Iyer, Asli Celikyilmaz, Veselin Stoyanov, Greg Durrett, and Ra- makanth Pasunuru. 2023. Complementary Explanations for Effective In-Context Fine Grained Evaluation of LLMs-as-Judges Conference acronym ’XX, June 03–05, 2018, Woodstock, NY Learning. InFindings of the Association for Computational Linguistics: ACL 2023 , Anna Rogers, Jordan...

work page doi:10.18653/v1/2023 2023