arxiv: 2602.08457 · v2 · submitted 2026-02-09 · 💻 cs.IR

Recognition: no theorem link

Hybrid Pooling with LLMs via Relevance Context Learning

David Otero , Javier Parapar

Authors on Pith no claims yet

Pith reviewed 2026-05-16 06:03 UTC · model grok-4.3

classification 💻 cs.IR

keywords relevance assessmentlarge language modelsinformation retrievalin-context learninghybrid poolingrelevance narrativesdataset construction

0 comments

The pith

Relevance Context Learning turns a few human judgements into explicit topic narratives that guide LLMs to produce more accurate relevance labels than standard prompting.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Relevance Context Learning to improve how large language models assess document relevance for information retrieval evaluation. Rather than feeding examples directly into prompts or relying on zero-shot instructions, the method first asks one LLM to examine judged query-document pairs and write out the underlying topic-specific relevance criteria in narrative form. A second LLM then uses those narratives as structured guidance when judging new pairs. The approach is tested in a hybrid pooling setup where humans label only a shallow pool of documents and LLMs handle the remainder, showing consistent gains over both zero-shot and conventional in-context learning.

Core claim

Relevance Context Learning works by prompting an Instructor LLM to analyse sets of judged query-document pairs and generate explicit narratives describing what constitutes relevance for a given topic; these narratives then serve as structured prompts for an Assessor LLM to label unseen pairs. In a hybrid pooling strategy, human assessors judge a shallow depth-k pool while the remaining documents are labelled by the RCL-guided LLM, yielding relevance judgements that substantially outperform zero-shot prompting and improve over standard in-context learning.

What carries the argument

Relevance Context Learning (RCL), a two-stage process in which an Instructor LLM distills human judgements into topic-specific relevance narratives that then guide an Assessor LLM.

Load-bearing premise

LLMs can reliably extract and articulate topic-specific relevance criteria from a small number of judged examples in a form that generalizes accurately to unseen query-document pairs without introducing systematic bias.

What would settle it

On a held-out collection of human-judged query-document pairs, if relevance labels produced by RCL show lower agreement with the human gold labels than labels produced by standard in-context learning, the performance advantage would be falsified.

Figures

Figures reproduced from arXiv: 2602.08457 by David Otero, Javier Parapar.

**Figure 2.** Figure 2: Per-query differences in F1 scores across ten ex [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Schematic overview of narrative generation: the [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

read the original abstract

High-quality relevance judgements over large query sets are essential for evaluating Information Retrieval (IR) systems, yet manual annotation remains costly and time-consuming. Large Language Models (LLMs) have recently shown promise as automatic relevance assessors, but their reliability is still limited. Most existing approaches rely on zero-shot prompting or in-context learning (ICL) with a small number of labelled examples. However, standard ICL treats examples as independent instances and fails to explicitly capture the underlying relevance criteria of a topic, restricting its ability to generalise to unseen query-document pairs. To address this limitation, we introduce Relevance Context Learning (RCL), a novel framework that leverages human relevance judgements to explicitly model topic-specific relevance criteria. Rather than directly using labelled examples for in-context prediction, RCL first prompts an LLM (Instructor LLM) to analyse sets of judged query-document pairs and generate explicit narratives that describe what constitutes relevance for a given topic. These relevance narratives are then used as structured prompts to guide a second LLM (Assessor LLM) in producing relevance judgements. To evaluate RCL in a realistic data collection setting, we propose a hybrid pooling strategy in which a shallow depth-k pool from participating systems is judged by human assessors, while the remaining documents are labelled by LLMs. Experimental results demonstrate that RCL substantially outperforms zero-shot prompting and consistently improves over standard ICL. Overall, our findings indicate that transforming relevance examples into explicit, context-aware relevance narratives is a more effective way of exploiting human judgements for LLM-based IR dataset construction.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

RCL generates explicit relevance narratives from judged pairs to guide LLM assessors and pairs it with hybrid pooling, but the abstract gives no metrics or validation so the gains stay unproven.

read the letter

The core idea here is turning a handful of human relevance judgments into explicit topic-specific narratives via one LLM, then using those narratives as structured prompts for a second LLM to judge new pairs. This is positioned as an improvement over standard in-context learning, which just feeds examples without distilling the criteria. The hybrid pooling setup—human labels on a shallow depth-k pool and LLM labels on the rest—is a straightforward way to cut annotation costs while keeping some human oversight for IR test collections. That framing of the practical bottleneck is clear and on target. The paper earns credit for naming why plain ICL often fails to generalize: examples stay independent instead of being synthesized into reusable relevance rules. The two-stage Instructor-Assessor split is a clean separation that could be replicated. The main weakness is that the abstract claims substantial outperformance over zero-shot and ICL baselines with zero numbers, datasets, statistical tests, or even basic setup details. Without those, it is impossible to tell whether the narrative step actually drives the gains or whether longer, more structured prompts are doing the work. There is also no reported check on whether the generated narratives faithfully reflect the original human criteria or inject new systematic bias, which is exactly the assumption that needs testing. The logic is internally consistent and the problem is real, but the evidence presented is too thin to assess the result. This is for IR groups already experimenting with LLM judges and looking for ways to scale collections. A reader who wants concrete prompting recipes or cost numbers will come away wanting more, but the framework itself is worth testing. I would send it to peer review so the experiments can be examined properly rather than desk-rejecting on the abstract alone.

Referee Report

2 major / 2 minor

Summary. The paper introduces Relevance Context Learning (RCL), a two-stage LLM framework for relevance judgment in IR: an Instructor LLM first analyzes small sets of human-judged query-document pairs to generate explicit topic-specific relevance narratives, which are then injected as structured context into prompts for an Assessor LLM to label unseen pairs. The authors also propose a hybrid pooling strategy that combines shallow human judgments on system pools with LLM labels for the remainder. They claim experimental results show RCL substantially outperforms zero-shot prompting and consistently improves over standard ICL.

Significance. If the central claim holds after proper validation, RCL could meaningfully improve the cost-effectiveness and reliability of constructing large-scale relevance judgment datasets by converting limited human annotations into reusable, topic-aware narratives rather than treating examples as isolated ICL shots. This would be a practical advance for IR evaluation pipelines that currently struggle with annotation scale.

major comments (2)

[Abstract and experimental evaluation section] The load-bearing assumption that the Instructor LLM's generated narratives faithfully capture topic-specific relevance criteria without introducing systematic bias or distortion is untested. No intermediate human validation, fidelity metrics, or ablation on narrative quality is reported, so end-to-end gains could stem from incidental prompt effects rather than genuine relevance context learning.
[Abstract] The abstract asserts substantial outperformance over zero-shot and ICL but supplies no dataset names, metrics (e.g., Cohen's kappa, accuracy, or nDCG correlation), statistical tests, or even the number of topics/queries used. Without these details the support for the central claim cannot be assessed.

minor comments (2)

[Method] Clarify the exact prompting templates used for the Instructor and Assessor LLMs, including how the generated narratives are formatted and inserted.
[Hybrid Pooling Strategy] The hybrid pooling description should specify the exact depth-k value, which participating systems are pooled, and how ties or disagreements between human and LLM labels are resolved.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the major comments point by point below and will revise the manuscript to incorporate the suggested improvements.

read point-by-point responses

Referee: [Abstract and experimental evaluation section] The load-bearing assumption that the Instructor LLM's generated narratives faithfully capture topic-specific relevance criteria without introducing systematic bias or distortion is untested. No intermediate human validation, fidelity metrics, or ablation on narrative quality is reported, so end-to-end gains could stem from incidental prompt effects rather than genuine relevance context learning.

Authors: We agree this is a valid concern and that the original submission lacks explicit validation of the generated narratives. The end-to-end gains are consistent across experiments, but without intermediate checks it is difficult to fully attribute improvements to relevance context learning. In the revision we will add a human evaluation of narrative fidelity (e.g., expert ratings of how well narratives capture topic-specific criteria) together with an ablation comparing RCL against variants that use unprocessed examples or generic prompts. revision: yes
Referee: [Abstract] The abstract asserts substantial outperformance over zero-shot and ICL but supplies no dataset names, metrics (e.g., Cohen's kappa, accuracy, or nDCG correlation), statistical tests, or even the number of topics/queries used. Without these details the support for the central claim cannot be assessed.

Authors: The full paper reports results on standard TREC collections (e.g., TREC DL 2019/2020 and Robust04) using 50–100 topics per collection, with metrics including accuracy, Cohen’s kappa, and Kendall’s tau correlation to human judgments, plus paired statistical significance tests. We will revise the abstract to include the dataset names, number of topics/queries, and the key quantitative improvements (e.g., absolute gains in kappa and accuracy) so that the central claims are self-contained. revision: yes

Circularity Check

0 steps flagged

No circularity: RCL is an empirical prompting method evaluated against external human judgments

full rationale

The paper defines RCL as a two-stage process in which an Instructor LLM first produces topic-specific relevance narratives from a small set of human-judged query-document pairs; those narratives then serve as structured context for an Assessor LLM on unseen pairs. Performance is measured by end-to-end agreement with held-out human labels in a hybrid pooling setup. No equation, parameter fit, or self-citation reduces the reported gains to the input judgments by construction. Human judgments remain an independent external signal; the method merely re-uses them via prompting rather than deriving them from itself. The central claim therefore rests on comparative experiments rather than tautological re-labeling of the same data.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework rests on the assumption that current LLMs can distill human relevance criteria into generalizable narratives from limited examples.

axioms (1)

domain assumption LLMs can analyze sets of judged query-document pairs and generate explicit, usable narratives that capture topic-specific relevance criteria.
This is the core step of the Instructor LLM in RCL as stated in the abstract.

pith-pipeline@v0.9.0 · 5568 in / 1086 out tokens · 34383 ms · 2026-05-16T06:03:00.101582+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

47 extracted references · 47 canonical work pages

[1]

Zahra Abbasiantaeb, Chuan Meng, Leif Azzopardi, and Mohammad Aliannejadi

work page
[2]

https://arxiv.org/abs/2405.05600 arXiv: 2405.05600 [cs.IR] (cited on pp

Can we use large language models to fill relevance judgment holes? (2024). https://arxiv.org/abs/2405.05600 arXiv: 2405.05600 [cs.IR] (cited on pp. 1, 9)

work page arXiv 2024
[3]

Marwah Alaofi, Paul Thomas, Falk Scholer, and Mark Sanderson. 2024. Llms can be fooled into labelling a document as relevant: best café near me; this paper is perfectly relevant. InProceedings of the 2024 Annual International ACM SIGIR Conference on Research and Development in Information Retrieval in the Asia Pacific Region(SIGIR-AP 2024). Association fo...

work page doi:10.1145/3673791.3698431 2024
[4]

Marwah Alaofi, Paul Thomas, Falk Scholer, and Mark Sanderson. 2026. On the use of LLMs for relevance labelling.ACM Transactions on Information Systems. doi:10.1145/3788872 (cited on pp. 1, 9)

work page doi:10.1145/3788872 2026
[5]

Omar Alonso and Stefano Mizzaro. 2012. Using crowdsourcing for TREC rel- evance assessment.Information Processing & Management, 48, 6, 1053–1066. doi:10.1016/j.ipm.2012.01.004 (cited on p. 1)

work page doi:10.1016/j.ipm.2012.01.004 2012
[6]

Aslam, Virgil Pavlu, and Emine Yilmaz

Javed A. Aslam, Virgil Pavlu, and Emine Yilmaz. 2006. A statistical method for system evaluation using incomplete judgments. InProceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in In- formation Retrieval(SIGIR ’06). Association for Computing Machinery, Seattle, Washington, USA, 541–548.isbn: 1595933697. doi:10.1...

work page doi:10.1145/1148170.1148263 2006
[7]

Krisztian Balog, Don Metzler, and Zhen Qin. 2025. Rankers, judges, and as- sistants: towards understanding the interplay of llms in information retrieval evaluation. InProceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval(SIGIR ’25). Association for Computing Machinery, Padua, Italy, 3865–3875.isbn...

work page arXiv 2025
[8]

Voorhees

Chris Buckley, Darrin Dimmick, Ian Soboroff, and Ellen M. Voorhees. 2007. Bias and the limits of pooling for large collections.Information Retrieval Journal, 10, 6, 491–508. doi:10.1007/s10791-007-9032-x (cited on pp. 1, 9)

work page doi:10.1007/s10791-007-9032-x 2007
[9]

Voorhees

Chris Buckley and Ellen M. Voorhees. 2005. Retrieval system evaluation. In TREC: Experiment and Evaluation in Information Retrieval. Ellen M. Voorhees and Donna K. Harman, (Eds.) The MIT Press, 53–78 (cited on p. 1)

work page 2005
[10]

Charles LA Clarke and Laura Dietz. 2024. Llm-based relevance assessment still can’t replace human relevance assessment. (2024). https://arxiv.org/abs/2412.1 7156 arXiv: 2412.17156[cs.IR](cited on p. 4)

work page arXiv 2024
[11]

Cleverdon

Cyril W. Cleverdon. 1967. The cranfield tests on index language devices.Aslib Proceedings, 19, 6, 173–194. doi:10.1108/eb050097 (cited on pp. 1, 9)

work page doi:10.1108/eb050097 1967
[12]

Cormack and Maura R

Gordon V. Cormack and Maura R. Grossman. 2018. Beyond pooling. InPro- ceedings of ACM SIGIR 2018. ACM, New York, NY, USA, 1169–1172. doi:10.1145 /3209978.3210119 (cited on p. 1)

work page arXiv 2018
[13]

Cormack, Christopher R

Gordon V. Cormack, Christopher R. Palmer, and Charles L. A. Clarke. 1998. Efficient construction of large test collections. InProceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Informa- tion Retrieval(SIGIR ’98). Association for Computing Machinery, Melbourne, Australia, 282–289.isbn: 1581130155. doi:10.1145/2...

work page doi:10.1145/290941.291009 1998
[14]

Nick Craswell, Bhaskar Mitra, Emine Yilmaz, and Daniel Campos. 2020. Overview of the TREC 2020 deep learning track. InProceedings of the Twenty- Ninth Text REtrieval Conference. NIST Special Publication 1266, Gaithersburg, Maryland, USA. https://trec.nist.gov/pubs/trec29/papers/OVERVIEW.DL.pdf (cited on p. 3)

work page 2020
[15]

Voorhees

Nick Craswell, Bhaskar Mitra, Emine Yilmaz, Daniel Campos, and Ellen M. Voorhees. 2019. Overview of the TREC 2019 deep learning track. InProceedings of the Twenty-Eight Text REtrieval Conference. NIST Special Publication 1250, Gaithersburg, Maryland, USA. https : / / trec . nist . gov / pubs / trec28 / papers /OVERVIEW.DL.pdf (cited on p. 3)

work page 2019
[16]

Qingxiu Dong et al. 2024. A survey on in-context learning. InProceedings of the 2024 conference on empirical methods in natural language processing, 1107–1128 (cited on p. 2)

work page 2024
[17]

Guglielmo Faggioli et al. 2023. Perspectives on large language models for rele- vance judgment. InProceedings of the 2023 ACM SIGIR International Conference on Theory of Information Retrieval(ICTIR ’23). Association for Computing Machinery, Taipei, Taiwan, 39–50. doi:10.1145/3578337.3605136 (cited on pp. 1, 2, 9)

work page doi:10.1145/3578337.3605136 2023
[18]

Or Honovich, Uri Shaham, Samuel Bowman, and Omer Levy. 2023. Instruction induction: from few examples to natural language task descriptions. InProceed- ings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 1935–1952 (cited on pp. 5, 9)

work page 2023
[19]

Salma Kharrat, Fares Fourati, and Marco Canini. 2025. Acing: actor-critic for instruction learning in black-box llms. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. Association for Computa- tional Linguistics, 19075–19102.isbn: 979-8-89176-332-6. doi:10.18653/v1/2025 .emnlp-main.965 (cited on p. 9)

work page doi:10.18653/v1/2025 2025
[20]

Baohao Liao, Xinyi Chen, Sara Rajaee, Yuhui Xu, Christian Herold, Anders Søgaard, Maarten de Rijke, and Christof Monz. 2025. Lost at the beginning of reasoning. (2025). https://arxiv.org/abs/2506.22058 arXiv: 2506.22058 [cs.CL] (cited on p. 4)

work page arXiv 2025
[21]

Losada, Guido Zuccon, and Mihai Lupu

Aldo Lipani, David E. Losada, Guido Zuccon, and Mihai Lupu. 2019. Fixed- cost pooling strategies.IEEE Transactions on Knowledge & Data Engineering. doi:10.1109/TKDE.2019.2947049 (cited on pp. 1, 9)

work page doi:10.1109/tkde.2019.2947049 2019
[22]

Aldo Lipani, Guido Zuccon, Mihai Lupu, Bevan Koopman, and Allan Hanbury

work page
[23]

In Proceedings of the 2016 ACM International Conference on the Theory of Infor- mation Retrieval(ICTIR ’16)

The impact of fixed-cost pooling strategies on test collection bias. In Proceedings of the 2016 ACM International Conference on the Theory of Infor- mation Retrieval(ICTIR ’16). Association for Computing Machinery, Newark, Delaware, USA, 105–108.isbn: 9781450344975. doi:10.1145/2970398.2970429 (cited on pp. 1, 9). Hybrid Pooling with LLMs via Relevance Co...

work page doi:10.1145/2970398.2970429 2016
[24]

Nelson F Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. 2024. Lost in the middle: how language models use long contexts.Transactions of the Association for Computational Linguistics, 12, 157–173 (cited on pp. 1, 4)

work page 2024
[25]

Losada, Javier Parapar, and Álvaro Barreiro

David E. Losada, Javier Parapar, and Álvaro Barreiro. 2016. Feeling lucky?: multi-armed bandits for ordering judgements in pooling-based evaluation. In Proceedings of the 31st Annual ACM Symposium on Applied Computing(SAC ’16). Sascha Ossowski, (Ed.) Association for Computing Machinery, Pisa, Italy, 1027–1034.isbn: 9781450337397. doi:10.1145/2851613.28516...

work page doi:10.1145/2851613.2851692 2016
[26]

Losada, Javier Parapar, and Álvaro Barreiro

David E. Losada, Javier Parapar, and Álvaro Barreiro. 2017. Multi-armed ban- dits for adjudicating documents in pooling-based evaluation of information retrieval systems.Information Processing & Management, 53, 5, 1005–1025. doi:10.1016/j.ipm.2017.04.005 (cited on p. 1)

work page doi:10.1016/j.ipm.2017.04.005 2017
[27]

Sean MacAvaney and Luca Soldaini. 2023. One-shot labeling for automatic rele- vance estimation. InProceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval(SIGIR ’23). Association for Computing Machinery, Taipei, Taiwan, 2230–2235. doi:10.1145/3539618.35 92032 (cited on p. 1)

work page doi:10.1145/3539618.35 2023
[28]

Jack McKechnie, Graham McDonald, and Craig Macdonald. 2025. Context example selection for llm generated relevance assessments. InAdvances in Information Retrieval: 47th European Conference on Information Retrieval, ECIR 2025, Lucca, Italy, April 6–10, 2025, Proceedings, Part I. Springer-Verlag, Lucca, Italy, 293–309.isbn: 978-3-031-88707-9. doi:10.1007/97...

work page doi:10.1007/978-3-031-88708-6_19 2025
[29]

Jack McKechnie, Graham McDonald, and Craig Macdonald. 2025. Measuring hypothesis testing errors in the evaluation of retrieval systems. InProceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval(SIGIR ’25). Association for Computing Machinery, Padua, Italy, 2875–2879.isbn: 9798400715921. doi:10.1145/3...

work page doi:10.1145/3726302.3730229 2025
[30]

Simone Merlo, Stefano Marchesin, Guglielmo Faggioli, and Nicola Ferro. 2025. A cost-effective framework to evaluate llm-generated relevance judgements. InProceedings of the 34th ACM International Conference on Information and Knowledge Management. Association for Computing Machinery, Seoul, Repub- lic of Korea, 2115–2126.isbn: 9798400720406. doi:10.1145/3...

work page doi:10.1145/3746252.3761200 2025
[31]

David Otero, Javier Parapar, and Álvaro Barreiro. 2025. Limitations of automatic relevance assessments with large language models for fair and reliable retrieval evaluation. InProceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval(SIGIR ’25). Association for Computing Machinery, Padua, Italy, 2545–2...

work page arXiv 2025
[32]

David Otero, Javier Parapar, and Álvaro Barreiro. 2023. Relevance feedback for building pooled test collections.Journal of Information Science. doi:10.1177/016 55515231171085 (cited on p. 1)

work page doi:10.1177/016 2023
[33]

David Otero, Javier Parapar, and Álvaro Barreiro. 2025. Towards reliable test- ing for multiple information retrieval system comparisons. InAdvances in Information Retrieval. Springer Nature Switzerland, Cham, 424–439.isbn: 978- 3-031-88711-6 (cited on p. 3)

work page 2025
[34]

Mark Sanderson. 2010. Test collection based evaluation of information retrieval systems.Foundations and Trends in Information Retrieval, 4, 4, 247–375. doi:10 .1561/1500000009 (cited on p. 1)

work page 2010
[35]

Mark Sanderson and Justin Zobel. 2005. Information retrieval system eval- uation: effort, sensitivity, and reliability. InProceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Informa- tion Retrieval(SIGIR ’05). Association for Computing Machinery, Salvador, Brazil, 162–169.isbn: 1595930345. doi:10.1145/1076034.1...

work page doi:10.1145/1076034.1076064 2005
[36]

Ian Soboroff. 2024. Don’t use llms to make relevance judgments. (2024). https: //arxiv.org/abs/2409.15133 arXiv: 2409.15133[cs.IR](cited on pp. 1, 9)

work page arXiv 2024
[37]

Spärck Jones and Cornelis J

K. Spärck Jones and Cornelis J. van Rijsbergen. 1975. Report on the need for and provision of an ’ideal’ information retrieval test collection.Computer Laboratory(cited on p. 9)

work page 1975
[38]

Voorhees, Tetsuya Sakai, and Ian Soboroff

Rikiya Takehi, Ellen M. Voorhees, Tetsuya Sakai, and Ian Soboroff. 2025. Llm- assisted relevance assessments: when should we ask llms for help? InProceed- ings of the 48th International ACM SIGIR Conference on Research and Develop- ment in Information Retrieval(SIGIR ’25). Association for Computing Machin- ery, Padua, Italy, 95–105.isbn: 9798400715921. do...

work page doi:10.1145/3726302.3729916 2025
[39]

Paul Thomas, Seth Spielman, Nick Craswell, and Bhaskar Mitra. 2024. Large language models can accurately predict searcher preferences. InProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval(SIGIR ’24). Association for Computing Machinery, Washington DC, USA, 1930–1940. doi:10.1145/3626772.3657707 ...

work page doi:10.1145/3626772.3657707 2024
[40]

Shivani Upadhyay, Ehsan Kamalloo, and Jimmy Lin. 2024. Llms can patch up missing relevance judgments in evaluation. (2024). https://arxiv.org/abs/2405.0 4727 arXiv: 2405.04727[cs.IR](cited on p. 1)

work page arXiv 2024
[41]

Shivani Upadhyay, Ronak Pradeep, Nandan Thakur, Daniel Campos, Nick Craswell, Ian Soboroff, Hoa Trang Dang, and Jimmy Lin. 2024. A large-scale study of relevance assessments with large language models: an initial look. (2024). https://arxiv.org/abs/2411.08275 arXiv: 2411.08275 [cs.IR] (cited on p. 1)

work page arXiv 2024
[42]

Shivani Upadhyay, Ronak Pradeep, Nandan Thakur, Nick Craswell, and Jimmy Lin. 2024. Umbrela: umbrela is the (open-source reproduction of the) bing relevance assessor. (2024). https://arxiv.org/abs/2406.06519 arXiv: 2406.06519 [cs.IR](cited on pp. 1, 4, 9)

work page arXiv 2024
[43]

Voorhees

Ellen M. Voorhees. 2002. The philosophy of information retrieval evaluation. InEvaluation of Cross-Language Information Retrieval Systems: Second Work- shop of the Cross-Language Evaluation Forum(CLEF ’01). Springer, Darmstadt, Germany, 355–370.isbn: 9783540456919. doi:10.1007/3-540-45691-0\_34 (cited on pp. 1, 9)

work page doi:10.1007/3-540-45691-0 2002
[44]

Voorhees and Donna K

Ellen M. Voorhees and Donna K. Harman. 2000. Overview of the eighth text retrieval conference (TREC-8). InProceedings of the Eighth Text REtrieval Con- ference. NIST Special Publication 500-246, Gaithersburg, Maryland, USA, 1–24. doi:10.6028/NIST.SP.500-246 (cited on p. 3)

work page doi:10.6028/nist.sp.500-246 2000
[45]

Voorhees and Donna K

Ellen M. Voorhees and Donna K. Harman. 2005.TREC: Experiment and Eval- uation in Information Retrieval. The MIT Press.isbn: 0262220733 (cited on p. 9)

work page 2005
[46]

Emily Xiao, Yixiao Zeng, Ada Chen, Chin-Jou Li, Amanda Bertsch, and Graham Neubig. 2025. Prompt-mii: meta-learning instruction induction for llms. (2025). https://arxiv.org/abs/2510.16932 arXiv: 2510.16932[cs.CL](cited on p. 9)

work page arXiv 2025
[47]

Justin Zobel. 1998. How reliable are the results of large-scale information retrieval experiments? InProceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval(SIGIR ’98). Association for Computing Machinery, Melbourne, Australia, 307–314. isbn: 1581130155. doi:10.1145/290941.291014 (cited on p. 9)

work page doi:10.1145/290941.291014 1998