Recognition: no theorem link
Hybrid Pooling with LLMs via Relevance Context Learning
Pith reviewed 2026-05-16 06:03 UTC · model grok-4.3
The pith
Relevance Context Learning turns a few human judgements into explicit topic narratives that guide LLMs to produce more accurate relevance labels than standard prompting.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Relevance Context Learning works by prompting an Instructor LLM to analyse sets of judged query-document pairs and generate explicit narratives describing what constitutes relevance for a given topic; these narratives then serve as structured prompts for an Assessor LLM to label unseen pairs. In a hybrid pooling strategy, human assessors judge a shallow depth-k pool while the remaining documents are labelled by the RCL-guided LLM, yielding relevance judgements that substantially outperform zero-shot prompting and improve over standard in-context learning.
What carries the argument
Relevance Context Learning (RCL), a two-stage process in which an Instructor LLM distills human judgements into topic-specific relevance narratives that then guide an Assessor LLM.
Load-bearing premise
LLMs can reliably extract and articulate topic-specific relevance criteria from a small number of judged examples in a form that generalizes accurately to unseen query-document pairs without introducing systematic bias.
What would settle it
On a held-out collection of human-judged query-document pairs, if relevance labels produced by RCL show lower agreement with the human gold labels than labels produced by standard in-context learning, the performance advantage would be falsified.
Figures
read the original abstract
High-quality relevance judgements over large query sets are essential for evaluating Information Retrieval (IR) systems, yet manual annotation remains costly and time-consuming. Large Language Models (LLMs) have recently shown promise as automatic relevance assessors, but their reliability is still limited. Most existing approaches rely on zero-shot prompting or in-context learning (ICL) with a small number of labelled examples. However, standard ICL treats examples as independent instances and fails to explicitly capture the underlying relevance criteria of a topic, restricting its ability to generalise to unseen query-document pairs. To address this limitation, we introduce Relevance Context Learning (RCL), a novel framework that leverages human relevance judgements to explicitly model topic-specific relevance criteria. Rather than directly using labelled examples for in-context prediction, RCL first prompts an LLM (Instructor LLM) to analyse sets of judged query-document pairs and generate explicit narratives that describe what constitutes relevance for a given topic. These relevance narratives are then used as structured prompts to guide a second LLM (Assessor LLM) in producing relevance judgements. To evaluate RCL in a realistic data collection setting, we propose a hybrid pooling strategy in which a shallow depth-k pool from participating systems is judged by human assessors, while the remaining documents are labelled by LLMs. Experimental results demonstrate that RCL substantially outperforms zero-shot prompting and consistently improves over standard ICL. Overall, our findings indicate that transforming relevance examples into explicit, context-aware relevance narratives is a more effective way of exploiting human judgements for LLM-based IR dataset construction.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Relevance Context Learning (RCL), a two-stage LLM framework for relevance judgment in IR: an Instructor LLM first analyzes small sets of human-judged query-document pairs to generate explicit topic-specific relevance narratives, which are then injected as structured context into prompts for an Assessor LLM to label unseen pairs. The authors also propose a hybrid pooling strategy that combines shallow human judgments on system pools with LLM labels for the remainder. They claim experimental results show RCL substantially outperforms zero-shot prompting and consistently improves over standard ICL.
Significance. If the central claim holds after proper validation, RCL could meaningfully improve the cost-effectiveness and reliability of constructing large-scale relevance judgment datasets by converting limited human annotations into reusable, topic-aware narratives rather than treating examples as isolated ICL shots. This would be a practical advance for IR evaluation pipelines that currently struggle with annotation scale.
major comments (2)
- [Abstract and experimental evaluation section] The load-bearing assumption that the Instructor LLM's generated narratives faithfully capture topic-specific relevance criteria without introducing systematic bias or distortion is untested. No intermediate human validation, fidelity metrics, or ablation on narrative quality is reported, so end-to-end gains could stem from incidental prompt effects rather than genuine relevance context learning.
- [Abstract] The abstract asserts substantial outperformance over zero-shot and ICL but supplies no dataset names, metrics (e.g., Cohen's kappa, accuracy, or nDCG correlation), statistical tests, or even the number of topics/queries used. Without these details the support for the central claim cannot be assessed.
minor comments (2)
- [Method] Clarify the exact prompting templates used for the Instructor and Assessor LLMs, including how the generated narratives are formatted and inserted.
- [Hybrid Pooling Strategy] The hybrid pooling description should specify the exact depth-k value, which participating systems are pooled, and how ties or disagreements between human and LLM labels are resolved.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address the major comments point by point below and will revise the manuscript to incorporate the suggested improvements.
read point-by-point responses
-
Referee: [Abstract and experimental evaluation section] The load-bearing assumption that the Instructor LLM's generated narratives faithfully capture topic-specific relevance criteria without introducing systematic bias or distortion is untested. No intermediate human validation, fidelity metrics, or ablation on narrative quality is reported, so end-to-end gains could stem from incidental prompt effects rather than genuine relevance context learning.
Authors: We agree this is a valid concern and that the original submission lacks explicit validation of the generated narratives. The end-to-end gains are consistent across experiments, but without intermediate checks it is difficult to fully attribute improvements to relevance context learning. In the revision we will add a human evaluation of narrative fidelity (e.g., expert ratings of how well narratives capture topic-specific criteria) together with an ablation comparing RCL against variants that use unprocessed examples or generic prompts. revision: yes
-
Referee: [Abstract] The abstract asserts substantial outperformance over zero-shot and ICL but supplies no dataset names, metrics (e.g., Cohen's kappa, accuracy, or nDCG correlation), statistical tests, or even the number of topics/queries used. Without these details the support for the central claim cannot be assessed.
Authors: The full paper reports results on standard TREC collections (e.g., TREC DL 2019/2020 and Robust04) using 50–100 topics per collection, with metrics including accuracy, Cohen’s kappa, and Kendall’s tau correlation to human judgments, plus paired statistical significance tests. We will revise the abstract to include the dataset names, number of topics/queries, and the key quantitative improvements (e.g., absolute gains in kappa and accuracy) so that the central claims are self-contained. revision: yes
Circularity Check
No circularity: RCL is an empirical prompting method evaluated against external human judgments
full rationale
The paper defines RCL as a two-stage process in which an Instructor LLM first produces topic-specific relevance narratives from a small set of human-judged query-document pairs; those narratives then serve as structured context for an Assessor LLM on unseen pairs. Performance is measured by end-to-end agreement with held-out human labels in a hybrid pooling setup. No equation, parameter fit, or self-citation reduces the reported gains to the input judgments by construction. Human judgments remain an independent external signal; the method merely re-uses them via prompting rather than deriving them from itself. The central claim therefore rests on comparative experiments rather than tautological re-labeling of the same data.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption LLMs can analyze sets of judged query-document pairs and generate explicit, usable narratives that capture topic-specific relevance criteria.
Reference graph
Works this paper leans on
-
[1]
Zahra Abbasiantaeb, Chuan Meng, Leif Azzopardi, and Mohammad Aliannejadi
-
[2]
https://arxiv.org/abs/2405.05600 arXiv: 2405.05600 [cs.IR] (cited on pp
Can we use large language models to fill relevance judgment holes? (2024). https://arxiv.org/abs/2405.05600 arXiv: 2405.05600 [cs.IR] (cited on pp. 1, 9)
-
[3]
Marwah Alaofi, Paul Thomas, Falk Scholer, and Mark Sanderson. 2024. Llms can be fooled into labelling a document as relevant: best café near me; this paper is perfectly relevant. InProceedings of the 2024 Annual International ACM SIGIR Conference on Research and Development in Information Retrieval in the Asia Pacific Region(SIGIR-AP 2024). Association fo...
-
[4]
Marwah Alaofi, Paul Thomas, Falk Scholer, and Mark Sanderson. 2026. On the use of LLMs for relevance labelling.ACM Transactions on Information Systems. doi:10.1145/3788872 (cited on pp. 1, 9)
-
[5]
Omar Alonso and Stefano Mizzaro. 2012. Using crowdsourcing for TREC rel- evance assessment.Information Processing & Management, 48, 6, 1053–1066. doi:10.1016/j.ipm.2012.01.004 (cited on p. 1)
-
[6]
Aslam, Virgil Pavlu, and Emine Yilmaz
Javed A. Aslam, Virgil Pavlu, and Emine Yilmaz. 2006. A statistical method for system evaluation using incomplete judgments. InProceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in In- formation Retrieval(SIGIR ’06). Association for Computing Machinery, Seattle, Washington, USA, 541–548.isbn: 1595933697. doi:10.1...
-
[7]
Krisztian Balog, Don Metzler, and Zhen Qin. 2025. Rankers, judges, and as- sistants: towards understanding the interplay of llms in information retrieval evaluation. InProceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval(SIGIR ’25). Association for Computing Machinery, Padua, Italy, 3865–3875.isbn...
-
[8]
Chris Buckley, Darrin Dimmick, Ian Soboroff, and Ellen M. Voorhees. 2007. Bias and the limits of pooling for large collections.Information Retrieval Journal, 10, 6, 491–508. doi:10.1007/s10791-007-9032-x (cited on pp. 1, 9)
- [9]
- [10]
-
[11]
Cyril W. Cleverdon. 1967. The cranfield tests on index language devices.Aslib Proceedings, 19, 6, 173–194. doi:10.1108/eb050097 (cited on pp. 1, 9)
-
[12]
Gordon V. Cormack and Maura R. Grossman. 2018. Beyond pooling. InPro- ceedings of ACM SIGIR 2018. ACM, New York, NY, USA, 1169–1172. doi:10.1145 /3209978.3210119 (cited on p. 1)
-
[13]
Gordon V. Cormack, Christopher R. Palmer, and Charles L. A. Clarke. 1998. Efficient construction of large test collections. InProceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Informa- tion Retrieval(SIGIR ’98). Association for Computing Machinery, Melbourne, Australia, 282–289.isbn: 1581130155. doi:10.1145/2...
-
[14]
Nick Craswell, Bhaskar Mitra, Emine Yilmaz, and Daniel Campos. 2020. Overview of the TREC 2020 deep learning track. InProceedings of the Twenty- Ninth Text REtrieval Conference. NIST Special Publication 1266, Gaithersburg, Maryland, USA. https://trec.nist.gov/pubs/trec29/papers/OVERVIEW.DL.pdf (cited on p. 3)
work page 2020
-
[15]
Nick Craswell, Bhaskar Mitra, Emine Yilmaz, Daniel Campos, and Ellen M. Voorhees. 2019. Overview of the TREC 2019 deep learning track. InProceedings of the Twenty-Eight Text REtrieval Conference. NIST Special Publication 1250, Gaithersburg, Maryland, USA. https : / / trec . nist . gov / pubs / trec28 / papers /OVERVIEW.DL.pdf (cited on p. 3)
work page 2019
-
[16]
Qingxiu Dong et al. 2024. A survey on in-context learning. InProceedings of the 2024 conference on empirical methods in natural language processing, 1107–1128 (cited on p. 2)
work page 2024
-
[17]
Guglielmo Faggioli et al. 2023. Perspectives on large language models for rele- vance judgment. InProceedings of the 2023 ACM SIGIR International Conference on Theory of Information Retrieval(ICTIR ’23). Association for Computing Machinery, Taipei, Taiwan, 39–50. doi:10.1145/3578337.3605136 (cited on pp. 1, 2, 9)
-
[18]
Or Honovich, Uri Shaham, Samuel Bowman, and Omer Levy. 2023. Instruction induction: from few examples to natural language task descriptions. InProceed- ings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 1935–1952 (cited on pp. 5, 9)
work page 2023
-
[19]
Salma Kharrat, Fares Fourati, and Marco Canini. 2025. Acing: actor-critic for instruction learning in black-box llms. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. Association for Computa- tional Linguistics, 19075–19102.isbn: 979-8-89176-332-6. doi:10.18653/v1/2025 .emnlp-main.965 (cited on p. 9)
- [20]
-
[21]
Losada, Guido Zuccon, and Mihai Lupu
Aldo Lipani, David E. Losada, Guido Zuccon, and Mihai Lupu. 2019. Fixed- cost pooling strategies.IEEE Transactions on Knowledge & Data Engineering. doi:10.1109/TKDE.2019.2947049 (cited on pp. 1, 9)
-
[22]
Aldo Lipani, Guido Zuccon, Mihai Lupu, Bevan Koopman, and Allan Hanbury
-
[23]
The impact of fixed-cost pooling strategies on test collection bias. In Proceedings of the 2016 ACM International Conference on the Theory of Infor- mation Retrieval(ICTIR ’16). Association for Computing Machinery, Newark, Delaware, USA, 105–108.isbn: 9781450344975. doi:10.1145/2970398.2970429 (cited on pp. 1, 9). Hybrid Pooling with LLMs via Relevance Co...
-
[24]
Nelson F Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. 2024. Lost in the middle: how language models use long contexts.Transactions of the Association for Computational Linguistics, 12, 157–173 (cited on pp. 1, 4)
work page 2024
-
[25]
Losada, Javier Parapar, and Álvaro Barreiro
David E. Losada, Javier Parapar, and Álvaro Barreiro. 2016. Feeling lucky?: multi-armed bandits for ordering judgements in pooling-based evaluation. In Proceedings of the 31st Annual ACM Symposium on Applied Computing(SAC ’16). Sascha Ossowski, (Ed.) Association for Computing Machinery, Pisa, Italy, 1027–1034.isbn: 9781450337397. doi:10.1145/2851613.28516...
-
[26]
Losada, Javier Parapar, and Álvaro Barreiro
David E. Losada, Javier Parapar, and Álvaro Barreiro. 2017. Multi-armed ban- dits for adjudicating documents in pooling-based evaluation of information retrieval systems.Information Processing & Management, 53, 5, 1005–1025. doi:10.1016/j.ipm.2017.04.005 (cited on p. 1)
-
[27]
Sean MacAvaney and Luca Soldaini. 2023. One-shot labeling for automatic rele- vance estimation. InProceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval(SIGIR ’23). Association for Computing Machinery, Taipei, Taiwan, 2230–2235. doi:10.1145/3539618.35 92032 (cited on p. 1)
-
[28]
Jack McKechnie, Graham McDonald, and Craig Macdonald. 2025. Context example selection for llm generated relevance assessments. InAdvances in Information Retrieval: 47th European Conference on Information Retrieval, ECIR 2025, Lucca, Italy, April 6–10, 2025, Proceedings, Part I. Springer-Verlag, Lucca, Italy, 293–309.isbn: 978-3-031-88707-9. doi:10.1007/97...
-
[29]
Jack McKechnie, Graham McDonald, and Craig Macdonald. 2025. Measuring hypothesis testing errors in the evaluation of retrieval systems. InProceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval(SIGIR ’25). Association for Computing Machinery, Padua, Italy, 2875–2879.isbn: 9798400715921. doi:10.1145/3...
-
[30]
Simone Merlo, Stefano Marchesin, Guglielmo Faggioli, and Nicola Ferro. 2025. A cost-effective framework to evaluate llm-generated relevance judgements. InProceedings of the 34th ACM International Conference on Information and Knowledge Management. Association for Computing Machinery, Seoul, Repub- lic of Korea, 2115–2126.isbn: 9798400720406. doi:10.1145/3...
-
[31]
David Otero, Javier Parapar, and Álvaro Barreiro. 2025. Limitations of automatic relevance assessments with large language models for fair and reliable retrieval evaluation. InProceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval(SIGIR ’25). Association for Computing Machinery, Padua, Italy, 2545–2...
-
[32]
David Otero, Javier Parapar, and Álvaro Barreiro. 2023. Relevance feedback for building pooled test collections.Journal of Information Science. doi:10.1177/016 55515231171085 (cited on p. 1)
work page doi:10.1177/016 2023
-
[33]
David Otero, Javier Parapar, and Álvaro Barreiro. 2025. Towards reliable test- ing for multiple information retrieval system comparisons. InAdvances in Information Retrieval. Springer Nature Switzerland, Cham, 424–439.isbn: 978- 3-031-88711-6 (cited on p. 3)
work page 2025
-
[34]
Mark Sanderson. 2010. Test collection based evaluation of information retrieval systems.Foundations and Trends in Information Retrieval, 4, 4, 247–375. doi:10 .1561/1500000009 (cited on p. 1)
work page 2010
-
[35]
Mark Sanderson and Justin Zobel. 2005. Information retrieval system eval- uation: effort, sensitivity, and reliability. InProceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Informa- tion Retrieval(SIGIR ’05). Association for Computing Machinery, Salvador, Brazil, 162–169.isbn: 1595930345. doi:10.1145/1076034.1...
- [36]
-
[37]
K. Spärck Jones and Cornelis J. van Rijsbergen. 1975. Report on the need for and provision of an ’ideal’ information retrieval test collection.Computer Laboratory(cited on p. 9)
work page 1975
-
[38]
Voorhees, Tetsuya Sakai, and Ian Soboroff
Rikiya Takehi, Ellen M. Voorhees, Tetsuya Sakai, and Ian Soboroff. 2025. Llm- assisted relevance assessments: when should we ask llms for help? InProceed- ings of the 48th International ACM SIGIR Conference on Research and Develop- ment in Information Retrieval(SIGIR ’25). Association for Computing Machin- ery, Padua, Italy, 95–105.isbn: 9798400715921. do...
-
[39]
Paul Thomas, Seth Spielman, Nick Craswell, and Bhaskar Mitra. 2024. Large language models can accurately predict searcher preferences. InProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval(SIGIR ’24). Association for Computing Machinery, Washington DC, USA, 1930–1940. doi:10.1145/3626772.3657707 ...
- [40]
-
[41]
Shivani Upadhyay, Ronak Pradeep, Nandan Thakur, Daniel Campos, Nick Craswell, Ian Soboroff, Hoa Trang Dang, and Jimmy Lin. 2024. A large-scale study of relevance assessments with large language models: an initial look. (2024). https://arxiv.org/abs/2411.08275 arXiv: 2411.08275 [cs.IR] (cited on p. 1)
- [42]
-
[43]
Ellen M. Voorhees. 2002. The philosophy of information retrieval evaluation. InEvaluation of Cross-Language Information Retrieval Systems: Second Work- shop of the Cross-Language Evaluation Forum(CLEF ’01). Springer, Darmstadt, Germany, 355–370.isbn: 9783540456919. doi:10.1007/3-540-45691-0\_34 (cited on pp. 1, 9)
-
[44]
Ellen M. Voorhees and Donna K. Harman. 2000. Overview of the eighth text retrieval conference (TREC-8). InProceedings of the Eighth Text REtrieval Con- ference. NIST Special Publication 500-246, Gaithersburg, Maryland, USA, 1–24. doi:10.6028/NIST.SP.500-246 (cited on p. 3)
-
[45]
Ellen M. Voorhees and Donna K. Harman. 2005.TREC: Experiment and Eval- uation in Information Retrieval. The MIT Press.isbn: 0262220733 (cited on p. 9)
work page 2005
- [46]
-
[47]
Justin Zobel. 1998. How reliable are the results of large-scale information retrieval experiments? InProceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval(SIGIR ’98). Association for Computing Machinery, Melbourne, Australia, 307–314. isbn: 1581130155. doi:10.1145/290941.291014 (cited on p. 9)
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.