pith. machine review for the scientific record. sign in

arxiv: 2602.08457 · v2 · submitted 2026-02-09 · 💻 cs.IR

Recognition: no theorem link

Hybrid Pooling with LLMs via Relevance Context Learning

Authors on Pith no claims yet

Pith reviewed 2026-05-16 06:03 UTC · model grok-4.3

classification 💻 cs.IR
keywords relevance assessmentlarge language modelsinformation retrievalin-context learninghybrid poolingrelevance narrativesdataset construction
0
0 comments X

The pith

Relevance Context Learning turns a few human judgements into explicit topic narratives that guide LLMs to produce more accurate relevance labels than standard prompting.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Relevance Context Learning to improve how large language models assess document relevance for information retrieval evaluation. Rather than feeding examples directly into prompts or relying on zero-shot instructions, the method first asks one LLM to examine judged query-document pairs and write out the underlying topic-specific relevance criteria in narrative form. A second LLM then uses those narratives as structured guidance when judging new pairs. The approach is tested in a hybrid pooling setup where humans label only a shallow pool of documents and LLMs handle the remainder, showing consistent gains over both zero-shot and conventional in-context learning.

Core claim

Relevance Context Learning works by prompting an Instructor LLM to analyse sets of judged query-document pairs and generate explicit narratives describing what constitutes relevance for a given topic; these narratives then serve as structured prompts for an Assessor LLM to label unseen pairs. In a hybrid pooling strategy, human assessors judge a shallow depth-k pool while the remaining documents are labelled by the RCL-guided LLM, yielding relevance judgements that substantially outperform zero-shot prompting and improve over standard in-context learning.

What carries the argument

Relevance Context Learning (RCL), a two-stage process in which an Instructor LLM distills human judgements into topic-specific relevance narratives that then guide an Assessor LLM.

Load-bearing premise

LLMs can reliably extract and articulate topic-specific relevance criteria from a small number of judged examples in a form that generalizes accurately to unseen query-document pairs without introducing systematic bias.

What would settle it

On a held-out collection of human-judged query-document pairs, if relevance labels produced by RCL show lower agreement with the human gold labels than labels produced by standard in-context learning, the performance advantage would be falsified.

Figures

Figures reproduced from arXiv: 2602.08457 by David Otero, Javier Parapar.

Figure 1
Figure 1. Figure 1: The hybrid pooling approach using Relevance Con [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Per-query differences in F1 scores across ten ex [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Schematic overview of narrative generation: the [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
read the original abstract

High-quality relevance judgements over large query sets are essential for evaluating Information Retrieval (IR) systems, yet manual annotation remains costly and time-consuming. Large Language Models (LLMs) have recently shown promise as automatic relevance assessors, but their reliability is still limited. Most existing approaches rely on zero-shot prompting or in-context learning (ICL) with a small number of labelled examples. However, standard ICL treats examples as independent instances and fails to explicitly capture the underlying relevance criteria of a topic, restricting its ability to generalise to unseen query-document pairs. To address this limitation, we introduce Relevance Context Learning (RCL), a novel framework that leverages human relevance judgements to explicitly model topic-specific relevance criteria. Rather than directly using labelled examples for in-context prediction, RCL first prompts an LLM (Instructor LLM) to analyse sets of judged query-document pairs and generate explicit narratives that describe what constitutes relevance for a given topic. These relevance narratives are then used as structured prompts to guide a second LLM (Assessor LLM) in producing relevance judgements. To evaluate RCL in a realistic data collection setting, we propose a hybrid pooling strategy in which a shallow depth-k pool from participating systems is judged by human assessors, while the remaining documents are labelled by LLMs. Experimental results demonstrate that RCL substantially outperforms zero-shot prompting and consistently improves over standard ICL. Overall, our findings indicate that transforming relevance examples into explicit, context-aware relevance narratives is a more effective way of exploiting human judgements for LLM-based IR dataset construction.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces Relevance Context Learning (RCL), a two-stage LLM framework for relevance judgment in IR: an Instructor LLM first analyzes small sets of human-judged query-document pairs to generate explicit topic-specific relevance narratives, which are then injected as structured context into prompts for an Assessor LLM to label unseen pairs. The authors also propose a hybrid pooling strategy that combines shallow human judgments on system pools with LLM labels for the remainder. They claim experimental results show RCL substantially outperforms zero-shot prompting and consistently improves over standard ICL.

Significance. If the central claim holds after proper validation, RCL could meaningfully improve the cost-effectiveness and reliability of constructing large-scale relevance judgment datasets by converting limited human annotations into reusable, topic-aware narratives rather than treating examples as isolated ICL shots. This would be a practical advance for IR evaluation pipelines that currently struggle with annotation scale.

major comments (2)
  1. [Abstract and experimental evaluation section] The load-bearing assumption that the Instructor LLM's generated narratives faithfully capture topic-specific relevance criteria without introducing systematic bias or distortion is untested. No intermediate human validation, fidelity metrics, or ablation on narrative quality is reported, so end-to-end gains could stem from incidental prompt effects rather than genuine relevance context learning.
  2. [Abstract] The abstract asserts substantial outperformance over zero-shot and ICL but supplies no dataset names, metrics (e.g., Cohen's kappa, accuracy, or nDCG correlation), statistical tests, or even the number of topics/queries used. Without these details the support for the central claim cannot be assessed.
minor comments (2)
  1. [Method] Clarify the exact prompting templates used for the Instructor and Assessor LLMs, including how the generated narratives are formatted and inserted.
  2. [Hybrid Pooling Strategy] The hybrid pooling description should specify the exact depth-k value, which participating systems are pooled, and how ties or disagreements between human and LLM labels are resolved.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the major comments point by point below and will revise the manuscript to incorporate the suggested improvements.

read point-by-point responses
  1. Referee: [Abstract and experimental evaluation section] The load-bearing assumption that the Instructor LLM's generated narratives faithfully capture topic-specific relevance criteria without introducing systematic bias or distortion is untested. No intermediate human validation, fidelity metrics, or ablation on narrative quality is reported, so end-to-end gains could stem from incidental prompt effects rather than genuine relevance context learning.

    Authors: We agree this is a valid concern and that the original submission lacks explicit validation of the generated narratives. The end-to-end gains are consistent across experiments, but without intermediate checks it is difficult to fully attribute improvements to relevance context learning. In the revision we will add a human evaluation of narrative fidelity (e.g., expert ratings of how well narratives capture topic-specific criteria) together with an ablation comparing RCL against variants that use unprocessed examples or generic prompts. revision: yes

  2. Referee: [Abstract] The abstract asserts substantial outperformance over zero-shot and ICL but supplies no dataset names, metrics (e.g., Cohen's kappa, accuracy, or nDCG correlation), statistical tests, or even the number of topics/queries used. Without these details the support for the central claim cannot be assessed.

    Authors: The full paper reports results on standard TREC collections (e.g., TREC DL 2019/2020 and Robust04) using 50–100 topics per collection, with metrics including accuracy, Cohen’s kappa, and Kendall’s tau correlation to human judgments, plus paired statistical significance tests. We will revise the abstract to include the dataset names, number of topics/queries, and the key quantitative improvements (e.g., absolute gains in kappa and accuracy) so that the central claims are self-contained. revision: yes

Circularity Check

0 steps flagged

No circularity: RCL is an empirical prompting method evaluated against external human judgments

full rationale

The paper defines RCL as a two-stage process in which an Instructor LLM first produces topic-specific relevance narratives from a small set of human-judged query-document pairs; those narratives then serve as structured context for an Assessor LLM on unseen pairs. Performance is measured by end-to-end agreement with held-out human labels in a hybrid pooling setup. No equation, parameter fit, or self-citation reduces the reported gains to the input judgments by construction. Human judgments remain an independent external signal; the method merely re-uses them via prompting rather than deriving them from itself. The central claim therefore rests on comparative experiments rather than tautological re-labeling of the same data.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework rests on the assumption that current LLMs can distill human relevance criteria into generalizable narratives from limited examples.

axioms (1)
  • domain assumption LLMs can analyze sets of judged query-document pairs and generate explicit, usable narratives that capture topic-specific relevance criteria.
    This is the core step of the Instructor LLM in RCL as stated in the abstract.

pith-pipeline@v0.9.0 · 5568 in / 1086 out tokens · 34383 ms · 2026-05-16T06:03:00.101582+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

47 extracted references · 47 canonical work pages

  1. [1]

    Zahra Abbasiantaeb, Chuan Meng, Leif Azzopardi, and Mohammad Aliannejadi

  2. [2]

    https://arxiv.org/abs/2405.05600 arXiv: 2405.05600 [cs.IR] (cited on pp

    Can we use large language models to fill relevance judgment holes? (2024). https://arxiv.org/abs/2405.05600 arXiv: 2405.05600 [cs.IR] (cited on pp. 1, 9)

  3. [3]

    Marwah Alaofi, Paul Thomas, Falk Scholer, and Mark Sanderson. 2024. Llms can be fooled into labelling a document as relevant: best café near me; this paper is perfectly relevant. InProceedings of the 2024 Annual International ACM SIGIR Conference on Research and Development in Information Retrieval in the Asia Pacific Region(SIGIR-AP 2024). Association fo...

  4. [4]

    Marwah Alaofi, Paul Thomas, Falk Scholer, and Mark Sanderson. 2026. On the use of LLMs for relevance labelling.ACM Transactions on Information Systems. doi:10.1145/3788872 (cited on pp. 1, 9)

  5. [5]

    Omar Alonso and Stefano Mizzaro. 2012. Using crowdsourcing for TREC rel- evance assessment.Information Processing & Management, 48, 6, 1053–1066. doi:10.1016/j.ipm.2012.01.004 (cited on p. 1)

  6. [6]

    Aslam, Virgil Pavlu, and Emine Yilmaz

    Javed A. Aslam, Virgil Pavlu, and Emine Yilmaz. 2006. A statistical method for system evaluation using incomplete judgments. InProceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in In- formation Retrieval(SIGIR ’06). Association for Computing Machinery, Seattle, Washington, USA, 541–548.isbn: 1595933697. doi:10.1...

  7. [7]

    Krisztian Balog, Don Metzler, and Zhen Qin. 2025. Rankers, judges, and as- sistants: towards understanding the interplay of llms in information retrieval evaluation. InProceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval(SIGIR ’25). Association for Computing Machinery, Padua, Italy, 3865–3875.isbn...

  8. [8]

    Voorhees

    Chris Buckley, Darrin Dimmick, Ian Soboroff, and Ellen M. Voorhees. 2007. Bias and the limits of pooling for large collections.Information Retrieval Journal, 10, 6, 491–508. doi:10.1007/s10791-007-9032-x (cited on pp. 1, 9)

  9. [9]

    Voorhees

    Chris Buckley and Ellen M. Voorhees. 2005. Retrieval system evaluation. In TREC: Experiment and Evaluation in Information Retrieval. Ellen M. Voorhees and Donna K. Harman, (Eds.) The MIT Press, 53–78 (cited on p. 1)

  10. [10]

    Charles LA Clarke and Laura Dietz. 2024. Llm-based relevance assessment still can’t replace human relevance assessment. (2024). https://arxiv.org/abs/2412.1 7156 arXiv: 2412.17156[cs.IR](cited on p. 4)

  11. [11]

    Cleverdon

    Cyril W. Cleverdon. 1967. The cranfield tests on index language devices.Aslib Proceedings, 19, 6, 173–194. doi:10.1108/eb050097 (cited on pp. 1, 9)

  12. [12]

    Cormack and Maura R

    Gordon V. Cormack and Maura R. Grossman. 2018. Beyond pooling. InPro- ceedings of ACM SIGIR 2018. ACM, New York, NY, USA, 1169–1172. doi:10.1145 /3209978.3210119 (cited on p. 1)

  13. [13]

    Cormack, Christopher R

    Gordon V. Cormack, Christopher R. Palmer, and Charles L. A. Clarke. 1998. Efficient construction of large test collections. InProceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Informa- tion Retrieval(SIGIR ’98). Association for Computing Machinery, Melbourne, Australia, 282–289.isbn: 1581130155. doi:10.1145/2...

  14. [14]

    Nick Craswell, Bhaskar Mitra, Emine Yilmaz, and Daniel Campos. 2020. Overview of the TREC 2020 deep learning track. InProceedings of the Twenty- Ninth Text REtrieval Conference. NIST Special Publication 1266, Gaithersburg, Maryland, USA. https://trec.nist.gov/pubs/trec29/papers/OVERVIEW.DL.pdf (cited on p. 3)

  15. [15]

    Voorhees

    Nick Craswell, Bhaskar Mitra, Emine Yilmaz, Daniel Campos, and Ellen M. Voorhees. 2019. Overview of the TREC 2019 deep learning track. InProceedings of the Twenty-Eight Text REtrieval Conference. NIST Special Publication 1250, Gaithersburg, Maryland, USA. https : / / trec . nist . gov / pubs / trec28 / papers /OVERVIEW.DL.pdf (cited on p. 3)

  16. [16]

    Qingxiu Dong et al. 2024. A survey on in-context learning. InProceedings of the 2024 conference on empirical methods in natural language processing, 1107–1128 (cited on p. 2)

  17. [17]

    Guglielmo Faggioli et al. 2023. Perspectives on large language models for rele- vance judgment. InProceedings of the 2023 ACM SIGIR International Conference on Theory of Information Retrieval(ICTIR ’23). Association for Computing Machinery, Taipei, Taiwan, 39–50. doi:10.1145/3578337.3605136 (cited on pp. 1, 2, 9)

  18. [18]

    Or Honovich, Uri Shaham, Samuel Bowman, and Omer Levy. 2023. Instruction induction: from few examples to natural language task descriptions. InProceed- ings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 1935–1952 (cited on pp. 5, 9)

  19. [19]

    Salma Kharrat, Fares Fourati, and Marco Canini. 2025. Acing: actor-critic for instruction learning in black-box llms. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. Association for Computa- tional Linguistics, 19075–19102.isbn: 979-8-89176-332-6. doi:10.18653/v1/2025 .emnlp-main.965 (cited on p. 9)

  20. [20]

    Baohao Liao, Xinyi Chen, Sara Rajaee, Yuhui Xu, Christian Herold, Anders Søgaard, Maarten de Rijke, and Christof Monz. 2025. Lost at the beginning of reasoning. (2025). https://arxiv.org/abs/2506.22058 arXiv: 2506.22058 [cs.CL] (cited on p. 4)

  21. [21]

    Losada, Guido Zuccon, and Mihai Lupu

    Aldo Lipani, David E. Losada, Guido Zuccon, and Mihai Lupu. 2019. Fixed- cost pooling strategies.IEEE Transactions on Knowledge & Data Engineering. doi:10.1109/TKDE.2019.2947049 (cited on pp. 1, 9)

  22. [22]

    Aldo Lipani, Guido Zuccon, Mihai Lupu, Bevan Koopman, and Allan Hanbury

  23. [23]

    In Proceedings of the 2016 ACM International Conference on the Theory of Infor- mation Retrieval(ICTIR ’16)

    The impact of fixed-cost pooling strategies on test collection bias. In Proceedings of the 2016 ACM International Conference on the Theory of Infor- mation Retrieval(ICTIR ’16). Association for Computing Machinery, Newark, Delaware, USA, 105–108.isbn: 9781450344975. doi:10.1145/2970398.2970429 (cited on pp. 1, 9). Hybrid Pooling with LLMs via Relevance Co...

  24. [24]

    Nelson F Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. 2024. Lost in the middle: how language models use long contexts.Transactions of the Association for Computational Linguistics, 12, 157–173 (cited on pp. 1, 4)

  25. [25]

    Losada, Javier Parapar, and Álvaro Barreiro

    David E. Losada, Javier Parapar, and Álvaro Barreiro. 2016. Feeling lucky?: multi-armed bandits for ordering judgements in pooling-based evaluation. In Proceedings of the 31st Annual ACM Symposium on Applied Computing(SAC ’16). Sascha Ossowski, (Ed.) Association for Computing Machinery, Pisa, Italy, 1027–1034.isbn: 9781450337397. doi:10.1145/2851613.28516...

  26. [26]

    Losada, Javier Parapar, and Álvaro Barreiro

    David E. Losada, Javier Parapar, and Álvaro Barreiro. 2017. Multi-armed ban- dits for adjudicating documents in pooling-based evaluation of information retrieval systems.Information Processing & Management, 53, 5, 1005–1025. doi:10.1016/j.ipm.2017.04.005 (cited on p. 1)

  27. [27]

    Sean MacAvaney and Luca Soldaini. 2023. One-shot labeling for automatic rele- vance estimation. InProceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval(SIGIR ’23). Association for Computing Machinery, Taipei, Taiwan, 2230–2235. doi:10.1145/3539618.35 92032 (cited on p. 1)

  28. [28]

    Jack McKechnie, Graham McDonald, and Craig Macdonald. 2025. Context example selection for llm generated relevance assessments. InAdvances in Information Retrieval: 47th European Conference on Information Retrieval, ECIR 2025, Lucca, Italy, April 6–10, 2025, Proceedings, Part I. Springer-Verlag, Lucca, Italy, 293–309.isbn: 978-3-031-88707-9. doi:10.1007/97...

  29. [29]

    Jack McKechnie, Graham McDonald, and Craig Macdonald. 2025. Measuring hypothesis testing errors in the evaluation of retrieval systems. InProceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval(SIGIR ’25). Association for Computing Machinery, Padua, Italy, 2875–2879.isbn: 9798400715921. doi:10.1145/3...

  30. [30]

    Simone Merlo, Stefano Marchesin, Guglielmo Faggioli, and Nicola Ferro. 2025. A cost-effective framework to evaluate llm-generated relevance judgements. InProceedings of the 34th ACM International Conference on Information and Knowledge Management. Association for Computing Machinery, Seoul, Repub- lic of Korea, 2115–2126.isbn: 9798400720406. doi:10.1145/3...

  31. [31]

    David Otero, Javier Parapar, and Álvaro Barreiro. 2025. Limitations of automatic relevance assessments with large language models for fair and reliable retrieval evaluation. InProceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval(SIGIR ’25). Association for Computing Machinery, Padua, Italy, 2545–2...

  32. [32]

    David Otero, Javier Parapar, and Álvaro Barreiro. 2023. Relevance feedback for building pooled test collections.Journal of Information Science. doi:10.1177/016 55515231171085 (cited on p. 1)

  33. [33]

    David Otero, Javier Parapar, and Álvaro Barreiro. 2025. Towards reliable test- ing for multiple information retrieval system comparisons. InAdvances in Information Retrieval. Springer Nature Switzerland, Cham, 424–439.isbn: 978- 3-031-88711-6 (cited on p. 3)

  34. [34]

    Mark Sanderson. 2010. Test collection based evaluation of information retrieval systems.Foundations and Trends in Information Retrieval, 4, 4, 247–375. doi:10 .1561/1500000009 (cited on p. 1)

  35. [35]

    Mark Sanderson and Justin Zobel. 2005. Information retrieval system eval- uation: effort, sensitivity, and reliability. InProceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Informa- tion Retrieval(SIGIR ’05). Association for Computing Machinery, Salvador, Brazil, 162–169.isbn: 1595930345. doi:10.1145/1076034.1...

  36. [36]

    Ian Soboroff. 2024. Don’t use llms to make relevance judgments. (2024). https: //arxiv.org/abs/2409.15133 arXiv: 2409.15133[cs.IR](cited on pp. 1, 9)

  37. [37]

    Spärck Jones and Cornelis J

    K. Spärck Jones and Cornelis J. van Rijsbergen. 1975. Report on the need for and provision of an ’ideal’ information retrieval test collection.Computer Laboratory(cited on p. 9)

  38. [38]

    Voorhees, Tetsuya Sakai, and Ian Soboroff

    Rikiya Takehi, Ellen M. Voorhees, Tetsuya Sakai, and Ian Soboroff. 2025. Llm- assisted relevance assessments: when should we ask llms for help? InProceed- ings of the 48th International ACM SIGIR Conference on Research and Develop- ment in Information Retrieval(SIGIR ’25). Association for Computing Machin- ery, Padua, Italy, 95–105.isbn: 9798400715921. do...

  39. [39]

    Paul Thomas, Seth Spielman, Nick Craswell, and Bhaskar Mitra. 2024. Large language models can accurately predict searcher preferences. InProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval(SIGIR ’24). Association for Computing Machinery, Washington DC, USA, 1930–1940. doi:10.1145/3626772.3657707 ...

  40. [40]

    Shivani Upadhyay, Ehsan Kamalloo, and Jimmy Lin. 2024. Llms can patch up missing relevance judgments in evaluation. (2024). https://arxiv.org/abs/2405.0 4727 arXiv: 2405.04727[cs.IR](cited on p. 1)

  41. [41]

    Shivani Upadhyay, Ronak Pradeep, Nandan Thakur, Daniel Campos, Nick Craswell, Ian Soboroff, Hoa Trang Dang, and Jimmy Lin. 2024. A large-scale study of relevance assessments with large language models: an initial look. (2024). https://arxiv.org/abs/2411.08275 arXiv: 2411.08275 [cs.IR] (cited on p. 1)

  42. [42]

    Shivani Upadhyay, Ronak Pradeep, Nandan Thakur, Nick Craswell, and Jimmy Lin. 2024. Umbrela: umbrela is the (open-source reproduction of the) bing relevance assessor. (2024). https://arxiv.org/abs/2406.06519 arXiv: 2406.06519 [cs.IR](cited on pp. 1, 4, 9)

  43. [43]

    Voorhees

    Ellen M. Voorhees. 2002. The philosophy of information retrieval evaluation. InEvaluation of Cross-Language Information Retrieval Systems: Second Work- shop of the Cross-Language Evaluation Forum(CLEF ’01). Springer, Darmstadt, Germany, 355–370.isbn: 9783540456919. doi:10.1007/3-540-45691-0\_34 (cited on pp. 1, 9)

  44. [44]

    Voorhees and Donna K

    Ellen M. Voorhees and Donna K. Harman. 2000. Overview of the eighth text retrieval conference (TREC-8). InProceedings of the Eighth Text REtrieval Con- ference. NIST Special Publication 500-246, Gaithersburg, Maryland, USA, 1–24. doi:10.6028/NIST.SP.500-246 (cited on p. 3)

  45. [45]

    Voorhees and Donna K

    Ellen M. Voorhees and Donna K. Harman. 2005.TREC: Experiment and Eval- uation in Information Retrieval. The MIT Press.isbn: 0262220733 (cited on p. 9)

  46. [46]

    Emily Xiao, Yixiao Zeng, Ada Chen, Chin-Jou Li, Amanda Bertsch, and Graham Neubig. 2025. Prompt-mii: meta-learning instruction induction for llms. (2025). https://arxiv.org/abs/2510.16932 arXiv: 2510.16932[cs.CL](cited on p. 9)

  47. [47]

    Justin Zobel. 1998. How reliable are the results of large-scale information retrieval experiments? InProceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval(SIGIR ’98). Association for Computing Machinery, Melbourne, Australia, 307–314. isbn: 1581130155. doi:10.1145/290941.291014 (cited on p. 9)