pith. machine review for the scientific record. sign in

arxiv: 2603.08819 · v3 · submitted 2026-03-09 · 💻 cs.IR · cs.AI

Recognition: 1 theorem link

· Lean Theorem

Beyond Relevance: On the Relationship Between Retrieval and RAG Information Coverage

Authors on Pith no claims yet

Pith reviewed 2026-05-15 13:06 UTC · model grok-4.3

classification 💻 cs.IR cs.AI
keywords retrieval-augmented generationRAGinformation coverageretrieval metricsnugget coverageevaluationbenchmarks
0
0 comments X

The pith

Retrieval metrics based on information coverage reliably predict how complete the final answers are in retrieval-augmented generation systems.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether upstream retrieval quality in RAG systems can forecast the amount of key information that appears in the generated response. Experiments run across two text benchmarks and one multimodal benchmark, covering 15 text retrieval stacks, 10 multimodal stacks, and four different RAG pipelines. Coverage-oriented retrieval metrics correlate strongly with the fraction of target nuggets that end up covered in the outputs, both at the level of individual topics and across entire systems. The link holds best when the retrieval objective matches the generation goal, but more complex iterative pipelines weaken the dependence on initial retrieval. The pattern supplies evidence that retrieval metrics can stand in for full RAG performance checks.

Core claim

Coverage-based retrieval metrics serve as reliable early indicators of nugget coverage in RAG-generated responses. Strong correlations appear at both topic and system levels across the TREC NeuCLIR 2024, TREC RAG 2024, and WikiVideo benchmarks. The relationship strengthens when retrieval objectives align with generation goals, while complex iterative RAG pipelines can partially decouple generation quality from retrieval effectiveness.

What carries the argument

Coverage-based retrieval metrics that quantify how much of the target information is captured in the retrieved documents rather than relying on relevance ranking alone.

If this is right

  • Retrieval metrics can serve as practical proxies for RAG performance without requiring full response generation.
  • Alignment between retrieval objectives and generation goals increases the predictive strength of coverage metrics.
  • Iterative RAG pipelines can reduce the impact of weaker initial retrieval on final answer quality.
  • The observed correlations hold across both text and multimodal settings and multiple evaluation frameworks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • RAG developers could screen candidate retrievers using coverage metrics before integrating them into full systems.
  • Retrieval methods might be redesigned with explicit coverage targets matched to the intended generation task.
  • Similar coverage relationships could be tested in other retrieval-augmented settings such as long-form summarization.

Load-bearing premise

The chosen benchmarks and evaluation frameworks produce coverage measures that generalize beyond the tested RAG pipelines and domains.

What would settle it

A new benchmark or domain in which coverage-based retrieval metrics show no or weak correlation with the nugget coverage achieved in generated RAG responses.

read the original abstract

Retrieval-augmented generation (RAG) systems combine document retrieval with a generative model to address complex information seeking tasks like report generation. While the relationship between retrieval quality and generation effectiveness seems intuitive, it has not been systematically studied. We investigate whether upstream retrieval metrics can serve as reliable early indicators of the final generated response's information coverage. Through experiments across two text RAG benchmarks (TREC NeuCLIR 2024 and TREC RAG 2024) and one multimodal benchmark (WikiVideo), we analyze 15 text retrieval stacks and 10 multimodal retrieval stacks across four RAG pipelines and multiple evaluation frameworks (Auto-ARGUE and MiRAGE). Our findings demonstrate strong correlations between coverage-based retrieval metrics and nugget coverage in generated responses at both topic and system levels. This relationship holds most strongly when retrieval objectives align with generation goals, though more complex iterative RAG pipelines can partially decouple generation quality from retrieval effectiveness. These findings provide empirical support for using retrieval metrics as proxies for RAG performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper investigates whether upstream retrieval metrics can serve as reliable proxies for information coverage in RAG-generated responses. Using TREC NeuCLIR 2024, TREC RAG 2024, and WikiVideo benchmarks, it evaluates 15 text and 10 multimodal retrieval stacks across four RAG pipelines with Auto-ARGUE and MiRAGE frameworks, reporting strong correlations between coverage-based retrieval metrics and nugget coverage at topic and system levels. The relationship is strongest when retrieval objectives align with generation goals, though iterative pipelines may partially decouple the two.

Significance. If the reported correlations are robust after appropriate statistical controls and stratification, the work supplies empirical grounding for treating retrieval effectiveness as an early indicator of RAG performance. This could streamline evaluation pipelines and clarify when retrieval quality directly translates to generation coverage, while highlighting the moderating role of pipeline complexity.

major comments (3)
  1. [§5] §5 (Results): The headline claim of 'strong correlations' at topic and system levels is presented in aggregate without reported correlation coefficients, p-values, confidence intervals, or effect sizes. The abstract supplies no statistical details, and the absence of these quantities prevents assessment of whether the correlations are practically meaningful or driven by outliers.
  2. [§5.3] §5.3 (Pipeline analysis): The manuscript notes that 'more complex iterative RAG pipelines can partially decouple generation quality from retrieval effectiveness' but does not quantify this decoupling via interaction terms, separate correlation tables by pipeline type, or moderation analysis. If the 4 pipelines include both simple and iterative variants without stratification, the aggregate coefficients may be inflated by aligned cases and fail to support the general proxy claim.
  3. [§4.2] §4.2 (Experimental setup): No explicit rules for data exclusion, handling of multimodal vs. text differences, or controls for topic difficulty are described. This leaves open whether the observed correlations generalize or are confounded by benchmark-specific properties of TREC NeuCLIR, TREC RAG, and WikiVideo.
minor comments (2)
  1. [Table 1] Table 1 and Figure 2: Axis labels and legend entries use inconsistent abbreviations for retrieval stacks; expand or standardize for readability.
  2. [§2] Related work section: The discussion of prior RAG evaluation frameworks (e.g., ARGUE, MiRAGE) would benefit from explicit comparison of their nugget coverage definitions to avoid potential circularity in metric choice.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and will revise the manuscript to incorporate the suggested improvements in statistical reporting, pipeline stratification, and experimental controls.

read point-by-point responses
  1. Referee: [§5] §5 (Results): The headline claim of 'strong correlations' at topic and system levels is presented in aggregate without reported correlation coefficients, p-values, confidence intervals, or effect sizes. The abstract supplies no statistical details, and the absence of these quantities prevents assessment of whether the correlations are practically meaningful or driven by outliers.

    Authors: We agree that the current presentation lacks the necessary statistical details. In the revised manuscript we will report Pearson and Spearman correlation coefficients, associated p-values, 95% confidence intervals, and effect sizes (including r-squared) for both topic-level and system-level analyses. These statistics will be provided in aggregate as well as stratified by benchmark and pipeline type to allow assessment of practical significance and potential outlier effects. revision: yes

  2. Referee: [§5.3] §5.3 (Pipeline analysis): The manuscript notes that 'more complex iterative RAG pipelines can partially decouple generation quality from retrieval effectiveness' but does not quantify this decoupling via interaction terms, separate correlation tables by pipeline type, or moderation analysis. If the 4 pipelines include both simple and iterative variants without stratification, the aggregate coefficients may be inflated by aligned cases and fail to support the general proxy claim.

    Authors: We acknowledge the value of quantifying the decoupling effect. The revision will add separate correlation tables for simple versus iterative pipelines, include interaction terms in regression models to test moderation by pipeline complexity, and present a dedicated moderation analysis. This will clarify the conditions under which retrieval metrics remain reliable proxies and prevent over-generalization from aggregate results. revision: yes

  3. Referee: [§4.2] §4.2 (Experimental setup): No explicit rules for data exclusion, handling of multimodal vs. text differences, or controls for topic difficulty are described. This leaves open whether the observed correlations generalize or are confounded by benchmark-specific properties of TREC NeuCLIR, TREC RAG, and WikiVideo.

    Authors: We will expand §4.2 to specify data exclusion rules (e.g., minimum nugget count per topic), detail the separate processing and normalization steps for multimodal versus text benchmarks, and incorporate controls for topic difficulty through stratification by topic features and inclusion of topic as a random effect in statistical models. These additions will strengthen claims of generalizability across the three benchmarks. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical correlation analysis on external benchmarks

full rationale

The paper reports measured correlations between retrieval coverage metrics and generated response nugget coverage across TREC NeuCLIR 2024, TREC RAG 2024, and WikiVideo benchmarks using 15 text + 10 multimodal stacks and four RAG pipelines. No equations, fitted parameters, or derivations are defined in terms of the target quantities; all results are direct statistical observations from independent evaluation frameworks (Auto-ARGUE, MiRAGE). The analysis contains no self-definitional steps, no predictions that reduce to fitted inputs, and no load-bearing self-citations that substitute for external verification. The central claim is therefore an empirical finding rather than a constructed equivalence.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Relies on standard IR assumptions about coverage metrics and nugget-based evaluation; no new free parameters, axioms, or invented entities are introduced in the abstract.

axioms (1)
  • domain assumption Nugget coverage measured by Auto-ARGUE and MiRAGE is a valid proxy for information coverage in generated responses
    Central to linking retrieval metrics with generation quality

pith-pipeline@v0.9.0 · 5489 in / 1168 out tokens · 48501 ms · 2026-05-15T13:06:52.804484+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

72 extracted references · 72 canonical work pages · 11 internal anchors

  1. [1]

    Zahra Abbasiantaeb, Simon Lupart, Leif Azzopardi, Jeffery Dalton, and Mo- hammad Aliannejadi. 2025. Conversational Gold: Evaluating Personalized Conversational Search System using Gold Nuggets. arXiv:2503.09902 [cs.IR] https://arxiv.org/abs/2503.09902

  2. [2]

    Akari Asai, Zeqiu Wu, Yizhong Wang, Avirup Sil, and Hannaneh Hajishirzi. 2023. Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection. arXiv:2310.11511 [cs.CL] https://arxiv.org/abs/2310.11511

  3. [3]

    Akari Asai, Zeqiu Wu, Yizhong Wang, Avirup Sil, and Hannaneh Hajishirzi. 2024. Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection. InThe Twelfth International Conference on Learning Representations

  4. [4]

    deterministic

    Berk Atil, Sarp Aykent, Alexa Chittams, Lisheng Fu, Rebecca J Passonneau, Evan Radcliffe, Guru Rajan Rajagopal, Adam Sloan, Tomasz Tudrej, Ferhan Ture, et al. 2024. Non-determinism of" deterministic" llm settings.arXiv preprint arXiv:2408.04667(2024)

  5. [5]

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. 2025. Qwen2.5-VL Technical Rep...

  6. [6]

    McKeown, and Michael Elhadad

    Regina Barzilay, Kathleen R. McKeown, and Michael Elhadad. 1999. Information Fusion in the Context of Multi-Document Summarization. InProceedings of the 37th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, College Park, Maryland, USA, 550–557. https: //doi.org/10.3115/1034678.1034760

  7. [7]

    Andrew Blair-Stanek and Benjamin Van Durme. 2025. Llms provide unstable answers to legal questions. InProceedings of the Twentieth International Conference on Artificial Intelligence and Law. 425–429

  8. [8]

    Carbonell and Jade Goldstein

    Jaime G. Carbonell and Jade Goldstein. 2018. The Use of MMR and Diversity- Based Reranking in Document Reranking and Summarization. (6 2018). https: //doi.org/10.1184/R1/6610814.v1

  9. [9]

    Chi-Min Chan, Chunpu Xu, Ruibin Yuan, Hongyin Luo, Wei Xue, Yike Guo, and Jie Fu. 2024. RQ-RAG: Learning to Refine Queries for Retrieval Augmented Generation. arXiv:2404.00610 [cs.CL] https://arxiv.org/abs/2404.00610

  10. [10]

    Olivier Chapelle, Shihao Ji, Ciya Liao, Emre Velipasaoglu, Larry Lai, and Su- Lin Wu. 2011. Intent-based diversification of web search results: metrics and algorithms.Inf. Retr.14, 6 (Dec. 2011), 572–592

  11. [11]

    Xin Cheng, Xun Wang, Xingxing Zhang, Tao Ge, Si-Qing Chen, Furu Wei, Huishuai Zhang, and Dongyan Zhao. 2024. xRAG: Extreme Context Compression for Retrieval-Augmented Generation with One Token. InAdvances in Neural Information Processing Systems (NeurIPS)

  12. [12]

    Clarke, Nick Craswell, Ian Soboroff, and Azin Ashkan

    Charles L.A. Clarke, Nick Craswell, Ian Soboroff, and Azin Ashkan. 2011. A comparative analysis of cascade measures for novelty and diversity. InProceedings of the Fourth ACM International Conference on Web Search and Data Mining(Hong Kong, China)(WSDM ’11). Association for Computing Machinery, New York, NY, USA, 75–84. https://doi.org/10.1145/1935826.1935847

  13. [13]

    Clarke, Maheedhar Kolla, Gordon V

    Charles L.A. Clarke, Maheedhar Kolla, Gordon V. Cormack, Olga Vechtomova, Azin Ashkan, Stefan Büttcher, and Ian MacKinnon. 2008. Novelty and diversity in information retrieval evaluation. InProceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (Singapore, Singapore)(SIGIR ’08). Association f...

  14. [14]

    Cormack, Charles L A Clarke, and Stefan Buettcher

    Gordon V. Cormack, Charles L A Clarke, and Stefan Buettcher. 2009. Reciprocal rank fusion outperforms condorcet and individual rank learning methods. In Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval(Boston, MA, USA)(SIGIR ’09). Association for Computing Machinery, New York, NY, USA, 758–759...

  15. [15]

    Laura Dietz, Bryan Li, Gabrielle Liu, Jia-Huei Ju, Eugene Yang, Dawn Lawrie, William Walden, and James Mayfield. 2026. Incorporating Q&A Nuggets into Retrieval-Augmented Generation. InProceedings of the 48th European Conference on Information Retrieval (ECIR 2026)

  16. [16]

    Yufeng Du, Minyang Tian, Srikanth Ronanki, Subendhu Rongali, Sravan Babu Bodapati, Aram Galstyan, Azton Wells, Roy Schwartz, Eliu A Huerta, and Hao Peng. 2025. Context Length Alone Hurts LLM Performance Despite Perfect Retrieval. InFindings of the Association for Computational Linguistics: EMNLP 2025, Christos Christodoulopoulos, Tanmoy Chakraborty, Carol...

  17. [17]

    Kevin Duh, Dawn Lawrie, Debashish Chakraborty, Roxana Petcu, Eugene Yang, Kenton Murraya, Daniel Khashabi, and Maxime Dassen. 2025. HLTCOE Genera- tion Team at TREC 2025. InThe Thirty-Fourth Text REtrieval Conference Proceed- ings (TREC2025). https://trec-ragtime.github.io/assets/notebooks/2025/hltcoe- gen.pdf

  18. [18]

    Kevin Duh, Eugene Yang, Orion Weller, Andrew Yates, and Dawn Lawrie. 2025. HLTCOE at LiveRAG: GPT-Researcher using ColBERT retrieval.arXiv preprint arXiv:2506.22356(2025)

  19. [19]

    Assaf Elovic. 2023. gpt-researcher. https://github.com/assafelovic/gpt-researcher

  20. [20]

    Jinyuan Fang, Zaiqiao Meng, and Craig Macdonald. 2025. KiRAG: Knowledge- Driven Iterative Retriever for Enhancing Retrieval-Augmented Generation.arXiv preprint arXiv:2502.18397(2025)

  21. [21]

    Naghmeh Farzi and Laura Dietz. 2024. An Exam-based Evaluation Approach Beyond Traditional Relevance Judgments. arXiv:2402.00309 [cs.IR] https://arxiv. org/abs/2402.00309

  22. [22]

    Naghmeh Farzi and Laura Dietz. 2024. Pencils Down! Automatic Rubric-based Evaluation of Retrieve/Generate Systems. InProceedings of the 2024 ACM SIGIR International Conference on Theory of Information Retrieval(Washington DC, USA)(ICTIR ’24). Association for Computing Machinery, New York, NY, USA, 175–184. https://doi.org/10.1145/3664190.3672511

  23. [23]

    Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yi Dai, Jiawei Sun, Meng Wang, and Haofen Wang. 2024. Retrieval-Augmented Generation for Large Language Models: A Survey. arXiv:2312.10997 [cs.CL] https://arxiv.org/abs/2312.10997

  24. [24]

    Kazuki Hayashi, Hidetaka Kamigaito, Shinya Kouda, and Taro Watanabe. 2025. Iterkey: Iterative keyword generation with llms for enhanced retrieval aug- mented generation. InProceedings of the Second Conference on Language Modeling (COLM’25)

  25. [25]

    de Vries

    Gijs Hendriksen, Djoerd Hiemstra, and Arjen P. de Vries. 2025. Selective Search as a First-Stage Retriever. InExperimental IR Meets Multilinguality, Multimodality, and Interaction: 16th International Conference of the CLEF Association, CLEF 2025, Madrid, Spain, September 9–12, 2025, Proceedings(Madrid, Spain). Springer-Verlag, Berlin, Heidelberg, 17–33. h...

  26. [26]

    Kalervo Järvelin and Jaana Kekäläinen. 2002. Cumulated gain-based evaluation of IR techniques.ACM Trans. Inf. Syst.20, 4 (Oct. 2002), 422–446. https://doi. org/10.1145/582415.582418

  27. [27]

    Why Language Models Hallucinate

    Adam Tauman Kalai, Ofir Nachum, Santosh S. Vempala, and Edwin Zhang. 2025. Why Language Models Hallucinate. arXiv:2509.04664 [cs.CL] https://arxiv.org/ abs/2509.04664

  28. [28]

    DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines

    Omar Khattab, Arnav Singhvi, Paridhi Maheshwari, Zhiyuan Zhang, Keshav Santhanam, Sri Vardhamanan, Saiful Haq, Ashutosh Sharma, Thomas T. Joshi, Hanna Moazam, Heather Miller, Matei Zaharia, and Christopher Potts. 2023. DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines. arXiv:2310.03714 [cs.CL] https://arxiv.org/abs/2310.03714

  29. [29]

    Gun Il Kim, Jong Wook Kim, and Beakcheol Jang. 2025. UniRAG: A Unified RAG Framework for Knowledge-Intensive Queries with Decomposition, Break- Down Reasoning, and Iterative Rewriting. InFindings of the Association for Computational Linguistics: EMNLP 2025. 18795–18810

  30. [30]

    Reno Kriz, Kate Sanders, David Etter, Kenton Murray, Cameron Carpenter, Kelly Van Ochten, Hannah Recknor, Jimena Guallar-Blasco, Alexander Mar- tin, Ronald Colaianni, Nolan King, Eugene Yang, and Benjamin Van Durme. 2025. MultiVENT 2.0: A Massive Multilingual Benchmark for Event-Centric Video Retrieval. arXiv:2410.11619 [cs.CV] https://arxiv.org/abs/2410.11619

  31. [31]

    Weronika Lajewska and Krisztian Balog. 2025. GINGER: Grounded Information Nugget-Based Generation of Responses. InProceedings of the 48th International ACM SIGIR Conference (SIGIR ’25). https://krisztianbalog.com/files/sigir2025- ginger.pdf SIGIR 2025 paper

  32. [32]

    Carlos Lassance, Hervé Déjean, Thibault Formal, and Stéphane Clinchant. 2024. SPLADE-v3: New baselines for SPLADE.arXiv preprint arXiv:2403.06789(2024)

  33. [33]

    Victor Lavrenko and W Bruce Croft. 2017. Relevance-based language models. In ACM SIGIR Forum, Vol. 51. ACM New York, NY, USA, 260–267

  34. [34]

    Oard, Luca Soldaini, and Eugene Yang

    Dawn Lawrie, Sean MacAvaney, James Mayfield, Paul McNamee, Douglas W. Oard, Luca Soldaini, and Eugene Yang. 2025. Overview of the TREC 2024 NeuCLIR Track. arXiv:2509.14355 [cs.IR] https://arxiv.org/abs/2509.14355

  35. [35]

    Liu, Kevin Lin, John Hewitt, Bhargavi Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang

    Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. 2024. Lost in the Middle: How Language Models Use Long Contexts.Transactions of the Association for Computational Linguistics 12 (2024), 157–173. https://doi.org/10.1162/tacl_a_00638

  36. [36]

    Xueguang Ma, Luyu Gao, Shengyao Zhuang, Jiaqi Samantha Zhan, Jamie Callan, and Jimmy Lin. 2025. Tevatron 2.0: Unified Document Retrieval Toolkit across Scale, Language, and Modality. arXiv:2505.02466 [cs.IR] https://arxiv.org/abs/ 2505.02466 10

  37. [37]

    Sean MacAvaney, Craig Macdonald, and Iadh Ounis. 2021. Streamlining Evalua- tion with ir-measures. arXiv:2111.13466 [cs.IR] https://arxiv.org/abs/2111.13466

  38. [38]

    Alexander Martin, Reno Kriz, William Gantt Walden, Kate Sanders, Hannah Rec- knor, Eugene Yang, Francis Ferraro, and Benjamin Van Durme. 2025. WikiVideo: Article Generation from Multiple Videos. arXiv:2504.00939 [cs.CV] https: //arxiv.org/abs/2504.00939

  39. [39]

    Alexander Martin, William Walden, Reno Kriz, Dengjia Zhang, Kate Sanders, Eugene Yang, Chihsheng Jin, and Benjamin Van Durme. 2025. Seeing Through the MiRAGE: Evaluating Multimodal Retrieval Augmented Genera- tion. arXiv:2510.24870 [cs.CL] https://arxiv.org/abs/2510.24870

  40. [40]

    Oard, Luca Soldaini, Ian Soboroff, Orion Weller, Efsun Kayi, Kate Sanders, Marc Mason, and Noah Hibbler

    James Mayfield, Eugene Yang, Dawn Lawrie, Sean MacAvaney, Paul McNamee, Douglas W. Oard, Luca Soldaini, Ian Soboroff, Orion Weller, Efsun Kayi, Kate Sanders, Marc Mason, and Noah Hibbler. 2024. On the Evaluation of Machine- Generated Reports. InProceedings of the 47th International ACM SIGIR Confer- ence on Research and Development in Information Retrieva...

  41. [41]

    Teague McMillan, Gabriele Dominici, Martin Gjoreski, and Marc Langheinrich

  42. [42]

    Towards Transparent Reasoning: What Drives Faithfulness in Large Lan- guage Models? arXiv:2510.24236 [cs.CL] https://arxiv.org/abs/2510.24236

  43. [43]

    Federico Nanni, Bhaskar Mitra, Matt Magnusson, and Laura Dietz. 2017. Bench- mark for Complex Answer Retrieval. arXiv:1705.04803 [cs.IR] https://arxiv.org/ abs/1705.04803

  44. [44]

    Federico Nanni, Bhaskar Mitra, Matt Magnusson, and Laura Dietz. 2017. Bench- mark for Complex Answer Retrieval. InProceedings of the ACM SIGIR International Conference on Theory of Information Retrieval(Amsterdam, The Netherlands) (ICTIR ’17). Association for Computing Machinery, New York, NY, USA, 293–296. https://doi.org/10.1145/3121050.3121099

  45. [45]

    Thong Nguyen, Yibin Lei, Jia-Huei Ju, Eugene Yang, and Andrew Yates. 2025. Milco: Learned Sparse Retrieval Across Languages via a Multilingual Connector. arXiv:2510.00671 [cs.IR] https://arxiv.org/abs/2510.00671

  46. [46]

    Paul Over. 2001. The TREC interactive track: an annotated bibliography.Inf. Process. Manage.37, 3 (May 2001), 369–381. https://doi.org/10.1016/S0306- 4573(00)00053-4

  47. [47]

    Ronak Pradeep, Nandan Thakur, Sahel Sharifymoghaddam, Eric Zhang, Ryan Nguyen, Daniel Campos, Nick Craswell, and Jimmy Lin. 2024. Ragnarök: A Reusable RAG Framework and Baselines for TREC 2024 Retrieval-Augmented Generation Track. arXiv:2406.16828 [cs.IR] https://arxiv.org/abs/2406.16828

  48. [48]

    Ronak Pradeep, Nandan Thakur, Shivani Upadhyay, Daniel Campos, Nick Craswell, and Jimmy Lin. 2024. Initial Nugget Evaluation Results for the TREC 2024 RAG Track with the AutoNuggetizer Framework. arXiv:2411.09607 [cs.IR] https://arxiv.org/abs/2411.09607

  49. [49]

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. 2021. Learning Transferable Visual Models From Natural Language Supervision. arXiv:2103.00020 [cs.CV] https://arxiv.org/ abs/2103.00020

  50. [50]

    de Melo, Benjamin Van Durme, and Rama Chellappa

    Arun Reddy, Alexander Martin, Eugene Yang, Andrew Yates, Kate Sanders, Ken- ton Murray, Reno Kriz, Celso M. de Melo, Benjamin Van Durme, and Rama Chellappa. 2025. Video-ColBERT: Contextualized Late Interaction for Text-to- Video Retrieval. arXiv:2503.19009 [cs.CV] https://arxiv.org/abs/2503.19009

  51. [51]

    Stephen Robertson. 2008. A new interpretation of average precision. InPro- ceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval. 689–690

  52. [52]

    Stephen Robertson, Hugo Zaragoza, et al . 2009. The probabilistic relevance framework: BM25 and beyond.Foundations and trends®in information retrieval 3, 4 (2009), 333–389

  53. [53]

    Saron Samuel, Dan DeGenaro, Jimena Guallar-Blasco, Kate Sanders, Oluwaseun Eisape, Tanner Spendlove, Arun Reddy, Alexander Martin, Andrew Yates, Eugene Yang, Cameron Carpenter, David Etter, Efsun Kayi, Matthew Wiesner, Kenton Murray, and Reno Kriz. 2025. MMMORRF: Multimodal Multilingual Modularized Reciprocal Rank Fusion. arXiv:2503.20698 [cs.CV] https://...

  54. [54]

    Tyler Skow, Alexander Martin, Benjamin Van Durme, Rama Chellappa, and Reno Kriz. 2026. RANKVIDEO: Reasoning Reranking for Text-to-Video Retrieval. arXiv:2602.02444 [cs.IR] https://arxiv.org/abs/2602.02444

  55. [55]

    Ian Soboroff, Donna Harman, et al. 2003. Overview of the TREC 2003 Novelty Track.. InTREC. 38–53

  56. [56]

    Nandan Thakur, Ronak Pradeep, Shivani Upadhyay, Daniel Campos, Nick Craswell, and Jimmy Lin. 2025. Support Evaluation for the TREC 2024 RAG Track: Comparing Human versus LLM Judges. arXiv:2504.15205 [cs.CL] https://arxiv.org/abs/2504.15205

  57. [57]

    Nandan Thakur, Ronak Pradeep, Shivani Upadhyay, Daniel Campos, Nick Craswell, Ian Soboroff, Hoa Trang Dang, and Jimmy Lin. 2025. Assessing Support for the TREC 2024 RAG Track: A Large-Scale Comparative Study of LLM and Human Evaluations. InProceedings of the 48th International ACM SIGIR Confer- ence on Research and Development in Information Retrieval(Pad...

  58. [58]

    William Walden, Orion Weller, Laura Dietz, Bryan Li, Gabrielle Kaili-May Liu, Yu Hou, and Eugene Yang. 2025. Auto-ARGUE: LLM-Based Report Generation Evaluation.arXiv preprint arXiv:2509.26184(2025)

  59. [59]

    Orion Weller, Kathryn Ricci, Eugene Yang, Andrew Yates, Dawn Lawrie, and Ben- jamin Van Durme. 2025. Rank1: Test-Time Compute for Reranking in Information Retrieval. arXiv:2502.18418 [cs.IR] https://arxiv.org/abs/2502.18418

  60. [60]

    Jin Xu, Zhifang Guo, Jinzheng He, Hangrui Hu, Ting He, Shuai Bai, Keqin Chen, Jialin Wang, Yang Fan, Kai Dang, Bin Zhang, Xiong Wang, Yunfei Chu, and Junyang Lin. 2025. Qwen2.5-Omni Technical Report. arXiv:2503.20215 [cs.CL] https://arxiv.org/abs/2503.20215

  61. [61]

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...

  62. [62]

    Diji Yang, Jinmeng Rao, Kezhen Chen, Xiaoyuan Guo, Yawen Zhang, Jie Yang, and Yi Zhang. 2024. Im-rag: Multi-round retrieval-augmented generation through learning inner monologues. InProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval. 730–740

  63. [63]

    Diji Yang, Linda Zeng, Jinmeng Rao, and Yi Zhang. 2025. Knowing You Don’t Know: Learning When to Continue Search in Multi-round RAG through Self- Practicing. InProceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval. 1305–1315

  64. [64]

    Oard, and Scott Miller

    Eugene Yang, Dawn Lawrie, James Mayfield, Douglas W. Oard, and Scott Miller

  65. [65]

    InProceedings of the 46th European Conference on Information Retrieval (ECIR)

    Translate-Distill: Learning Cross-Language Dense Retrieval by Translation and Distillation. InProceedings of the 46th European Conference on Information Retrieval (ECIR). https://arxiv.org/abs/2401.04810

  66. [66]

    Eugene Yang, Dawn Lawrie, Orion Weller, and James Mayfield. 2025. HLTCOE at TREC 2024 NeuCLIR Track. arXiv:2510.00143 [cs.CL] https://arxiv.org/abs/ 2510.00143

  67. [67]

    Eugene Yang, Andrew Yates, Dawn Lawrie, James Mayfield, and Trevor Adri- aanse. 2026. RoutIR: Fast Serving of Retrieval Pipelines for Retrieval-Augmented Generation. arXiv:2601.10644 [cs.IR] https://arxiv.org/abs/2601.10644

  68. [68]

    Yanzhao Zhang, Mingxin Li, Dingkun Long, Xin Zhang, Huan Lin, Baosong Yang, Pengjun Xie, An Yang, Dayiheng Liu, Junyang Lin, Fei Huang, and Jingren Zhou

  69. [69]

    Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models

    Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models. arXiv:2506.05176 [cs.CL] https://arxiv.org/abs/2506.05176

  70. [70]

    Mingjun Zhao, Shengli Yan, Bang Liu, Xinwang Zhong, Qian Hao, Haolan Chen, Di Niu, Bowei Long, and Weidong Guo. 2021. QBSUM: A large-scale query-based document summarization dataset from real-world applications.Computer Speech & Language66 (March 2021), 101166. https://doi.org/10.1016/j.csl.2020.101166

  71. [71]

    Wei Zheng, Xuanhui Wang, Hui Fang, and Hong Cheng. 2012. Coverage-based search result diversification.Inf. Retr.15, 5 (Oct. 2012), 433–457. https://doi.org/ 10.1007/s10791-011-9178-4

  72. [72]

    Bin Zhu, Bin Lin, Munan Ning, Yang Yan, Jiaxi Cui, HongFa Wang, Yatian Pang, Wenhao Jiang, Junwu Zhang, Zongwei Li, Wancai Zhang, Zhifeng Li, Wei Liu, and Li Yuan. 2024. LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment. arXiv:2310.01852 [cs.CV] https://arxiv.org/abs/2310.01852 11