arxiv: 2603.08819 · v3 · submitted 2026-03-09 · 💻 cs.IR · cs.AI

Recognition: 1 theorem link

· Lean Theorem

Beyond Relevance: On the Relationship Between Retrieval and RAG Information Coverage

Saron Samuel , Alexander Martin , Eugene Yang , Andrew Yates , Dawn Lawrie , Laura Dietz , Benjamin Van Durme

Authors on Pith no claims yet

Pith reviewed 2026-05-15 13:06 UTC · model grok-4.3

classification 💻 cs.IR cs.AI

keywords retrieval-augmented generationRAGinformation coverageretrieval metricsnugget coverageevaluationbenchmarks

0 comments

The pith

Retrieval metrics based on information coverage reliably predict how complete the final answers are in retrieval-augmented generation systems.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether upstream retrieval quality in RAG systems can forecast the amount of key information that appears in the generated response. Experiments run across two text benchmarks and one multimodal benchmark, covering 15 text retrieval stacks, 10 multimodal stacks, and four different RAG pipelines. Coverage-oriented retrieval metrics correlate strongly with the fraction of target nuggets that end up covered in the outputs, both at the level of individual topics and across entire systems. The link holds best when the retrieval objective matches the generation goal, but more complex iterative pipelines weaken the dependence on initial retrieval. The pattern supplies evidence that retrieval metrics can stand in for full RAG performance checks.

Core claim

Coverage-based retrieval metrics serve as reliable early indicators of nugget coverage in RAG-generated responses. Strong correlations appear at both topic and system levels across the TREC NeuCLIR 2024, TREC RAG 2024, and WikiVideo benchmarks. The relationship strengthens when retrieval objectives align with generation goals, while complex iterative RAG pipelines can partially decouple generation quality from retrieval effectiveness.

What carries the argument

Coverage-based retrieval metrics that quantify how much of the target information is captured in the retrieved documents rather than relying on relevance ranking alone.

If this is right

Retrieval metrics can serve as practical proxies for RAG performance without requiring full response generation.
Alignment between retrieval objectives and generation goals increases the predictive strength of coverage metrics.
Iterative RAG pipelines can reduce the impact of weaker initial retrieval on final answer quality.
The observed correlations hold across both text and multimodal settings and multiple evaluation frameworks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

RAG developers could screen candidate retrievers using coverage metrics before integrating them into full systems.
Retrieval methods might be redesigned with explicit coverage targets matched to the intended generation task.
Similar coverage relationships could be tested in other retrieval-augmented settings such as long-form summarization.

Load-bearing premise

The chosen benchmarks and evaluation frameworks produce coverage measures that generalize beyond the tested RAG pipelines and domains.

What would settle it

A new benchmark or domain in which coverage-based retrieval metrics show no or weak correlation with the nugget coverage achieved in generated RAG responses.

read the original abstract

Retrieval-augmented generation (RAG) systems combine document retrieval with a generative model to address complex information seeking tasks like report generation. While the relationship between retrieval quality and generation effectiveness seems intuitive, it has not been systematically studied. We investigate whether upstream retrieval metrics can serve as reliable early indicators of the final generated response's information coverage. Through experiments across two text RAG benchmarks (TREC NeuCLIR 2024 and TREC RAG 2024) and one multimodal benchmark (WikiVideo), we analyze 15 text retrieval stacks and 10 multimodal retrieval stacks across four RAG pipelines and multiple evaluation frameworks (Auto-ARGUE and MiRAGE). Our findings demonstrate strong correlations between coverage-based retrieval metrics and nugget coverage in generated responses at both topic and system levels. This relationship holds most strongly when retrieval objectives align with generation goals, though more complex iterative RAG pipelines can partially decouple generation quality from retrieval effectiveness. These findings provide empirical support for using retrieval metrics as proxies for RAG performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper investigates whether upstream retrieval metrics can serve as reliable proxies for information coverage in RAG-generated responses. Using TREC NeuCLIR 2024, TREC RAG 2024, and WikiVideo benchmarks, it evaluates 15 text and 10 multimodal retrieval stacks across four RAG pipelines with Auto-ARGUE and MiRAGE frameworks, reporting strong correlations between coverage-based retrieval metrics and nugget coverage at topic and system levels. The relationship is strongest when retrieval objectives align with generation goals, though iterative pipelines may partially decouple the two.

Significance. If the reported correlations are robust after appropriate statistical controls and stratification, the work supplies empirical grounding for treating retrieval effectiveness as an early indicator of RAG performance. This could streamline evaluation pipelines and clarify when retrieval quality directly translates to generation coverage, while highlighting the moderating role of pipeline complexity.

major comments (3)

[§5] §5 (Results): The headline claim of 'strong correlations' at topic and system levels is presented in aggregate without reported correlation coefficients, p-values, confidence intervals, or effect sizes. The abstract supplies no statistical details, and the absence of these quantities prevents assessment of whether the correlations are practically meaningful or driven by outliers.
[§5.3] §5.3 (Pipeline analysis): The manuscript notes that 'more complex iterative RAG pipelines can partially decouple generation quality from retrieval effectiveness' but does not quantify this decoupling via interaction terms, separate correlation tables by pipeline type, or moderation analysis. If the 4 pipelines include both simple and iterative variants without stratification, the aggregate coefficients may be inflated by aligned cases and fail to support the general proxy claim.
[§4.2] §4.2 (Experimental setup): No explicit rules for data exclusion, handling of multimodal vs. text differences, or controls for topic difficulty are described. This leaves open whether the observed correlations generalize or are confounded by benchmark-specific properties of TREC NeuCLIR, TREC RAG, and WikiVideo.

minor comments (2)

[Table 1] Table 1 and Figure 2: Axis labels and legend entries use inconsistent abbreviations for retrieval stacks; expand or standardize for readability.
[§2] Related work section: The discussion of prior RAG evaluation frameworks (e.g., ARGUE, MiRAGE) would benefit from explicit comparison of their nugget coverage definitions to avoid potential circularity in metric choice.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and will revise the manuscript to incorporate the suggested improvements in statistical reporting, pipeline stratification, and experimental controls.

read point-by-point responses

Referee: [§5] §5 (Results): The headline claim of 'strong correlations' at topic and system levels is presented in aggregate without reported correlation coefficients, p-values, confidence intervals, or effect sizes. The abstract supplies no statistical details, and the absence of these quantities prevents assessment of whether the correlations are practically meaningful or driven by outliers.

Authors: We agree that the current presentation lacks the necessary statistical details. In the revised manuscript we will report Pearson and Spearman correlation coefficients, associated p-values, 95% confidence intervals, and effect sizes (including r-squared) for both topic-level and system-level analyses. These statistics will be provided in aggregate as well as stratified by benchmark and pipeline type to allow assessment of practical significance and potential outlier effects. revision: yes
Referee: [§5.3] §5.3 (Pipeline analysis): The manuscript notes that 'more complex iterative RAG pipelines can partially decouple generation quality from retrieval effectiveness' but does not quantify this decoupling via interaction terms, separate correlation tables by pipeline type, or moderation analysis. If the 4 pipelines include both simple and iterative variants without stratification, the aggregate coefficients may be inflated by aligned cases and fail to support the general proxy claim.

Authors: We acknowledge the value of quantifying the decoupling effect. The revision will add separate correlation tables for simple versus iterative pipelines, include interaction terms in regression models to test moderation by pipeline complexity, and present a dedicated moderation analysis. This will clarify the conditions under which retrieval metrics remain reliable proxies and prevent over-generalization from aggregate results. revision: yes
Referee: [§4.2] §4.2 (Experimental setup): No explicit rules for data exclusion, handling of multimodal vs. text differences, or controls for topic difficulty are described. This leaves open whether the observed correlations generalize or are confounded by benchmark-specific properties of TREC NeuCLIR, TREC RAG, and WikiVideo.

Authors: We will expand §4.2 to specify data exclusion rules (e.g., minimum nugget count per topic), detail the separate processing and normalization steps for multimodal versus text benchmarks, and incorporate controls for topic difficulty through stratification by topic features and inclusion of topic as a random effect in statistical models. These additions will strengthen claims of generalizability across the three benchmarks. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical correlation analysis on external benchmarks

full rationale

The paper reports measured correlations between retrieval coverage metrics and generated response nugget coverage across TREC NeuCLIR 2024, TREC RAG 2024, and WikiVideo benchmarks using 15 text + 10 multimodal stacks and four RAG pipelines. No equations, fitted parameters, or derivations are defined in terms of the target quantities; all results are direct statistical observations from independent evaluation frameworks (Auto-ARGUE, MiRAGE). The analysis contains no self-definitional steps, no predictions that reduce to fitted inputs, and no load-bearing self-citations that substitute for external verification. The central claim is therefore an empirical finding rather than a constructed equivalence.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Relies on standard IR assumptions about coverage metrics and nugget-based evaluation; no new free parameters, axioms, or invented entities are introduced in the abstract.

axioms (1)

domain assumption Nugget coverage measured by Auto-ARGUE and MiRAGE is a valid proxy for information coverage in generated responses
Central to linking retrieval metrics with generation quality

pith-pipeline@v0.9.0 · 5489 in / 1168 out tokens · 48501 ms · 2026-05-15T13:06:52.804484+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

strong correlations between coverage-based retrieval metrics and nugget coverage in generated responses at both topic and system levels

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

72 extracted references · 72 canonical work pages · 11 internal anchors

[1]

Zahra Abbasiantaeb, Simon Lupart, Leif Azzopardi, Jeffery Dalton, and Mo- hammad Aliannejadi. 2025. Conversational Gold: Evaluating Personalized Conversational Search System using Gold Nuggets. arXiv:2503.09902 [cs.IR] https://arxiv.org/abs/2503.09902

work page arXiv 2025
[2]

Akari Asai, Zeqiu Wu, Yizhong Wang, Avirup Sil, and Hannaneh Hajishirzi. 2023. Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection. arXiv:2310.11511 [cs.CL] https://arxiv.org/abs/2310.11511

work page internal anchor Pith review Pith/arXiv arXiv 2023
[3]

Akari Asai, Zeqiu Wu, Yizhong Wang, Avirup Sil, and Hannaneh Hajishirzi. 2024. Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection. InThe Twelfth International Conference on Learning Representations

work page 2024
[4]

deterministic

Berk Atil, Sarp Aykent, Alexa Chittams, Lisheng Fu, Rebecca J Passonneau, Evan Radcliffe, Guru Rajan Rajagopal, Adam Sloan, Tomasz Tudrej, Ferhan Ture, et al. 2024. Non-determinism of" deterministic" llm settings.arXiv preprint arXiv:2408.04667(2024)

work page arXiv 2024
[5]

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. 2025. Qwen2.5-VL Technical Rep...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

McKeown, and Michael Elhadad

Regina Barzilay, Kathleen R. McKeown, and Michael Elhadad. 1999. Information Fusion in the Context of Multi-Document Summarization. InProceedings of the 37th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, College Park, Maryland, USA, 550–557. https: //doi.org/10.3115/1034678.1034760

work page doi:10.3115/1034678.1034760 1999
[7]

Andrew Blair-Stanek and Benjamin Van Durme. 2025. Llms provide unstable answers to legal questions. InProceedings of the Twentieth International Conference on Artificial Intelligence and Law. 425–429

work page 2025
[8]

Carbonell and Jade Goldstein

Jaime G. Carbonell and Jade Goldstein. 2018. The Use of MMR and Diversity- Based Reranking in Document Reranking and Summarization. (6 2018). https: //doi.org/10.1184/R1/6610814.v1

work page doi:10.1184/r1/6610814.v1 2018
[9]

Chi-Min Chan, Chunpu Xu, Ruibin Yuan, Hongyin Luo, Wei Xue, Yike Guo, and Jie Fu. 2024. RQ-RAG: Learning to Refine Queries for Retrieval Augmented Generation. arXiv:2404.00610 [cs.CL] https://arxiv.org/abs/2404.00610

work page arXiv 2024
[10]

Olivier Chapelle, Shihao Ji, Ciya Liao, Emre Velipasaoglu, Larry Lai, and Su- Lin Wu. 2011. Intent-based diversification of web search results: metrics and algorithms.Inf. Retr.14, 6 (Dec. 2011), 572–592

work page 2011
[11]

Xin Cheng, Xun Wang, Xingxing Zhang, Tao Ge, Si-Qing Chen, Furu Wei, Huishuai Zhang, and Dongyan Zhao. 2024. xRAG: Extreme Context Compression for Retrieval-Augmented Generation with One Token. InAdvances in Neural Information Processing Systems (NeurIPS)

work page 2024
[12]

Clarke, Nick Craswell, Ian Soboroff, and Azin Ashkan

Charles L.A. Clarke, Nick Craswell, Ian Soboroff, and Azin Ashkan. 2011. A comparative analysis of cascade measures for novelty and diversity. InProceedings of the Fourth ACM International Conference on Web Search and Data Mining(Hong Kong, China)(WSDM ’11). Association for Computing Machinery, New York, NY, USA, 75–84. https://doi.org/10.1145/1935826.1935847

work page doi:10.1145/1935826.1935847 2011
[13]

Clarke, Maheedhar Kolla, Gordon V

Charles L.A. Clarke, Maheedhar Kolla, Gordon V. Cormack, Olga Vechtomova, Azin Ashkan, Stefan Büttcher, and Ian MacKinnon. 2008. Novelty and diversity in information retrieval evaluation. InProceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (Singapore, Singapore)(SIGIR ’08). Association f...

work page doi:10.1145/1390334.1390446 2008
[14]

Cormack, Charles L A Clarke, and Stefan Buettcher

Gordon V. Cormack, Charles L A Clarke, and Stefan Buettcher. 2009. Reciprocal rank fusion outperforms condorcet and individual rank learning methods. In Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval(Boston, MA, USA)(SIGIR ’09). Association for Computing Machinery, New York, NY, USA, 758–759...

work page arXiv 2009
[15]

Laura Dietz, Bryan Li, Gabrielle Liu, Jia-Huei Ju, Eugene Yang, Dawn Lawrie, William Walden, and James Mayfield. 2026. Incorporating Q&A Nuggets into Retrieval-Augmented Generation. InProceedings of the 48th European Conference on Information Retrieval (ECIR 2026)

work page 2026
[16]

Yufeng Du, Minyang Tian, Srikanth Ronanki, Subendhu Rongali, Sravan Babu Bodapati, Aram Galstyan, Azton Wells, Roy Schwartz, Eliu A Huerta, and Hao Peng. 2025. Context Length Alone Hurts LLM Performance Despite Perfect Retrieval. InFindings of the Association for Computational Linguistics: EMNLP 2025, Christos Christodoulopoulos, Tanmoy Chakraborty, Carol...

work page doi:10.18653/v1/2025.findings-emnlp.1264 2025
[17]

Kevin Duh, Dawn Lawrie, Debashish Chakraborty, Roxana Petcu, Eugene Yang, Kenton Murraya, Daniel Khashabi, and Maxime Dassen. 2025. HLTCOE Genera- tion Team at TREC 2025. InThe Thirty-Fourth Text REtrieval Conference Proceed- ings (TREC2025). https://trec-ragtime.github.io/assets/notebooks/2025/hltcoe- gen.pdf

work page 2025
[18]

Kevin Duh, Eugene Yang, Orion Weller, Andrew Yates, and Dawn Lawrie. 2025. HLTCOE at LiveRAG: GPT-Researcher using ColBERT retrieval.arXiv preprint arXiv:2506.22356(2025)

work page arXiv 2025
[19]

Assaf Elovic. 2023. gpt-researcher. https://github.com/assafelovic/gpt-researcher

work page 2023
[20]

Jinyuan Fang, Zaiqiao Meng, and Craig Macdonald. 2025. KiRAG: Knowledge- Driven Iterative Retriever for Enhancing Retrieval-Augmented Generation.arXiv preprint arXiv:2502.18397(2025)

work page arXiv 2025
[21]

Naghmeh Farzi and Laura Dietz. 2024. An Exam-based Evaluation Approach Beyond Traditional Relevance Judgments. arXiv:2402.00309 [cs.IR] https://arxiv. org/abs/2402.00309

work page arXiv 2024
[22]

Naghmeh Farzi and Laura Dietz. 2024. Pencils Down! Automatic Rubric-based Evaluation of Retrieve/Generate Systems. InProceedings of the 2024 ACM SIGIR International Conference on Theory of Information Retrieval(Washington DC, USA)(ICTIR ’24). Association for Computing Machinery, New York, NY, USA, 175–184. https://doi.org/10.1145/3664190.3672511

work page doi:10.1145/3664190.3672511 2024
[23]

Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yi Dai, Jiawei Sun, Meng Wang, and Haofen Wang. 2024. Retrieval-Augmented Generation for Large Language Models: A Survey. arXiv:2312.10997 [cs.CL] https://arxiv.org/abs/2312.10997

work page internal anchor Pith review Pith/arXiv arXiv 2024
[24]

Kazuki Hayashi, Hidetaka Kamigaito, Shinya Kouda, and Taro Watanabe. 2025. Iterkey: Iterative keyword generation with llms for enhanced retrieval aug- mented generation. InProceedings of the Second Conference on Language Modeling (COLM’25)

work page 2025
[25]

de Vries

Gijs Hendriksen, Djoerd Hiemstra, and Arjen P. de Vries. 2025. Selective Search as a First-Stage Retriever. InExperimental IR Meets Multilinguality, Multimodality, and Interaction: 16th International Conference of the CLEF Association, CLEF 2025, Madrid, Spain, September 9–12, 2025, Proceedings(Madrid, Spain). Springer-Verlag, Berlin, Heidelberg, 17–33. h...

work page doi:10.1007/978-3-032-04354-2_2 2025
[26]

Kalervo Järvelin and Jaana Kekäläinen. 2002. Cumulated gain-based evaluation of IR techniques.ACM Trans. Inf. Syst.20, 4 (Oct. 2002), 422–446. https://doi. org/10.1145/582415.582418

work page doi:10.1145/582415.582418 2002
[27]

Why Language Models Hallucinate

Adam Tauman Kalai, Ofir Nachum, Santosh S. Vempala, and Edwin Zhang. 2025. Why Language Models Hallucinate. arXiv:2509.04664 [cs.CL] https://arxiv.org/ abs/2509.04664

work page internal anchor Pith review Pith/arXiv arXiv 2025
[28]

DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines

Omar Khattab, Arnav Singhvi, Paridhi Maheshwari, Zhiyuan Zhang, Keshav Santhanam, Sri Vardhamanan, Saiful Haq, Ashutosh Sharma, Thomas T. Joshi, Hanna Moazam, Heather Miller, Matei Zaharia, and Christopher Potts. 2023. DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines. arXiv:2310.03714 [cs.CL] https://arxiv.org/abs/2310.03714

work page internal anchor Pith review Pith/arXiv arXiv 2023
[29]

Gun Il Kim, Jong Wook Kim, and Beakcheol Jang. 2025. UniRAG: A Unified RAG Framework for Knowledge-Intensive Queries with Decomposition, Break- Down Reasoning, and Iterative Rewriting. InFindings of the Association for Computational Linguistics: EMNLP 2025. 18795–18810

work page 2025
[30]

Reno Kriz, Kate Sanders, David Etter, Kenton Murray, Cameron Carpenter, Kelly Van Ochten, Hannah Recknor, Jimena Guallar-Blasco, Alexander Mar- tin, Ronald Colaianni, Nolan King, Eugene Yang, and Benjamin Van Durme. 2025. MultiVENT 2.0: A Massive Multilingual Benchmark for Event-Centric Video Retrieval. arXiv:2410.11619 [cs.CV] https://arxiv.org/abs/2410.11619

work page arXiv 2025
[31]

Weronika Lajewska and Krisztian Balog. 2025. GINGER: Grounded Information Nugget-Based Generation of Responses. InProceedings of the 48th International ACM SIGIR Conference (SIGIR ’25). https://krisztianbalog.com/files/sigir2025- ginger.pdf SIGIR 2025 paper

work page 2025
[32]

Carlos Lassance, Hervé Déjean, Thibault Formal, and Stéphane Clinchant. 2024. SPLADE-v3: New baselines for SPLADE.arXiv preprint arXiv:2403.06789(2024)

work page arXiv 2024
[33]

Victor Lavrenko and W Bruce Croft. 2017. Relevance-based language models. In ACM SIGIR Forum, Vol. 51. ACM New York, NY, USA, 260–267

work page 2017
[34]

Oard, Luca Soldaini, and Eugene Yang

Dawn Lawrie, Sean MacAvaney, James Mayfield, Paul McNamee, Douglas W. Oard, Luca Soldaini, and Eugene Yang. 2025. Overview of the TREC 2024 NeuCLIR Track. arXiv:2509.14355 [cs.IR] https://arxiv.org/abs/2509.14355

work page arXiv 2025
[35]

Liu, Kevin Lin, John Hewitt, Bhargavi Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang

Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. 2024. Lost in the Middle: How Language Models Use Long Contexts.Transactions of the Association for Computational Linguistics 12 (2024), 157–173. https://doi.org/10.1162/tacl_a_00638

work page doi:10.1162/tacl_a_00638 2024
[36]

Xueguang Ma, Luyu Gao, Shengyao Zhuang, Jiaqi Samantha Zhan, Jamie Callan, and Jimmy Lin. 2025. Tevatron 2.0: Unified Document Retrieval Toolkit across Scale, Language, and Modality. arXiv:2505.02466 [cs.IR] https://arxiv.org/abs/ 2505.02466 10

work page arXiv 2025
[37]

Sean MacAvaney, Craig Macdonald, and Iadh Ounis. 2021. Streamlining Evalua- tion with ir-measures. arXiv:2111.13466 [cs.IR] https://arxiv.org/abs/2111.13466

work page arXiv 2021
[38]

Alexander Martin, Reno Kriz, William Gantt Walden, Kate Sanders, Hannah Rec- knor, Eugene Yang, Francis Ferraro, and Benjamin Van Durme. 2025. WikiVideo: Article Generation from Multiple Videos. arXiv:2504.00939 [cs.CV] https: //arxiv.org/abs/2504.00939

work page arXiv 2025
[39]

Alexander Martin, William Walden, Reno Kriz, Dengjia Zhang, Kate Sanders, Eugene Yang, Chihsheng Jin, and Benjamin Van Durme. 2025. Seeing Through the MiRAGE: Evaluating Multimodal Retrieval Augmented Genera- tion. arXiv:2510.24870 [cs.CL] https://arxiv.org/abs/2510.24870

work page arXiv 2025
[40]

Oard, Luca Soldaini, Ian Soboroff, Orion Weller, Efsun Kayi, Kate Sanders, Marc Mason, and Noah Hibbler

James Mayfield, Eugene Yang, Dawn Lawrie, Sean MacAvaney, Paul McNamee, Douglas W. Oard, Luca Soldaini, Ian Soboroff, Orion Weller, Efsun Kayi, Kate Sanders, Marc Mason, and Noah Hibbler. 2024. On the Evaluation of Machine- Generated Reports. InProceedings of the 47th International ACM SIGIR Confer- ence on Research and Development in Information Retrieva...

work page doi:10.1145/3626772.3657846 2024
[41]

Teague McMillan, Gabriele Dominici, Martin Gjoreski, and Marc Langheinrich

work page
[42]

Towards Transparent Reasoning: What Drives Faithfulness in Large Lan- guage Models? arXiv:2510.24236 [cs.CL] https://arxiv.org/abs/2510.24236

work page arXiv
[43]

Federico Nanni, Bhaskar Mitra, Matt Magnusson, and Laura Dietz. 2017. Bench- mark for Complex Answer Retrieval. arXiv:1705.04803 [cs.IR] https://arxiv.org/ abs/1705.04803

work page internal anchor Pith review Pith/arXiv arXiv 2017
[44]

Federico Nanni, Bhaskar Mitra, Matt Magnusson, and Laura Dietz. 2017. Bench- mark for Complex Answer Retrieval. InProceedings of the ACM SIGIR International Conference on Theory of Information Retrieval(Amsterdam, The Netherlands) (ICTIR ’17). Association for Computing Machinery, New York, NY, USA, 293–296. https://doi.org/10.1145/3121050.3121099

work page doi:10.1145/3121050.3121099 2017
[45]

Thong Nguyen, Yibin Lei, Jia-Huei Ju, Eugene Yang, and Andrew Yates. 2025. Milco: Learned Sparse Retrieval Across Languages via a Multilingual Connector. arXiv:2510.00671 [cs.IR] https://arxiv.org/abs/2510.00671

work page arXiv 2025
[46]

Paul Over. 2001. The TREC interactive track: an annotated bibliography.Inf. Process. Manage.37, 3 (May 2001), 369–381. https://doi.org/10.1016/S0306- 4573(00)00053-4

work page doi:10.1016/s0306- 2001
[47]

Ronak Pradeep, Nandan Thakur, Sahel Sharifymoghaddam, Eric Zhang, Ryan Nguyen, Daniel Campos, Nick Craswell, and Jimmy Lin. 2024. Ragnarök: A Reusable RAG Framework and Baselines for TREC 2024 Retrieval-Augmented Generation Track. arXiv:2406.16828 [cs.IR] https://arxiv.org/abs/2406.16828

work page arXiv 2024
[48]

Ronak Pradeep, Nandan Thakur, Shivani Upadhyay, Daniel Campos, Nick Craswell, and Jimmy Lin. 2024. Initial Nugget Evaluation Results for the TREC 2024 RAG Track with the AutoNuggetizer Framework. arXiv:2411.09607 [cs.IR] https://arxiv.org/abs/2411.09607

work page arXiv 2024
[49]

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. 2021. Learning Transferable Visual Models From Natural Language Supervision. arXiv:2103.00020 [cs.CV] https://arxiv.org/ abs/2103.00020

work page internal anchor Pith review Pith/arXiv arXiv 2021
[50]

de Melo, Benjamin Van Durme, and Rama Chellappa

Arun Reddy, Alexander Martin, Eugene Yang, Andrew Yates, Kate Sanders, Ken- ton Murray, Reno Kriz, Celso M. de Melo, Benjamin Van Durme, and Rama Chellappa. 2025. Video-ColBERT: Contextualized Late Interaction for Text-to- Video Retrieval. arXiv:2503.19009 [cs.CV] https://arxiv.org/abs/2503.19009

work page arXiv 2025
[51]

Stephen Robertson. 2008. A new interpretation of average precision. InPro- ceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval. 689–690

work page 2008
[52]

Stephen Robertson, Hugo Zaragoza, et al . 2009. The probabilistic relevance framework: BM25 and beyond.Foundations and trends®in information retrieval 3, 4 (2009), 333–389

work page 2009
[53]

Saron Samuel, Dan DeGenaro, Jimena Guallar-Blasco, Kate Sanders, Oluwaseun Eisape, Tanner Spendlove, Arun Reddy, Alexander Martin, Andrew Yates, Eugene Yang, Cameron Carpenter, David Etter, Efsun Kayi, Matthew Wiesner, Kenton Murray, and Reno Kriz. 2025. MMMORRF: Multimodal Multilingual Modularized Reciprocal Rank Fusion. arXiv:2503.20698 [cs.CV] https://...

work page arXiv 2025
[54]

Tyler Skow, Alexander Martin, Benjamin Van Durme, Rama Chellappa, and Reno Kriz. 2026. RANKVIDEO: Reasoning Reranking for Text-to-Video Retrieval. arXiv:2602.02444 [cs.IR] https://arxiv.org/abs/2602.02444

work page arXiv 2026
[55]

Ian Soboroff, Donna Harman, et al. 2003. Overview of the TREC 2003 Novelty Track.. InTREC. 38–53

work page 2003
[56]

Nandan Thakur, Ronak Pradeep, Shivani Upadhyay, Daniel Campos, Nick Craswell, and Jimmy Lin. 2025. Support Evaluation for the TREC 2024 RAG Track: Comparing Human versus LLM Judges. arXiv:2504.15205 [cs.CL] https://arxiv.org/abs/2504.15205

work page arXiv 2025
[57]

Nandan Thakur, Ronak Pradeep, Shivani Upadhyay, Daniel Campos, Nick Craswell, Ian Soboroff, Hoa Trang Dang, and Jimmy Lin. 2025. Assessing Support for the TREC 2024 RAG Track: A Large-Scale Comparative Study of LLM and Human Evaluations. InProceedings of the 48th International ACM SIGIR Confer- ence on Research and Development in Information Retrieval(Pad...

work page doi:10.1145/3726302.3730165 2025
[58]

William Walden, Orion Weller, Laura Dietz, Bryan Li, Gabrielle Kaili-May Liu, Yu Hou, and Eugene Yang. 2025. Auto-ARGUE: LLM-Based Report Generation Evaluation.arXiv preprint arXiv:2509.26184(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[59]

Orion Weller, Kathryn Ricci, Eugene Yang, Andrew Yates, Dawn Lawrie, and Ben- jamin Van Durme. 2025. Rank1: Test-Time Compute for Reranking in Information Retrieval. arXiv:2502.18418 [cs.IR] https://arxiv.org/abs/2502.18418

work page arXiv 2025
[60]

Jin Xu, Zhifang Guo, Jinzheng He, Hangrui Hu, Ting He, Shuai Bai, Keqin Chen, Jialin Wang, Yang Fan, Kai Dang, Bin Zhang, Xiong Wang, Yunfei Chu, and Junyang Lin. 2025. Qwen2.5-Omni Technical Report. arXiv:2503.20215 [cs.CL] https://arxiv.org/abs/2503.20215

work page internal anchor Pith review Pith/arXiv arXiv 2025
[61]

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[62]

Diji Yang, Jinmeng Rao, Kezhen Chen, Xiaoyuan Guo, Yawen Zhang, Jie Yang, and Yi Zhang. 2024. Im-rag: Multi-round retrieval-augmented generation through learning inner monologues. InProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval. 730–740

work page 2024
[63]

Diji Yang, Linda Zeng, Jinmeng Rao, and Yi Zhang. 2025. Knowing You Don’t Know: Learning When to Continue Search in Multi-round RAG through Self- Practicing. InProceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval. 1305–1315

work page 2025
[64]

Oard, and Scott Miller

Eugene Yang, Dawn Lawrie, James Mayfield, Douglas W. Oard, and Scott Miller

work page
[65]

InProceedings of the 46th European Conference on Information Retrieval (ECIR)

Translate-Distill: Learning Cross-Language Dense Retrieval by Translation and Distillation. InProceedings of the 46th European Conference on Information Retrieval (ECIR). https://arxiv.org/abs/2401.04810

work page arXiv
[66]

Eugene Yang, Dawn Lawrie, Orion Weller, and James Mayfield. 2025. HLTCOE at TREC 2024 NeuCLIR Track. arXiv:2510.00143 [cs.CL] https://arxiv.org/abs/ 2510.00143

work page arXiv 2025
[67]

Eugene Yang, Andrew Yates, Dawn Lawrie, James Mayfield, and Trevor Adri- aanse. 2026. RoutIR: Fast Serving of Retrieval Pipelines for Retrieval-Augmented Generation. arXiv:2601.10644 [cs.IR] https://arxiv.org/abs/2601.10644

work page arXiv 2026
[68]

Yanzhao Zhang, Mingxin Li, Dingkun Long, Xin Zhang, Huan Lin, Baosong Yang, Pengjun Xie, An Yang, Dayiheng Liu, Junyang Lin, Fei Huang, and Jingren Zhou

work page
[69]

Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models

Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models. arXiv:2506.05176 [cs.CL] https://arxiv.org/abs/2506.05176

work page internal anchor Pith review Pith/arXiv arXiv
[70]

Mingjun Zhao, Shengli Yan, Bang Liu, Xinwang Zhong, Qian Hao, Haolan Chen, Di Niu, Bowei Long, and Weidong Guo. 2021. QBSUM: A large-scale query-based document summarization dataset from real-world applications.Computer Speech & Language66 (March 2021), 101166. https://doi.org/10.1016/j.csl.2020.101166

work page doi:10.1016/j.csl.2020.101166 2021
[71]

Wei Zheng, Xuanhui Wang, Hui Fang, and Hong Cheng. 2012. Coverage-based search result diversification.Inf. Retr.15, 5 (Oct. 2012), 433–457. https://doi.org/ 10.1007/s10791-011-9178-4

work page doi:10.1007/s10791-011-9178-4 2012
[72]

Bin Zhu, Bin Lin, Munan Ning, Yang Yan, Jiaxi Cui, HongFa Wang, Yatian Pang, Wenhao Jiang, Junwu Zhang, Zongwei Li, Wancai Zhang, Zhifeng Li, Wei Liu, and Li Yuan. 2024. LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment. arXiv:2310.01852 [cs.CV] https://arxiv.org/abs/2310.01852 11

work page arXiv 2024