LTRR: Learning To Rank Retrievers for LLMs

Fernando Diaz; To Eun Kim

arxiv: 2506.13743 · v2 · submitted 2025-06-16 · 💻 cs.CL · cs.IR

LTRR: Learning To Rank Retrievers for LLMs

To Eun Kim , Fernando Diaz This is my paper

Pith reviewed 2026-05-19 09:06 UTC · model grok-4.3

classification 💻 cs.CL cs.IR

keywords query routinglearning to rankretrieval augmented generationRAGretriever selectionquestion answeringLLMranking model

0 comments

The pith

A model that ranks retrievers by their expected help for each query improves RAG accuracy over any single fixed retriever.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that retrieval-augmented generation benefits from choosing different retrievers depending on the query rather than using one always. It frames the choice as a learning-to-rank task where the model predicts which retriever will lead to the correct final answer. Experiments across question-answering benchmarks with varied query types show that this routing approach beats the best single retriever. The improvements are clearest when the ranking is trained to maximize answer correctness and when using pairwise comparisons with models like XGBoost. It also handles new query types better than fixed setups.

Core claim

By treating retriever selection as a learning-to-rank problem, a model can be trained to order retrievers according to how much they are expected to improve the final answer correctness in a RAG pipeline, and using this ranking to pick the top one for each query yields higher performance than any static retriever across multiple benchmarks.

What carries the argument

The LTRR framework that learns to rank retrievers according to their expected contribution to downstream RAG performance using query features.

If this is right

Routing-based RAG consistently surpasses the strongest single-retriever baselines on diverse question-answering benchmarks.
Gains are particularly substantial when training with the Answer Correctness objective.
Pairwise ranking methods, with XGBoost yielding the best results, outperform other approaches.
The method shows stronger generalization to out-of-distribution queries.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Query features alone may suffice to predict which retriever's strengths match the current question without running expensive inference first.
This routing idea could apply to selecting among different generation strategies or prompt formats in LLM systems.
Adding new retrievers to the pool might require only retraining the ranker rather than redesigning the whole system.

Load-bearing premise

The different retrievers have strengths that vary with query type in ways that a model can learn to predict from the query itself.

What would settle it

Running the routing system on a held-out set of queries where all retrievers perform equally or where the router picks poorly would show no improvement over the best baseline.

read the original abstract

Retrieval-Augmented Generation (RAG) systems typically rely on a single fixed retriever, despite growing evidence that no single retriever performs optimally across all query types. In this paper, we explore a query routing approach that dynamically selects from a pool of retrievers based on the query, using both train-free heuristics and learned routing models. We frame routing as a learning-to-rank problem and introduce LTRR, a framework that Learns To Rank Retrievers according to their expected contribution to downstream RAG performance. Through experiments on diverse question-answering benchmarks with controlled variations in query types, we demonstrate that routing-based RAG consistently surpasses the strongest single-retriever baselines. The gains are particularly substantial when training with the Answer Correctness (AC) objective and when using pairwise ranking methods, with XGBoost yielding the best results. Additionally, our approach exhibits stronger generalization to out-of-distribution queries. Overall, our results underscore the critical role of both training strategy and optimization metric choice in effective query routing for RAG systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Learned routing among retrievers using answer-correctness labels beats fixed retrievers in the experiments, with pairwise methods and XGBoost working best.

read the letter

The main point is that routing queries to different retrievers with a model trained on how much each one helps the final answer can outperform sticking with the single strongest retriever across the board. They set this up as a learning-to-rank task where the target is downstream answer correctness rather than just retrieval scores. Experiments across several QA benchmarks with controlled query-type changes show consistent gains, especially when using pairwise ranking and the answer-correctness objective, and the routing approach also generalizes better to out-of-distribution queries than any fixed baseline. XGBoost came out ahead among the learned routers they tried. This is a direct, practical extension of existing routing ideas from information retrieval. The experiments are set up to test when complementary strengths across retrievers can be predicted from query features, and the comparisons of training objectives and ranking methods are the most useful part of the work. The design with query variations helps isolate the conditions where routing adds value. The soft spots are the missing details on effect sizes, statistical significance, and exact baseline construction. The abstract claims outperformance but gives no numbers, so the full paper has to show whether the gains are large enough to matter in practice or whether they shrink once you account for variance. If the retriever pool lacks real diversity or the router cannot reliably detect differences, the whole approach adds little. This paper is for people building or studying RAG systems who already have access to multiple retrievers and want to make selection adaptive. A reader focused on practical retrieval improvements for LLMs would find the training-strategy comparisons worth looking at. I would send it to peer review because the method is clear, the empirical question is relevant, and the setup is reproducible enough to let referees check the claims.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces LTRR, a learning-to-rank framework for dynamically selecting retrievers from a pool in RAG systems for LLMs. It evaluates both train-free heuristics and learned models (including XGBoost) on diverse QA benchmarks with controlled query-type variations, claiming that routing-based RAG consistently outperforms the strongest single-retriever baselines. Gains are reported as particularly large when training uses the Answer Correctness (AC) objective and pairwise ranking methods, with additional benefits in out-of-distribution generalization.

Significance. If the empirical results hold, the work provides evidence that query-adaptive retriever selection can improve RAG performance by exploiting complementary retriever strengths, with practical implications for choosing training objectives and ranking methods. The emphasis on generalization to OOD queries and the role of the AC objective adds value for RAG system design.

major comments (2)

[Abstract and §4] Abstract and §4 (Experiments): the central claim of consistent outperformance over strongest single-retriever baselines is stated without reported effect sizes, confidence intervals, or statistical significance tests comparing routing variants to the best fixed baseline; this information is load-bearing for evaluating whether the gains are reliable and practically meaningful.
[§3 and §4.3] §3 (Method) and §4.3 (Ablations): the assumption that retriever performance differences are predictable from query features is central to the routing value proposition, yet no analysis of failure cases (e.g., when all retrievers perform similarly or when the router cannot detect differences) is provided to bound the conditions under which routing adds value.

minor comments (2)

[Abstract] Abstract: consider including one or two key quantitative results (e.g., average improvement or best-model delta) to make the performance claims more concrete for readers.
[Throughout] Notation: ensure consistent use of 'AC objective' versus full 'Answer Correctness' throughout the text and figures for clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the two major comments below, agreeing where revisions are needed to strengthen the empirical presentation and analysis of routing conditions.

read point-by-point responses

Referee: [Abstract and §4] Abstract and §4 (Experiments): the central claim of consistent outperformance over strongest single-retriever baselines is stated without reported effect sizes, confidence intervals, or statistical significance tests comparing routing variants to the best fixed baseline; this information is load-bearing for evaluating whether the gains are reliable and practically meaningful.

Authors: We agree that the absence of effect sizes, confidence intervals, and statistical significance tests limits the ability to assess the reliability of the reported gains. The current manuscript presents average performance improvements across benchmarks but does not include these quantitative details or formal tests against the strongest single-retriever baseline. In the revised version we will add effect sizes (e.g., absolute and relative improvements), standard deviations or confidence intervals where multiple runs are available, and paired statistical significance tests (such as Wilcoxon signed-rank or t-tests) for each routing variant versus the best fixed baseline. These additions will be placed in §4 and referenced in the abstract. revision: yes
Referee: [§3 and §4.3] §3 (Method) and §4.3 (Ablations): the assumption that retriever performance differences are predictable from query features is central to the routing value proposition, yet no analysis of failure cases (e.g., when all retrievers perform similarly or when the router cannot detect differences) is provided to bound the conditions under which routing adds value.

Authors: We acknowledge that an explicit analysis of failure cases is missing and would help bound the practical value of routing. While the experiments in §4.3 vary query types and examine OOD generalization, they do not directly quantify scenarios in which retriever performances are similar or where the router fails to detect meaningful differences. In the revision we will add a targeted discussion and supporting figures in §4.3 that measure per-query retriever score variance, identify queries where all retrievers yield comparable Answer Correctness, and report router accuracy and downstream impact in those regimes. This will clarify the conditions under which the predictability assumption holds. revision: yes

Circularity Check

0 steps flagged

No significant circularity: empirical comparison of routers vs. baselines

full rationale

The paper is an empirical study that trains and evaluates learned routers (XGBoost, pairwise ranking, AC objective) against fixed single-retriever baselines on QA benchmarks with controlled query variations. No derivation chain, first-principles prediction, or self-citation is used to establish the central claim; reported gains come from direct experimental comparisons rather than quantities defined by the same fitted parameters. The work is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard supervised learning assumptions for ranking models plus the untested premise that retriever performance differences are predictable from query text alone.

axioms (1)

domain assumption Retriever performance differences are learnable from query features
Implicit in the decision to train a router on query-retriever pairs.

pith-pipeline@v0.9.0 · 5701 in / 1190 out tokens · 32617 ms · 2026-05-19T09:06:47.456001+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

R$^3$AG: Retriever Routing for Retrieval-Augmented Generation
cs.IR 2026-04 unverdicted novelty 6.0

R³AG routes queries to retrievers by decomposing capabilities into retrieval quality and generation utility, trained via contrastive learning on document assessments and downstream answer correctness to outperform sta...

Reference graph

Works this paper leans on

44 extracted references · 44 canonical work pages · cited by 1 Pith paper · 4 internal anchors

[1]

Anthropic. 2024. Introducing the Model Context Protocol. https://www. anthropic.com/news/model-context-protocol Accessed: 2025-05-23

work page 2024
[2]

Jaime Arguello et al . 2017. Aggregated search. Foundations and Trends ® in Information Retrieval 10, 5 (2017), 365–502

work page 2017
[3]

Jannis Bulian, Christian Buck, Wojciech Gajewski, Benjamin Börschinger, and Tal Schuster. 2022. Tomayto, Tomahto. Beyond Token-level Answer Equivalence for Question Answering Evaluation. In Proceedings of the 2022 Conference on Empiri- cal Methods in Natural Language Processing , Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang (Eds.). Association for Comp...

work page doi:10.18653/v1/2022.emnlp-main.20 2022
[4]

Jamie Callan. 2002. Distributed information retrieval. In Advances in informa- tion retrieval: recent research from the center for intelligent information retrieval . Springer, 127–150

work page 2002
[5]

James P Callan, Zhihong Lu, and W Bruce Croft. 1995. Searching distributed collections with inference networks. In Proceedings of the 18th annual interna- tional ACM SIGIR conference on Research and development in information retrieval . 21–28

work page 1995
[6]

Zhe Cao, Tao Qin, Tie-Yan Liu, Ming-Feng Tsai, and Hang Li. 2007. Learning to rank: from pairwise approach to listwise approach. In Proceedings of the 24th international conference on Machine learning . 129–136

work page 2007
[7]

Hsinchun Chen, Haiyan Fan, Michael Chau, and Daniel Zeng. 2001. MetaSpider: Meta-searching and categorization on the Web. Journal of the American Society for Information Science and Technology 52, 13 (2001), 1134–1147

work page 2001
[8]

ReSearch: Learning to Reason with Search for LLMs via Reinforcement Learning

Mingyang Chen, Tianpeng Li, Haoze Sun, Yijie Zhou, Chenzheng Zhu, Haofen Wang, Jeff Z. Pan, Wen Zhang, Huajun Chen, Fan Yang, Zenan Zhou, and Weipeng Chen. 2025. ReSearch: Learning to Reason with Search for LLMs via Reinforce- ment Learning. arXiv:2503.19470 [cs.AI] https://arxiv.org/abs/2503.19470

work page internal anchor Pith review Pith/arXiv arXiv 2025
[9]

Tianqi Chen and Carlos Guestrin. 2016. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining . 785–794

work page 2016
[10]

Gordon V Cormack, Charles LA Clarke, and Stefan Buettcher. 2009. Reciprocal rank fusion outperforms condorcet and individual rank learning methods. In Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval. 758–759

work page 2009
[11]

Zhuyun Dai, Yubin Kim, and Jamie Callan. 2017. Learning to rank resources. In Proceedings of the 40th International ACM SIGIR conference on research and development in information retrieval. 837–840

work page 2017
[12]

Fernando Diaz. 2005. Regularizing ad hoc retrieval scores. In Proceedings of the 14th ACM international conference on Information and knowledge management . 672–679

work page 2005
[13]

Fernando Diaz. 2007. Performance prediction using spatial autocorrelation. In Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval . 583–590

work page 2007
[14]

Fernando Diaz. 2007. Regularizing query-based retrieval scores. Information Retrieval 10 (2007), 531–562

work page 2007
[15]

Fernando Diaz, Mounia Lalmas, and Milad Shokouhi. 2010. From federated to aggregated search. In Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval . 910–910

work page 2010
[16]

Ekstrand, Asia J

Fernando Diaz, Bhaskar Mitra, Michael D. Ekstrand, Asia J. Biega, and Ben Carterette. 2020. Evaluating Stochastic Rankings with Expected Exposure. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management (CIKM ’20). Association for Computing Machinery, 275–284

work page 2020
[17]

Shahul Es, Jithin James, Luis Espinosa Anke, and Steven Schockaert. 2024. RAGAs: Automated Evaluation of Retrieval Augmented Generation. In Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations, Nikolaos Aletras and Orphee De Clercq (Eds.). Association for Computational Linguisti...

work page 2024
[18]

Simone Filice, Guy Horowitz, David Carmel, Zohar Karnin, Liane Lewin-Eytan, and Yoelle Maarek. 2025. Generating Diverse Q&A Benchmarks for RAG Evalu- ation with DataMorgana. arXiv:2501.12789 [cs.CL] https://arxiv.org/abs/2501. 12789

work page arXiv 2025
[19]

Eric J Glover, Steve Lawrence, William P Birmingham, and C Lee Giles. 1999. Architecture of a metasearch engine that supports user information needs. In Proceedings of the eighth international conference on Information and knowledge management. 210–216

work page 1999
[20]

Rachid Guerraoui, Anne-Marie Kermarrec, Diana Petrescu, Rafael Pires, Mathis Randl, and Martijn de Vos. 2025. Efficient Federated Search for Retrieval- Augmented Generation. In Proceedings of the 5th Workshop on Machine Learning and Systems (World Trade Center, Rotterdam, Netherlands)(EuroMLSys ’25). As- sociation for Computing Machinery, New York, NY, US...

work page arXiv 2025
[21]

Pengcheng He, Jianfeng Gao, and Weizhu Chen. 2021. Debertav3: Improving deberta using electra-style pre-training with gradient-disentangled embedding sharing. arXiv preprint arXiv:2111.09543 (2021)

work page internal anchor Pith review Pith/arXiv arXiv 2021
[22]

Soyeong Jeong, Jinheon Baek, Sukmin Cho, Sung Ju Hwang, and Jong Park

work page
[23]

Adaptive-RAG: Learning to Adapt Retrieval-Augmented Large Language Models through Question Complexity. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), Kevin Duh, Helena Gomez, and Steven Bethard (Eds.). Association for Computational Li...

work page doi:10.18653/v1/2024.naacl-long.389 2024
[24]

Bowen Jin, Hansi Zeng, Zhenrui Yue, Dong Wang, Hamed Zamani, and Jiawei Han. 2025. Search-r1: Training llms to reason and leverage search engines with reinforcement learning. arXiv preprint arXiv:2503.09516 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[25]

Thorsten Joachims. 2006. Training linear SVMs in linear time. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining. 217–226

work page 2006
[26]

Ekaterina Khramtsova, Shengyao Zhuang, Mahsa Baktashmotlagh, Xi Wang, and Guido Zuccon. 2023. Selecting which Dense Retriever to use for Zero-Shot Search. In Proceedings of the Annual International ACM SIGIR Conference on Research and Development in Information Retrieval in the Asia Pacific Region (Beijing, China) (SIGIR-AP ’23). Association for Computing...

work page doi:10.1145/3624918.3625330 2023
[27]

Ekaterina Khramtsova, Shengyao Zhuang, Mahsa Baktashmotlagh, and Guido Zuccon. 2024. Leveraging LLMs for Unsupervised Dense Retriever Rank- ing. In Proceedings of the 47th International ACM SIGIR Conference on Re- search and Development in Information Retrieval (Washington DC, USA) (SIGIR ’24). Association for Computing Machinery, New York, NY, USA, 1307–...

work page doi:10.1145/3626772.3657798 2024
[28]

To Eun Kim and Fernando Diaz. 2025. Towards Fair RAG: On the Impact of Fair Ranking in Retrieval-Augmented Generation. arXiv:2409.11598 [cs.IR] https: //arxiv.org/abs/2409.11598

work page arXiv 2025
[29]

To Eun Kim, Alireza Salemi, Andrew Drozdov, Fernando Diaz, and Hamed Zamani

work page
[30]

arXiv preprint arXiv:2407.12982 (2024)

Retrieval-Enhanced Machine Learning: Synthesis and Opportunities. arXiv preprint arXiv:2407.12982 (2024)

work page arXiv 2024
[31]

Hyunji Lee, Luca Soldaini, Arman Cohan, Minjoon Seo, and Kyle Lo. 2024. Router- retriever: Exploring the benefits of routing over multiple expert embedding models. arXiv preprint arXiv:2409.02685 (2024)

work page arXiv 2024
[32]

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. 2020. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in neural information processing systems 33 (2020), 9459–9474

work page 2020
[33]

Zhiling Luo, Xiaorong Shi, Xuanrui Lin, and Jinyang Gao. 2025. Evaluation Report on MCP Servers. arXiv preprint arXiv:2504.11094 (2025)

work page arXiv 2025
[34]

Feiteng Mu, Yong Jiang, Liwen Zhang, Liuchu Liuchu, Wenjie Li, Pengjun Xie, and Fei Huang. 2024. Query Routing for Homogeneous Tools: An Instantiation in the RAG Scenario. InFindings of the Association for Computational Linguistics: EMNLP 2024, Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen (Eds.). Association for Computational Linguistics, Miami, Flor...

work page 2024
[35]

Harrie Oosterhuis. 2021. Computationally efficient optimization of plackett-luce ranking models for relevance and fairness. InProceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval . 1023–1032

work page 2021
[36]

Ermelinda Oro, Francesco Maria Granata, Antonio Lanza, Amir Bachir, Luca De Grandis, and Massimo Ruffolo. 2024. Evaluating Retrieval-Augmented Gener- ation for Question Answering with Large Language Models. (2024)

work page 2024
[37]

Guilherme Penedo, Hynek Kydlíček, Anton Lozhkov, Margaret Mitchell, Colin A Raffel, Leandro Von Werra, Thomas Wolf, et al . 2025. The fineweb datasets: Decanting the web for the finest text data at scale.Advances in Neural Information Processing Systems 37 (2025), 30811–30849

work page 2025
[38]

Joseph John Rocchio Jr. 1971. Relevance feedback in information retrieval. The SMART retrieval system: experiments in automatic document processing (1971)

work page 1971
[39]

Alireza Salemi and Hamed Zamani. 2024. Evaluating retrieval quality in retrieval- augmented generation. In Proceedings of the 47th International ACM SIGIR Con- ference on Research and Development in Information Retrieval . 2395–2400

work page 2024
[40]

Xiaqiang Tang, Jian Li, Nan Du, and Sihong Xie. 2024. Adapting to Non-Stationary Environments: Multi-Armed Bandit Enhanced Retrieval-Augmented Generation on Knowledge Graphs. arXiv preprint arXiv:2412.07618 (2024)

work page arXiv 2024
[41]

Falcon-LLM Team. 2024. The Falcon 3 Family of Open Models. https: //huggingface.co/blog/falcon3

work page 2024
[42]

Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin Jiang, Rangan Majumder, and Furu Wei. 2024. Text Embeddings by Weakly-Supervised Contrastive Pre-training. arXiv:2212.03533 [cs.CL] https://arxiv.org/abs/2212. 03533

work page internal anchor Pith review Pith/arXiv arXiv 2024
[43]

Xiao Wang, Craig Macdonald, Nicola Tonellotto, and Iadh Ounis. 2023. ColBERT- PRF: Semantic pseudo-relevance feedback for dense passage and document re- trieval. ACM Transactions on the Web 17, 1 (2023), 1–39

work page 2023
[44]

{QUESTION}

Qiang Wu, Christopher JC Burges, Krysta M Svore, and Jianfeng Gao. 2010. Adapting boosting for information retrieval measures. Information Retrieval 13 (2010), 254–270. SIGIR’25, July 2025, Padua, Italy To Eun Kim and Fernando Diaz A Reranking Methods Query-Time Score Regularization . Based on the cluster hypothesis in IR, query-time score regularization ...

work page 2010

[1] [1]

Anthropic. 2024. Introducing the Model Context Protocol. https://www. anthropic.com/news/model-context-protocol Accessed: 2025-05-23

work page 2024

[2] [2]

Jaime Arguello et al . 2017. Aggregated search. Foundations and Trends ® in Information Retrieval 10, 5 (2017), 365–502

work page 2017

[3] [3]

Jannis Bulian, Christian Buck, Wojciech Gajewski, Benjamin Börschinger, and Tal Schuster. 2022. Tomayto, Tomahto. Beyond Token-level Answer Equivalence for Question Answering Evaluation. In Proceedings of the 2022 Conference on Empiri- cal Methods in Natural Language Processing , Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang (Eds.). Association for Comp...

work page doi:10.18653/v1/2022.emnlp-main.20 2022

[4] [4]

Jamie Callan. 2002. Distributed information retrieval. In Advances in informa- tion retrieval: recent research from the center for intelligent information retrieval . Springer, 127–150

work page 2002

[5] [5]

James P Callan, Zhihong Lu, and W Bruce Croft. 1995. Searching distributed collections with inference networks. In Proceedings of the 18th annual interna- tional ACM SIGIR conference on Research and development in information retrieval . 21–28

work page 1995

[6] [6]

Zhe Cao, Tao Qin, Tie-Yan Liu, Ming-Feng Tsai, and Hang Li. 2007. Learning to rank: from pairwise approach to listwise approach. In Proceedings of the 24th international conference on Machine learning . 129–136

work page 2007

[7] [7]

Hsinchun Chen, Haiyan Fan, Michael Chau, and Daniel Zeng. 2001. MetaSpider: Meta-searching and categorization on the Web. Journal of the American Society for Information Science and Technology 52, 13 (2001), 1134–1147

work page 2001

[8] [8]

ReSearch: Learning to Reason with Search for LLMs via Reinforcement Learning

Mingyang Chen, Tianpeng Li, Haoze Sun, Yijie Zhou, Chenzheng Zhu, Haofen Wang, Jeff Z. Pan, Wen Zhang, Huajun Chen, Fan Yang, Zenan Zhou, and Weipeng Chen. 2025. ReSearch: Learning to Reason with Search for LLMs via Reinforce- ment Learning. arXiv:2503.19470 [cs.AI] https://arxiv.org/abs/2503.19470

work page internal anchor Pith review Pith/arXiv arXiv 2025

[9] [9]

Tianqi Chen and Carlos Guestrin. 2016. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining . 785–794

work page 2016

[10] [10]

Gordon V Cormack, Charles LA Clarke, and Stefan Buettcher. 2009. Reciprocal rank fusion outperforms condorcet and individual rank learning methods. In Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval. 758–759

work page 2009

[11] [11]

Zhuyun Dai, Yubin Kim, and Jamie Callan. 2017. Learning to rank resources. In Proceedings of the 40th International ACM SIGIR conference on research and development in information retrieval. 837–840

work page 2017

[12] [12]

Fernando Diaz. 2005. Regularizing ad hoc retrieval scores. In Proceedings of the 14th ACM international conference on Information and knowledge management . 672–679

work page 2005

[13] [13]

Fernando Diaz. 2007. Performance prediction using spatial autocorrelation. In Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval . 583–590

work page 2007

[14] [14]

Fernando Diaz. 2007. Regularizing query-based retrieval scores. Information Retrieval 10 (2007), 531–562

work page 2007

[15] [15]

Fernando Diaz, Mounia Lalmas, and Milad Shokouhi. 2010. From federated to aggregated search. In Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval . 910–910

work page 2010

[16] [16]

Ekstrand, Asia J

Fernando Diaz, Bhaskar Mitra, Michael D. Ekstrand, Asia J. Biega, and Ben Carterette. 2020. Evaluating Stochastic Rankings with Expected Exposure. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management (CIKM ’20). Association for Computing Machinery, 275–284

work page 2020

[17] [17]

Shahul Es, Jithin James, Luis Espinosa Anke, and Steven Schockaert. 2024. RAGAs: Automated Evaluation of Retrieval Augmented Generation. In Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations, Nikolaos Aletras and Orphee De Clercq (Eds.). Association for Computational Linguisti...

work page 2024

[18] [18]

Simone Filice, Guy Horowitz, David Carmel, Zohar Karnin, Liane Lewin-Eytan, and Yoelle Maarek. 2025. Generating Diverse Q&A Benchmarks for RAG Evalu- ation with DataMorgana. arXiv:2501.12789 [cs.CL] https://arxiv.org/abs/2501. 12789

work page arXiv 2025

[19] [19]

Eric J Glover, Steve Lawrence, William P Birmingham, and C Lee Giles. 1999. Architecture of a metasearch engine that supports user information needs. In Proceedings of the eighth international conference on Information and knowledge management. 210–216

work page 1999

[20] [20]

Rachid Guerraoui, Anne-Marie Kermarrec, Diana Petrescu, Rafael Pires, Mathis Randl, and Martijn de Vos. 2025. Efficient Federated Search for Retrieval- Augmented Generation. In Proceedings of the 5th Workshop on Machine Learning and Systems (World Trade Center, Rotterdam, Netherlands)(EuroMLSys ’25). As- sociation for Computing Machinery, New York, NY, US...

work page arXiv 2025

[21] [21]

Pengcheng He, Jianfeng Gao, and Weizhu Chen. 2021. Debertav3: Improving deberta using electra-style pre-training with gradient-disentangled embedding sharing. arXiv preprint arXiv:2111.09543 (2021)

work page internal anchor Pith review Pith/arXiv arXiv 2021

[22] [22]

Soyeong Jeong, Jinheon Baek, Sukmin Cho, Sung Ju Hwang, and Jong Park

work page

[23] [23]

Adaptive-RAG: Learning to Adapt Retrieval-Augmented Large Language Models through Question Complexity. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), Kevin Duh, Helena Gomez, and Steven Bethard (Eds.). Association for Computational Li...

work page doi:10.18653/v1/2024.naacl-long.389 2024

[24] [24]

Bowen Jin, Hansi Zeng, Zhenrui Yue, Dong Wang, Hamed Zamani, and Jiawei Han. 2025. Search-r1: Training llms to reason and leverage search engines with reinforcement learning. arXiv preprint arXiv:2503.09516 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[25] [25]

Thorsten Joachims. 2006. Training linear SVMs in linear time. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining. 217–226

work page 2006

[26] [26]

Ekaterina Khramtsova, Shengyao Zhuang, Mahsa Baktashmotlagh, Xi Wang, and Guido Zuccon. 2023. Selecting which Dense Retriever to use for Zero-Shot Search. In Proceedings of the Annual International ACM SIGIR Conference on Research and Development in Information Retrieval in the Asia Pacific Region (Beijing, China) (SIGIR-AP ’23). Association for Computing...

work page doi:10.1145/3624918.3625330 2023

[27] [27]

Ekaterina Khramtsova, Shengyao Zhuang, Mahsa Baktashmotlagh, and Guido Zuccon. 2024. Leveraging LLMs for Unsupervised Dense Retriever Rank- ing. In Proceedings of the 47th International ACM SIGIR Conference on Re- search and Development in Information Retrieval (Washington DC, USA) (SIGIR ’24). Association for Computing Machinery, New York, NY, USA, 1307–...

work page doi:10.1145/3626772.3657798 2024

[28] [28]

To Eun Kim and Fernando Diaz. 2025. Towards Fair RAG: On the Impact of Fair Ranking in Retrieval-Augmented Generation. arXiv:2409.11598 [cs.IR] https: //arxiv.org/abs/2409.11598

work page arXiv 2025

[29] [29]

To Eun Kim, Alireza Salemi, Andrew Drozdov, Fernando Diaz, and Hamed Zamani

work page

[30] [30]

arXiv preprint arXiv:2407.12982 (2024)

Retrieval-Enhanced Machine Learning: Synthesis and Opportunities. arXiv preprint arXiv:2407.12982 (2024)

work page arXiv 2024

[31] [31]

Hyunji Lee, Luca Soldaini, Arman Cohan, Minjoon Seo, and Kyle Lo. 2024. Router- retriever: Exploring the benefits of routing over multiple expert embedding models. arXiv preprint arXiv:2409.02685 (2024)

work page arXiv 2024

[32] [32]

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. 2020. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in neural information processing systems 33 (2020), 9459–9474

work page 2020

[33] [33]

Zhiling Luo, Xiaorong Shi, Xuanrui Lin, and Jinyang Gao. 2025. Evaluation Report on MCP Servers. arXiv preprint arXiv:2504.11094 (2025)

work page arXiv 2025

[34] [34]

Feiteng Mu, Yong Jiang, Liwen Zhang, Liuchu Liuchu, Wenjie Li, Pengjun Xie, and Fei Huang. 2024. Query Routing for Homogeneous Tools: An Instantiation in the RAG Scenario. InFindings of the Association for Computational Linguistics: EMNLP 2024, Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen (Eds.). Association for Computational Linguistics, Miami, Flor...

work page 2024

[35] [35]

Harrie Oosterhuis. 2021. Computationally efficient optimization of plackett-luce ranking models for relevance and fairness. InProceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval . 1023–1032

work page 2021

[36] [36]

Ermelinda Oro, Francesco Maria Granata, Antonio Lanza, Amir Bachir, Luca De Grandis, and Massimo Ruffolo. 2024. Evaluating Retrieval-Augmented Gener- ation for Question Answering with Large Language Models. (2024)

work page 2024

[37] [37]

Guilherme Penedo, Hynek Kydlíček, Anton Lozhkov, Margaret Mitchell, Colin A Raffel, Leandro Von Werra, Thomas Wolf, et al . 2025. The fineweb datasets: Decanting the web for the finest text data at scale.Advances in Neural Information Processing Systems 37 (2025), 30811–30849

work page 2025

[38] [38]

Joseph John Rocchio Jr. 1971. Relevance feedback in information retrieval. The SMART retrieval system: experiments in automatic document processing (1971)

work page 1971

[39] [39]

Alireza Salemi and Hamed Zamani. 2024. Evaluating retrieval quality in retrieval- augmented generation. In Proceedings of the 47th International ACM SIGIR Con- ference on Research and Development in Information Retrieval . 2395–2400

work page 2024

[40] [40]

Xiaqiang Tang, Jian Li, Nan Du, and Sihong Xie. 2024. Adapting to Non-Stationary Environments: Multi-Armed Bandit Enhanced Retrieval-Augmented Generation on Knowledge Graphs. arXiv preprint arXiv:2412.07618 (2024)

work page arXiv 2024

[41] [41]

Falcon-LLM Team. 2024. The Falcon 3 Family of Open Models. https: //huggingface.co/blog/falcon3

work page 2024

[42] [42]

Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin Jiang, Rangan Majumder, and Furu Wei. 2024. Text Embeddings by Weakly-Supervised Contrastive Pre-training. arXiv:2212.03533 [cs.CL] https://arxiv.org/abs/2212. 03533

work page internal anchor Pith review Pith/arXiv arXiv 2024

[43] [43]

Xiao Wang, Craig Macdonald, Nicola Tonellotto, and Iadh Ounis. 2023. ColBERT- PRF: Semantic pseudo-relevance feedback for dense passage and document re- trieval. ACM Transactions on the Web 17, 1 (2023), 1–39

work page 2023

[44] [44]

{QUESTION}

Qiang Wu, Christopher JC Burges, Krysta M Svore, and Jianfeng Gao. 2010. Adapting boosting for information retrieval measures. Information Retrieval 13 (2010), 254–270. SIGIR’25, July 2025, Padua, Italy To Eun Kim and Fernando Diaz A Reranking Methods Query-Time Score Regularization . Based on the cluster hypothesis in IR, query-time score regularization ...

work page 2010