arxiv: 2604.03676 · v1 · submitted 2026-04-04 · 💻 cs.IR

Recognition: no theorem link

Are LLM-Based Retrievers Worth Their Cost? An Empirical Study of Efficiency, Robustness, and Reasoning Overhead

Abdelrahman Abdallah , Jamie Holdcroft , Mohammed Ali , Adam Jatowt

Authors on Pith no claims yet

Pith reviewed 2026-05-13 17:15 UTC · model grok-4.3

classification 💻 cs.IR

keywords LLM-based retrieversretrieval efficiencyreasoning overheadconfidence calibrationbenchmark evaluationquery robustnessinformation retrieval

0 comments

The pith

Some reasoning-specialized retrievers achieve strong effectiveness competitively while large LLM-based bi-encoders incur substantial latency for modest gains.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper assesses the practical value of LLM-based retrievers by reproducing the BRIGHT benchmark across 12 tasks with 14 retrievers and adding tests for efficiency, robustness, and reasoning overhead. It tracks cold-start indexing costs, query latency and throughput, how performance scales with corpus size, resilience to query perturbations, and the reliability of confidence scores via AUROC for predicting success. Results show that some specialized retrievers deliver high effectiveness at competitive speeds, unlike several large bi-encoders that pay high latency for small improvements. Reasoning augmentation adds little cost for small encoders but brings diminishing returns for top models and can harm results on math and code tasks. Confidence calibration is poor across all families, so raw scores cannot be trusted for routing without extra work.

Core claim

The central claim is that some reasoning-specialized retrievers achieve strong effectiveness while remaining competitive in throughput, whereas several large LLM-based bi-encoders incur substantial latency for modest gains. Reasoning augmentation incurs minimal latency for sub-1B encoders but exhibits diminishing returns for top retrievers and may reduce performance on formal math/code domains. Confidence calibration is consistently weak across model families, indicating that raw retrieval scores are unreliable for downstream routing without additional calibration.

What carries the argument

The reasoning overhead metric obtained by comparing accuracy gains from five reasoning-augmented query variants to the added latency, together with robustness checks using controlled perturbations and AUROC evaluation of confidence for success prediction.

If this is right

Some specialized retrievers provide a better balance of effectiveness and throughput than large LLM-based models.
Reasoning augmentation yields minimal latency cost for encoders below 1B parameters but limited additional benefit for top performers.
Reasoning augmentation may decrease performance in formal math and code domains for certain retrievers.
Raw retrieval scores show weak calibration for predicting query success across all tested model families.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Production retrieval systems could route queries to smaller specialized models for most cases and reserve larger ones for difficult queries to control costs.
The diminishing returns on reasoning augmentation suggest that model scaling alone may not resolve efficiency issues for complex retrieval.
Testing the same metrics on live user query logs would reveal whether the controlled variants capture real-world reasoning overhead.
Improved calibration methods could allow confidence scores to guide hybrid retrieval strategies that mix efficient and expensive models.

Load-bearing premise

The five provided reasoning-augmented query variants and the controlled perturbations used for robustness testing are representative of the reasoning overhead and robustness issues that would appear in actual user queries and production traffic.

What would settle it

Measuring the same efficiency, overhead, robustness, and calibration metrics on a collection of real user queries from a deployed search system to see if the patterns of latency costs, gains, and weak calibration hold outside the benchmark variants.

Figures

Figures reproduced from arXiv: 2604.03676 by Abdelrahman Abdallah, Adam Jatowt, Jamie Holdcroft, Mohammed Ali.

**Figure 3.** Figure 3: Task-level nDCG@10 gain (averaged across retriev [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Robustness under query perturbations, shown as [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Hybrid fusion of BM25 with seven dense retrievers [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

read the original abstract

Large language model retrievers improve performance on complex queries, but their practical value depends on efficiency, robustness, and reliable confidence signals in addition to accuracy. We reproduce a reasoning-intensive retrieval benchmark (BRIGHT) across 12 tasks and 14 retrievers, and extend evaluation with cold-start indexing cost, query latency distributions and throughput, corpus scaling, robustness to controlled query perturbations, and confidence use (AUROC) for predicting query success. We also quantify \emph{reasoning overhead} by comparing standard queries to five provided reasoning-augmented variants, measuring accuracy gains relative to added latency. We find that some reasoning-specialized retrievers achieve strong effectiveness while remaining competitive in throughput, whereas several large LLM-based bi-encoders incur substantial latency for modest gains. Reasoning augmentation incurs minimal latency for sub-1B encoders but exhibits diminishing returns for top retrievers and may reduce performance on formal math/code domains. Confidence calibration is consistently weak across model families, indicating that raw retrieval scores are unreliable for downstream routing without additional calibration. We release all code and artifacts for reproducibility.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Solid empirical extension of BRIGHT that quantifies latency and reasoning costs across retrievers, with the main caveat that the overhead numbers rest on fixed benchmark variants rather than organic queries.

read the letter

The core takeaway is that some reasoning-focused retrievers deliver strong results at competitive throughput while several large LLM bi-encoders add heavy latency for only modest lifts. Reasoning augmentation adds almost no cost for sub-1B encoders but shows clear diminishing returns on the strongest models and can even hurt on math or code tasks. Raw retrieval scores also calibrate poorly for predicting success across the board.

Referee Report

1 major / 0 minor

Summary. The paper reproduces the BRIGHT benchmark across 12 tasks and 14 retrievers, extending evaluation to cold-start indexing cost, query latency distributions, throughput, corpus scaling, robustness under controlled perturbations, and AUROC-based confidence calibration for predicting success. It quantifies reasoning overhead by comparing standard queries to five provided reasoning-augmented variants, concluding that some reasoning-specialized retrievers deliver strong effectiveness at competitive throughput while large LLM bi-encoders incur high latency for modest gains; reasoning augmentation adds minimal latency for sub-1B models but shows diminishing returns and possible drops on math/code domains, with consistently weak confidence calibration across families. All code and artifacts are released.

Significance. If the empirical results hold, the work supplies a reproducible, multi-dimensional assessment of practical trade-offs for LLM retrievers on reasoning-intensive tasks, directly informing deployment decisions on efficiency versus effectiveness. The reproduction of BRIGHT, release of full artifacts, and orthogonal measurements (latency, robustness, calibration) constitute clear strengths that enable follow-on research.

major comments (1)

The reasoning-overhead claims (minimal added latency for sub-1B encoders, diminishing returns for top retrievers, and possible performance reductions on math/code domains) rest on direct comparison of standard queries versus the five fixed, benchmark-provided reasoning-augmented variants. These variants are not sampled from or validated against organic user queries or production traffic; if they differ systematically in length, complexity, or reasoning style, the reported latency distributions, throughput trade-offs, and domain-specific patterns may not generalize, weakening the practical takeaway on when augmentation is worth the cost.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive comment on the scope of our reasoning-overhead measurements. We address the point directly below and have incorporated a partial revision to clarify limitations.

read point-by-point responses

Referee: The reasoning-overhead claims (minimal added latency for sub-1B encoders, diminishing returns for top retrievers, and possible performance reductions on math/code domains) rest on direct comparison of standard queries versus the five fixed, benchmark-provided reasoning-augmented variants. These variants are not sampled from or validated against organic user queries or production traffic; if they differ systematically in length, complexity, or reasoning style, the reported latency distributions, throughput trade-offs, and domain-specific patterns may not generalize, weakening the practical takeaway on when augmentation is worth the cost.

Authors: We agree that the five reasoning-augmented variants are fixed artifacts supplied by the BRIGHT benchmark and were not sampled from or validated against organic user queries or production traffic. Consequently, systematic differences in length, complexity, or reasoning style could affect how well the observed latency distributions, throughput trade-offs, and domain-specific patterns (including possible drops on math/code) generalize to real deployment scenarios. Our analysis remains a controlled, reproducible comparison within the BRIGHT framework, which is the established benchmark for reasoning-intensive retrieval. To address the concern, we have added a new paragraph in the Discussion section that explicitly states this limitation and recommends future validation against production query logs. This revision preserves the value of the benchmark-based findings while acknowledging the boundary on external validity. revision: partial

Circularity Check

0 steps flagged

No circularity: purely empirical benchmark reproduction and metric comparisons

full rationale

The paper performs an empirical reproduction of the BRIGHT benchmark across 12 tasks and 14 retrievers, extending it with direct measurements of indexing cost, query latency, throughput, corpus scaling, robustness to controlled perturbations, and AUROC for confidence calibration. All reported findings (e.g., reasoning augmentation latency for sub-1B encoders, diminishing returns for top retrievers, domain-specific performance drops) are obtained by comparing standard queries against the five fixed benchmark-provided variants and standard metrics against released baselines. No equations, fitted parameters, self-definitional constructs, or load-bearing self-citations reduce any result to its own inputs by construction; the evaluation chain remains externally falsifiable via the public benchmark and code release.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The study relies on the existing BRIGHT benchmark and standard IR evaluation practices without introducing new fitted parameters or postulated entities.

axioms (1)

domain assumption BRIGHT constitutes a representative benchmark for reasoning-intensive retrieval tasks.
Used as the base reproduction target and extended with new metrics.

pith-pipeline@v0.9.0 · 5498 in / 1263 out tokens · 43096 ms · 2026-05-13T17:15:52.988893+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

57 extracted references · 57 canonical work pages · 8 internal anchors

[1]

Abdelrahman Abdallah, Mohammed Ali, Muhammad Abdul-Mageed, and Adam Jatowt. 2026. TEMPO: A Realistic Multi-Domain Benchmark for Temporal Reasoning-Intensive Retrieval.arXiv preprint arXiv:2601.09523(2026)

work page arXiv 2026
[2]

Abdelrahman Abdallah, Mohamed Darwish Mounis, Mahmoud Abdalla, Mah- moud SalahEldin Kasem, Mostafa Farouk Senussi, Mohamed Mahmoud, Mo- hammed Ali, Adam Jatowt, and Hyun-Soo Kang. 2026. MM-BRIGHT: A Multi- Task Multimodal Benchmark for Reasoning-Intensive Retrieval.arXiv preprint arXiv:2601.09562(2026)

work page arXiv 2026
[3]

Abdelrahman Abdallah, Jamshid Mozafari, Bhawna Piryani, and Adam Jatowt

work page
[4]

Dear: Dual-stage document reranking with reasoning agents via llm distil- lation.arXiv preprint arXiv:2508.16998(2025)

work page arXiv 2025
[5]

Abdelrahman Abdallah, Bhawna Piryani, Jamshid Mozafari, Mohammed Ali, and Adam Jatowt. 2025. Rankify: A comprehensive python toolkit for retrieval, re-ranking, and retrieval-augmented generation.arXiv preprint arXiv:2502.02464 (2025)

work page arXiv 2025
[6]

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Floren- cia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. 2023. Gpt-4 technical report.arXiv preprint arXiv:2303.08774 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[7]

Mohammed Ali, Abdelrahman Abdallah, Amit Agarwal, Hitesh Laxmichand Patel, and Adam Jatowt. 2026. RECOR: Reasoning-focused Multi-turn Conversational Retrieval Benchmark.arXiv preprint arXiv:2601.05461(2026)

work page arXiv 2026
[8]

Anthropic. [n. d.]. The Claude 3 Model Family: Opus, Sonnet, Haiku. https: //api.semanticscholar.org/CorpusID:268232499

work page
[9]

Negar Arabzadeh, Radin Hamidi Rad, Maryam Khodabakhsh, and Ebrahim Bagheri. 2023. Noisy perturbations for estimating query difficulty in dense retrievers. InProceedings of the 32nd ACM International Conference on Information and Knowledge Management. 3722–3727

work page 2023
[10]

Luiz Bonifacio, Hugo Abonizio, Marzieh Fadaee, and Rodrigo Nogueira. 2022. Inpars: Data augmentation for information retrieval using large language models. arXiv preprint arXiv:2202.05144(2022)

work page arXiv 2022
[11]

2010.Estimating the query difficulty for infor- mation retrieval

David Carmel and Elad Yom-Tov. 2010.Estimating the query difficulty for infor- mation retrieval. Morgan & Claypool Publishers

work page 2010
[12]

Daniel Cohen, Bhaskar Mitra, Oleg Lesota, Navid Rekabsaz, and Carsten Eickhoff

work page
[13]

InProceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval

Not all relevance scores are equal: Efficient uncertainty and calibration modeling for deep retrieval models. InProceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. 654–664

work page
[14]

Nachshon Cohen, Yaron Fairstein, and Guy Kushilevitz. 2024. Extremely efficient online query encoding for dense retrieval. InFindings of the Association for Computational Linguistics: NAACL 2024. 43–50

work page 2024
[15]

Gordon V Cormack, Charles LA Clarke, and Stefan Buettcher. 2009. Reciprocal rank fusion outperforms condorcet and individual rank learning methods. In Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval. 758–759

work page 2009
[16]

Zhuyun Dai, Vincent Y Zhao, Ji Ma, Yi Luan, Jianmo Ni, Jing Lu, Anton Bakalov, Kelvin Guu, Keith B Hall, and Ming-Wei Chang. 2022. Promptagator: Few-shot dense retrieval from 8 examples.arXiv preprint arXiv:2209.11755(2022)

work page arXiv 2022
[17]

Debrup Das, Sam O’Nuallain, and Razieh Rahimi. 2025. Rader: Reasoning-aware dense retrieval models. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 19981–20008

work page 2025
[18]

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. 2024. The llama 3 herd of models.arXiv e-prints(2024), arXiv–2407

work page 2024
[19]

Yan Fang, Jingtao Zhan, Qingyao Ai, Jiaxin Mao, Weihang Su, Jia Chen, and Yiqun Liu. 2024. Scaling laws for dense retrieval. InProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval. 1339–1349

work page 2024
[20]

Thibault Formal, Carlos Lassance, Benjamin Piwowarski, and Stéphane Clinchant

work page
[21]

arXiv preprint arXiv:2109.10086(2021)

SPLADE v2: Sparse Lexical and Expansion Model for Information Retrieval. doi:10.48550/ARXIV.2109.10086

work page doi:10.48550/arxiv.2109.10086
[22]

2021.SPLADE: Sparse Lexical and Expansion Model for First Stage Ranking

Thibault Formal, Benjamin Piwowarski, and Stéphane Clinchant. 2021.SPLADE: Sparse Lexical and Expansion Model for First Stage Ranking. Association for Computing Machinery, New York, NY, USA, 2288–2292. https://doi.org/10.1145/ 3404835.3463098

work page arXiv 2021
[23]

Maik Fröbe, Joel Mackenzie, Bhaskar Mitra, Franco Maria Nardini, and Martin Potthast. 2024. ReNeuIR at SIGIR 2024: The third workshop on reaching efficiency in neural information retrieval. InProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval. 3051–3054

work page 2024
[24]

Luyu Gao, Zhuyun Dai, and Jamie Callan. 2021. COIL: Revisit exact lexical match in information retrieval with contextualized inverted list.arXiv preprint arXiv:2104.07186(2021)

work page arXiv 2021
[25]

Raphael Gruber, Abdelrahman Abdallah, Michael Färber, and Adam Jatowt. 2025. COMPLEXTEMPQA: A 100m Dataset for Complex Temporal Question Answer- ing. InProceedings of the 2025 Conference on Empirical Methods in Natural Lan- guage Processing. 9111–9123

work page 2025
[26]

Hsin-Ling Hsu and Jengnan Tzeng. 2025. DAT: Dynamic alpha tuning for hybrid retrieval in retrieval-augmented generation.arXiv preprint arXiv:2503.23013 (2025)

work page arXiv 2025
[27]

Gautier Izacard, Mathilde Caron, Lucas Hosseini, Sebastian Riedel, Piotr Bo- janowski, Armand Joulin, and Edouard Grave. 2021. Unsupervised dense in- formation retrieval with contrastive learning.arXiv preprint arXiv:2112.09118 (2021)

work page internal anchor Pith review Pith/arXiv arXiv 2021
[28]

Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick SH Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. 2020. Dense Passage Retrieval for Open-Domain Question Answering.. InEMNLP (1). 6769–6781

work page 2020
[29]

Omar Khattab and Matei Zaharia. 2020. Colbert: Efficient and effective passage search via contextualized late interaction over bert. InProceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval. 39–48

work page 2020
[30]

Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, et al. 2019. Natural questions: a benchmark for question answering research. Transactions of the Association for Computational Linguistics7 (2019), 453–466

work page 2019
[31]

Victor Lavrenko and W Bruce Croft. 2017. Relevance-based language models. In ACM SIGIR Forum, Vol. 51. ACM New York, NY, USA, 260–267

work page 2017
[32]

Zehan Li, Xin Zhang, Yanzhao Zhang, Dingkun Long, Pengjun Xie, and Meishan Zhang. 2023. Towards general text embeddings with multi-stage contrastive learning.arXiv preprint arXiv:2308.03281(2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[33]

Meixiu Long, Duolin Sun, Dan Yang, Junjie Wang, Yecheng Luo, Yue Shen, Jian Wang, Hualei Zhou, Chunxiao Guo, Peng Wei, et al . 2025. Diver: A multi- stage approach for reasoning-intensive information retrieval.arXiv preprint arXiv:2508.07995(2025)

work page arXiv 2025
[34]

Xueguang Ma, Liang Wang, Nan Yang, Furu Wei, and Jimmy Lin. 2024. Fine- tuning llama for multi-stage text retrieval. InProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2421–2425

work page 2024
[35]

Niklas Muennighoff, SU Hongjin, Liang Wang, Nan Yang, Furu Wei, Tao Yu, Aman- preet Singh, and Douwe Kiela. 2024. Generative representational instruction tuning. InThe Thirteenth International Conference on Learning Representations

work page 2024
[36]

Niklas Muennighoff, Nouamane Tazi, Loïc Magne, and Nils Reimers. 2023. Mteb: Massive text embedding benchmark. InProceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics. 2014–2037

work page 2023
[37]

Tri Nguyen, Mir Rosenberg, Xia Song, Jianfeng Gao, Saurabh Tiwary, Rangan Majumder, and Li Deng. 2016. Ms marco: A human-generated machine reading comprehension dataset. (2016)

work page 2016
[38]

Gustavo Penha, Arthur Câmara, and Claudia Hauff. 2022. Evaluating the robust- ness of retrieval pipelines with query variation generators. InEuropean conference on information retrieval. Springer, 397–412

work page 2022
[39]

Gustavo Penha and Claudia Hauff. 2021. On the calibration and uncertainty of neural learning to rank models.arXiv preprint arXiv:2101.04356(2021)

work page arXiv 2021
[40]

Ronak Pradeep, Kai Hui, Jai Gupta, Adam Lelkes, Honglei Zhuang, Jimmy Lin, Donald Metzler, and Vinh Tran. 2023. How does generative retrieval scale to millions of passages?. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 1305–1321

work page 2023
[41]

Nils Reimers and Iryna Gurevych. 2019. Sentence-bert: Sentence embeddings using siamese bert-networks.arXiv preprint arXiv:1908.10084(2019)

work page internal anchor Pith review Pith/arXiv arXiv 2019
[42]

Stephen Robertson and Hugo Zaragoza. 2009. The Probabilistic Relevance Frame- work: BM25 and Beyond.Foundations and Trends in Information Retrieval3, 4 (2009), 333–389. https://dl.acm.org/doi/abs/10.1561/1500000019

work page doi:10.1561/1500000019 2009
[43]

Joseph John Rocchio Jr. 1971. Relevance feedback in information retrieval.The SMART retrieval system: experiments in automatic document processing(1971)

work page 1971
[44]

Keshav Santhanam, Omar Khattab, Jon Saad-Falcon, Christopher Potts, and Matei Zaharia. 2022. Colbertv2: Effective and efficient retrieval via lightweight late interaction. InProceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 3715–3734

work page 2022
[45]

Keshav Santhanam, Jon Saad-Falcon, Martin Franz, Omar Khattab, Avirup Sil, Radu Florian, Md Arafat Sultan, Salim Roukos, Matei Zaharia, and Christopher Potts. 2023. Moving beyond downstream task accuracy for information retrieval benchmarking. InFindings of the Association for Computational Linguistics: ACL

work page 2023
[46]

Rulin Shao, Rui Qiao, Varsha Kishore, Niklas Muennighoff, Xi Victoria Lin, Daniela Rus, Bryan Kian Hsiang Low, Sewon Min, Wen-tau Yih, Pang Wei Koh, et al. 2025. ReasonIR: Training Retrievers for Reasoning Tasks.arXiv preprint arXiv:2504.20595(2025)

work page arXiv 2025
[47]

Hongjin Su, Weijia Shi, Jungo Kasai, Yizhong Wang, Yushi Hu, Mari Ostendorf, Wen-tau Yih, Noah A Smith, Luke Zettlemoyer, and Tao Yu. 2023. One embedder, any task: Instruction-finetuned text embeddings. InFindings of the Association for Computational Linguistics: ACL 2023. 1102–1121

work page 2023
[48]

Hongjin Su, Howard Yen, Mengzhou Xia, Weijia Shi, Niklas Muennighoff, Han-yu Wang, Haisu Liu, Quan Shi, Zachary S Siegel, Michael Tang, et al. 2024. Bright: A realistic and challenging benchmark for reasoning-intensive retrieval.arXiv preprint arXiv:2407.12883(2024). Are LLM-Based Retrievers Worth Their Cost? An Empirical Study of Efficiency, Robustness, ...

work page arXiv 2024
[49]

Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, et al. 2024. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context.arXiv preprint arXiv:2403.05530(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[50]

Qwen Team et al. 2024. Qwen2 technical report.arXiv preprint arXiv:2407.10671 2, 3 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[51]

Nandan Thakur, Nils Reimers, Andreas Rücklé, Abhishek Srivastava, and Iryna Gurevych. 2021. Beir: A heterogenous benchmark for zero-shot evaluation of information retrieval models.arXiv preprint arXiv:2104.08663(2021)

work page internal anchor Pith review Pith/arXiv arXiv 2021
[52]

Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin Jiang, Rangan Majumder, and Furu Wei. 2022. Text embeddings by weakly-supervised contrastive pre-training.arXiv preprint arXiv:2212.03533(2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022
[53]

Liang Wang, Nan Yang, Xiaolong Huang, Linjun Yang, Rangan Majumder, and Furu Wei. 2024. Improving text embeddings with large language models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 11897–11916

work page 2024
[54]

Zihao Wang, Anji Liu, Haowei Lin, Jiaqi Li, Xiaojian Ma, and Yitao Liang. 2024. Rat: Retrieval augmented thoughts elicit context-aware reasoning in long-horizon generation.arXiv preprint arXiv:2403.05313(2024)

work page arXiv 2024
[55]

Shitao Xiao, Zheng Liu, Peitian Zhang, Niklas Muennighoff, Defu Lian, and Jian-Yun Nie. 2024. C-pack: Packed resources for general chinese embeddings. InProceedings of the 47th international ACM SIGIR conference on research and development in information retrieval. 641–649

work page 2024
[56]

Adam Yang, Gustavo Penha, Enrico Palumbo, and Hugues Bouchard. 2025. Aligned Query Expansion: Efficient Query Expansion for Information Retrieval through LLM Alignment.arXiv preprint arXiv:2507.11042(2025)

work page arXiv 2025
[57]

Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christopher D Manning. 2018. HotpotQA: A dataset for diverse, explainable multi-hop question answering. InProceedings of the 2018 conference on empirical methods in natural language processing. 2369–2380

work page 2018