STORM: Stepwise Token Optimization with Reward-Guided Beam Search

Arthur Satouf; Benjamin Piwowarski; Giulio D'Erasmo; Habiboulaye Amadou Boubacar; Pablo Piantanida; Yuxuan Zong

arxiv: 2606.10621 · v1 · pith:QY5J6ZEFnew · submitted 2026-06-09 · 💻 cs.IR · cs.AI

STORM: Stepwise Token Optimization with Reward-Guided Beam Search

Arthur Satouf , Giulio D'Erasmo , Yuxuan Zong , Habiboulaye Amadou Boubacar , Pablo Piantanida , Benjamin Piwowarski This is my paper

Pith reviewed 2026-06-27 11:44 UTC · model grok-4.3

classification 💻 cs.IR cs.AI

keywords query expansionlexical retrievalbeam searchreward-guided generationself-supervised learninginformation retrievalquery rewritingBM25

0 comments

The pith

STORM trains query rewriters with token-level retrieval rewards so small models match larger ones while retaining BM25 speed.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that guiding query expansion generation with stepwise retrieval rewards allows smaller language models to produce effective lexical expansions. By pruning low-scoring partial sequences against an existing BM25 index, the method converts a delayed retrieval signal into per-token feedback. This matters because it enables competitive performance on retrieval tasks without the computational overhead of dense indexes or the need for very large models. Readers would care if it means maintaining the efficiency and transparency of traditional search while improving its effectiveness through learned expansions.

Core claim

The central discovery is a training procedure for lexical query rewriters in which beam search is guided by BM25 retrieval scores at each step, pruning continuations that lead to poor final retrieval and thereby supplying a token-level optimization signal that improves the quality of the generated expansions.

What carries the argument

Reward-guided beam search that prunes low-reward partial query expansions using BM25 scores to generate token-level supervision signals.

If this is right

Models from 0.6B to 8B parameters achieve retrieval performance that matches or exceeds competitive LLM rewriters.
The expanded queries retrieve at the speed of standard BM25 without requiring new indexes.
The trained rewriters transfer zero-shot to 18 languages and outperform dedicated multilingual dense retrievers on average.
At the 8B scale the approach rivals performance of far larger proprietary rewriters.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same pruning approach could be tested on other sequence-level metrics beyond retrieval to supply early feedback in generation tasks.
If the token-level signal works, it reduces the infrastructure cost of switching from lexical to dense retrieval for many collections.
The method might extend to training on multiple retrieval metrics simultaneously to balance different aspects of effectiveness.

Load-bearing premise

The method assumes that using BM25 scores to prune partial expansions during beam search provides an effective and unbiased signal for improving the model's query rewriting at the token level.

What would settle it

A direct comparison showing that models trained with STORM do not outperform prompted baselines on retrieval metrics like nDCG on TREC DL or BEIR datasets would falsify the effectiveness of the token-level signal.

Figures

Figures reproduced from arXiv: 2606.10621 by Arthur Satouf, Benjamin Piwowarski, Giulio D'Erasmo, Habiboulaye Amadou Boubacar, Pablo Piantanida, Yuxuan Zong.

**Figure 1.** Figure 1: Overview of the STORM training loop. A Bernoulli(ε) coin flip routes each query to either rewardguided beam search, where partial expansions are scored against the inverted index and low-reward branches are pruned, or nucleus sampling (classic generation). The selected sequence is then scored and importanceweighted to update the policy θ. 3.1 Task Formulation Let VLLM denote the LLM token vocabulary and … view at source ↗

**Figure 2.** Figure 2: Latency–effectiveness comparison on TREC DL’20. The x-axis shows total per-query latency (generation [PITH_FULL_IMAGE:figures/full_fig_p012_2.png] view at source ↗

read the original abstract

Modern retrieval increasingly relies on dense and learned-sparse neural models that are effective but require encoding the entire corpus into a specialized index, rebuilt whenever the model changes. Lexical retrievers like BM25 stay efficient and transparent on a standard inverted index that need not change as models evolve, but suffer from vocabulary mismatch. LLM query rewriting can help, yet prompted rewriters emit well-formed but retrieval-ineffective or harmful-terms, and training against a retrieval reward gives only delayed, sequence-level supervision that obscures which terms helped. We introduce STORM (Stepwise Token Optimization with Reward-guided beaM search), a self-supervised framework for lexical query expansion. STORM trains the rewriter through generation guided by retrieval metrics: at each step, candidate expansions are scored against the BM25 index and low-reward continuations pruned, turning the retrieval reward into a token-level signal that concentrates exploration on retrieval-effective vocabulary. Across TREC DL and BEIR, STORM lets 0.6B-8B backbones match or surpass competitive LLM rewriters while retrieving as fast as plain BM25; at 8B it rivals far larger proprietary rewriters. It further transfers zero-shot to 18 languages (MIRACL), beating dedicated multilingual dense retrievers on average, making STORM a competitive, infrastructure-light alternative to dense neural retrieval.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

STORM uses BM25-guided beam pruning to turn retrieval scores into token-level signals for training query expanders, which is a practical framing but the abstract gives no numbers or ablations to check if it actually works.

read the letter

The main contribution is a training loop for LLM-based query rewriters that runs beam search over partial expansions, scores each continuation directly against the BM25 index, and prunes low-reward branches. This converts the final retrieval metric into per-token feedback instead of waiting for a complete sequence reward.

The approach keeps the final retriever as ordinary BM25, so no new index or encoding step is needed when the rewriter changes. The abstract claims that 0.6B–8B models trained this way match or beat other LLM rewriters on TREC DL and BEIR, rival larger proprietary systems at the high end, and transfer zero-shot to 18 languages on MIRACL while beating dedicated multilingual dense retrievers on average. That infrastructure-light property is the part that would matter in production.

The soft spot is that the abstract supplies zero experimental detail—no exact metrics, no baseline list, no beam sizes, no ablations on the pruning rule, and no variance numbers. Without those, it is impossible to tell whether the reported gains come from the stepwise signal or from other choices. The assumption that BM25 pruning gives an unbiased token-level training signal also needs checking; it could simply reinforce whatever BM25 already likes.

This is aimed at IR groups that want better lexical expansion without moving to dense indexes. A reader working on efficient or multilingual retrieval would get the most from it. The idea is clear enough and the practical motivation is real, so it should go to peer review even if the experiments need substantial tightening.

Referee Report

1 major / 0 minor

Summary. The paper introduces STORM, a self-supervised framework for lexical query expansion that trains LLM rewriters via reward-guided beam search. At each generation step, candidate token expansions are scored against a BM25 index and low-reward continuations are pruned, converting the retrieval reward into a token-level training signal. The method is evaluated on TREC DL and BEIR, claiming that 0.6B–8B backbones match or surpass competitive LLM rewriters while retaining BM25 retrieval speed; at 8B it rivals larger proprietary rewriters. It further reports zero-shot transfer to 18 languages on MIRACL, outperforming dedicated multilingual dense retrievers on average.

Significance. If the results hold, STORM provides an infrastructure-light alternative to dense retrieval by improving lexical methods on an unchanging inverted index. The core technical idea—turning delayed sequence-level retrieval rewards into stepwise token-level supervision via BM25-guided pruning—addresses a recognized limitation of standard RL or supervised fine-tuning for query rewriting and could influence future work on efficient, transparent retrieval.

major comments (1)

[Abstract] Abstract: the abstract states positive benchmark results on TREC DL, BEIR, and MIRACL but supplies no experimental details, baselines, controls, or statistical information, so it is impossible to determine whether the reported numbers support the claims.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on the abstract. We address the single major comment below and will revise the manuscript accordingly.

read point-by-point responses

Referee: [Abstract] Abstract: the abstract states positive benchmark results on TREC DL, BEIR, and MIRACL but supplies no experimental details, baselines, controls, or statistical information, so it is impossible to determine whether the reported numbers support the claims.

Authors: We agree that the abstract is concise and omits specific numerical results, baseline names, and statistical details, which are standard limitations of abstracts but can reduce immediate assessability. The full manuscript (Sections 4–5) details all experimental setups, baselines (including BM25, prompted LLMs, and dense retrievers), controls, and significance testing. To address the concern directly, we will revise the abstract to incorporate key quantitative highlights (e.g., nDCG@10 gains on TREC DL and average MIRACL performance) and explicit baseline references while preserving length constraints. This change will be reflected in the next version. revision: yes

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper introduces STORM as a self-supervised framework that generates token-level supervision for query rewriting by pruning beam-search expansions against an external BM25 index and retrieval metrics. This reward signal is drawn from a fixed, standard lexical retriever whose index is independent of the trained model; final claims are validated on public benchmarks (TREC DL, BEIR, MIRACL) with no reported parameter fitting that is then relabeled as a prediction, no self-definitional equations, and no load-bearing self-citations. The derivation chain therefore remains self-contained against external retrieval performance rather than reducing to its own fitted inputs or prior author results.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract only; no information is given on free parameters, axioms, or invented entities.

pith-pipeline@v0.9.1-grok · 5796 in / 1312 out tokens · 37455 ms · 2026-06-27T11:44:36.786544+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

59 extracted references · 16 canonical work pages

[1]

A Reproducibility Study of LLM-Based Query Reformulation , doi =

Bigdeli, Amin and Hamidi Rad, Radin and Le, Hai and Incesu, Mert and Arabzadeh, Negar and Clarke, Charles and Bagheri, Ebrahim , year =. A Reproducibility Study of LLM-Based Query Reformulation , doi =
[2]

Aho and Jeffrey D

Alfred V. Aho and Jeffrey D. Ullman , title =. 1972

1972
[3]

2019 , eprint=

Document Expansion by Query Prediction , author=. 2019 , eprint=

2019
[4]

Publications Manual , year = "1983", publisher =

1983
[5]

2025 , eprint=

Qwen3 Technical Report , author=. 2025 , eprint=

2025
[6]

2026 , eprint=

QueStER: Query Specification for Generative keyword-based Retrieval , author=. 2026 , eprint=

2026
[7]

2022 , eprint=

Generative Cooperative Networks for Natural Language Generation , author=. 2022 , eprint=

2022
[8]

Q ue S t ER : Query Specification for Generative Keyword-Based Retrieval

Satouf, Arthur and Zong, Yuxuan and Boubacar, Habiboulaye Amadou and Piantanida, Pablo and Piwowarski, Benjamin. Q ue S t ER : Query Specification for Generative Keyword-Based Retrieval. Findings of the A ssociation for C omputational L inguistics: EACL 2026. 2026. doi:10.18653/v1/2026.findings-eacl.312

work page doi:10.18653/v1/2026.findings-eacl.312 2026
[9]

2023 , eprint=

Generative Query Reformulation for Effective Adhoc Search , author=. 2023 , eprint=

2023
[10]

2023 , eprint=

Query Expansion by Prompting Large Language Models , author=. 2023 , eprint=

2023
[11]

Bruce , title =

Lavrenko, Victor and Croft, W. Bruce , title =. 2001 , isbn =. doi:10.1145/383952.383972 , booktitle =

work page doi:10.1145/383952.383972 2001
[12]

2019 , issue_date =

Azad, Hiteshwar Kumar and Deepak, Akshay , title =. 2019 , issue_date =. doi:10.1016/j.ipm.2019.05.009 , journal =

work page doi:10.1016/j.ipm.2019.05.009 2019
[13]

2011 , issue_date =

Fontoura, Marcus and Josifovski, Vanja and Liu, Jinhui and Venkatesan, Srihari and Zhu, Xiangfei and Zien, Jason , title =. 2011 , issue_date =. doi:10.14778/3402755.3402756 , journal =

work page doi:10.14778/3402755.3402756 2011
[14]

2020 , eprint=

ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT , author=. 2020 , eprint=

2020
[15]

Dense Passage Retrieval for Open-Domain Question Answering

Karpukhin, Vladimir and Oguz, Barlas and Min, Sewon and Lewis, Patrick and Wu, Ledell and Edunov, Sergey and Chen, Danqi and Yih, Wen-tau. Dense Passage Retrieval for Open-Domain Question Answering. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2020. doi:10.18653/v1/2020.emnlp-main.550

work page doi:10.18653/v1/2020.emnlp-main.550 2020
[16]

Zhang, Xinyu and Ma, Xueguang and Shi, Peng and Lin, Jimmy. Mr. T y D i: A Multi-lingual Benchmark for Dense Retrieval. Proceedings of the 1st Workshop on Multilingual Representation Learning. 2021. doi:10.18653/v1/2021.mrl-1.12

work page doi:10.18653/v1/2021.mrl-1.12 2021
[17]

Retrieval-Augmented Retrieval: Large Language Models are Strong Zero-Shot Retriever

Shen, Tao and Long, Guodong and Geng, Xiubo and Tao, Chongyang and Lei, Yibin and Zhou, Tianyi and Blumenstein, Michael and Jiang, Daxin. Retrieval-Augmented Retrieval: Large Language Models are Strong Zero-Shot Retriever. Findings of the Association for Computational Linguistics: ACL 2024. 2024. doi:10.18653/v1/2024.findings-acl.943

work page doi:10.18653/v1/2024.findings-acl.943 2024
[18]

2022 , eprint=

Unsupervised Dense Information Retrieval with Contrastive Learning , author=. 2022 , eprint=

2022
[19]

2022 , eprint=

mMARCO: A Multilingual Version of the MS MARCO Passage Ranking Dataset , author=. 2022 , eprint=

2022
[20]

Corpus-Steered Query Expansion with Large Language Models

Lei, Yibin and Cao, Yu and Zhou, Tianyi and Shen, Tao and Yates, Andrew. Corpus-Steered Query Expansion with Large Language Models. Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 2: Short Papers). 2024. doi:10.18653/v1/2024.eacl-short.34

work page doi:10.18653/v1/2024.eacl-short.34 2024
[21]

2023 , eprint=

Query2doc: Query Expansion with Large Language Models , author=. 2023 , eprint=

2023
[22]

and Agichtein, Eugene , year=

Dhole, Kaustubh D. and Agichtein, Eugene , year=. GenQREnsemble: Zero-Shot LLM Ensemble Prompting for Generative Query Reformulation , ISBN=. doi:10.1007/978-3-031-56063-7_24 , booktitle=

work page doi:10.1007/978-3-031-56063-7_24
[23]

2024 , eprint=

Towards Competitive Search Relevance For Inference-Free Learned Sparse Retrievers , author=. 2024 , eprint=

2024
[24]

2021 , eprint=

Pyserini: An Easy-to-Use Python Toolkit to Support Replicable IR Research with Sparse and Dense Representations , author=. 2021 , eprint=

2021
[25]

Text Retrieval Conference , year=

Okapi at TREC-3 , author=. Text Retrieval Conference , year=
[26]

Pretrained Transformers for Text Ranking: BERT and Beyond

Yates, Andrew and Nogueira, Rodrigo and Lin, Jimmy. Pretrained Transformers for Text Ranking: BERT and Beyond. Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Tutorials. 2021. doi:10.18653/v1/2021.naacl-tutorials.1

work page doi:10.18653/v1/2021.naacl-tutorials.1 2021
[27]

A Wrong Answer or a Wrong Question? An Intricate Relationship between Question Reformulation and Answer Selection in Conversational Question Answering

Vakulenko, Svitlana and Longpre, Shayne and Tu, Zhucheng and Anantha, Raviteja. A Wrong Answer or a Wrong Question? An Intricate Relationship between Question Reformulation and Answer Selection in Conversational Question Answering. Proceedings of the 5th International Workshop on Search-Oriented Conversational AI (SCAI). 2020. doi:10.18653/v1/2020.scai-1.2

work page doi:10.18653/v1/2020.scai-1.2 2020
[28]

Chandra and Dexter C

Ashok K. Chandra and Dexter C. Kozen and Larry J. Stockmeyer , year = "1981", title =. doi:10.1145/322234.322243

work page doi:10.1145/322234.322243 1981
[29]

Scalable training of

Andrew, Galen and Gao, Jianfeng , booktitle=. Scalable training of
[30]

Dan Gusfield , title =. 1997

1997
[31]

Tetreault , title =

Mohammad Sadegh Rasooli and Joel R. Tetreault , title =. Computing Research Repository , volume =. 2015 , url =

2015
[32]

A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =

Ando, Rie Kubota and Zhang, Tong , Issn =. A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =. Journal of Machine Learning Research , Month = dec, Numpages =
[33]

Precise Zero‑Shot Dense Retrieval without Relevance Labels , author =
[34]

Retrieval-Augmented Retrieval: Large Language Models are Strong Zero-Shot Retriever , author =
[35]

Query Rewriting for Retrieval-Augmented Large Language Models , author =. emnlp
[36]

RaFe: Ranking Feedback Improves Query Rewriting for RAG , author =
[37]

2020 , organization=

Dense Passage Retrieval for Open-Domain Question Answering , author=. 2020 , organization=

2020
[38]

arXiv preprint arXiv:1901.04085 , year=

Passage Re-ranking with BERT , author=. arXiv preprint arXiv:1901.04085 , year=

Pith/arXiv arXiv 1901
[39]

arXiv preprint arXiv:2109.10086 , year=

SPLADE v2: Sparse lexical and expansion model for information retrieval , author=. arXiv preprint arXiv:2109.10086 , year=

arXiv
[40]

Proceedings of the AAAI Conference on Artificial Intelligence , year=

Maferw: Query rewriting with multi-aspect feedbacks for retrieval-augmented large language models , author=. Proceedings of the AAAI Conference on Artificial Intelligence , year=
[41]

arXiv preprint arXiv:2404.00610 , year=

Rq-rag: Learning to refine queries for retrieval augmented generation , author=. arXiv preprint arXiv:2404.00610 , year=

arXiv
[42]

Grand, Adrien and Muir, Robert and Ferenczi, Jim and Lin, Jimmy , editor =. From. Advances in. 2020 , keywords =. doi:10.1007/978-3-030-45442-5_3 , language =

work page doi:10.1007/978-3-030-45442-5_3 2020
[43]

Advances in neural information processing systems , volume=

Retrieval-augmented generation for knowledge-intensive nlp tasks , author=. Advances in neural information processing systems , volume=
[44]

FineTuning LLaMA for Multi Stage Text Retrieval , author =
[45]

2001 , organization=

Relevance-based language models , author=. 2001 , organization=

2001
[46]

2004 , organization=

UMass at TREC 2004: Novelty and HARD , author=. 2004 , organization=

2004
[47]

1998 , isbn =

Mitra, Mandar and Singhal, Amit and Buckley, Chris , title =. 1998 , isbn =. doi:10.1145/290941.290995 , booktitle =

work page doi:10.1145/290941.290995 1998
[48]

Bajaj, Payal and Campos, Daniel and Craswell, Nick and Deng, Li and Gao, Jianfeng and Liu, Xiaodong and Majumder, Rangan and McNamara, Andrew and Mitra, Bhaskar and Nguyen, Tri and Rosenberg, Mir and Song, Xia and Stoica, Alina and Tiwary, Saurabh and Wang, Tong , journal =
[49]

2021 , publisher =

Craswell, Nick and Mitra, Bhaskar and Yilmaz, Emine and Campos, Daniel and Lin, Jimmy , booktitle =. 2021 , publisher =

2021
[50]

, booktitle =

Craswell, Nick and Mitra, Bhaskar and Yilmaz, Emine and Campos, Daniel and Voorhees, Ellen M. , booktitle =. Overview of the. 2020 , publisher =

2020
[51]

Overview of the

Craswell, Nick and Mitra, Bhaskar and Yilmaz, Emine and Campos, Daniel , booktitle =. Overview of the. 2021 , publisher =

2021
[52]

2021 , eprint=

BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models , author=. 2021 , eprint=

2021
[53]

2023 , doi =

Zhang, Xinyu and Thakur, Nandan and Ogundepo, Odunayo and Kamalloo, Ehsan and Alfonso-Hermelo, David and Li, Xiaoguang and Liu, Qun and Rezagholizadeh, Mehdi and Lin, Jimmy , journal =. 2023 , doi =

2023
[54]

MIRACL : A Multilingual Retrieval Dataset Covering 18 Diverse Languages

Zhang, Xinyu and Thakur, Nandan and Ogundepo, Odunayo and Kamalloo, Ehsan and Alfonso-Hermelo, David and Li, Xiaoguang and Liu, Qun and Rezagholizadeh, Mehdi and Lin, Jimmy. MIRACL : A Multilingual Retrieval Dataset Covering 18 Diverse Languages. Transactions of the Association for Computational Linguistics. 2023. doi:10.1162/tacl_a_00595

work page doi:10.1162/tacl_a_00595 2023
[55]

2018 , eprint=

Diverse Beam Search: Decoding Diverse Solutions from Neural Sequence Models , author=. 2018 , eprint=

2018
[56]

A Reproducibility Study of LLM-Based Query Reformulation , author=. -. 2026 , url=

2026
[57]

W ord2 P assage: Word-level Importance Re-weighting for Query Expansion

Choi, Jeonghwan and Ban, Minjeong and Kim, Minseok and Song, Hwanjun. W ord2 P assage: Word-level Importance Re-weighting for Query Expansion. Findings of the Association for Computational Linguistics: ACL 2025. 2025. doi:10.18653/v1/2025.findings-acl.434

work page doi:10.18653/v1/2025.findings-acl.434 2025
[58]

Can Generative LLMs Create Query Variants for Test Collections? An Exploratory Study , author =
[59]

Exploring the best practices of query expansion with large language models , author=

[1] [1]

A Reproducibility Study of LLM-Based Query Reformulation , doi =

Bigdeli, Amin and Hamidi Rad, Radin and Le, Hai and Incesu, Mert and Arabzadeh, Negar and Clarke, Charles and Bagheri, Ebrahim , year =. A Reproducibility Study of LLM-Based Query Reformulation , doi =

[2] [2]

Aho and Jeffrey D

Alfred V. Aho and Jeffrey D. Ullman , title =. 1972

1972

[3] [3]

2019 , eprint=

Document Expansion by Query Prediction , author=. 2019 , eprint=

2019

[4] [4]

Publications Manual , year = "1983", publisher =

1983

[5] [5]

2025 , eprint=

Qwen3 Technical Report , author=. 2025 , eprint=

2025

[6] [6]

2026 , eprint=

QueStER: Query Specification for Generative keyword-based Retrieval , author=. 2026 , eprint=

2026

[7] [7]

2022 , eprint=

Generative Cooperative Networks for Natural Language Generation , author=. 2022 , eprint=

2022

[8] [8]

Q ue S t ER : Query Specification for Generative Keyword-Based Retrieval

Satouf, Arthur and Zong, Yuxuan and Boubacar, Habiboulaye Amadou and Piantanida, Pablo and Piwowarski, Benjamin. Q ue S t ER : Query Specification for Generative Keyword-Based Retrieval. Findings of the A ssociation for C omputational L inguistics: EACL 2026. 2026. doi:10.18653/v1/2026.findings-eacl.312

work page doi:10.18653/v1/2026.findings-eacl.312 2026

[9] [9]

2023 , eprint=

Generative Query Reformulation for Effective Adhoc Search , author=. 2023 , eprint=

2023

[10] [10]

2023 , eprint=

Query Expansion by Prompting Large Language Models , author=. 2023 , eprint=

2023

[11] [11]

Bruce , title =

Lavrenko, Victor and Croft, W. Bruce , title =. 2001 , isbn =. doi:10.1145/383952.383972 , booktitle =

work page doi:10.1145/383952.383972 2001

[12] [12]

2019 , issue_date =

Azad, Hiteshwar Kumar and Deepak, Akshay , title =. 2019 , issue_date =. doi:10.1016/j.ipm.2019.05.009 , journal =

work page doi:10.1016/j.ipm.2019.05.009 2019

[13] [13]

2011 , issue_date =

Fontoura, Marcus and Josifovski, Vanja and Liu, Jinhui and Venkatesan, Srihari and Zhu, Xiangfei and Zien, Jason , title =. 2011 , issue_date =. doi:10.14778/3402755.3402756 , journal =

work page doi:10.14778/3402755.3402756 2011

[14] [14]

2020 , eprint=

ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT , author=. 2020 , eprint=

2020

[15] [15]

Dense Passage Retrieval for Open-Domain Question Answering

Karpukhin, Vladimir and Oguz, Barlas and Min, Sewon and Lewis, Patrick and Wu, Ledell and Edunov, Sergey and Chen, Danqi and Yih, Wen-tau. Dense Passage Retrieval for Open-Domain Question Answering. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2020. doi:10.18653/v1/2020.emnlp-main.550

work page doi:10.18653/v1/2020.emnlp-main.550 2020

[16] [16]

Zhang, Xinyu and Ma, Xueguang and Shi, Peng and Lin, Jimmy. Mr. T y D i: A Multi-lingual Benchmark for Dense Retrieval. Proceedings of the 1st Workshop on Multilingual Representation Learning. 2021. doi:10.18653/v1/2021.mrl-1.12

work page doi:10.18653/v1/2021.mrl-1.12 2021

[17] [17]

Retrieval-Augmented Retrieval: Large Language Models are Strong Zero-Shot Retriever

Shen, Tao and Long, Guodong and Geng, Xiubo and Tao, Chongyang and Lei, Yibin and Zhou, Tianyi and Blumenstein, Michael and Jiang, Daxin. Retrieval-Augmented Retrieval: Large Language Models are Strong Zero-Shot Retriever. Findings of the Association for Computational Linguistics: ACL 2024. 2024. doi:10.18653/v1/2024.findings-acl.943

work page doi:10.18653/v1/2024.findings-acl.943 2024

[18] [18]

2022 , eprint=

Unsupervised Dense Information Retrieval with Contrastive Learning , author=. 2022 , eprint=

2022

[19] [19]

2022 , eprint=

mMARCO: A Multilingual Version of the MS MARCO Passage Ranking Dataset , author=. 2022 , eprint=

2022

[20] [20]

Corpus-Steered Query Expansion with Large Language Models

Lei, Yibin and Cao, Yu and Zhou, Tianyi and Shen, Tao and Yates, Andrew. Corpus-Steered Query Expansion with Large Language Models. Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 2: Short Papers). 2024. doi:10.18653/v1/2024.eacl-short.34

work page doi:10.18653/v1/2024.eacl-short.34 2024

[21] [21]

2023 , eprint=

Query2doc: Query Expansion with Large Language Models , author=. 2023 , eprint=

2023

[22] [22]

and Agichtein, Eugene , year=

Dhole, Kaustubh D. and Agichtein, Eugene , year=. GenQREnsemble: Zero-Shot LLM Ensemble Prompting for Generative Query Reformulation , ISBN=. doi:10.1007/978-3-031-56063-7_24 , booktitle=

work page doi:10.1007/978-3-031-56063-7_24

[23] [23]

2024 , eprint=

Towards Competitive Search Relevance For Inference-Free Learned Sparse Retrievers , author=. 2024 , eprint=

2024

[24] [24]

2021 , eprint=

Pyserini: An Easy-to-Use Python Toolkit to Support Replicable IR Research with Sparse and Dense Representations , author=. 2021 , eprint=

2021

[25] [25]

Text Retrieval Conference , year=

Okapi at TREC-3 , author=. Text Retrieval Conference , year=

[26] [26]

Pretrained Transformers for Text Ranking: BERT and Beyond

Yates, Andrew and Nogueira, Rodrigo and Lin, Jimmy. Pretrained Transformers for Text Ranking: BERT and Beyond. Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Tutorials. 2021. doi:10.18653/v1/2021.naacl-tutorials.1

work page doi:10.18653/v1/2021.naacl-tutorials.1 2021

[27] [27]

A Wrong Answer or a Wrong Question? An Intricate Relationship between Question Reformulation and Answer Selection in Conversational Question Answering

Vakulenko, Svitlana and Longpre, Shayne and Tu, Zhucheng and Anantha, Raviteja. A Wrong Answer or a Wrong Question? An Intricate Relationship between Question Reformulation and Answer Selection in Conversational Question Answering. Proceedings of the 5th International Workshop on Search-Oriented Conversational AI (SCAI). 2020. doi:10.18653/v1/2020.scai-1.2

work page doi:10.18653/v1/2020.scai-1.2 2020

[28] [28]

Chandra and Dexter C

Ashok K. Chandra and Dexter C. Kozen and Larry J. Stockmeyer , year = "1981", title =. doi:10.1145/322234.322243

work page doi:10.1145/322234.322243 1981

[29] [29]

Scalable training of

Andrew, Galen and Gao, Jianfeng , booktitle=. Scalable training of

[30] [30]

Dan Gusfield , title =. 1997

1997

[31] [31]

Tetreault , title =

Mohammad Sadegh Rasooli and Joel R. Tetreault , title =. Computing Research Repository , volume =. 2015 , url =

2015

[32] [32]

A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =

Ando, Rie Kubota and Zhang, Tong , Issn =. A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =. Journal of Machine Learning Research , Month = dec, Numpages =

[33] [33]

Precise Zero‑Shot Dense Retrieval without Relevance Labels , author =

[34] [34]

Retrieval-Augmented Retrieval: Large Language Models are Strong Zero-Shot Retriever , author =

[35] [35]

Query Rewriting for Retrieval-Augmented Large Language Models , author =. emnlp

[36] [36]

RaFe: Ranking Feedback Improves Query Rewriting for RAG , author =

[37] [37]

2020 , organization=

Dense Passage Retrieval for Open-Domain Question Answering , author=. 2020 , organization=

2020

[38] [38]

arXiv preprint arXiv:1901.04085 , year=

Passage Re-ranking with BERT , author=. arXiv preprint arXiv:1901.04085 , year=

Pith/arXiv arXiv 1901

[39] [39]

arXiv preprint arXiv:2109.10086 , year=

SPLADE v2: Sparse lexical and expansion model for information retrieval , author=. arXiv preprint arXiv:2109.10086 , year=

arXiv

[40] [40]

Proceedings of the AAAI Conference on Artificial Intelligence , year=

Maferw: Query rewriting with multi-aspect feedbacks for retrieval-augmented large language models , author=. Proceedings of the AAAI Conference on Artificial Intelligence , year=

[41] [41]

arXiv preprint arXiv:2404.00610 , year=

Rq-rag: Learning to refine queries for retrieval augmented generation , author=. arXiv preprint arXiv:2404.00610 , year=

arXiv

[42] [42]

Grand, Adrien and Muir, Robert and Ferenczi, Jim and Lin, Jimmy , editor =. From. Advances in. 2020 , keywords =. doi:10.1007/978-3-030-45442-5_3 , language =

work page doi:10.1007/978-3-030-45442-5_3 2020

[43] [43]

Advances in neural information processing systems , volume=

Retrieval-augmented generation for knowledge-intensive nlp tasks , author=. Advances in neural information processing systems , volume=

[44] [44]

FineTuning LLaMA for Multi Stage Text Retrieval , author =

[45] [45]

2001 , organization=

Relevance-based language models , author=. 2001 , organization=

2001

[46] [46]

2004 , organization=

UMass at TREC 2004: Novelty and HARD , author=. 2004 , organization=

2004

[47] [47]

1998 , isbn =

Mitra, Mandar and Singhal, Amit and Buckley, Chris , title =. 1998 , isbn =. doi:10.1145/290941.290995 , booktitle =

work page doi:10.1145/290941.290995 1998

[48] [48]

Bajaj, Payal and Campos, Daniel and Craswell, Nick and Deng, Li and Gao, Jianfeng and Liu, Xiaodong and Majumder, Rangan and McNamara, Andrew and Mitra, Bhaskar and Nguyen, Tri and Rosenberg, Mir and Song, Xia and Stoica, Alina and Tiwary, Saurabh and Wang, Tong , journal =

[49] [49]

2021 , publisher =

Craswell, Nick and Mitra, Bhaskar and Yilmaz, Emine and Campos, Daniel and Lin, Jimmy , booktitle =. 2021 , publisher =

2021

[50] [50]

, booktitle =

Craswell, Nick and Mitra, Bhaskar and Yilmaz, Emine and Campos, Daniel and Voorhees, Ellen M. , booktitle =. Overview of the. 2020 , publisher =

2020

[51] [51]

Overview of the

Craswell, Nick and Mitra, Bhaskar and Yilmaz, Emine and Campos, Daniel , booktitle =. Overview of the. 2021 , publisher =

2021

[52] [52]

2021 , eprint=

BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models , author=. 2021 , eprint=

2021

[53] [53]

2023 , doi =

Zhang, Xinyu and Thakur, Nandan and Ogundepo, Odunayo and Kamalloo, Ehsan and Alfonso-Hermelo, David and Li, Xiaoguang and Liu, Qun and Rezagholizadeh, Mehdi and Lin, Jimmy , journal =. 2023 , doi =

2023

[54] [54]

MIRACL : A Multilingual Retrieval Dataset Covering 18 Diverse Languages

Zhang, Xinyu and Thakur, Nandan and Ogundepo, Odunayo and Kamalloo, Ehsan and Alfonso-Hermelo, David and Li, Xiaoguang and Liu, Qun and Rezagholizadeh, Mehdi and Lin, Jimmy. MIRACL : A Multilingual Retrieval Dataset Covering 18 Diverse Languages. Transactions of the Association for Computational Linguistics. 2023. doi:10.1162/tacl_a_00595

work page doi:10.1162/tacl_a_00595 2023

[55] [55]

2018 , eprint=

Diverse Beam Search: Decoding Diverse Solutions from Neural Sequence Models , author=. 2018 , eprint=

2018

[56] [56]

A Reproducibility Study of LLM-Based Query Reformulation , author=. -. 2026 , url=

2026

[57] [57]

W ord2 P assage: Word-level Importance Re-weighting for Query Expansion

Choi, Jeonghwan and Ban, Minjeong and Kim, Minseok and Song, Hwanjun. W ord2 P assage: Word-level Importance Re-weighting for Query Expansion. Findings of the Association for Computational Linguistics: ACL 2025. 2025. doi:10.18653/v1/2025.findings-acl.434

work page doi:10.18653/v1/2025.findings-acl.434 2025

[58] [58]

Can Generative LLMs Create Query Variants for Test Collections? An Exploratory Study , author =

[59] [59]

Exploring the best practices of query expansion with large language models , author=