STORM: Stepwise Token Optimization with Reward-Guided Beam Search
Pith reviewed 2026-06-27 11:44 UTC · model grok-4.3
The pith
STORM trains query rewriters with token-level retrieval rewards so small models match larger ones while retaining BM25 speed.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central discovery is a training procedure for lexical query rewriters in which beam search is guided by BM25 retrieval scores at each step, pruning continuations that lead to poor final retrieval and thereby supplying a token-level optimization signal that improves the quality of the generated expansions.
What carries the argument
Reward-guided beam search that prunes low-reward partial query expansions using BM25 scores to generate token-level supervision signals.
If this is right
- Models from 0.6B to 8B parameters achieve retrieval performance that matches or exceeds competitive LLM rewriters.
- The expanded queries retrieve at the speed of standard BM25 without requiring new indexes.
- The trained rewriters transfer zero-shot to 18 languages and outperform dedicated multilingual dense retrievers on average.
- At the 8B scale the approach rivals performance of far larger proprietary rewriters.
Where Pith is reading between the lines
- The same pruning approach could be tested on other sequence-level metrics beyond retrieval to supply early feedback in generation tasks.
- If the token-level signal works, it reduces the infrastructure cost of switching from lexical to dense retrieval for many collections.
- The method might extend to training on multiple retrieval metrics simultaneously to balance different aspects of effectiveness.
Load-bearing premise
The method assumes that using BM25 scores to prune partial expansions during beam search provides an effective and unbiased signal for improving the model's query rewriting at the token level.
What would settle it
A direct comparison showing that models trained with STORM do not outperform prompted baselines on retrieval metrics like nDCG on TREC DL or BEIR datasets would falsify the effectiveness of the token-level signal.
Figures
read the original abstract
Modern retrieval increasingly relies on dense and learned-sparse neural models that are effective but require encoding the entire corpus into a specialized index, rebuilt whenever the model changes. Lexical retrievers like BM25 stay efficient and transparent on a standard inverted index that need not change as models evolve, but suffer from vocabulary mismatch. LLM query rewriting can help, yet prompted rewriters emit well-formed but retrieval-ineffective or harmful-terms, and training against a retrieval reward gives only delayed, sequence-level supervision that obscures which terms helped. We introduce STORM (Stepwise Token Optimization with Reward-guided beaM search), a self-supervised framework for lexical query expansion. STORM trains the rewriter through generation guided by retrieval metrics: at each step, candidate expansions are scored against the BM25 index and low-reward continuations pruned, turning the retrieval reward into a token-level signal that concentrates exploration on retrieval-effective vocabulary. Across TREC DL and BEIR, STORM lets 0.6B-8B backbones match or surpass competitive LLM rewriters while retrieving as fast as plain BM25; at 8B it rivals far larger proprietary rewriters. It further transfers zero-shot to 18 languages (MIRACL), beating dedicated multilingual dense retrievers on average, making STORM a competitive, infrastructure-light alternative to dense neural retrieval.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces STORM, a self-supervised framework for lexical query expansion that trains LLM rewriters via reward-guided beam search. At each generation step, candidate token expansions are scored against a BM25 index and low-reward continuations are pruned, converting the retrieval reward into a token-level training signal. The method is evaluated on TREC DL and BEIR, claiming that 0.6B–8B backbones match or surpass competitive LLM rewriters while retaining BM25 retrieval speed; at 8B it rivals larger proprietary rewriters. It further reports zero-shot transfer to 18 languages on MIRACL, outperforming dedicated multilingual dense retrievers on average.
Significance. If the results hold, STORM provides an infrastructure-light alternative to dense retrieval by improving lexical methods on an unchanging inverted index. The core technical idea—turning delayed sequence-level retrieval rewards into stepwise token-level supervision via BM25-guided pruning—addresses a recognized limitation of standard RL or supervised fine-tuning for query rewriting and could influence future work on efficient, transparent retrieval.
major comments (1)
- [Abstract] Abstract: the abstract states positive benchmark results on TREC DL, BEIR, and MIRACL but supplies no experimental details, baselines, controls, or statistical information, so it is impossible to determine whether the reported numbers support the claims.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on the abstract. We address the single major comment below and will revise the manuscript accordingly.
read point-by-point responses
-
Referee: [Abstract] Abstract: the abstract states positive benchmark results on TREC DL, BEIR, and MIRACL but supplies no experimental details, baselines, controls, or statistical information, so it is impossible to determine whether the reported numbers support the claims.
Authors: We agree that the abstract is concise and omits specific numerical results, baseline names, and statistical details, which are standard limitations of abstracts but can reduce immediate assessability. The full manuscript (Sections 4–5) details all experimental setups, baselines (including BM25, prompted LLMs, and dense retrievers), controls, and significance testing. To address the concern directly, we will revise the abstract to incorporate key quantitative highlights (e.g., nDCG@10 gains on TREC DL and average MIRACL performance) and explicit baseline references while preserving length constraints. This change will be reflected in the next version. revision: yes
Circularity Check
No significant circularity identified
full rationale
The paper introduces STORM as a self-supervised framework that generates token-level supervision for query rewriting by pruning beam-search expansions against an external BM25 index and retrieval metrics. This reward signal is drawn from a fixed, standard lexical retriever whose index is independent of the trained model; final claims are validated on public benchmarks (TREC DL, BEIR, MIRACL) with no reported parameter fitting that is then relabeled as a prediction, no self-definitional equations, and no load-bearing self-citations. The derivation chain therefore remains self-contained against external retrieval performance rather than reducing to its own fitted inputs or prior author results.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
A Reproducibility Study of LLM-Based Query Reformulation , doi =
Bigdeli, Amin and Hamidi Rad, Radin and Le, Hai and Incesu, Mert and Arabzadeh, Negar and Clarke, Charles and Bagheri, Ebrahim , year =. A Reproducibility Study of LLM-Based Query Reformulation , doi =
-
[2]
Aho and Jeffrey D
Alfred V. Aho and Jeffrey D. Ullman , title =. 1972
1972
-
[3]
2019 , eprint=
Document Expansion by Query Prediction , author=. 2019 , eprint=
2019
-
[4]
Publications Manual , year = "1983", publisher =
1983
-
[5]
2025 , eprint=
Qwen3 Technical Report , author=. 2025 , eprint=
2025
-
[6]
2026 , eprint=
QueStER: Query Specification for Generative keyword-based Retrieval , author=. 2026 , eprint=
2026
-
[7]
2022 , eprint=
Generative Cooperative Networks for Natural Language Generation , author=. 2022 , eprint=
2022
-
[8]
Q ue S t ER : Query Specification for Generative Keyword-Based Retrieval
Satouf, Arthur and Zong, Yuxuan and Boubacar, Habiboulaye Amadou and Piantanida, Pablo and Piwowarski, Benjamin. Q ue S t ER : Query Specification for Generative Keyword-Based Retrieval. Findings of the A ssociation for C omputational L inguistics: EACL 2026. 2026. doi:10.18653/v1/2026.findings-eacl.312
-
[9]
2023 , eprint=
Generative Query Reformulation for Effective Adhoc Search , author=. 2023 , eprint=
2023
-
[10]
2023 , eprint=
Query Expansion by Prompting Large Language Models , author=. 2023 , eprint=
2023
-
[11]
Lavrenko, Victor and Croft, W. Bruce , title =. 2001 , isbn =. doi:10.1145/383952.383972 , booktitle =
-
[12]
Azad, Hiteshwar Kumar and Deepak, Akshay , title =. 2019 , issue_date =. doi:10.1016/j.ipm.2019.05.009 , journal =
-
[13]
Fontoura, Marcus and Josifovski, Vanja and Liu, Jinhui and Venkatesan, Srihari and Zhu, Xiangfei and Zien, Jason , title =. 2011 , issue_date =. doi:10.14778/3402755.3402756 , journal =
-
[14]
2020 , eprint=
ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT , author=. 2020 , eprint=
2020
-
[15]
Dense Passage Retrieval for Open-Domain Question Answering
Karpukhin, Vladimir and Oguz, Barlas and Min, Sewon and Lewis, Patrick and Wu, Ledell and Edunov, Sergey and Chen, Danqi and Yih, Wen-tau. Dense Passage Retrieval for Open-Domain Question Answering. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2020. doi:10.18653/v1/2020.emnlp-main.550
-
[16]
Zhang, Xinyu and Ma, Xueguang and Shi, Peng and Lin, Jimmy. Mr. T y D i: A Multi-lingual Benchmark for Dense Retrieval. Proceedings of the 1st Workshop on Multilingual Representation Learning. 2021. doi:10.18653/v1/2021.mrl-1.12
-
[17]
Retrieval-Augmented Retrieval: Large Language Models are Strong Zero-Shot Retriever
Shen, Tao and Long, Guodong and Geng, Xiubo and Tao, Chongyang and Lei, Yibin and Zhou, Tianyi and Blumenstein, Michael and Jiang, Daxin. Retrieval-Augmented Retrieval: Large Language Models are Strong Zero-Shot Retriever. Findings of the Association for Computational Linguistics: ACL 2024. 2024. doi:10.18653/v1/2024.findings-acl.943
-
[18]
2022 , eprint=
Unsupervised Dense Information Retrieval with Contrastive Learning , author=. 2022 , eprint=
2022
-
[19]
2022 , eprint=
mMARCO: A Multilingual Version of the MS MARCO Passage Ranking Dataset , author=. 2022 , eprint=
2022
-
[20]
Corpus-Steered Query Expansion with Large Language Models
Lei, Yibin and Cao, Yu and Zhou, Tianyi and Shen, Tao and Yates, Andrew. Corpus-Steered Query Expansion with Large Language Models. Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 2: Short Papers). 2024. doi:10.18653/v1/2024.eacl-short.34
-
[21]
2023 , eprint=
Query2doc: Query Expansion with Large Language Models , author=. 2023 , eprint=
2023
-
[22]
Dhole, Kaustubh D. and Agichtein, Eugene , year=. GenQREnsemble: Zero-Shot LLM Ensemble Prompting for Generative Query Reformulation , ISBN=. doi:10.1007/978-3-031-56063-7_24 , booktitle=
-
[23]
2024 , eprint=
Towards Competitive Search Relevance For Inference-Free Learned Sparse Retrievers , author=. 2024 , eprint=
2024
-
[24]
2021 , eprint=
Pyserini: An Easy-to-Use Python Toolkit to Support Replicable IR Research with Sparse and Dense Representations , author=. 2021 , eprint=
2021
-
[25]
Text Retrieval Conference , year=
Okapi at TREC-3 , author=. Text Retrieval Conference , year=
-
[26]
Pretrained Transformers for Text Ranking: BERT and Beyond
Yates, Andrew and Nogueira, Rodrigo and Lin, Jimmy. Pretrained Transformers for Text Ranking: BERT and Beyond. Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Tutorials. 2021. doi:10.18653/v1/2021.naacl-tutorials.1
-
[27]
Vakulenko, Svitlana and Longpre, Shayne and Tu, Zhucheng and Anantha, Raviteja. A Wrong Answer or a Wrong Question? An Intricate Relationship between Question Reformulation and Answer Selection in Conversational Question Answering. Proceedings of the 5th International Workshop on Search-Oriented Conversational AI (SCAI). 2020. doi:10.18653/v1/2020.scai-1.2
-
[28]
Ashok K. Chandra and Dexter C. Kozen and Larry J. Stockmeyer , year = "1981", title =. doi:10.1145/322234.322243
-
[29]
Scalable training of
Andrew, Galen and Gao, Jianfeng , booktitle=. Scalable training of
-
[30]
Dan Gusfield , title =. 1997
1997
-
[31]
Tetreault , title =
Mohammad Sadegh Rasooli and Joel R. Tetreault , title =. Computing Research Repository , volume =. 2015 , url =
2015
-
[32]
A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =
Ando, Rie Kubota and Zhang, Tong , Issn =. A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =. Journal of Machine Learning Research , Month = dec, Numpages =
-
[33]
Precise Zero‑Shot Dense Retrieval without Relevance Labels , author =
-
[34]
Retrieval-Augmented Retrieval: Large Language Models are Strong Zero-Shot Retriever , author =
-
[35]
Query Rewriting for Retrieval-Augmented Large Language Models , author =. emnlp
-
[36]
RaFe: Ranking Feedback Improves Query Rewriting for RAG , author =
-
[37]
2020 , organization=
Dense Passage Retrieval for Open-Domain Question Answering , author=. 2020 , organization=
2020
-
[38]
arXiv preprint arXiv:1901.04085 , year=
Passage Re-ranking with BERT , author=. arXiv preprint arXiv:1901.04085 , year=
Pith/arXiv arXiv 1901
-
[39]
arXiv preprint arXiv:2109.10086 , year=
SPLADE v2: Sparse lexical and expansion model for information retrieval , author=. arXiv preprint arXiv:2109.10086 , year=
-
[40]
Proceedings of the AAAI Conference on Artificial Intelligence , year=
Maferw: Query rewriting with multi-aspect feedbacks for retrieval-augmented large language models , author=. Proceedings of the AAAI Conference on Artificial Intelligence , year=
-
[41]
arXiv preprint arXiv:2404.00610 , year=
Rq-rag: Learning to refine queries for retrieval augmented generation , author=. arXiv preprint arXiv:2404.00610 , year=
-
[42]
Grand, Adrien and Muir, Robert and Ferenczi, Jim and Lin, Jimmy , editor =. From. Advances in. 2020 , keywords =. doi:10.1007/978-3-030-45442-5_3 , language =
-
[43]
Advances in neural information processing systems , volume=
Retrieval-augmented generation for knowledge-intensive nlp tasks , author=. Advances in neural information processing systems , volume=
-
[44]
FineTuning LLaMA for Multi Stage Text Retrieval , author =
-
[45]
2001 , organization=
Relevance-based language models , author=. 2001 , organization=
2001
-
[46]
2004 , organization=
UMass at TREC 2004: Novelty and HARD , author=. 2004 , organization=
2004
-
[47]
Mitra, Mandar and Singhal, Amit and Buckley, Chris , title =. 1998 , isbn =. doi:10.1145/290941.290995 , booktitle =
-
[48]
Bajaj, Payal and Campos, Daniel and Craswell, Nick and Deng, Li and Gao, Jianfeng and Liu, Xiaodong and Majumder, Rangan and McNamara, Andrew and Mitra, Bhaskar and Nguyen, Tri and Rosenberg, Mir and Song, Xia and Stoica, Alina and Tiwary, Saurabh and Wang, Tong , journal =
-
[49]
2021 , publisher =
Craswell, Nick and Mitra, Bhaskar and Yilmaz, Emine and Campos, Daniel and Lin, Jimmy , booktitle =. 2021 , publisher =
2021
-
[50]
, booktitle =
Craswell, Nick and Mitra, Bhaskar and Yilmaz, Emine and Campos, Daniel and Voorhees, Ellen M. , booktitle =. Overview of the. 2020 , publisher =
2020
-
[51]
Overview of the
Craswell, Nick and Mitra, Bhaskar and Yilmaz, Emine and Campos, Daniel , booktitle =. Overview of the. 2021 , publisher =
2021
-
[52]
2021 , eprint=
BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models , author=. 2021 , eprint=
2021
-
[53]
2023 , doi =
Zhang, Xinyu and Thakur, Nandan and Ogundepo, Odunayo and Kamalloo, Ehsan and Alfonso-Hermelo, David and Li, Xiaoguang and Liu, Qun and Rezagholizadeh, Mehdi and Lin, Jimmy , journal =. 2023 , doi =
2023
-
[54]
MIRACL : A Multilingual Retrieval Dataset Covering 18 Diverse Languages
Zhang, Xinyu and Thakur, Nandan and Ogundepo, Odunayo and Kamalloo, Ehsan and Alfonso-Hermelo, David and Li, Xiaoguang and Liu, Qun and Rezagholizadeh, Mehdi and Lin, Jimmy. MIRACL : A Multilingual Retrieval Dataset Covering 18 Diverse Languages. Transactions of the Association for Computational Linguistics. 2023. doi:10.1162/tacl_a_00595
-
[55]
2018 , eprint=
Diverse Beam Search: Decoding Diverse Solutions from Neural Sequence Models , author=. 2018 , eprint=
2018
-
[56]
A Reproducibility Study of LLM-Based Query Reformulation , author=. -. 2026 , url=
2026
-
[57]
W ord2 P assage: Word-level Importance Re-weighting for Query Expansion
Choi, Jeonghwan and Ban, Minjeong and Kim, Minseok and Song, Hwanjun. W ord2 P assage: Word-level Importance Re-weighting for Query Expansion. Findings of the Association for Computational Linguistics: ACL 2025. 2025. doi:10.18653/v1/2025.findings-acl.434
-
[58]
Can Generative LLMs Create Query Variants for Test Collections? An Exploratory Study , author =
-
[59]
Exploring the best practices of query expansion with large language models , author=
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.