pith. sign in

arxiv: 2606.10621 · v1 · pith:QY5J6ZEFnew · submitted 2026-06-09 · 💻 cs.IR · cs.AI

STORM: Stepwise Token Optimization with Reward-Guided Beam Search

Pith reviewed 2026-06-27 11:44 UTC · model grok-4.3

classification 💻 cs.IR cs.AI
keywords query expansionlexical retrievalbeam searchreward-guided generationself-supervised learninginformation retrievalquery rewritingBM25
0
0 comments X

The pith

STORM trains query rewriters with token-level retrieval rewards so small models match larger ones while retaining BM25 speed.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that guiding query expansion generation with stepwise retrieval rewards allows smaller language models to produce effective lexical expansions. By pruning low-scoring partial sequences against an existing BM25 index, the method converts a delayed retrieval signal into per-token feedback. This matters because it enables competitive performance on retrieval tasks without the computational overhead of dense indexes or the need for very large models. Readers would care if it means maintaining the efficiency and transparency of traditional search while improving its effectiveness through learned expansions.

Core claim

The central discovery is a training procedure for lexical query rewriters in which beam search is guided by BM25 retrieval scores at each step, pruning continuations that lead to poor final retrieval and thereby supplying a token-level optimization signal that improves the quality of the generated expansions.

What carries the argument

Reward-guided beam search that prunes low-reward partial query expansions using BM25 scores to generate token-level supervision signals.

If this is right

  • Models from 0.6B to 8B parameters achieve retrieval performance that matches or exceeds competitive LLM rewriters.
  • The expanded queries retrieve at the speed of standard BM25 without requiring new indexes.
  • The trained rewriters transfer zero-shot to 18 languages and outperform dedicated multilingual dense retrievers on average.
  • At the 8B scale the approach rivals performance of far larger proprietary rewriters.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same pruning approach could be tested on other sequence-level metrics beyond retrieval to supply early feedback in generation tasks.
  • If the token-level signal works, it reduces the infrastructure cost of switching from lexical to dense retrieval for many collections.
  • The method might extend to training on multiple retrieval metrics simultaneously to balance different aspects of effectiveness.

Load-bearing premise

The method assumes that using BM25 scores to prune partial expansions during beam search provides an effective and unbiased signal for improving the model's query rewriting at the token level.

What would settle it

A direct comparison showing that models trained with STORM do not outperform prompted baselines on retrieval metrics like nDCG on TREC DL or BEIR datasets would falsify the effectiveness of the token-level signal.

Figures

Figures reproduced from arXiv: 2606.10621 by Arthur Satouf, Benjamin Piwowarski, Giulio D'Erasmo, Habiboulaye Amadou Boubacar, Pablo Piantanida, Yuxuan Zong.

Figure 1
Figure 1. Figure 1: Overview of the STORM training loop. A Bernoulli(ε) coin flip routes each query to either reward￾guided beam search, where partial expansions are scored against the inverted index and low-reward branches are pruned, or nucleus sampling (classic generation). The selected sequence is then scored and importance￾weighted to update the policy θ. 3.1 Task Formulation Let VLLM denote the LLM token vocabulary and … view at source ↗
Figure 2
Figure 2. Figure 2: Latency–effectiveness comparison on TREC DL’20. The x-axis shows total per-query latency (generation [PITH_FULL_IMAGE:figures/full_fig_p012_2.png] view at source ↗
read the original abstract

Modern retrieval increasingly relies on dense and learned-sparse neural models that are effective but require encoding the entire corpus into a specialized index, rebuilt whenever the model changes. Lexical retrievers like BM25 stay efficient and transparent on a standard inverted index that need not change as models evolve, but suffer from vocabulary mismatch. LLM query rewriting can help, yet prompted rewriters emit well-formed but retrieval-ineffective or harmful-terms, and training against a retrieval reward gives only delayed, sequence-level supervision that obscures which terms helped. We introduce STORM (Stepwise Token Optimization with Reward-guided beaM search), a self-supervised framework for lexical query expansion. STORM trains the rewriter through generation guided by retrieval metrics: at each step, candidate expansions are scored against the BM25 index and low-reward continuations pruned, turning the retrieval reward into a token-level signal that concentrates exploration on retrieval-effective vocabulary. Across TREC DL and BEIR, STORM lets 0.6B-8B backbones match or surpass competitive LLM rewriters while retrieving as fast as plain BM25; at 8B it rivals far larger proprietary rewriters. It further transfers zero-shot to 18 languages (MIRACL), beating dedicated multilingual dense retrievers on average, making STORM a competitive, infrastructure-light alternative to dense neural retrieval.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper introduces STORM, a self-supervised framework for lexical query expansion that trains LLM rewriters via reward-guided beam search. At each generation step, candidate token expansions are scored against a BM25 index and low-reward continuations are pruned, converting the retrieval reward into a token-level training signal. The method is evaluated on TREC DL and BEIR, claiming that 0.6B–8B backbones match or surpass competitive LLM rewriters while retaining BM25 retrieval speed; at 8B it rivals larger proprietary rewriters. It further reports zero-shot transfer to 18 languages on MIRACL, outperforming dedicated multilingual dense retrievers on average.

Significance. If the results hold, STORM provides an infrastructure-light alternative to dense retrieval by improving lexical methods on an unchanging inverted index. The core technical idea—turning delayed sequence-level retrieval rewards into stepwise token-level supervision via BM25-guided pruning—addresses a recognized limitation of standard RL or supervised fine-tuning for query rewriting and could influence future work on efficient, transparent retrieval.

major comments (1)
  1. [Abstract] Abstract: the abstract states positive benchmark results on TREC DL, BEIR, and MIRACL but supplies no experimental details, baselines, controls, or statistical information, so it is impossible to determine whether the reported numbers support the claims.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on the abstract. We address the single major comment below and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the abstract states positive benchmark results on TREC DL, BEIR, and MIRACL but supplies no experimental details, baselines, controls, or statistical information, so it is impossible to determine whether the reported numbers support the claims.

    Authors: We agree that the abstract is concise and omits specific numerical results, baseline names, and statistical details, which are standard limitations of abstracts but can reduce immediate assessability. The full manuscript (Sections 4–5) details all experimental setups, baselines (including BM25, prompted LLMs, and dense retrievers), controls, and significance testing. To address the concern directly, we will revise the abstract to incorporate key quantitative highlights (e.g., nDCG@10 gains on TREC DL and average MIRACL performance) and explicit baseline references while preserving length constraints. This change will be reflected in the next version. revision: yes

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper introduces STORM as a self-supervised framework that generates token-level supervision for query rewriting by pruning beam-search expansions against an external BM25 index and retrieval metrics. This reward signal is drawn from a fixed, standard lexical retriever whose index is independent of the trained model; final claims are validated on public benchmarks (TREC DL, BEIR, MIRACL) with no reported parameter fitting that is then relabeled as a prediction, no self-definitional equations, and no load-bearing self-citations. The derivation chain therefore remains self-contained against external retrieval performance rather than reducing to its own fitted inputs or prior author results.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract only; no information is given on free parameters, axioms, or invented entities.

pith-pipeline@v0.9.1-grok · 5796 in / 1312 out tokens · 37455 ms · 2026-06-27T11:44:36.786544+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

59 extracted references · 16 canonical work pages

  1. [1]

    A Reproducibility Study of LLM-Based Query Reformulation , doi =

    Bigdeli, Amin and Hamidi Rad, Radin and Le, Hai and Incesu, Mert and Arabzadeh, Negar and Clarke, Charles and Bagheri, Ebrahim , year =. A Reproducibility Study of LLM-Based Query Reformulation , doi =

  2. [2]

    Aho and Jeffrey D

    Alfred V. Aho and Jeffrey D. Ullman , title =. 1972

  3. [3]

    2019 , eprint=

    Document Expansion by Query Prediction , author=. 2019 , eprint=

  4. [4]

    Publications Manual , year = "1983", publisher =

  5. [5]

    2025 , eprint=

    Qwen3 Technical Report , author=. 2025 , eprint=

  6. [6]

    2026 , eprint=

    QueStER: Query Specification for Generative keyword-based Retrieval , author=. 2026 , eprint=

  7. [7]

    2022 , eprint=

    Generative Cooperative Networks for Natural Language Generation , author=. 2022 , eprint=

  8. [8]

    Q ue S t ER : Query Specification for Generative Keyword-Based Retrieval

    Satouf, Arthur and Zong, Yuxuan and Boubacar, Habiboulaye Amadou and Piantanida, Pablo and Piwowarski, Benjamin. Q ue S t ER : Query Specification for Generative Keyword-Based Retrieval. Findings of the A ssociation for C omputational L inguistics: EACL 2026. 2026. doi:10.18653/v1/2026.findings-eacl.312

  9. [9]

    2023 , eprint=

    Generative Query Reformulation for Effective Adhoc Search , author=. 2023 , eprint=

  10. [10]

    2023 , eprint=

    Query Expansion by Prompting Large Language Models , author=. 2023 , eprint=

  11. [11]

    Bruce , title =

    Lavrenko, Victor and Croft, W. Bruce , title =. 2001 , isbn =. doi:10.1145/383952.383972 , booktitle =

  12. [12]

    2019 , issue_date =

    Azad, Hiteshwar Kumar and Deepak, Akshay , title =. 2019 , issue_date =. doi:10.1016/j.ipm.2019.05.009 , journal =

  13. [13]

    2011 , issue_date =

    Fontoura, Marcus and Josifovski, Vanja and Liu, Jinhui and Venkatesan, Srihari and Zhu, Xiangfei and Zien, Jason , title =. 2011 , issue_date =. doi:10.14778/3402755.3402756 , journal =

  14. [14]

    2020 , eprint=

    ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT , author=. 2020 , eprint=

  15. [15]

    Dense Passage Retrieval for Open-Domain Question Answering

    Karpukhin, Vladimir and Oguz, Barlas and Min, Sewon and Lewis, Patrick and Wu, Ledell and Edunov, Sergey and Chen, Danqi and Yih, Wen-tau. Dense Passage Retrieval for Open-Domain Question Answering. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2020. doi:10.18653/v1/2020.emnlp-main.550

  16. [16]

    Zhang, Xinyu and Ma, Xueguang and Shi, Peng and Lin, Jimmy. Mr. T y D i: A Multi-lingual Benchmark for Dense Retrieval. Proceedings of the 1st Workshop on Multilingual Representation Learning. 2021. doi:10.18653/v1/2021.mrl-1.12

  17. [17]

    Retrieval-Augmented Retrieval: Large Language Models are Strong Zero-Shot Retriever

    Shen, Tao and Long, Guodong and Geng, Xiubo and Tao, Chongyang and Lei, Yibin and Zhou, Tianyi and Blumenstein, Michael and Jiang, Daxin. Retrieval-Augmented Retrieval: Large Language Models are Strong Zero-Shot Retriever. Findings of the Association for Computational Linguistics: ACL 2024. 2024. doi:10.18653/v1/2024.findings-acl.943

  18. [18]

    2022 , eprint=

    Unsupervised Dense Information Retrieval with Contrastive Learning , author=. 2022 , eprint=

  19. [19]

    2022 , eprint=

    mMARCO: A Multilingual Version of the MS MARCO Passage Ranking Dataset , author=. 2022 , eprint=

  20. [20]

    Corpus-Steered Query Expansion with Large Language Models

    Lei, Yibin and Cao, Yu and Zhou, Tianyi and Shen, Tao and Yates, Andrew. Corpus-Steered Query Expansion with Large Language Models. Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 2: Short Papers). 2024. doi:10.18653/v1/2024.eacl-short.34

  21. [21]

    2023 , eprint=

    Query2doc: Query Expansion with Large Language Models , author=. 2023 , eprint=

  22. [22]

    and Agichtein, Eugene , year=

    Dhole, Kaustubh D. and Agichtein, Eugene , year=. GenQREnsemble: Zero-Shot LLM Ensemble Prompting for Generative Query Reformulation , ISBN=. doi:10.1007/978-3-031-56063-7_24 , booktitle=

  23. [23]

    2024 , eprint=

    Towards Competitive Search Relevance For Inference-Free Learned Sparse Retrievers , author=. 2024 , eprint=

  24. [24]

    2021 , eprint=

    Pyserini: An Easy-to-Use Python Toolkit to Support Replicable IR Research with Sparse and Dense Representations , author=. 2021 , eprint=

  25. [25]

    Text Retrieval Conference , year=

    Okapi at TREC-3 , author=. Text Retrieval Conference , year=

  26. [26]

    Pretrained Transformers for Text Ranking: BERT and Beyond

    Yates, Andrew and Nogueira, Rodrigo and Lin, Jimmy. Pretrained Transformers for Text Ranking: BERT and Beyond. Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Tutorials. 2021. doi:10.18653/v1/2021.naacl-tutorials.1

  27. [27]

    A Wrong Answer or a Wrong Question? An Intricate Relationship between Question Reformulation and Answer Selection in Conversational Question Answering

    Vakulenko, Svitlana and Longpre, Shayne and Tu, Zhucheng and Anantha, Raviteja. A Wrong Answer or a Wrong Question? An Intricate Relationship between Question Reformulation and Answer Selection in Conversational Question Answering. Proceedings of the 5th International Workshop on Search-Oriented Conversational AI (SCAI). 2020. doi:10.18653/v1/2020.scai-1.2

  28. [28]

    Chandra and Dexter C

    Ashok K. Chandra and Dexter C. Kozen and Larry J. Stockmeyer , year = "1981", title =. doi:10.1145/322234.322243

  29. [29]

    Scalable training of

    Andrew, Galen and Gao, Jianfeng , booktitle=. Scalable training of

  30. [30]

    Dan Gusfield , title =. 1997

  31. [31]

    Tetreault , title =

    Mohammad Sadegh Rasooli and Joel R. Tetreault , title =. Computing Research Repository , volume =. 2015 , url =

  32. [32]

    A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =

    Ando, Rie Kubota and Zhang, Tong , Issn =. A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =. Journal of Machine Learning Research , Month = dec, Numpages =

  33. [33]

    Precise Zero‑Shot Dense Retrieval without Relevance Labels , author =

  34. [34]

    Retrieval-Augmented Retrieval: Large Language Models are Strong Zero-Shot Retriever , author =

  35. [35]

    Query Rewriting for Retrieval-Augmented Large Language Models , author =. emnlp

  36. [36]

    RaFe: Ranking Feedback Improves Query Rewriting for RAG , author =

  37. [37]

    2020 , organization=

    Dense Passage Retrieval for Open-Domain Question Answering , author=. 2020 , organization=

  38. [38]

    arXiv preprint arXiv:1901.04085 , year=

    Passage Re-ranking with BERT , author=. arXiv preprint arXiv:1901.04085 , year=

  39. [39]

    arXiv preprint arXiv:2109.10086 , year=

    SPLADE v2: Sparse lexical and expansion model for information retrieval , author=. arXiv preprint arXiv:2109.10086 , year=

  40. [40]

    Proceedings of the AAAI Conference on Artificial Intelligence , year=

    Maferw: Query rewriting with multi-aspect feedbacks for retrieval-augmented large language models , author=. Proceedings of the AAAI Conference on Artificial Intelligence , year=

  41. [41]

    arXiv preprint arXiv:2404.00610 , year=

    Rq-rag: Learning to refine queries for retrieval augmented generation , author=. arXiv preprint arXiv:2404.00610 , year=

  42. [42]

    Grand, Adrien and Muir, Robert and Ferenczi, Jim and Lin, Jimmy , editor =. From. Advances in. 2020 , keywords =. doi:10.1007/978-3-030-45442-5_3 , language =

  43. [43]

    Advances in neural information processing systems , volume=

    Retrieval-augmented generation for knowledge-intensive nlp tasks , author=. Advances in neural information processing systems , volume=

  44. [44]

    FineTuning LLaMA for Multi Stage Text Retrieval , author =

  45. [45]

    2001 , organization=

    Relevance-based language models , author=. 2001 , organization=

  46. [46]

    2004 , organization=

    UMass at TREC 2004: Novelty and HARD , author=. 2004 , organization=

  47. [47]

    1998 , isbn =

    Mitra, Mandar and Singhal, Amit and Buckley, Chris , title =. 1998 , isbn =. doi:10.1145/290941.290995 , booktitle =

  48. [48]

    Bajaj, Payal and Campos, Daniel and Craswell, Nick and Deng, Li and Gao, Jianfeng and Liu, Xiaodong and Majumder, Rangan and McNamara, Andrew and Mitra, Bhaskar and Nguyen, Tri and Rosenberg, Mir and Song, Xia and Stoica, Alina and Tiwary, Saurabh and Wang, Tong , journal =

  49. [49]

    2021 , publisher =

    Craswell, Nick and Mitra, Bhaskar and Yilmaz, Emine and Campos, Daniel and Lin, Jimmy , booktitle =. 2021 , publisher =

  50. [50]

    , booktitle =

    Craswell, Nick and Mitra, Bhaskar and Yilmaz, Emine and Campos, Daniel and Voorhees, Ellen M. , booktitle =. Overview of the. 2020 , publisher =

  51. [51]

    Overview of the

    Craswell, Nick and Mitra, Bhaskar and Yilmaz, Emine and Campos, Daniel , booktitle =. Overview of the. 2021 , publisher =

  52. [52]

    2021 , eprint=

    BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models , author=. 2021 , eprint=

  53. [53]

    2023 , doi =

    Zhang, Xinyu and Thakur, Nandan and Ogundepo, Odunayo and Kamalloo, Ehsan and Alfonso-Hermelo, David and Li, Xiaoguang and Liu, Qun and Rezagholizadeh, Mehdi and Lin, Jimmy , journal =. 2023 , doi =

  54. [54]

    MIRACL : A Multilingual Retrieval Dataset Covering 18 Diverse Languages

    Zhang, Xinyu and Thakur, Nandan and Ogundepo, Odunayo and Kamalloo, Ehsan and Alfonso-Hermelo, David and Li, Xiaoguang and Liu, Qun and Rezagholizadeh, Mehdi and Lin, Jimmy. MIRACL : A Multilingual Retrieval Dataset Covering 18 Diverse Languages. Transactions of the Association for Computational Linguistics. 2023. doi:10.1162/tacl_a_00595

  55. [55]

    2018 , eprint=

    Diverse Beam Search: Decoding Diverse Solutions from Neural Sequence Models , author=. 2018 , eprint=

  56. [56]

    A Reproducibility Study of LLM-Based Query Reformulation , author=. -. 2026 , url=

  57. [57]

    W ord2 P assage: Word-level Importance Re-weighting for Query Expansion

    Choi, Jeonghwan and Ban, Minjeong and Kim, Minseok and Song, Hwanjun. W ord2 P assage: Word-level Importance Re-weighting for Query Expansion. Findings of the Association for Computational Linguistics: ACL 2025. 2025. doi:10.18653/v1/2025.findings-acl.434

  58. [58]

    Can Generative LLMs Create Query Variants for Test Collections? An Exploratory Study , author =

  59. [59]

    Exploring the best practices of query expansion with large language models , author=