SD-Search: On-Policy Hindsight Self-Distillation for Search-Augmented Reasoning
Pith reviewed 2026-05-20 09:55 UTC · model grok-4.3
The pith
A single model can create its own step-level supervision for search queries by distilling from a hindsight-aware version of itself.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that on-policy hindsight self-distillation supplies effective step-level supervision for query decisions: a single model plays the role of teacher when additionally conditioned on a compact hindsight block that summarizes search queries and final outcomes across sampled rollouts from the same question, and the student is trained to recover the teacher's query distribution by minimizing token-level Jensen-Shannon divergence at search-query positions, thereby layering a dense signal atop trajectory-level rewards such as those from GRPO.
What carries the argument
The hindsight block, a compact summary of search queries and final outcomes from sampled rollouts, which lets the teacher implicitly mark useful decisions so the student can learn them through distribution matching.
If this is right
- The approach adds dense step-level signals to the coarse trajectory reward used in reinforcement learning for search-augmented agents.
- Training stays inside the standard RL loop and requires no separate annotation pipeline or external model calls.
- Query quality improves because the student learns to imitate the decisions that the hindsight-aware teacher associates with successful outcomes.
- The method removes dependence on larger teacher models or externally generated sub-question labels.
Where Pith is reading between the lines
- The same self-distillation pattern could be tested on other sequential decision problems where credit assignment is difficult and hindsight summaries are easy to form.
- Varying the size or content of the hindsight block might reveal how much outcome information is needed to produce useful query distributions.
- If the divergence signal proves reliable, it could be combined with other forms of process supervision without increasing model count.
Load-bearing premise
The hindsight block must contain enough information about which rollouts succeeded that matching the teacher's query distribution actually improves the student's later decisions when the block is removed.
What would settle it
An ablation in which removing the hindsight block or replacing the Jensen-Shannon term with uniform random targets produces no gain in final task accuracy or query quality compared with standard trajectory-only reinforcement learning.
Figures
read the original abstract
Search-augmented reasoning agents interleave internal reasoning with calls to an external retriever, and their performance relies on the quality of each issued query. However, under outcome-reward reinforcement learning, every search decision in a rollout shares the same trajectory-level reward, leaving individual queries without step-specific credit. Recent process-supervision approaches address this gap by drawing step-level signals from outside the policy, relying either on a much larger teacher model, or on sub-question annotations produced by a stronger external system. In contrast, we propose SD-Search, which derives step-level supervision from the policy itself through on-policy hindsight self-distillation, requiring neither an external teacher nor additional annotations. In SD-Search, a single model plays two roles that differ only in conditioning: a student that sees only the context available at inference time, and a teacher that additionally conditions on a compact hindsight block summarizing the search queries and final outcomes of a group of rollouts sampled from the same question. Since the teacher knows how each rollout unfolded and which ones succeeded, its query distribution implicitly marks which decisions were worth making, and the student is trained to recover this behavior by minimizing the token-level Jensen--Shannon divergence to the teacher at search-query positions. This layers a dense, step-level signal on top of GRPO's coarse trajectory reward. Crucially, this signal is produced by the policy itself within the standard RL training loop, without external model inference, auxiliary annotation pipeline, or additional training stage.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes SD-Search, a method for search-augmented reasoning agents that interleaves internal reasoning with external retrieval. It addresses the lack of step-level credit in outcome-reward RL (e.g., GRPO) by introducing on-policy hindsight self-distillation: a single model serves as a student (inference-time context only) and a teacher (additionally conditioned on a compact hindsight block of sampled rollouts' queries and outcomes). The student is trained to match the teacher's query distribution at search positions via token-level Jensen-Shannon divergence, layering dense supervision on top of trajectory rewards without external teachers, annotations, or extra training stages.
Significance. If the mechanism is effective, the work provides a self-contained way to derive process-level signals for query generation in RL-trained agents, reducing dependence on larger teachers or manual sub-question labels. The on-policy nature, use of a single model in dual roles, and token-level JSD objective are strengths that could improve sample efficiency and performance on multi-step reasoning tasks with retrieval.
major comments (1)
- [Abstract] Abstract and method description: the central claim that 'the teacher knows how each rollout unfolded and which ones succeeded, [so] its query distribution implicitly marks which decisions were worth making' depends on the hindsight block preserving distinguishable per-query signals. The description of the block as a 'compact hindsight block summarizing the search queries and final outcomes of a group of rollouts' does not specify the linking mechanism (e.g., whether outcomes are explicitly attributed to individual queries or whether the block is a flat concatenation). Without such linkage, the resulting JSD target supplies no finer credit than the shared trajectory reward, undermining the step-level supervision argument.
minor comments (1)
- [Abstract] Notation for the Jensen-Shannon divergence should be defined explicitly (including the token-level formulation) rather than left as 'token-level Jensen--Shannon divergence'.
Simulated Author's Rebuttal
We thank the referee for their detailed and insightful comments on our manuscript. We have carefully considered the major comment regarding the description of the hindsight block and provide our response below. We believe the clarification will strengthen the paper.
read point-by-point responses
-
Referee: [Abstract] Abstract and method description: the central claim that 'the teacher knows how each rollout unfolded and which ones succeeded, [so] its query distribution implicitly marks which decisions were worth making' depends on the hindsight block preserving distinguishable per-query signals. The description of the block as a 'compact hindsight block summarizing the search queries and final outcomes of a group of rollouts' does not specify the linking mechanism (e.g., whether outcomes are explicitly attributed to individual queries or whether the block is a flat concatenation). Without such linkage, the resulting JSD target supplies no finer credit than the shared trajectory reward, undermining the step-level supervision argument.
Authors: We thank the referee for highlighting this important point about the clarity of our description. The hindsight block is constructed by appending, for each sampled rollout, the sequence of search queries issued during that rollout followed by the final trajectory outcome (e.g., success or failure indicator). This structure explicitly links individual queries to their outcomes within the block, enabling the teacher model to distinguish the contribution of specific queries to successful rollouts. The student, lacking this block, learns to produce query distributions that align with those informed by outcome knowledge. We acknowledge that the abstract's phrasing is concise and could be more explicit about this per-query attribution. We will revise the abstract and the method section to include a more detailed description of the hindsight block's format, such as 'a compact hindsight block that concatenates per-rollout query sequences with their corresponding final outcomes'. This should address the concern and make the step-level supervision mechanism clearer. revision: yes
Circularity Check
No significant circularity; self-contained on-policy training procedure
full rationale
The paper's central claim is that SD-Search generates step-level supervision internally by having the same policy act as teacher (conditioned on a hindsight block of sampled rollouts and outcomes) and student (inference-time context only), with the student trained to match the teacher's query distribution via token-level JSD on top of GRPO trajectory rewards. This construction does not reduce to a tautology or fitted input by definition: the hindsight summary supplies outcome information unavailable to the student at inference time, and the JSD objective is an explicit additional loss rather than a renaming or self-referential fit. No equations or sections in the provided text exhibit self-definitional equivalence, load-bearing self-citations, or imported uniqueness theorems. The method is presented as an empirical augmentation within the standard RL loop, with effectiveness depending on whether the compact block preserves per-query signals—an assumption that is falsifiable rather than definitional.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Outcome-reward reinforcement learning assigns the same trajectory-level reward to every search decision within a rollout.
invented entities (1)
-
hindsight block
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
on-policy hindsight self-distillation... minimizing the token-level Jensen-Shannon divergence to the teacher at search-query positions
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
compact hindsight block summarizing the search queries and final outcomes
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
On-policy distillation of language models: Learning from self-generated mistakes
Rishabh Agarwal, Nino Vieillard, Yongchao Zhou, Piotr Stanczyk, Sabela Ramos Garea, Matthieu Geist, and Olivier Bachem. On-policy distillation of language models: Learning from self-generated mistakes. InInternational Conference on Learning Representations, pages 21246–21263, 2024
work page 2024
-
[2]
Marcin Andrychowicz, Filip Wolski, Alex Ray, Jonas Schneider, Rachel Fong, Peter Welinder, Bob McGrew, Josh Tobin, Pieter Abbeel, and Wojciech Zaremba. Hindsight experience replay. arXiv preprint arXiv:1707.01495, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[3]
Improving language models by retrieving from trillions of tokens
Sebastian Borgeaud, Arthur Mensch, Jordan Hoffmann, Trevor Cai, Eliza Rutherford, Katie Millican, George van den Driessche, Jean-Baptiste Lespiau, Bogdan Damoc, Aidan Clark, Diego de Las Casas, Aurelia Guy, Jacob Menick, Roman Ring, Tom Hennigan, Saffron Huang, Loren Maggiore, Chris Jones, Albin Cassirer, Andy Brock, Michela Paganini, Geoffrey Irving, Ori...
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[4]
ReSearch: Learning to Reason with Search for LLMs via Reinforcement Learning
Mingyang Chen, Tianpeng Li, Haoze Sun, Yijie Zhou, Chenzheng Zhu, Haofen Wang, Jeff Z. Pan, Wen Zhang, Huajun Chen, Fan Yang, Zenan Zhou, and Weipeng Chen. Research: Learning to reason with search for llms via reinforcement learning.arXiv preprint arXiv:2503.19470, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[5]
Group-in-group policy optimization for LLM agent training
Lang Feng, Zhenghai Xue, Tingcong Liu, and Bo An. Group-in-group policy optimization for LLM agent training. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025
work page 2025
-
[6]
Minillm: Knowledge distillation of large language models
Yuxian Gu, Li Dong, Furu Wei, and Minlie Huang. Minillm: Knowledge distillation of large language models. InInternational Conference on Learning Representations, pages 32694– 32717, 2024
work page 2024
-
[7]
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, Xiaokang Zhang, Xingkai Yu, and Wu. Deepseek-r1 incentivizes reasoning in llms through reinforcement learning.Nature, 645(8081):633–638, 2025
work page 2025
-
[8]
Anna Harutyunyan, Will Dabney, Thomas Mesnard, Mohammad Gheshlaghi Azar, Bilal Piot, Nicolas Heess, Hado van Hasselt, Gregory Wayne, Satinder Singh, Doina Precup, and Remi Munos. Hindsight credit assignment. InAdvances in Neural Information Processing Systems. Curran Associates, Inc., 2019
work page 2019
-
[9]
Distilling the Knowledge in a Neural Network
Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[10]
Constructing a multi-hop QA dataset for comprehensive evaluation of reasoning steps
Xanh Ho, Anh-Khoa Duong Nguyen, Saku Sugawara, and Akiko Aizawa. Constructing a multi-hop QA dataset for comprehensive evaluation of reasoning steps. InProceedings of the 28th International Conference on Computational Linguistics, pages 6609–6625, 2020
work page 2020
-
[11]
Search-r1: Training LLMs to reason and leverage search engines with reinforcement learning
Bowen Jin, Hansi Zeng, Zhenrui Yue, Jinsung Yoon, Sercan O Arik, Dong Wang, Hamed Zamani, and Jiawei Han. Search-r1: Training LLMs to reason and leverage search engines with reinforcement learning. InSecond Conference on Language Modeling, 2025
work page 2025
-
[12]
TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension
Mandar Joshi, Eunsol Choi, Daniel Weld, and Luke Zettlemoyer. TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension. InProceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1601–1611, Vancouver, Canada, 2017. 10
work page 2017
-
[13]
Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov
Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, Kristina Toutanova, Llion Jones, Matthew Kelcey, Ming-Wei Chang, Andrew M. Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov. Natural questions: A benchmark for question answering research.Transact...
work page 2019
-
[14]
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks
Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. Retrieval-augmented generation for knowledge-intensive nlp tasks.arXiv preprint arXiv:2005.11401, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2005
-
[15]
Search-o1: Agentic search-enhanced large reasoning models
Xiaoxi Li, Guanting Dong, Jiajie Jin, Yuyao Zhang, Yujia Zhou, Yutao Zhu, Peitian Zhang, and Zhicheng Dou. Search-o1: Agentic search-enhanced large reasoning models. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 5420–5438, Suzhou, China, 2025
work page 2025
-
[16]
Unifying distillation and privileged information
David Lopez-Paz, Léon Bottou, Bernhard Schölkopf, and Vladimir Vapnik. Unifying distillation and privileged information.arXiv preprint arXiv:1511.03643, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[17]
Alex Mallen, Akari Asai, Victor Zhong, Rajarshi Das, Daniel Khashabi, and Hannaneh Ha- jishirzi. When not to trust language models: Investigating effectiveness of parametric and non-parametric memories. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 9802–9822, Toronto, Canada, 2023
work page 2023
-
[18]
OpenAI. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[19]
Measuring and narrowing the compositionality gap in language models
Ofir Press, Muru Zhang, Sewon Min, Ludwig Schmidt, Noah Smith, and Mike Lewis. Measuring and narrowing the compositionality gap in language models. InFindings of the Association for Computational Linguistics: EMNLP 2023, pages 5687–5711, Singapore, 2023
work page 2023
-
[20]
Toolformer: Language models can teach themselves to use tools
Timo Schick, Jane Dwivedi-Yu, Roberto Dessi, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools. InAdvances in Neural Information Processing Systems, pages 68539–68551, 2023
work page 2023
-
[21]
Proximal Policy Optimization Algorithms
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[22]
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y . K. Li, Y . Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[23]
Self-Distillation Enables Continual Learning
Idan Shenfeld, Mehul Damani, Jonas Hübotter, and Pulkit Agrawal. Self-distillation enables continual learning.arXiv preprint arXiv:2601.19897, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[24]
Hybridflow: A flexible and efficient rlhf framework
Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework. In Proceedings of the Twentieth European Conference on Computer Systems, EuroSys ’25, pages 1279–1297, 2025
work page 2025
-
[25]
Yaorui Shi, Sihang Li, Chang Wu, Zhiyuan Liu, Junfeng Fang, Hengxing Cai, An Zhang, and Xiang Wang. Search and refine during think: Facilitating knowledge refinement for improved retrieval-augmented reasoning. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025
work page 2025
-
[26]
Qwen Team. Qwen2.5 technical report.arXiv preprint arXiv:2412.15115, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[27]
Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. MuSiQue: Multihop questions via single-hop question composition.Transactions of the Association for Computational Linguistics, 10:539–554, 2022
work page 2022
-
[28]
Interleaving retrieval with chain-of-thought reasoning for knowledge-intensive multi-step questions
Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. Interleaving retrieval with chain-of-thought reasoning for knowledge-intensive multi-step questions. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 10014–10037, Toronto, Canada, 2023. 11
work page 2023
-
[29]
Text Embeddings by Weakly-Supervised Contrastive Pre-training
Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin Jiang, Rangan Majumder, and Furu Wei. Text embeddings by weakly-supervised contrastive pre-training. arXiv preprint arXiv:2212.03533, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[30]
Ziliang Wang, Xuhui Zheng, Kang An, Cijun Ouyang, Jialu Cai, Yuhang Wang, and Yichao Wu. Stepsearch: Igniting llms search ability via step-wise proximal policy optimization.arXiv preprint arXiv:2505.15107, 2025
-
[31]
Chain-of-thought prompting elicits reasoning in large language models
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, brian ichter, Fei Xia, Ed Chi, Quoc V Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. InAdvances in Neural Information Processing Systems, pages 24824–24837, 2022
work page 2022
-
[32]
Smith, and Hannaneh Hajishirzi
Teng Xiao, Yige Yuan, Hamish Ivison, Huaisheng Zhu, Faeze Brahman, Nathan Lambert, Pradeep Dasigi, Noah A. Smith, and Hannaneh Hajishirzi. Meta-reinforcement learning with self-reflection for agentic search.arXiv preprint arXiv:2603.11327, 2026
-
[33]
Jun Xu, Xinkai Du, Yu Ao, Peilong Zhao, Yang Li, Ling Zhong, Lin Yuan, Zhongpu Bo, Xiaorui Wang, Mengshu Sun, Zhengke Gui, Dalong Zhang, Zhaoyang Wang, Qiwei Wang, Yangyang Hou, Zhiying Yin, Haofen Wang, Huajun Chen, Lei Liang, and Jun Zhou. Thinker: Training llms in hierarchical thinking for deep search via multi-turn interaction.arXiv preprint arXiv:251...
-
[34]
Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christopher D. Manning. HotpotQA: A dataset for diverse, explainable multi-hop question answering. InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2369–2380, Brussels, Belgium, 2018
work page 2018
-
[35]
ReAct: Synergizing Reasoning and Acting in Language Models
Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models.arXiv preprint arXiv:2210.03629, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[36]
Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models
Siyan Zhao, Zhihui Xie, Mengchen Liu, Jing Huang, Guan Pang, Feiyu Chen, and Aditya Grover. Self-distilled reasoner: On-policy self-distillation for large language models.arXiv preprint arXiv:2601.18734, 2026. 12 A Related Work Search-augmented reasoning with reinforcement learning.A line of recent work trains LLMs to invoke retrieval tools during reasoni...
work page internal anchor Pith review Pith/arXiv arXiv 2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.