pith. sign in

arxiv: 2605.18299 · v1 · pith:4HLLBK5Znew · submitted 2026-05-18 · 💻 cs.AI · cs.CL· cs.IR

SD-Search: On-Policy Hindsight Self-Distillation for Search-Augmented Reasoning

Pith reviewed 2026-05-20 09:55 UTC · model grok-4.3

classification 💻 cs.AI cs.CLcs.IR
keywords search-augmented reasoningself-distillationhindsight learningon-policy trainingJensen-Shannon divergencereinforcement learningquery generation
0
0 comments X

The pith

A single model can create its own step-level supervision for search queries by distilling from a hindsight-aware version of itself.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Search-augmented reasoning agents suffer from poor credit assignment because every search query in a rollout receives only the same final outcome reward. SD-Search lets the model itself generate denser signals by running rollouts, packing their queries and results into a compact hindsight block, and using that block to condition a teacher version of the model. The inference-time student version is then trained to match the teacher's query distribution at search positions through token-level Jensen-Shannon divergence. This process runs inside the normal reinforcement learning loop and needs no larger external model or extra human annotations.

Core claim

The central claim is that on-policy hindsight self-distillation supplies effective step-level supervision for query decisions: a single model plays the role of teacher when additionally conditioned on a compact hindsight block that summarizes search queries and final outcomes across sampled rollouts from the same question, and the student is trained to recover the teacher's query distribution by minimizing token-level Jensen-Shannon divergence at search-query positions, thereby layering a dense signal atop trajectory-level rewards such as those from GRPO.

What carries the argument

The hindsight block, a compact summary of search queries and final outcomes from sampled rollouts, which lets the teacher implicitly mark useful decisions so the student can learn them through distribution matching.

If this is right

  • The approach adds dense step-level signals to the coarse trajectory reward used in reinforcement learning for search-augmented agents.
  • Training stays inside the standard RL loop and requires no separate annotation pipeline or external model calls.
  • Query quality improves because the student learns to imitate the decisions that the hindsight-aware teacher associates with successful outcomes.
  • The method removes dependence on larger teacher models or externally generated sub-question labels.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same self-distillation pattern could be tested on other sequential decision problems where credit assignment is difficult and hindsight summaries are easy to form.
  • Varying the size or content of the hindsight block might reveal how much outcome information is needed to produce useful query distributions.
  • If the divergence signal proves reliable, it could be combined with other forms of process supervision without increasing model count.

Load-bearing premise

The hindsight block must contain enough information about which rollouts succeeded that matching the teacher's query distribution actually improves the student's later decisions when the block is removed.

What would settle it

An ablation in which removing the hindsight block or replacing the Jensen-Shannon term with uniform random targets produces no gain in final task accuracy or query quality compared with standard trajectory-only reinforcement learning.

Figures

Figures reproduced from arXiv: 2605.18299 by Ben Chen, Chenyi Lei, Huangyu Dai, Lingtao Mao, Wenwu Ou, Xuxin Zhang, Yufei Ma, Zhipeng Qian, Zihan Liang.

Figure 1
Figure 1. Figure 1: Three paradigms for supervising search decisions. [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Method overview of SD-Search. The policy acts as its own teacher by conditioning on a [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Hindsight block construction ablation on Qwen2.5-3B-Base. Eight configurations cluster [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Sensitivity of hyperparameters on Qwen2.5-3B-Base. Each panel varies one hyperparameter [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Training dynamics of SD-Search versus AutoRefine on Qwen2.5-3B-Base, averaged [PITH_FULL_IMAGE:figures/full_fig_p019_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Per-token trace of P(actual token | condition) across the action span of a failed Bamboogle rollout, in which the focal trajectory issues two near-identical queries, a degenerate failure mode that hindsight conditioning is designed to penalize. Student (blue) sees no hindsight; teacher with real outcomes (green) is the SD-Search teacher view; teacher with flipped outcomes (purple) is a control in which eve… view at source ↗
Figure 7
Figure 7. Figure 7: Scaling on Qwen2.5 at 1.5B, 3B, 7B, 14B (average EM across the seven benchmarks at [PITH_FULL_IMAGE:figures/full_fig_p022_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Teacher input layout for one supervised step. Blocks (1) and (5) are the standard student [PITH_FULL_IMAGE:figures/full_fig_p023_8.png] view at source ↗
read the original abstract

Search-augmented reasoning agents interleave internal reasoning with calls to an external retriever, and their performance relies on the quality of each issued query. However, under outcome-reward reinforcement learning, every search decision in a rollout shares the same trajectory-level reward, leaving individual queries without step-specific credit. Recent process-supervision approaches address this gap by drawing step-level signals from outside the policy, relying either on a much larger teacher model, or on sub-question annotations produced by a stronger external system. In contrast, we propose SD-Search, which derives step-level supervision from the policy itself through on-policy hindsight self-distillation, requiring neither an external teacher nor additional annotations. In SD-Search, a single model plays two roles that differ only in conditioning: a student that sees only the context available at inference time, and a teacher that additionally conditions on a compact hindsight block summarizing the search queries and final outcomes of a group of rollouts sampled from the same question. Since the teacher knows how each rollout unfolded and which ones succeeded, its query distribution implicitly marks which decisions were worth making, and the student is trained to recover this behavior by minimizing the token-level Jensen--Shannon divergence to the teacher at search-query positions. This layers a dense, step-level signal on top of GRPO's coarse trajectory reward. Crucially, this signal is produced by the policy itself within the standard RL training loop, without external model inference, auxiliary annotation pipeline, or additional training stage.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The manuscript proposes SD-Search, a method for search-augmented reasoning agents that interleaves internal reasoning with external retrieval. It addresses the lack of step-level credit in outcome-reward RL (e.g., GRPO) by introducing on-policy hindsight self-distillation: a single model serves as a student (inference-time context only) and a teacher (additionally conditioned on a compact hindsight block of sampled rollouts' queries and outcomes). The student is trained to match the teacher's query distribution at search positions via token-level Jensen-Shannon divergence, layering dense supervision on top of trajectory rewards without external teachers, annotations, or extra training stages.

Significance. If the mechanism is effective, the work provides a self-contained way to derive process-level signals for query generation in RL-trained agents, reducing dependence on larger teachers or manual sub-question labels. The on-policy nature, use of a single model in dual roles, and token-level JSD objective are strengths that could improve sample efficiency and performance on multi-step reasoning tasks with retrieval.

major comments (1)
  1. [Abstract] Abstract and method description: the central claim that 'the teacher knows how each rollout unfolded and which ones succeeded, [so] its query distribution implicitly marks which decisions were worth making' depends on the hindsight block preserving distinguishable per-query signals. The description of the block as a 'compact hindsight block summarizing the search queries and final outcomes of a group of rollouts' does not specify the linking mechanism (e.g., whether outcomes are explicitly attributed to individual queries or whether the block is a flat concatenation). Without such linkage, the resulting JSD target supplies no finer credit than the shared trajectory reward, undermining the step-level supervision argument.
minor comments (1)
  1. [Abstract] Notation for the Jensen-Shannon divergence should be defined explicitly (including the token-level formulation) rather than left as 'token-level Jensen--Shannon divergence'.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their detailed and insightful comments on our manuscript. We have carefully considered the major comment regarding the description of the hindsight block and provide our response below. We believe the clarification will strengthen the paper.

read point-by-point responses
  1. Referee: [Abstract] Abstract and method description: the central claim that 'the teacher knows how each rollout unfolded and which ones succeeded, [so] its query distribution implicitly marks which decisions were worth making' depends on the hindsight block preserving distinguishable per-query signals. The description of the block as a 'compact hindsight block summarizing the search queries and final outcomes of a group of rollouts' does not specify the linking mechanism (e.g., whether outcomes are explicitly attributed to individual queries or whether the block is a flat concatenation). Without such linkage, the resulting JSD target supplies no finer credit than the shared trajectory reward, undermining the step-level supervision argument.

    Authors: We thank the referee for highlighting this important point about the clarity of our description. The hindsight block is constructed by appending, for each sampled rollout, the sequence of search queries issued during that rollout followed by the final trajectory outcome (e.g., success or failure indicator). This structure explicitly links individual queries to their outcomes within the block, enabling the teacher model to distinguish the contribution of specific queries to successful rollouts. The student, lacking this block, learns to produce query distributions that align with those informed by outcome knowledge. We acknowledge that the abstract's phrasing is concise and could be more explicit about this per-query attribution. We will revise the abstract and the method section to include a more detailed description of the hindsight block's format, such as 'a compact hindsight block that concatenates per-rollout query sequences with their corresponding final outcomes'. This should address the concern and make the step-level supervision mechanism clearer. revision: yes

Circularity Check

0 steps flagged

No significant circularity; self-contained on-policy training procedure

full rationale

The paper's central claim is that SD-Search generates step-level supervision internally by having the same policy act as teacher (conditioned on a hindsight block of sampled rollouts and outcomes) and student (inference-time context only), with the student trained to match the teacher's query distribution via token-level JSD on top of GRPO trajectory rewards. This construction does not reduce to a tautology or fitted input by definition: the hindsight summary supplies outcome information unavailable to the student at inference time, and the JSD objective is an explicit additional loss rather than a renaming or self-referential fit. No equations or sections in the provided text exhibit self-definitional equivalence, load-bearing self-citations, or imported uniqueness theorems. The method is presented as an empirical augmentation within the standard RL loop, with effectiveness depending on whether the compact block preserves per-query signals—an assumption that is falsifiable rather than definitional.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Abstract-only review provides insufficient detail to enumerate free parameters or background axioms exhaustively; the approach rests on standard RL outcome-reward assumptions and introduces the hindsight block as a new conditioning mechanism.

axioms (1)
  • domain assumption Outcome-reward reinforcement learning assigns the same trajectory-level reward to every search decision within a rollout.
    Stated as the starting problem that SD-Search addresses.
invented entities (1)
  • hindsight block no independent evidence
    purpose: Compact summary of search queries and final outcomes from a group of rollouts used to condition the teacher.
    Introduced to enable the teacher to mark worthwhile decisions without external input.

pith-pipeline@v0.9.0 · 5830 in / 1383 out tokens · 63082 ms · 2026-05-20T09:55:55.071568+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

36 extracted references · 36 canonical work pages · 14 internal anchors

  1. [1]

    On-policy distillation of language models: Learning from self-generated mistakes

    Rishabh Agarwal, Nino Vieillard, Yongchao Zhou, Piotr Stanczyk, Sabela Ramos Garea, Matthieu Geist, and Olivier Bachem. On-policy distillation of language models: Learning from self-generated mistakes. InInternational Conference on Learning Representations, pages 21246–21263, 2024

  2. [2]

    Hindsight Experience Replay

    Marcin Andrychowicz, Filip Wolski, Alex Ray, Jonas Schneider, Rachel Fong, Peter Welinder, Bob McGrew, Josh Tobin, Pieter Abbeel, and Wojciech Zaremba. Hindsight experience replay. arXiv preprint arXiv:1707.01495, 2018

  3. [3]

    Improving language models by retrieving from trillions of tokens

    Sebastian Borgeaud, Arthur Mensch, Jordan Hoffmann, Trevor Cai, Eliza Rutherford, Katie Millican, George van den Driessche, Jean-Baptiste Lespiau, Bogdan Damoc, Aidan Clark, Diego de Las Casas, Aurelia Guy, Jacob Menick, Roman Ring, Tom Hennigan, Saffron Huang, Loren Maggiore, Chris Jones, Albin Cassirer, Andy Brock, Michela Paganini, Geoffrey Irving, Ori...

  4. [4]

    ReSearch: Learning to Reason with Search for LLMs via Reinforcement Learning

    Mingyang Chen, Tianpeng Li, Haoze Sun, Yijie Zhou, Chenzheng Zhu, Haofen Wang, Jeff Z. Pan, Wen Zhang, Huajun Chen, Fan Yang, Zenan Zhou, and Weipeng Chen. Research: Learning to reason with search for llms via reinforcement learning.arXiv preprint arXiv:2503.19470, 2025

  5. [5]

    Group-in-group policy optimization for LLM agent training

    Lang Feng, Zhenghai Xue, Tingcong Liu, and Bo An. Group-in-group policy optimization for LLM agent training. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

  6. [6]

    Minillm: Knowledge distillation of large language models

    Yuxian Gu, Li Dong, Furu Wei, and Minlie Huang. Minillm: Knowledge distillation of large language models. InInternational Conference on Learning Representations, pages 32694– 32717, 2024

  7. [7]

    Deepseek-r1 incentivizes reasoning in llms through reinforcement learning.Nature, 645(8081):633–638, 2025

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, Xiaokang Zhang, Xingkai Yu, and Wu. Deepseek-r1 incentivizes reasoning in llms through reinforcement learning.Nature, 645(8081):633–638, 2025

  8. [8]

    Hindsight credit assignment

    Anna Harutyunyan, Will Dabney, Thomas Mesnard, Mohammad Gheshlaghi Azar, Bilal Piot, Nicolas Heess, Hado van Hasselt, Gregory Wayne, Satinder Singh, Doina Precup, and Remi Munos. Hindsight credit assignment. InAdvances in Neural Information Processing Systems. Curran Associates, Inc., 2019

  9. [9]

    Distilling the Knowledge in a Neural Network

    Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015

  10. [10]

    Constructing a multi-hop QA dataset for comprehensive evaluation of reasoning steps

    Xanh Ho, Anh-Khoa Duong Nguyen, Saku Sugawara, and Akiko Aizawa. Constructing a multi-hop QA dataset for comprehensive evaluation of reasoning steps. InProceedings of the 28th International Conference on Computational Linguistics, pages 6609–6625, 2020

  11. [11]

    Search-r1: Training LLMs to reason and leverage search engines with reinforcement learning

    Bowen Jin, Hansi Zeng, Zhenrui Yue, Jinsung Yoon, Sercan O Arik, Dong Wang, Hamed Zamani, and Jiawei Han. Search-r1: Training LLMs to reason and leverage search engines with reinforcement learning. InSecond Conference on Language Modeling, 2025

  12. [12]

    TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension

    Mandar Joshi, Eunsol Choi, Daniel Weld, and Luke Zettlemoyer. TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension. InProceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1601–1611, Vancouver, Canada, 2017. 10

  13. [13]

    Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov

    Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, Kristina Toutanova, Llion Jones, Matthew Kelcey, Ming-Wei Chang, Andrew M. Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov. Natural questions: A benchmark for question answering research.Transact...

  14. [14]

    Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

    Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. Retrieval-augmented generation for knowledge-intensive nlp tasks.arXiv preprint arXiv:2005.11401, 2021

  15. [15]

    Search-o1: Agentic search-enhanced large reasoning models

    Xiaoxi Li, Guanting Dong, Jiajie Jin, Yuyao Zhang, Yujia Zhou, Yutao Zhu, Peitian Zhang, and Zhicheng Dou. Search-o1: Agentic search-enhanced large reasoning models. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 5420–5438, Suzhou, China, 2025

  16. [16]

    Unifying distillation and privileged information

    David Lopez-Paz, Léon Bottou, Bernhard Schölkopf, and Vladimir Vapnik. Unifying distillation and privileged information.arXiv preprint arXiv:1511.03643, 2016

  17. [17]

    When not to trust language models: Investigating effectiveness of parametric and non-parametric memories

    Alex Mallen, Akari Asai, Victor Zhong, Rajarshi Das, Daniel Khashabi, and Hannaneh Ha- jishirzi. When not to trust language models: Investigating effectiveness of parametric and non-parametric memories. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 9802–9822, Toronto, Canada, 2023

  18. [18]

    GPT-4 Technical Report

    OpenAI. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2024

  19. [19]

    Measuring and narrowing the compositionality gap in language models

    Ofir Press, Muru Zhang, Sewon Min, Ludwig Schmidt, Noah Smith, and Mike Lewis. Measuring and narrowing the compositionality gap in language models. InFindings of the Association for Computational Linguistics: EMNLP 2023, pages 5687–5711, Singapore, 2023

  20. [20]

    Toolformer: Language models can teach themselves to use tools

    Timo Schick, Jane Dwivedi-Yu, Roberto Dessi, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools. InAdvances in Neural Information Processing Systems, pages 68539–68551, 2023

  21. [21]

    Proximal Policy Optimization Algorithms

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

  22. [22]

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y . K. Li, Y . Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

  23. [23]

    Self-Distillation Enables Continual Learning

    Idan Shenfeld, Mehul Damani, Jonas Hübotter, and Pulkit Agrawal. Self-distillation enables continual learning.arXiv preprint arXiv:2601.19897, 2026

  24. [24]

    Hybridflow: A flexible and efficient rlhf framework

    Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework. In Proceedings of the Twentieth European Conference on Computer Systems, EuroSys ’25, pages 1279–1297, 2025

  25. [25]

    Search and refine during think: Facilitating knowledge refinement for improved retrieval-augmented reasoning

    Yaorui Shi, Sihang Li, Chang Wu, Zhiyuan Liu, Junfeng Fang, Hengxing Cai, An Zhang, and Xiang Wang. Search and refine during think: Facilitating knowledge refinement for improved retrieval-augmented reasoning. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

  26. [26]

    Qwen2.5 Technical Report

    Qwen Team. Qwen2.5 technical report.arXiv preprint arXiv:2412.15115, 2025

  27. [27]

    MuSiQue: Multihop questions via single-hop question composition.Transactions of the Association for Computational Linguistics, 10:539–554, 2022

    Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. MuSiQue: Multihop questions via single-hop question composition.Transactions of the Association for Computational Linguistics, 10:539–554, 2022

  28. [28]

    Interleaving retrieval with chain-of-thought reasoning for knowledge-intensive multi-step questions

    Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. Interleaving retrieval with chain-of-thought reasoning for knowledge-intensive multi-step questions. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 10014–10037, Toronto, Canada, 2023. 11

  29. [29]

    Text Embeddings by Weakly-Supervised Contrastive Pre-training

    Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin Jiang, Rangan Majumder, and Furu Wei. Text embeddings by weakly-supervised contrastive pre-training. arXiv preprint arXiv:2212.03533, 2024

  30. [30]

    Stepsearch: Igniting llms search ability via step-wise proximal policy optimization.arXiv preprint arXiv:2505.15107,

    Ziliang Wang, Xuhui Zheng, Kang An, Cijun Ouyang, Jialu Cai, Yuhang Wang, and Yichao Wu. Stepsearch: Igniting llms search ability via step-wise proximal policy optimization.arXiv preprint arXiv:2505.15107, 2025

  31. [31]

    Chain-of-thought prompting elicits reasoning in large language models

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, brian ichter, Fei Xia, Ed Chi, Quoc V Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. InAdvances in Neural Information Processing Systems, pages 24824–24837, 2022

  32. [32]

    Smith, and Hannaneh Hajishirzi

    Teng Xiao, Yige Yuan, Hamish Ivison, Huaisheng Zhu, Faeze Brahman, Nathan Lambert, Pradeep Dasigi, Noah A. Smith, and Hannaneh Hajishirzi. Meta-reinforcement learning with self-reflection for agentic search.arXiv preprint arXiv:2603.11327, 2026

  33. [33]

    Thinker: Train- ing llms in hierarchical thinking for deep search via multi-turn interaction.arXiv preprint arXiv:2511.07943,

    Jun Xu, Xinkai Du, Yu Ao, Peilong Zhao, Yang Li, Ling Zhong, Lin Yuan, Zhongpu Bo, Xiaorui Wang, Mengshu Sun, Zhengke Gui, Dalong Zhang, Zhaoyang Wang, Qiwei Wang, Yangyang Hou, Zhiying Yin, Haofen Wang, Huajun Chen, Lei Liang, and Jun Zhou. Thinker: Training llms in hierarchical thinking for deep search via multi-turn interaction.arXiv preprint arXiv:251...

  34. [34]

    Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christopher D. Manning. HotpotQA: A dataset for diverse, explainable multi-hop question answering. InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2369–2380, Brussels, Belgium, 2018

  35. [35]

    ReAct: Synergizing Reasoning and Acting in Language Models

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models.arXiv preprint arXiv:2210.03629, 2022

  36. [36]

    Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models

    Siyan Zhao, Zhihui Xie, Mengchen Liu, Jing Huang, Guan Pang, Feiyu Chen, and Aditya Grover. Self-distilled reasoner: On-policy self-distillation for large language models.arXiv preprint arXiv:2601.18734, 2026. 12 A Related Work Search-augmented reasoning with reinforcement learning.A line of recent work trains LLMs to invoke retrieval tools during reasoni...