SD-Search: On-Policy Hindsight Self-Distillation for Search-Augmented Reasoning

Ben Chen; Chenyi Lei; Huangyu Dai; Lingtao Mao; Wenwu Ou; Xuxin Zhang; Yufei Ma; Zhipeng Qian; Zihan Liang

arxiv: 2605.18299 · v1 · pith:4HLLBK5Znew · submitted 2026-05-18 · 💻 cs.AI · cs.CL· cs.IR

SD-Search: On-Policy Hindsight Self-Distillation for Search-Augmented Reasoning

Yufei Ma , Zihan Liang , Ben Chen , Zhipeng Qian , Huangyu Dai , Lingtao Mao , Xuxin Zhang , Chenyi Lei

show 1 more author

Wenwu Ou

This is my paper

Pith reviewed 2026-05-20 09:55 UTC · model grok-4.3

classification 💻 cs.AI cs.CLcs.IR

keywords search-augmented reasoningself-distillationhindsight learningon-policy trainingJensen-Shannon divergencereinforcement learningquery generation

0 comments

The pith

A single model can create its own step-level supervision for search queries by distilling from a hindsight-aware version of itself.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Search-augmented reasoning agents suffer from poor credit assignment because every search query in a rollout receives only the same final outcome reward. SD-Search lets the model itself generate denser signals by running rollouts, packing their queries and results into a compact hindsight block, and using that block to condition a teacher version of the model. The inference-time student version is then trained to match the teacher's query distribution at search positions through token-level Jensen-Shannon divergence. This process runs inside the normal reinforcement learning loop and needs no larger external model or extra human annotations.

Core claim

The central claim is that on-policy hindsight self-distillation supplies effective step-level supervision for query decisions: a single model plays the role of teacher when additionally conditioned on a compact hindsight block that summarizes search queries and final outcomes across sampled rollouts from the same question, and the student is trained to recover the teacher's query distribution by minimizing token-level Jensen-Shannon divergence at search-query positions, thereby layering a dense signal atop trajectory-level rewards such as those from GRPO.

What carries the argument

The hindsight block, a compact summary of search queries and final outcomes from sampled rollouts, which lets the teacher implicitly mark useful decisions so the student can learn them through distribution matching.

If this is right

The approach adds dense step-level signals to the coarse trajectory reward used in reinforcement learning for search-augmented agents.
Training stays inside the standard RL loop and requires no separate annotation pipeline or external model calls.
Query quality improves because the student learns to imitate the decisions that the hindsight-aware teacher associates with successful outcomes.
The method removes dependence on larger teacher models or externally generated sub-question labels.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same self-distillation pattern could be tested on other sequential decision problems where credit assignment is difficult and hindsight summaries are easy to form.
Varying the size or content of the hindsight block might reveal how much outcome information is needed to produce useful query distributions.
If the divergence signal proves reliable, it could be combined with other forms of process supervision without increasing model count.

Load-bearing premise

The hindsight block must contain enough information about which rollouts succeeded that matching the teacher's query distribution actually improves the student's later decisions when the block is removed.

What would settle it

An ablation in which removing the hindsight block or replacing the Jensen-Shannon term with uniform random targets produces no gain in final task accuracy or query quality compared with standard trajectory-only reinforcement learning.

Figures

Figures reproduced from arXiv: 2605.18299 by Ben Chen, Chenyi Lei, Huangyu Dai, Lingtao Mao, Wenwu Ou, Xuxin Zhang, Yufei Ma, Zhipeng Qian, Zihan Liang.

**Figure 2.** Figure 2: Method overview of SD-Search. The policy acts as its own teacher by conditioning on a [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Hindsight block construction ablation on Qwen2.5-3B-Base. Eight configurations cluster [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Sensitivity of hyperparameters on Qwen2.5-3B-Base. Each panel varies one hyperparameter [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Training dynamics of SD-Search versus AutoRefine on Qwen2.5-3B-Base, averaged [PITH_FULL_IMAGE:figures/full_fig_p019_5.png] view at source ↗

**Figure 6.** Figure 6: Per-token trace of P(actual token | condition) across the action span of a failed Bamboogle rollout, in which the focal trajectory issues two near-identical queries, a degenerate failure mode that hindsight conditioning is designed to penalize. Student (blue) sees no hindsight; teacher with real outcomes (green) is the SD-Search teacher view; teacher with flipped outcomes (purple) is a control in which eve… view at source ↗

**Figure 7.** Figure 7: Scaling on Qwen2.5 at 1.5B, 3B, 7B, 14B (average EM across the seven benchmarks at [PITH_FULL_IMAGE:figures/full_fig_p022_7.png] view at source ↗

**Figure 8.** Figure 8: Teacher input layout for one supervised step. Blocks (1) and (5) are the standard student [PITH_FULL_IMAGE:figures/full_fig_p023_8.png] view at source ↗

read the original abstract

Search-augmented reasoning agents interleave internal reasoning with calls to an external retriever, and their performance relies on the quality of each issued query. However, under outcome-reward reinforcement learning, every search decision in a rollout shares the same trajectory-level reward, leaving individual queries without step-specific credit. Recent process-supervision approaches address this gap by drawing step-level signals from outside the policy, relying either on a much larger teacher model, or on sub-question annotations produced by a stronger external system. In contrast, we propose SD-Search, which derives step-level supervision from the policy itself through on-policy hindsight self-distillation, requiring neither an external teacher nor additional annotations. In SD-Search, a single model plays two roles that differ only in conditioning: a student that sees only the context available at inference time, and a teacher that additionally conditions on a compact hindsight block summarizing the search queries and final outcomes of a group of rollouts sampled from the same question. Since the teacher knows how each rollout unfolded and which ones succeeded, its query distribution implicitly marks which decisions were worth making, and the student is trained to recover this behavior by minimizing the token-level Jensen--Shannon divergence to the teacher at search-query positions. This layers a dense, step-level signal on top of GRPO's coarse trajectory reward. Crucially, this signal is produced by the policy itself within the standard RL training loop, without external model inference, auxiliary annotation pipeline, or additional training stage.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The manuscript proposes SD-Search, a method for search-augmented reasoning agents that interleaves internal reasoning with external retrieval. It addresses the lack of step-level credit in outcome-reward RL (e.g., GRPO) by introducing on-policy hindsight self-distillation: a single model serves as a student (inference-time context only) and a teacher (additionally conditioned on a compact hindsight block of sampled rollouts' queries and outcomes). The student is trained to match the teacher's query distribution at search positions via token-level Jensen-Shannon divergence, layering dense supervision on top of trajectory rewards without external teachers, annotations, or extra training stages.

Significance. If the mechanism is effective, the work provides a self-contained way to derive process-level signals for query generation in RL-trained agents, reducing dependence on larger teachers or manual sub-question labels. The on-policy nature, use of a single model in dual roles, and token-level JSD objective are strengths that could improve sample efficiency and performance on multi-step reasoning tasks with retrieval.

major comments (1)

[Abstract] Abstract and method description: the central claim that 'the teacher knows how each rollout unfolded and which ones succeeded, [so] its query distribution implicitly marks which decisions were worth making' depends on the hindsight block preserving distinguishable per-query signals. The description of the block as a 'compact hindsight block summarizing the search queries and final outcomes of a group of rollouts' does not specify the linking mechanism (e.g., whether outcomes are explicitly attributed to individual queries or whether the block is a flat concatenation). Without such linkage, the resulting JSD target supplies no finer credit than the shared trajectory reward, undermining the step-level supervision argument.

minor comments (1)

[Abstract] Notation for the Jensen-Shannon divergence should be defined explicitly (including the token-level formulation) rather than left as 'token-level Jensen--Shannon divergence'.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their detailed and insightful comments on our manuscript. We have carefully considered the major comment regarding the description of the hindsight block and provide our response below. We believe the clarification will strengthen the paper.

read point-by-point responses

Referee: [Abstract] Abstract and method description: the central claim that 'the teacher knows how each rollout unfolded and which ones succeeded, [so] its query distribution implicitly marks which decisions were worth making' depends on the hindsight block preserving distinguishable per-query signals. The description of the block as a 'compact hindsight block summarizing the search queries and final outcomes of a group of rollouts' does not specify the linking mechanism (e.g., whether outcomes are explicitly attributed to individual queries or whether the block is a flat concatenation). Without such linkage, the resulting JSD target supplies no finer credit than the shared trajectory reward, undermining the step-level supervision argument.

Authors: We thank the referee for highlighting this important point about the clarity of our description. The hindsight block is constructed by appending, for each sampled rollout, the sequence of search queries issued during that rollout followed by the final trajectory outcome (e.g., success or failure indicator). This structure explicitly links individual queries to their outcomes within the block, enabling the teacher model to distinguish the contribution of specific queries to successful rollouts. The student, lacking this block, learns to produce query distributions that align with those informed by outcome knowledge. We acknowledge that the abstract's phrasing is concise and could be more explicit about this per-query attribution. We will revise the abstract and the method section to include a more detailed description of the hindsight block's format, such as 'a compact hindsight block that concatenates per-rollout query sequences with their corresponding final outcomes'. This should address the concern and make the step-level supervision mechanism clearer. revision: yes

Circularity Check

0 steps flagged

No significant circularity; self-contained on-policy training procedure

full rationale

The paper's central claim is that SD-Search generates step-level supervision internally by having the same policy act as teacher (conditioned on a hindsight block of sampled rollouts and outcomes) and student (inference-time context only), with the student trained to match the teacher's query distribution via token-level JSD on top of GRPO trajectory rewards. This construction does not reduce to a tautology or fitted input by definition: the hindsight summary supplies outcome information unavailable to the student at inference time, and the JSD objective is an explicit additional loss rather than a renaming or self-referential fit. No equations or sections in the provided text exhibit self-definitional equivalence, load-bearing self-citations, or imported uniqueness theorems. The method is presented as an empirical augmentation within the standard RL loop, with effectiveness depending on whether the compact block preserves per-query signals—an assumption that is falsifiable rather than definitional.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Abstract-only review provides insufficient detail to enumerate free parameters or background axioms exhaustively; the approach rests on standard RL outcome-reward assumptions and introduces the hindsight block as a new conditioning mechanism.

axioms (1)

domain assumption Outcome-reward reinforcement learning assigns the same trajectory-level reward to every search decision within a rollout.
Stated as the starting problem that SD-Search addresses.

invented entities (1)

hindsight block no independent evidence
purpose: Compact summary of search queries and final outcomes from a group of rollouts used to condition the teacher.
Introduced to enable the teacher to mark worthwhile decisions without external input.

pith-pipeline@v0.9.0 · 5830 in / 1383 out tokens · 63082 ms · 2026-05-20T09:55:55.071568+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

on-policy hindsight self-distillation... minimizing the token-level Jensen-Shannon divergence to the teacher at search-query positions
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

compact hindsight block summarizing the search queries and final outcomes

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

36 extracted references · 36 canonical work pages · 14 internal anchors

[1]

On-policy distillation of language models: Learning from self-generated mistakes

Rishabh Agarwal, Nino Vieillard, Yongchao Zhou, Piotr Stanczyk, Sabela Ramos Garea, Matthieu Geist, and Olivier Bachem. On-policy distillation of language models: Learning from self-generated mistakes. InInternational Conference on Learning Representations, pages 21246–21263, 2024

work page 2024
[2]

Hindsight Experience Replay

Marcin Andrychowicz, Filip Wolski, Alex Ray, Jonas Schneider, Rachel Fong, Peter Welinder, Bob McGrew, Josh Tobin, Pieter Abbeel, and Wojciech Zaremba. Hindsight experience replay. arXiv preprint arXiv:1707.01495, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[3]

Improving language models by retrieving from trillions of tokens

Sebastian Borgeaud, Arthur Mensch, Jordan Hoffmann, Trevor Cai, Eliza Rutherford, Katie Millican, George van den Driessche, Jean-Baptiste Lespiau, Bogdan Damoc, Aidan Clark, Diego de Las Casas, Aurelia Guy, Jacob Menick, Roman Ring, Tom Hennigan, Saffron Huang, Loren Maggiore, Chris Jones, Albin Cassirer, Andy Brock, Michela Paganini, Geoffrey Irving, Ori...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[4]

ReSearch: Learning to Reason with Search for LLMs via Reinforcement Learning

Mingyang Chen, Tianpeng Li, Haoze Sun, Yijie Zhou, Chenzheng Zhu, Haofen Wang, Jeff Z. Pan, Wen Zhang, Huajun Chen, Fan Yang, Zenan Zhou, and Weipeng Chen. Research: Learning to reason with search for llms via reinforcement learning.arXiv preprint arXiv:2503.19470, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[5]

Group-in-group policy optimization for LLM agent training

Lang Feng, Zhenghai Xue, Tingcong Liu, and Bo An. Group-in-group policy optimization for LLM agent training. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

work page 2025
[6]

Minillm: Knowledge distillation of large language models

Yuxian Gu, Li Dong, Furu Wei, and Minlie Huang. Minillm: Knowledge distillation of large language models. InInternational Conference on Learning Representations, pages 32694– 32717, 2024

work page 2024
[7]

Deepseek-r1 incentivizes reasoning in llms through reinforcement learning.Nature, 645(8081):633–638, 2025

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, Xiaokang Zhang, Xingkai Yu, and Wu. Deepseek-r1 incentivizes reasoning in llms through reinforcement learning.Nature, 645(8081):633–638, 2025

work page 2025
[8]

Hindsight credit assignment

Anna Harutyunyan, Will Dabney, Thomas Mesnard, Mohammad Gheshlaghi Azar, Bilal Piot, Nicolas Heess, Hado van Hasselt, Gregory Wayne, Satinder Singh, Doina Precup, and Remi Munos. Hindsight credit assignment. InAdvances in Neural Information Processing Systems. Curran Associates, Inc., 2019

work page 2019
[9]

Distilling the Knowledge in a Neural Network

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[10]

Constructing a multi-hop QA dataset for comprehensive evaluation of reasoning steps

Xanh Ho, Anh-Khoa Duong Nguyen, Saku Sugawara, and Akiko Aizawa. Constructing a multi-hop QA dataset for comprehensive evaluation of reasoning steps. InProceedings of the 28th International Conference on Computational Linguistics, pages 6609–6625, 2020

work page 2020
[11]

Search-r1: Training LLMs to reason and leverage search engines with reinforcement learning

Bowen Jin, Hansi Zeng, Zhenrui Yue, Jinsung Yoon, Sercan O Arik, Dong Wang, Hamed Zamani, and Jiawei Han. Search-r1: Training LLMs to reason and leverage search engines with reinforcement learning. InSecond Conference on Language Modeling, 2025

work page 2025
[12]

TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension

Mandar Joshi, Eunsol Choi, Daniel Weld, and Luke Zettlemoyer. TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension. InProceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1601–1611, Vancouver, Canada, 2017. 10

work page 2017
[13]

Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov

Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, Kristina Toutanova, Llion Jones, Matthew Kelcey, Ming-Wei Chang, Andrew M. Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov. Natural questions: A benchmark for question answering research.Transact...

work page 2019
[14]

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. Retrieval-augmented generation for knowledge-intensive nlp tasks.arXiv preprint arXiv:2005.11401, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2005
[15]

Search-o1: Agentic search-enhanced large reasoning models

Xiaoxi Li, Guanting Dong, Jiajie Jin, Yuyao Zhang, Yujia Zhou, Yutao Zhu, Peitian Zhang, and Zhicheng Dou. Search-o1: Agentic search-enhanced large reasoning models. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 5420–5438, Suzhou, China, 2025

work page 2025
[16]

Unifying distillation and privileged information

David Lopez-Paz, Léon Bottou, Bernhard Schölkopf, and Vladimir Vapnik. Unifying distillation and privileged information.arXiv preprint arXiv:1511.03643, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[17]

When not to trust language models: Investigating effectiveness of parametric and non-parametric memories

Alex Mallen, Akari Asai, Victor Zhong, Rajarshi Das, Daniel Khashabi, and Hannaneh Ha- jishirzi. When not to trust language models: Investigating effectiveness of parametric and non-parametric memories. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 9802–9822, Toronto, Canada, 2023

work page 2023
[18]

GPT-4 Technical Report

OpenAI. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[19]

Measuring and narrowing the compositionality gap in language models

Ofir Press, Muru Zhang, Sewon Min, Ludwig Schmidt, Noah Smith, and Mike Lewis. Measuring and narrowing the compositionality gap in language models. InFindings of the Association for Computational Linguistics: EMNLP 2023, pages 5687–5711, Singapore, 2023

work page 2023
[20]

Toolformer: Language models can teach themselves to use tools

Timo Schick, Jane Dwivedi-Yu, Roberto Dessi, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools. InAdvances in Neural Information Processing Systems, pages 68539–68551, 2023

work page 2023
[21]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[22]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y . K. Li, Y . Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[23]

Self-Distillation Enables Continual Learning

Idan Shenfeld, Mehul Damani, Jonas Hübotter, and Pulkit Agrawal. Self-distillation enables continual learning.arXiv preprint arXiv:2601.19897, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[24]

Hybridflow: A flexible and efficient rlhf framework

Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework. In Proceedings of the Twentieth European Conference on Computer Systems, EuroSys ’25, pages 1279–1297, 2025

work page 2025
[25]

Search and refine during think: Facilitating knowledge refinement for improved retrieval-augmented reasoning

Yaorui Shi, Sihang Li, Chang Wu, Zhiyuan Liu, Junfeng Fang, Hengxing Cai, An Zhang, and Xiang Wang. Search and refine during think: Facilitating knowledge refinement for improved retrieval-augmented reasoning. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

work page 2025
[26]

Qwen2.5 Technical Report

Qwen Team. Qwen2.5 technical report.arXiv preprint arXiv:2412.15115, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[27]

MuSiQue: Multihop questions via single-hop question composition.Transactions of the Association for Computational Linguistics, 10:539–554, 2022

Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. MuSiQue: Multihop questions via single-hop question composition.Transactions of the Association for Computational Linguistics, 10:539–554, 2022

work page 2022
[28]

Interleaving retrieval with chain-of-thought reasoning for knowledge-intensive multi-step questions

Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. Interleaving retrieval with chain-of-thought reasoning for knowledge-intensive multi-step questions. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 10014–10037, Toronto, Canada, 2023. 11

work page 2023
[29]

Text Embeddings by Weakly-Supervised Contrastive Pre-training

Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin Jiang, Rangan Majumder, and Furu Wei. Text embeddings by weakly-supervised contrastive pre-training. arXiv preprint arXiv:2212.03533, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[30]

Stepsearch: Igniting llms search ability via step-wise proximal policy optimization.arXiv preprint arXiv:2505.15107,

Ziliang Wang, Xuhui Zheng, Kang An, Cijun Ouyang, Jialu Cai, Yuhang Wang, and Yichao Wu. Stepsearch: Igniting llms search ability via step-wise proximal policy optimization.arXiv preprint arXiv:2505.15107, 2025

work page arXiv 2025
[31]

Chain-of-thought prompting elicits reasoning in large language models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, brian ichter, Fei Xia, Ed Chi, Quoc V Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. InAdvances in Neural Information Processing Systems, pages 24824–24837, 2022

work page 2022
[32]

Smith, and Hannaneh Hajishirzi

Teng Xiao, Yige Yuan, Hamish Ivison, Huaisheng Zhu, Faeze Brahman, Nathan Lambert, Pradeep Dasigi, Noah A. Smith, and Hannaneh Hajishirzi. Meta-reinforcement learning with self-reflection for agentic search.arXiv preprint arXiv:2603.11327, 2026

work page arXiv 2026
[33]

Thinker: Train- ing llms in hierarchical thinking for deep search via multi-turn interaction.arXiv preprint arXiv:2511.07943,

Jun Xu, Xinkai Du, Yu Ao, Peilong Zhao, Yang Li, Ling Zhong, Lin Yuan, Zhongpu Bo, Xiaorui Wang, Mengshu Sun, Zhengke Gui, Dalong Zhang, Zhaoyang Wang, Qiwei Wang, Yangyang Hou, Zhiying Yin, Haofen Wang, Huajun Chen, Lei Liang, and Jun Zhou. Thinker: Training llms in hierarchical thinking for deep search via multi-turn interaction.arXiv preprint arXiv:251...

work page arXiv 2025
[34]

Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christopher D. Manning. HotpotQA: A dataset for diverse, explainable multi-hop question answering. InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2369–2380, Brussels, Belgium, 2018

work page 2018
[35]

ReAct: Synergizing Reasoning and Acting in Language Models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models.arXiv preprint arXiv:2210.03629, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[36]

Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models

Siyan Zhao, Zhihui Xie, Mengchen Liu, Jing Huang, Guan Pang, Feiyu Chen, and Aditya Grover. Self-distilled reasoner: On-policy self-distillation for large language models.arXiv preprint arXiv:2601.18734, 2026. 12 A Related Work Search-augmented reasoning with reinforcement learning.A line of recent work trains LLMs to invoke retrieval tools during reasoni...

work page internal anchor Pith review Pith/arXiv arXiv 2026

[1] [1]

On-policy distillation of language models: Learning from self-generated mistakes

Rishabh Agarwal, Nino Vieillard, Yongchao Zhou, Piotr Stanczyk, Sabela Ramos Garea, Matthieu Geist, and Olivier Bachem. On-policy distillation of language models: Learning from self-generated mistakes. InInternational Conference on Learning Representations, pages 21246–21263, 2024

work page 2024

[2] [2]

Hindsight Experience Replay

Marcin Andrychowicz, Filip Wolski, Alex Ray, Jonas Schneider, Rachel Fong, Peter Welinder, Bob McGrew, Josh Tobin, Pieter Abbeel, and Wojciech Zaremba. Hindsight experience replay. arXiv preprint arXiv:1707.01495, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[3] [3]

Improving language models by retrieving from trillions of tokens

Sebastian Borgeaud, Arthur Mensch, Jordan Hoffmann, Trevor Cai, Eliza Rutherford, Katie Millican, George van den Driessche, Jean-Baptiste Lespiau, Bogdan Damoc, Aidan Clark, Diego de Las Casas, Aurelia Guy, Jacob Menick, Roman Ring, Tom Hennigan, Saffron Huang, Loren Maggiore, Chris Jones, Albin Cassirer, Andy Brock, Michela Paganini, Geoffrey Irving, Ori...

work page internal anchor Pith review Pith/arXiv arXiv 2022

[4] [4]

ReSearch: Learning to Reason with Search for LLMs via Reinforcement Learning

Mingyang Chen, Tianpeng Li, Haoze Sun, Yijie Zhou, Chenzheng Zhu, Haofen Wang, Jeff Z. Pan, Wen Zhang, Huajun Chen, Fan Yang, Zenan Zhou, and Weipeng Chen. Research: Learning to reason with search for llms via reinforcement learning.arXiv preprint arXiv:2503.19470, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[5] [5]

Group-in-group policy optimization for LLM agent training

Lang Feng, Zhenghai Xue, Tingcong Liu, and Bo An. Group-in-group policy optimization for LLM agent training. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

work page 2025

[6] [6]

Minillm: Knowledge distillation of large language models

Yuxian Gu, Li Dong, Furu Wei, and Minlie Huang. Minillm: Knowledge distillation of large language models. InInternational Conference on Learning Representations, pages 32694– 32717, 2024

work page 2024

[7] [7]

Deepseek-r1 incentivizes reasoning in llms through reinforcement learning.Nature, 645(8081):633–638, 2025

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, Xiaokang Zhang, Xingkai Yu, and Wu. Deepseek-r1 incentivizes reasoning in llms through reinforcement learning.Nature, 645(8081):633–638, 2025

work page 2025

[8] [8]

Hindsight credit assignment

Anna Harutyunyan, Will Dabney, Thomas Mesnard, Mohammad Gheshlaghi Azar, Bilal Piot, Nicolas Heess, Hado van Hasselt, Gregory Wayne, Satinder Singh, Doina Precup, and Remi Munos. Hindsight credit assignment. InAdvances in Neural Information Processing Systems. Curran Associates, Inc., 2019

work page 2019

[9] [9]

Distilling the Knowledge in a Neural Network

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015

[10] [10]

Constructing a multi-hop QA dataset for comprehensive evaluation of reasoning steps

Xanh Ho, Anh-Khoa Duong Nguyen, Saku Sugawara, and Akiko Aizawa. Constructing a multi-hop QA dataset for comprehensive evaluation of reasoning steps. InProceedings of the 28th International Conference on Computational Linguistics, pages 6609–6625, 2020

work page 2020

[11] [11]

Search-r1: Training LLMs to reason and leverage search engines with reinforcement learning

Bowen Jin, Hansi Zeng, Zhenrui Yue, Jinsung Yoon, Sercan O Arik, Dong Wang, Hamed Zamani, and Jiawei Han. Search-r1: Training LLMs to reason and leverage search engines with reinforcement learning. InSecond Conference on Language Modeling, 2025

work page 2025

[12] [12]

TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension

Mandar Joshi, Eunsol Choi, Daniel Weld, and Luke Zettlemoyer. TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension. InProceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1601–1611, Vancouver, Canada, 2017. 10

work page 2017

[13] [13]

Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov

Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, Kristina Toutanova, Llion Jones, Matthew Kelcey, Ming-Wei Chang, Andrew M. Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov. Natural questions: A benchmark for question answering research.Transact...

work page 2019

[14] [14]

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. Retrieval-augmented generation for knowledge-intensive nlp tasks.arXiv preprint arXiv:2005.11401, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2005

[15] [15]

Search-o1: Agentic search-enhanced large reasoning models

Xiaoxi Li, Guanting Dong, Jiajie Jin, Yuyao Zhang, Yujia Zhou, Yutao Zhu, Peitian Zhang, and Zhicheng Dou. Search-o1: Agentic search-enhanced large reasoning models. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 5420–5438, Suzhou, China, 2025

work page 2025

[16] [16]

Unifying distillation and privileged information

David Lopez-Paz, Léon Bottou, Bernhard Schölkopf, and Vladimir Vapnik. Unifying distillation and privileged information.arXiv preprint arXiv:1511.03643, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[17] [17]

When not to trust language models: Investigating effectiveness of parametric and non-parametric memories

Alex Mallen, Akari Asai, Victor Zhong, Rajarshi Das, Daniel Khashabi, and Hannaneh Ha- jishirzi. When not to trust language models: Investigating effectiveness of parametric and non-parametric memories. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 9802–9822, Toronto, Canada, 2023

work page 2023

[18] [18]

GPT-4 Technical Report

OpenAI. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[19] [19]

Measuring and narrowing the compositionality gap in language models

Ofir Press, Muru Zhang, Sewon Min, Ludwig Schmidt, Noah Smith, and Mike Lewis. Measuring and narrowing the compositionality gap in language models. InFindings of the Association for Computational Linguistics: EMNLP 2023, pages 5687–5711, Singapore, 2023

work page 2023

[20] [20]

Toolformer: Language models can teach themselves to use tools

Timo Schick, Jane Dwivedi-Yu, Roberto Dessi, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools. InAdvances in Neural Information Processing Systems, pages 68539–68551, 2023

work page 2023

[21] [21]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[22] [22]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y . K. Li, Y . Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[23] [23]

Self-Distillation Enables Continual Learning

Idan Shenfeld, Mehul Damani, Jonas Hübotter, and Pulkit Agrawal. Self-distillation enables continual learning.arXiv preprint arXiv:2601.19897, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[24] [24]

Hybridflow: A flexible and efficient rlhf framework

Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework. In Proceedings of the Twentieth European Conference on Computer Systems, EuroSys ’25, pages 1279–1297, 2025

work page 2025

[25] [25]

Search and refine during think: Facilitating knowledge refinement for improved retrieval-augmented reasoning

Yaorui Shi, Sihang Li, Chang Wu, Zhiyuan Liu, Junfeng Fang, Hengxing Cai, An Zhang, and Xiang Wang. Search and refine during think: Facilitating knowledge refinement for improved retrieval-augmented reasoning. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

work page 2025

[26] [26]

Qwen2.5 Technical Report

Qwen Team. Qwen2.5 technical report.arXiv preprint arXiv:2412.15115, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[27] [27]

MuSiQue: Multihop questions via single-hop question composition.Transactions of the Association for Computational Linguistics, 10:539–554, 2022

Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. MuSiQue: Multihop questions via single-hop question composition.Transactions of the Association for Computational Linguistics, 10:539–554, 2022

work page 2022

[28] [28]

Interleaving retrieval with chain-of-thought reasoning for knowledge-intensive multi-step questions

Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. Interleaving retrieval with chain-of-thought reasoning for knowledge-intensive multi-step questions. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 10014–10037, Toronto, Canada, 2023. 11

work page 2023

[29] [29]

Text Embeddings by Weakly-Supervised Contrastive Pre-training

Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin Jiang, Rangan Majumder, and Furu Wei. Text embeddings by weakly-supervised contrastive pre-training. arXiv preprint arXiv:2212.03533, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[30] [30]

Stepsearch: Igniting llms search ability via step-wise proximal policy optimization.arXiv preprint arXiv:2505.15107,

Ziliang Wang, Xuhui Zheng, Kang An, Cijun Ouyang, Jialu Cai, Yuhang Wang, and Yichao Wu. Stepsearch: Igniting llms search ability via step-wise proximal policy optimization.arXiv preprint arXiv:2505.15107, 2025

work page arXiv 2025

[31] [31]

Chain-of-thought prompting elicits reasoning in large language models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, brian ichter, Fei Xia, Ed Chi, Quoc V Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. InAdvances in Neural Information Processing Systems, pages 24824–24837, 2022

work page 2022

[32] [32]

Smith, and Hannaneh Hajishirzi

Teng Xiao, Yige Yuan, Hamish Ivison, Huaisheng Zhu, Faeze Brahman, Nathan Lambert, Pradeep Dasigi, Noah A. Smith, and Hannaneh Hajishirzi. Meta-reinforcement learning with self-reflection for agentic search.arXiv preprint arXiv:2603.11327, 2026

work page arXiv 2026

[33] [33]

Thinker: Train- ing llms in hierarchical thinking for deep search via multi-turn interaction.arXiv preprint arXiv:2511.07943,

Jun Xu, Xinkai Du, Yu Ao, Peilong Zhao, Yang Li, Ling Zhong, Lin Yuan, Zhongpu Bo, Xiaorui Wang, Mengshu Sun, Zhengke Gui, Dalong Zhang, Zhaoyang Wang, Qiwei Wang, Yangyang Hou, Zhiying Yin, Haofen Wang, Huajun Chen, Lei Liang, and Jun Zhou. Thinker: Training llms in hierarchical thinking for deep search via multi-turn interaction.arXiv preprint arXiv:251...

work page arXiv 2025

[34] [34]

Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christopher D. Manning. HotpotQA: A dataset for diverse, explainable multi-hop question answering. InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2369–2380, Brussels, Belgium, 2018

work page 2018

[35] [35]

ReAct: Synergizing Reasoning and Acting in Language Models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models.arXiv preprint arXiv:2210.03629, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[36] [36]

Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models

Siyan Zhao, Zhihui Xie, Mengchen Liu, Jing Huang, Guan Pang, Feiyu Chen, and Aditya Grover. Self-distilled reasoner: On-policy self-distillation for large language models.arXiv preprint arXiv:2601.18734, 2026. 12 A Related Work Search-augmented reasoning with reinforcement learning.A line of recent work trains LLMs to invoke retrieval tools during reasoni...

work page internal anchor Pith review Pith/arXiv arXiv 2026