Search-E1: Self-Distillation Drives Self-Evolution in Search-Augmented Reasoning

Ben Chen; Huangyu Dai; Lingtao Mao; Xuxin Zhang; Yufei Ma; Zhipeng Qian; Zihan Liang

arxiv: 2605.22511 · v1 · pith:CF5QLFCRnew · submitted 2026-05-21 · 💻 cs.AI · cs.CL· cs.IR

Search-E1: Self-Distillation Drives Self-Evolution in Search-Augmented Reasoning

Zihan Liang , Yufei Ma , Ben Chen , Zhipeng Qian , Xuxin Zhang , Huangyu Dai , Lingtao Mao This is my paper

Pith reviewed 2026-05-22 06:01 UTC · model grok-4.3

classification 💻 cs.AI cs.CLcs.IR

keywords self-distillationsearch-augmented reasoningGRPOself-evolutionquestion answeringlanguage model post-trainingoffline distillationreasoning agents

0 comments

The pith

Self-distillation after GRPO lets search-augmented models evolve using only their own better trajectories.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that recent elaborations on post-training for search-augmented reasoning agents are unnecessary. A loop of vanilla GRPO followed by offline self-distillation is enough: after each GRPO round the model generates its own rollouts on training questions, then applies a token-level forward KL loss that pulls its normal inference distribution toward the distribution it produces when given a privileged context exposing a more efficient sibling trajectory. This supplies dense per-step supervision from internal data alone. On seven QA benchmarks the resulting 3B model reaches 0.440 average exact match and beats prior open-source systems at both scales. A reader would care because the result suggests that much of the added machinery in current agent recipes can be stripped away while still obtaining strong performance gains.

Core claim

Search-E1 interleaves vanilla GRPO with offline self-distillation. In the distillation step the policy is aligned via token-level forward KL to its own output distribution under a privileged context that reveals a more efficient sibling trajectory. The procedure runs without external supervision, auxiliary reward models, tree search, or hand-crafted bonuses, yet naturally yields dense per-step signals and produces a Qwen2.5-3B model that attains 0.440 average EM across seven QA benchmarks, surpassing all open-source baselines.

What carries the argument

Offline self-distillation (OFSD) that aligns the policy's inference distribution to its own distribution under a privileged context containing a more efficient sibling trajectory, using token-level forward KL.

If this is right

Vanilla GRPO plus internal self-distillation suffices for competitive search-augmented reasoning performance.
Dense per-step supervision arises automatically from aligning to better internal trajectories.
The approach works at both 3B and larger scales and exceeds prior open-source baselines on seven QA tasks.
No external models, process reward modules, or hand-crafted reward terms are required for the observed gains.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The method effectively creates an internal curriculum by repeatedly distilling toward shorter or higher-quality trajectories found in the same rollout batch.
This self-evolution loop could be repeated for many cycles to test whether gains continue or plateau without external data.
The reliance on privileged sibling trajectories suggests the technique may transfer to other sequential tasks where better paths can be identified within a single generation batch.
By eliminating dependence on stronger external systems the recipe lowers the resource threshold for training capable reasoning agents.

Load-bearing premise

The privileged context must expose a trajectory that is sufficiently independent of the current policy to supply genuine new improvement signals rather than simply reinforcing what the model already knows.

What would settle it

Replace the privileged context with the model's standard unprivileged rollouts and check whether the combined GRPO-plus-distillation loop still produces gains beyond GRPO alone.

Figures

Figures reproduced from arXiv: 2605.22511 by Ben Chen, Huangyu Dai, Lingtao Mao, Xuxin Zhang, Yufei Ma, Zhipeng Qian, Zihan Liang.

**Figure 1.** Figure 1: Overview of Search-E1. Top: a GRPO round with exact-match outcome reward. Bottom: an OFSD round in which the student conditions on q +τ stu and the teacher on q +τ ref +τ stu, aligned by a token-level forward KL. supervision from a stronger system, either by distilling sub-question decompositions from a 72B teacher (Xu et al., 2025) or by deriving step-wise rewards from GPT-4o annotations (Wang et al., 202… view at source ↗

**Figure 1.** Figure 1: Pair mining. After a GRPO round converges, we sample the policy on its training questions to obtain a fresh rollout pool: for each question q, we draw K trajectories {τ (1) q , . . . , τ (K) q } from the converged policy, annotated with its outcome reward R ∈ {0, 1} and the number of retrieval calls nsrch. For each question we then construct a pair (τ ref, τ stu). The reference τ ref is the correct traject… view at source ↗

read the original abstract

Post-training has become the dominant recipe for turning a language model into a competent search-augmented reasoning agent. A line of recent work pushes its performance further by adding elaborate machinery on top of this standard pipeline. These augmentations import external supervision from stronger external systems, attach auxiliary modules such as process reward models or retrospective critics, restructure the rollout itself with tree search or multi-stage curricula, or shape the reward with hand-crafted bonuses and penalties. Each addition delivers a measurable gain, but each also inflates the training pipeline and ties the recipe to resources or designs that may not always be available. We take a step back and ask whether any of this machinery is actually necessary, and propose Search-E1, a self-evolution method that lets a search-augmented agent improve through only vanilla GRPO interleaved with offline self-distillation (OFSD). After each GRPO round, the policy rolls out on its own training questions. A token-level forward KL objective then aligns the policy's inference-time distribution to its own distribution under a privileged context that exposes a more efficient sibling trajectory. Despite this simplicity, the procedure naturally provides dense per-step supervision. On seven QA benchmarks, Search-E1 reaches $0.440$ average EM with Qwen2.5-3B, surpassing all open-source baselines at both scales. Code and complete version will be made public soon.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Search-E1 claims a minimal GRPO-plus-self-distillation loop can beat open-source baselines on search-augmented QA, but the abstract gives almost no detail on how the privileged sibling trajectories are actually generated.

read the letter

The main point to take away is that the paper tries to strip search-augmented reasoning back to just vanilla GRPO interleaved with token-level forward KL distillation onto the model's own outputs under some privileged context. They report this reaches 0.440 average EM across seven QA benchmarks with Qwen2.5-3B and beats the open-source baselines they list. If the numbers hold, it would support the idea that a lot of the extra modules and external supervision in recent work are not required. That is the cleanest part of the pitch: it keeps the pipeline short and still claims dense per-step signals from the distillation step. The interleaving itself and the specific use of sibling trajectories under privileged context do not appear in the prior work cited in the abstract, so the recipe has a concrete difference from what has been published so far. The authors also avoid overclaiming the method as revolutionary; they simply ask whether the added complexity is necessary and present a simpler alternative. That restraint is useful. The soft spots are exactly where the stress-test note points. The abstract never says how the privileged context is built—no mention of extra search budget, oracle filtering, or an external model—so it is impossible to judge whether the sibling trajectory supplies signals that are independent of the current policy or whether it is mostly another high-probability sample from the same distribution. If the latter, the KL step risks becoming self-reinforcement rather than genuine evolution. The performance claim also sits on very thin evidence: no baseline descriptions, no ablations, no statistical tests, and no experimental protocol. Without those, the 0.440 number cannot be evaluated. The circularity concern is therefore load-bearing and currently unaddressed. This paper is aimed at researchers who train search-augmented agents and want lighter recipes. Someone already running GRPO-style loops could get practical value from trying the OFSD addition once the code is released. It is worth sending to peer review because the core claim is simple enough to test and the results, if reproducible, would matter to the field, even though the experimental section will need substantial work to make the method verifiable.

Circularity Check

0 steps flagged

No significant circularity; empirical claims rest on external benchmarks

full rationale

The paper describes a training procedure (GRPO interleaved with OFSD) that aligns the policy to its own distribution under a privileged context derived from its rollouts. The central performance claim (0.440 average EM on seven QA benchmarks, surpassing open-source baselines) is an empirical measurement against external test sets, not a mathematical derivation or fitted quantity that reduces to the input distribution by construction. No self-citation chains, uniqueness theorems, or ansatzes are invoked to justify the method. The procedure is self-contained; any concern about whether the privileged context supplies decorrelated signals is a question of empirical effectiveness rather than definitional circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The method rests on the assumption that privileged sibling trajectories generated during search provide an independent and superior target distribution for distillation; no free parameters or invented entities are explicitly introduced in the abstract.

axioms (1)

domain assumption The privileged context exposes a more efficient sibling trajectory that serves as a reliable target for improving the policy.
This premise is required for the self-distillation step to produce genuine gains rather than circular self-alignment.

pith-pipeline@v0.9.0 · 5800 in / 1270 out tokens · 47457 ms · 2026-05-22T06:01:51.583793+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

24 extracted references · 24 canonical work pages · 15 internal anchors

[1]

Improving language models by retrieving from trillions of tokens

Sebastian Borgeaud, Arthur Mensch, Jordan Hoffmann, Trevor Cai, Eliza Rutherford, Katie Mil- lican, George van den Driessche, Jean-Baptiste Lespiau, Bogdan Damoc, Aidan Clark, Diego de Las Casas, Aurelia Guy, Jacob Menick, Roman Ring, Tom Hennigan, Saffron Huang, Loren Maggiore, Chris Jones, Albin Cassirer, Andy Brock, Michela Paganini, Geoffrey Irving, O...

work page internal anchor Pith review Pith/arXiv arXiv
[2]

ReSearch: Learning to Reason with Search for LLMs via Reinforcement Learning

Mingyang Chen, Tianpeng Li, Haoze Sun, Yijie Zhou, Chenzheng Zhu, Haofen Wang, Jeff Z. Pan, Wen Zhang, Huajun Chen, Fan Yang, Zenan Zhou, and Weipeng Chen. Research: Learning to reason with search for llms via reinforcement learning.arXiv preprint arXiv:2503.19470,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Treegrpo: Tree-advantage grpo for online rl post-training of diffusion models.arXiv preprint arXiv:2512.08153,

Zheng Ding and Weirui Ye. Treegrpo: Tree-advantage grpo for online rl post-training of diffusion models.arXiv preprint arXiv:2512.08153,

work page arXiv
[4]

Dynasearcher: Dynamic knowl- edge graph augmented search agent via multi-reward reinforcement learning.arXiv preprint arXiv:2507.17365,

Chuzhan Hao, Wenfeng Feng, Yuewei Zhang, and Hao Wang. Dynasearcher: Dynamic knowl- edge graph augmented search agent via multi-reward reinforcement learning.arXiv preprint arXiv:2507.17365,

work page arXiv
[5]

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich K ¨uttler, Mike Lewis, Wen tau Yih, Tim Rockt ¨aschel, Sebastian Riedel, and Douwe Kiela. Retrieval-augmented generation for knowledge-intensive nlp tasks.arXiv preprint arXiv:2005.11401,

work page internal anchor Pith review Pith/arXiv arXiv 2005
[6]

Search-o1: Agentic search-enhanced large reasoning models

Xiaoxi Li, Guanting Dong, Jiajie Jin, Yuyao Zhang, Yujia Zhou, Yutao Zhu, Peitian Zhang, and Zhicheng Dou. Search-o1: Agentic search-enhanced large reasoning models. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pp. 5420–5438, Suzhou, China,

work page 2025
[7]

Let's Verify Step by Step

Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step.arXiv preprint arXiv:2305.20050,

work page internal anchor Pith review Pith/arXiv arXiv
[8]

Unifying distillation and privileged information

David Lopez-Paz, L ´eon Bottou, Bernhard Sch ¨olkopf, and Vladimir Vapnik. Unifying distillation and privileged information.arXiv preprint arXiv:1511.03643,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

Qarm: Quantitative alignment multi-modal recommenda- tion at kuaishou.arXiv preprint arXiv:2411.11739,

Xinchen Luo, Jiangxia Cao, Tianyu Sun, Jinkai Yu, Rui Huang, Wei Yuan, Hezheng Lin, Yichen Zheng, Shiyao Wang, Qigen Hu, et al. Qarm: Quantitative alignment multi-modal recommenda- tion at kuaishou.arXiv preprint arXiv:2411.11739,

work page arXiv
[10]

GPT-4 Technical Report

OpenAI. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2024a. OpenAI. Openai o1 system card.arXiv preprint arXiv:2412.16720, 2024b. Ofir Press, Muru Zhang, Sewon Min, Ludwig Schmidt, Noah Smith, and Mike Lewis. Measuring and narrowing the compositionality gap in language models. InFindings of the Association for Computational Linguistics: EMNLP 2...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[11]

Qwen2.5 Technical Report

Qwen. Qwen2.5 technical report.arXiv preprint arXiv:2412.15115,

work page internal anchor Pith review Pith/arXiv arXiv
[12]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y . K. Li, Y . Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathe- matical reasoning in open language models.arXiv preprint arXiv:2402.03300,

work page internal anchor Pith review Pith/arXiv arXiv
[13]

Self-Distillation Enables Continual Learning

Idan Shenfeld, Mehul Damani, Jonas H ¨ubotter, and Pulkit Agrawal. Self-distillation enables con- tinual learning.arXiv preprint arXiv:2601.19897,

work page internal anchor Pith review Pith/arXiv arXiv
[14]

Agentic Reasoning and Tool Integration for LLMs via Reinforcement Learning

Joykirat Singh, Raghav Magazine, Yash Pandya, and Akshay Nambi. Agentic reasoning and tool integration for llms via reinforcement learning.arXiv preprint arXiv:2505.01441,

work page internal anchor Pith review Pith/arXiv arXiv
[15]

R1-Searcher: Incentivizing the Search Capability in LLMs via Reinforcement Learning

9 Preprint Huatong Song, Jinhao Jiang, Yingqian Min, Jie Chen, Zhipeng Chen, Wayne Xin Zhao, Lei Fang, and Ji-Rong Wen. R1-searcher: Incentivizing the search capability in llms via reinforcement learning.arXiv preprint arXiv:2503.05592,

work page internal anchor Pith review Pith/arXiv arXiv
[16]

ZeroSearch: Incentivize the Search Capability of LLMs without Searching

Hao Sun, Zile Qiao, Jiayan Guo, Xuanbo Fan, Yingyan Hou, Yong Jiang, Pengjun Xie, Yan Zhang, Fei Huang, and Jingren Zhou. Zerosearch: Incentivize the search capability of llms without searching.arXiv preprint arXiv:2505.04588,

work page internal anchor Pith review Pith/arXiv arXiv
[17]

Text Embeddings by Weakly-Supervised Contrastive Pre-training

Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin Jiang, Rangan Ma- jumder, and Furu Wei. Text embeddings by weakly-supervised contrastive pre-training.arXiv preprint arXiv:2212.03533,

work page internal anchor Pith review Pith/arXiv arXiv
[18]

Stepsearch: Igniting llms search ability via step-wise proximal policy optimization.arXiv preprint arXiv:2505.15107,

Ziliang Wang, Xuhui Zheng, Kang An, Cijun Ouyang, Jialu Cai, Yuhang Wang, and Yichao Wu. Stepsearch: Igniting llms search ability via step-wise proximal policy optimization.arXiv preprint arXiv:2505.15107,

work page arXiv
[19]

Search-p1: Path-centric reward shaping for stable and efficient agentic rag training.arXiv preprint arXiv:2602.22576,

Tianle Xia, Ming Xu, Lingxiang Hu, Yiding Sun, Wenwei Li, Linfang Shang, Liqun Liu, Peng Shu, Huan Yu, and Jie Jiang. Search-p1: Path-centric reward shaping for stable and efficient agentic rag training.arXiv preprint arXiv:2602.22576,

work page arXiv
[20]

Thinker: Train- ing llms in hierarchical thinking for deep search via multi-turn interaction.arXiv preprint arXiv:2511.07943,

Jun Xu, Xinkai Du, Yu Ao, Peilong Zhao, Yang Li, Ling Zhong, Lin Yuan, Zhongpu Bo, Xiaorui Wang, Mengshu Sun, Zhengke Gui, Dalong Zhang, Zhaoyang Wang, Qiwei Wang, Yangyang Hou, Zhiying Yin, Haofen Wang, Huajun Chen, Lei Liang, and Jun Zhou. Thinker: Train- ing llms in hierarchical thinking for deep search via multi-turn interaction.arXiv preprint arXiv:2...

work page arXiv
[21]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388,

work page internal anchor Pith review Pith/arXiv arXiv
[22]

Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christopher D. Manning. HotpotQA: A dataset for diverse, explainable multi-hop question answering. InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 2369–2380, Brussels, Belgium,

work page 2018
[23]

ReAct: Synergizing Reasoning and Acting in Language Models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models.arXiv preprint arXiv:2210.03629,

work page internal anchor Pith review Pith/arXiv arXiv
[24]

Criticsearch: Fine-grained credit assignment for search agents via a retrospective critic.arXiv preprint arXiv:2511.12159, 2025a

Yaocheng Zhang, Haohuan Huang, Zijun Song, Yuanheng Zhu, Qichao Zhang, Zijie Zhao, and Dongbin Zhao. Criticsearch: Fine-grained credit assignment for search agents via a retrospective critic.arXiv preprint arXiv:2511.12159, 2025a. Zhenru Zhang, Chujie Zheng, Yangzhen Wu, Beichen Zhang, Runji Lin, Bowen Yu, Dayiheng Liu, Jingren Zhou, and Junyang Lin. The ...

work page arXiv

[1] [1]

Improving language models by retrieving from trillions of tokens

Sebastian Borgeaud, Arthur Mensch, Jordan Hoffmann, Trevor Cai, Eliza Rutherford, Katie Mil- lican, George van den Driessche, Jean-Baptiste Lespiau, Bogdan Damoc, Aidan Clark, Diego de Las Casas, Aurelia Guy, Jacob Menick, Roman Ring, Tom Hennigan, Saffron Huang, Loren Maggiore, Chris Jones, Albin Cassirer, Andy Brock, Michela Paganini, Geoffrey Irving, O...

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

ReSearch: Learning to Reason with Search for LLMs via Reinforcement Learning

Mingyang Chen, Tianpeng Li, Haoze Sun, Yijie Zhou, Chenzheng Zhu, Haofen Wang, Jeff Z. Pan, Wen Zhang, Huajun Chen, Fan Yang, Zenan Zhou, and Weipeng Chen. Research: Learning to reason with search for llms via reinforcement learning.arXiv preprint arXiv:2503.19470,

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

Treegrpo: Tree-advantage grpo for online rl post-training of diffusion models.arXiv preprint arXiv:2512.08153,

Zheng Ding and Weirui Ye. Treegrpo: Tree-advantage grpo for online rl post-training of diffusion models.arXiv preprint arXiv:2512.08153,

work page arXiv

[4] [4]

Dynasearcher: Dynamic knowl- edge graph augmented search agent via multi-reward reinforcement learning.arXiv preprint arXiv:2507.17365,

Chuzhan Hao, Wenfeng Feng, Yuewei Zhang, and Hao Wang. Dynasearcher: Dynamic knowl- edge graph augmented search agent via multi-reward reinforcement learning.arXiv preprint arXiv:2507.17365,

work page arXiv

[5] [5]

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich K ¨uttler, Mike Lewis, Wen tau Yih, Tim Rockt ¨aschel, Sebastian Riedel, and Douwe Kiela. Retrieval-augmented generation for knowledge-intensive nlp tasks.arXiv preprint arXiv:2005.11401,

work page internal anchor Pith review Pith/arXiv arXiv 2005

[6] [6]

Search-o1: Agentic search-enhanced large reasoning models

Xiaoxi Li, Guanting Dong, Jiajie Jin, Yuyao Zhang, Yujia Zhou, Yutao Zhu, Peitian Zhang, and Zhicheng Dou. Search-o1: Agentic search-enhanced large reasoning models. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pp. 5420–5438, Suzhou, China,

work page 2025

[7] [7]

Let's Verify Step by Step

Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step.arXiv preprint arXiv:2305.20050,

work page internal anchor Pith review Pith/arXiv arXiv

[8] [8]

Unifying distillation and privileged information

David Lopez-Paz, L ´eon Bottou, Bernhard Sch ¨olkopf, and Vladimir Vapnik. Unifying distillation and privileged information.arXiv preprint arXiv:1511.03643,

work page internal anchor Pith review Pith/arXiv arXiv

[9] [9]

Qarm: Quantitative alignment multi-modal recommenda- tion at kuaishou.arXiv preprint arXiv:2411.11739,

Xinchen Luo, Jiangxia Cao, Tianyu Sun, Jinkai Yu, Rui Huang, Wei Yuan, Hezheng Lin, Yichen Zheng, Shiyao Wang, Qigen Hu, et al. Qarm: Quantitative alignment multi-modal recommenda- tion at kuaishou.arXiv preprint arXiv:2411.11739,

work page arXiv

[10] [10]

GPT-4 Technical Report

OpenAI. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2024a. OpenAI. Openai o1 system card.arXiv preprint arXiv:2412.16720, 2024b. Ofir Press, Muru Zhang, Sewon Min, Ludwig Schmidt, Noah Smith, and Mike Lewis. Measuring and narrowing the compositionality gap in language models. InFindings of the Association for Computational Linguistics: EMNLP 2...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[11] [11]

Qwen2.5 Technical Report

Qwen. Qwen2.5 technical report.arXiv preprint arXiv:2412.15115,

work page internal anchor Pith review Pith/arXiv arXiv

[12] [12]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y . K. Li, Y . Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathe- matical reasoning in open language models.arXiv preprint arXiv:2402.03300,

work page internal anchor Pith review Pith/arXiv arXiv

[13] [13]

Self-Distillation Enables Continual Learning

Idan Shenfeld, Mehul Damani, Jonas H ¨ubotter, and Pulkit Agrawal. Self-distillation enables con- tinual learning.arXiv preprint arXiv:2601.19897,

work page internal anchor Pith review Pith/arXiv arXiv

[14] [14]

Agentic Reasoning and Tool Integration for LLMs via Reinforcement Learning

Joykirat Singh, Raghav Magazine, Yash Pandya, and Akshay Nambi. Agentic reasoning and tool integration for llms via reinforcement learning.arXiv preprint arXiv:2505.01441,

work page internal anchor Pith review Pith/arXiv arXiv

[15] [15]

R1-Searcher: Incentivizing the Search Capability in LLMs via Reinforcement Learning

9 Preprint Huatong Song, Jinhao Jiang, Yingqian Min, Jie Chen, Zhipeng Chen, Wayne Xin Zhao, Lei Fang, and Ji-Rong Wen. R1-searcher: Incentivizing the search capability in llms via reinforcement learning.arXiv preprint arXiv:2503.05592,

work page internal anchor Pith review Pith/arXiv arXiv

[16] [16]

ZeroSearch: Incentivize the Search Capability of LLMs without Searching

Hao Sun, Zile Qiao, Jiayan Guo, Xuanbo Fan, Yingyan Hou, Yong Jiang, Pengjun Xie, Yan Zhang, Fei Huang, and Jingren Zhou. Zerosearch: Incentivize the search capability of llms without searching.arXiv preprint arXiv:2505.04588,

work page internal anchor Pith review Pith/arXiv arXiv

[17] [17]

Text Embeddings by Weakly-Supervised Contrastive Pre-training

Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin Jiang, Rangan Ma- jumder, and Furu Wei. Text embeddings by weakly-supervised contrastive pre-training.arXiv preprint arXiv:2212.03533,

work page internal anchor Pith review Pith/arXiv arXiv

[18] [18]

Stepsearch: Igniting llms search ability via step-wise proximal policy optimization.arXiv preprint arXiv:2505.15107,

Ziliang Wang, Xuhui Zheng, Kang An, Cijun Ouyang, Jialu Cai, Yuhang Wang, and Yichao Wu. Stepsearch: Igniting llms search ability via step-wise proximal policy optimization.arXiv preprint arXiv:2505.15107,

work page arXiv

[19] [19]

Search-p1: Path-centric reward shaping for stable and efficient agentic rag training.arXiv preprint arXiv:2602.22576,

Tianle Xia, Ming Xu, Lingxiang Hu, Yiding Sun, Wenwei Li, Linfang Shang, Liqun Liu, Peng Shu, Huan Yu, and Jie Jiang. Search-p1: Path-centric reward shaping for stable and efficient agentic rag training.arXiv preprint arXiv:2602.22576,

work page arXiv

[20] [20]

Thinker: Train- ing llms in hierarchical thinking for deep search via multi-turn interaction.arXiv preprint arXiv:2511.07943,

Jun Xu, Xinkai Du, Yu Ao, Peilong Zhao, Yang Li, Ling Zhong, Lin Yuan, Zhongpu Bo, Xiaorui Wang, Mengshu Sun, Zhengke Gui, Dalong Zhang, Zhaoyang Wang, Qiwei Wang, Yangyang Hou, Zhiying Yin, Haofen Wang, Huajun Chen, Lei Liang, and Jun Zhou. Thinker: Train- ing llms in hierarchical thinking for deep search via multi-turn interaction.arXiv preprint arXiv:2...

work page arXiv

[21] [21]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388,

work page internal anchor Pith review Pith/arXiv arXiv

[22] [22]

Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christopher D. Manning. HotpotQA: A dataset for diverse, explainable multi-hop question answering. InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 2369–2380, Brussels, Belgium,

work page 2018

[23] [23]

ReAct: Synergizing Reasoning and Acting in Language Models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models.arXiv preprint arXiv:2210.03629,

work page internal anchor Pith review Pith/arXiv arXiv

[24] [24]

Criticsearch: Fine-grained credit assignment for search agents via a retrospective critic.arXiv preprint arXiv:2511.12159, 2025a

Yaocheng Zhang, Haohuan Huang, Zijun Song, Yuanheng Zhu, Qichao Zhang, Zijie Zhao, and Dongbin Zhao. Criticsearch: Fine-grained credit assignment for search agents via a retrospective critic.arXiv preprint arXiv:2511.12159, 2025a. Zhenru Zhang, Chujie Zheng, Yangzhen Wu, Beichen Zhang, Runji Lin, Bowen Yu, Dayiheng Liu, Jingren Zhou, and Junyang Lin. The ...

work page arXiv