arxiv: 2604.09455 · v1 · submitted 2026-04-10 · 💻 cs.AI

Recognition: 2 theorem links

· Lean Theorem

E3-TIR: Enhanced Experience Exploitation for Tool-Integrated Reasoning

Weiyang Guo , Zesheng Shi , Liye Zhao , Jiayuan Ma , Zeen Zhu , Junxian He , Min Zhang , Jing Li

Authors on Pith no claims yet

Pith reviewed 2026-05-10 18:03 UTC · model grok-4.3

classification 💻 cs.AI

keywords tool-integrated reasoningLLM agent trainingexperience exploitationpolicy optimizationsynthetic data efficiencyreinforcement learningtool-use taskswarm-up paradigm

0 comments

The pith

E3-TIR integrates expert anchors with self-exploration branches to raise tool-use performance in LLMs by 6 percent while cutting synthetic data needs below 10 percent.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to fix two main problems in training large language models for tool-integrated reasoning. Pure reinforcement learning explores too little and collapses into narrow behaviors without guidance, while first doing supervised fine-tuning on lots of data and then applying reinforcement learning hits performance ceilings and costs too much to generate the data. E3-TIR treats early training as a mixture of three experience sources: expert prefixes that give strong starting points, expert-guided paths that stay close to good examples, and free self-exploration that branches out from those anchors. A mixed policy optimization step then trains on all three at once, which the authors claim reduces the mismatch between the data the model sees and the tasks it must solve. If this holds, developers could build capable tool-using agents with far smaller synthetic datasets and shorter training runs, raising the overall return on the compute invested.

Core claim

E3-TIR formulates the early stages of agent training as the dynamic integration of Expert Prefixes, Expert Guided, and Self-Exploration experiences. By executing diverse branching exploration around expert anchors and employing a mix policy optimization mechanism, the approach mitigates distribution shifts and resolves optimization conflicts arising from shared prefixes, allowing the model to adapt its knowledge boundaries while balancing exploration diversity with training efficiency.

What carries the argument

Mix policy optimization over three experience types anchored at expert prefixes, with branching self-exploration that shares prefixes yet diverges later.

If this is right

Tool-use benchmarks show a 6 percent performance lift over both zero-reinforcement-learning and supervised-fine-tuning-then-reinforcement-learning baselines.
Synthetic data volume drops below 10 percent of what prior pipelines require while still reaching higher final capability.
A composite ROI metric that folds together accuracy, data cost, and wall-clock efficiency improves by a factor of 1.46 relative to the same baselines.
Early-stage training becomes more stable because shared prefixes no longer force the optimizer into conflicting gradients.
The model continues to expand its effective knowledge boundary rather than collapsing to low-entropy outputs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same anchoring-plus-branching pattern could be tested on other long-horizon agent domains such as web navigation or code repair where data generation is expensive.
If the method scales, organizations could lower the carbon and dollar cost of producing specialized tool-using models by roughly an order of magnitude.
Future experiments might measure whether the same three-way experience mix prevents mode collapse when training runs extend to thousands of steps rather than the early-stage regime studied here.
The approach suggests that explicit control of prefix sharing may be a general lever for stabilizing reinforcement learning on any autoregressive model that must reuse earlier tokens.

Load-bearing premise

Branching exploration around expert anchors plus mix policy optimization will reliably reduce distribution shifts and prefix conflicts without creating new instabilities or demanding heavy extra tuning.

What would settle it

An ablation run on the same tool-use benchmarks in which removing either the branching step or the mixed optimization causes performance to drop back to baseline levels or data requirements to rise above 10 percent of the original volume would falsify the central claim.

Figures

Figures reproduced from arXiv: 2604.09455 by Jiayuan Ma, Jing Li, Junxian He, Liye Zhao, Min Zhang, Weiyang Guo, Zeen Zhu, Zesheng Shi.

**Figure 2.** Figure 2: Statistical analysis of the limitations of current training paradigms. [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: Illustration of our E 3 -TIR framework. (a) Branching exploration from expert anchors and dynamic experience filtering. (b) Hybrid advantage estimation and advantage-aware gradient detachment. search engines. For question-answer (QA) tasks, efforts such as Search-R1 (Jin et al., 2025a), R1- Search (Song et al., 2025) , and Re-search (Chen et al., 2025a) primarily focus on utilizing search tools. Conversely… view at source ↗

**Figure 4.** Figure 4: Distribution of Standard Deviation and En [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: Comparison of the training curves on Qwen2.5-3B-Instruct. [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 7.** Figure 7: Ablation analysis of clip ratio and gradient [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗

**Figure 8.** Figure 8: Comparison of Code Failed Rate and Solve [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗

**Figure 9.** Figure 9: (a) Cost-Benefit Heatmap during the Warm-up [PITH_FULL_IMAGE:figures/full_fig_p009_9.png] view at source ↗

**Figure 10.** Figure 10: Variation of the sample size and rewards in the preheating stage of the mixed strategy. [PITH_FULL_IMAGE:figures/full_fig_p020_10.png] view at source ↗

read the original abstract

While Large Language Models (LLMs) have demonstrated significant potential in Tool-Integrated Reasoning (TIR), existing training paradigms face significant limitations: Zero-RL suffers from inefficient exploration and mode degradation due to a lack of prior guidance, while SFT-then-RL is limited by high data costs and capability plateaus caused by low-entropy collapse. To address these challenges, we propose E3-TIR (Enhanced Experience Exploitation), a warm-up paradigm for the early stages of agent training. Specifically, we formulate training as the dynamic integration of three experience types: Expert Prefixes, Expert Guided, and Self-Exploration. By executing diverse branching exploration around expert "anchors" and employing a mix policy optimization mechanism, we effectively mitigate distribution shifts and resolve optimization conflicts arising from shared prefixes. Our method dynamically adapts the model's knowledge boundaries, effectively balancing exploration diversity with training efficiency.Experimental results demonstrate that E3-TIR achieves a 6 performance improvement over traditional paradigms on tool-use tasks, while requiring less than 10 of the synthetic data. Furthermore, in terms of ROI, a comprehensive metric integrating performance, data cost, and training efficiency we achieve a 1.46x gain compared to baselines. Code is available at https://github.com/yuki-younai/E3-TIR.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

E3-TIR mixes expert-anchored branching with self-exploration and a mix policy for more efficient early-stage TIR training, but the 6x gains and ROI numbers sit on unshown experiments.

read the letter

The main point to know is that E3-TIR proposes a warm-up training method for tool-integrated reasoning in LLMs that mixes expert-anchored experiences with self-exploration through branching and a mix policy optimizer to cut down on data needs while boosting performance. What is new here is the concrete combination of those three experience types with dynamic branching around expert anchors and the mix policy to deal with shared prefix conflicts and distribution shifts. It takes familiar RL and SFT ideas and applies them in this early training stage to avoid the usual pitfalls like inefficient exploration or entropy collapse. The paper does well at spelling out why zero-RL and standard SFT-then-RL fall short on tool-use tasks, and the approach seems like a reasonable way to get better ROI on synthetic data. The soft spots are mostly around the evidence. The abstract claims a 6x performance lift with less than 10% of the data and a 1.46x ROI gain, but it gives no baselines, no stats, no ablations, and no details on how the mix policy is implemented or if it introduces tuning issues. The assumption that branching plus mixing reliably avoids new problems is stated but not shown in the provided text. If the full paper has the code and results section, that would change things, but right now the claims are hard to assess. This paper is for people working on training efficient LLM agents for tool use, particularly those looking at data-efficient RL methods. A practitioner or researcher in that area could pick up the paradigm and try it out, especially since code is linked. It deserves a serious referee because the core idea is grounded in real training challenges and the method is described clearly enough to be reproducible if the details check out. I would recommend putting it through peer review rather than desk rejecting it, but flag the need for stronger empirical validation in the reviews.

Referee Report

2 major / 2 minor

Summary. The paper proposes E3-TIR, a warm-up training paradigm for tool-integrated reasoning in LLMs. It addresses limitations of Zero-RL (inefficient exploration and mode degradation) and SFT-then-RL (high data costs and low-entropy collapse) by dynamically integrating three experience types—Expert Prefixes, Expert Guided, and Self-Exploration—via diverse branching exploration around expert anchors combined with a mix policy optimization mechanism. The central empirical claims are a 6x performance improvement on tool-use tasks using less than 10% of the synthetic data required by baselines, plus a 1.46x ROI gain (integrating performance, data cost, and training efficiency).

Significance. If the reported gains are substantiated with full experimental controls, this could offer a practical advance in data-efficient early-stage training for tool-using agents, reducing reliance on large synthetic datasets while maintaining exploration diversity. The explicit framing of experience integration as a dynamic process targeting distribution shift and prefix conflicts is a targeted contribution to RL-for-agents literature.

major comments (2)

[Abstract] Abstract: The central claims of '6 performance improvement', 'less than 10 of the synthetic data', and '1.46x gain' in ROI are presented with no experimental details, baseline definitions, task descriptions, statistical tests, number of runs, or ablation results for the branching exploration or mix policy components. These omissions are load-bearing for the paper's primary contribution.
[Method] Method (description of mix policy optimization): The mechanism for 'resolv[ing] optimization conflicts arising from shared prefixes' and 'mitigat[ing] distribution shifts' is described only at a high level with no equations, pseudocode, loss formulation, or stability analysis, leaving the weakest assumption (reliable avoidance of new instabilities without heavy tuning) untested in the provided text.

minor comments (2)

[Abstract] Abstract: The ROI sentence is grammatically incomplete ('efficiency we achieve a 1.46x gain' requires a comma or rephrasing).
[Abstract] Abstract: 'less than 10 of the synthetic data' is missing the percent sign or qualifier ('10%').

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback, which has helped clarify areas for improvement in our presentation of E3-TIR. We respond to each major comment below and indicate the revisions made to the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: The central claims of '6 performance improvement', 'less than 10 of the synthetic data', and '1.46x gain' in ROI are presented with no experimental details, baseline definitions, task descriptions, statistical tests, number of runs, or ablation results for the branching exploration or mix policy components. These omissions are load-bearing for the paper's primary contribution.

Authors: The abstract is designed as a concise summary of the core claims, with full experimental details—including task descriptions (ToolBench and API-Bank), baseline definitions (Zero-RL and SFT-then-RL), number of runs (5 seeds), statistical tests, and ablations on branching and mix policy—provided in Sections 4 and 5. To address the concern that these details are load-bearing, we have revised the abstract to briefly reference the benchmarks, the use of multiple runs with significance testing, and the key ablation findings. This change improves accessibility without altering the abstract's length substantially. revision: partial
Referee: [Method] Method (description of mix policy optimization): The mechanism for 'resolv[ing] optimization conflicts arising from shared prefixes' and 'mitigat[ing] distribution shifts' is described only at a high level with no equations, pseudocode, loss formulation, or stability analysis, leaving the weakest assumption (reliable avoidance of new instabilities without heavy tuning) untested in the provided text.

Authors: We agree that the original description of mix policy optimization was high-level. In the revised manuscript, we have added the explicit loss formulation (now Equation 3) that combines weighted terms for Expert Prefixes, Expert Guided, and Self-Exploration experiences to resolve prefix conflicts and mitigate shifts. We have also included pseudocode as Algorithm 1 and a stability analysis in Appendix C, with additional experiments confirming that the approach avoids new instabilities across the tested hyperparameter ranges and requires no more tuning than standard RL baselines. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper proposes E3-TIR as a new warm-up training paradigm that dynamically integrates Expert Prefixes, Expert Guided, and Self-Exploration via branching exploration around anchors plus mix policy optimization. All central claims are framed as empirical experimental outcomes (6x performance gain, <10% synthetic data, 1.46x ROI) rather than mathematical derivations, predictions from fitted parameters, or self-referential definitions. No equations, loss formulations, or self-citations appear in a load-bearing role that would reduce the method to its own inputs by construction. The approach is presented as an independent procedural contribution whose validity rests on reported results, making the derivation chain self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard reinforcement-learning assumptions about policy optimization and the premise that expert anchors plus mixed exploration will produce stable training signals; no new physical entities or ad-hoc constants are introduced.

axioms (1)

domain assumption Standard policy-gradient or actor-critic methods can be applied to mixed on-policy and off-policy trajectories without additional instability beyond what the mix policy addresses.
The method invokes RL optimization on the combined experience streams.

pith-pipeline@v0.9.0 · 5549 in / 1241 out tokens · 55601 ms · 2026-05-10T18:03:45.584436+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

By executing diverse branching exploration around expert “anchors” and employing a mix policy optimization mechanism... JHybrid(θ) = ... CLIP(ρk,t(θ), Âexp k)·I(Âexp k >0)
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We formulate training as the dynamic integration of three experience types: Expert Prefixes, Expert Guided, and Self-Exploration.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

TRACER: Verifiable Generative Provenance for Multimodal Tool-Using Agents
cs.CL 2026-05 unverdicted novelty 7.0

TRACER attaches verifiable sentence-level provenance records to multimodal agent outputs using tool-turn alignment and semantic relations, yielding 78.23% answer accuracy and fewer tool calls than baselines on TRACE-Bench.
Personalizing LLMs with Binary Feedback: A Preference-Corrected Optimization Framework
cs.CL 2026-05 unverdicted novelty 5.0

C-BPO personalizes LLMs via preference-calibrated binary signals and PU learning theory to isolate inter-user differences from shared task knowledge.

Reference graph

Works this paper leans on

35 extracted references · 25 canonical work pages · cited by 2 Pith papers · 8 internal anchors

[1]

online" 'onlinestring :=

ENTRY address archivePrefix author booktitle chapter edition editor eid eprint eprinttype howpublished institution journal key month note number organization pages publisher school series title type volume year doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRING...
[2]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...
[4]

Pan, Wen Zhang, Huajun Chen, Fan Yang, Zenan Zhou, and Weipeng Chen

Mingyang Chen, Linzhuang Sun, Tianpeng Li, Haoze Sun, Yijie Zhou, Chenzheng Zhu, Haofen Wang, Jeff Z. Pan, Wen Zhang, Huajun Chen, Fan Yang, Zenan Zhou, and Weipeng Chen. 2025 b . Research: Learning to reason with search for llms via reinforcement learning. arXiv preprint arXiv:2503.19470

work page arXiv 2025
[5]

Yifei Chen, Guanting Dong, and Zhicheng Dou. 2025 c . Toward effective tool-integrated reasoning via self-evolved preference learning. arXiv preprint arXiv:2509.23285

work page arXiv 2025
[6]

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. 2021. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168

work page internal anchor Pith review Pith/arXiv arXiv 2021
[7]

Zheng Ding and Weirui Ye. 2025. Treegrpo: Tree-advantage grpo for online rl post-training of diffusion models. arXiv preprint arXiv:2512.08153

work page arXiv 2025
[8]

Guanting Dong, Licheng Bao, Zhongyuan Wang, Kangzhi Zhao, Xiaoxi Li, Jiajie Jin, Jinghan Yang, Hangyu Mao, Fuzheng Zhang, Kun Gai, Guorui Zhou, Yutao Zhu, Ji - Rong Wen, and Zhicheng Dou. 2025 a . Agentic entropy-balanced policy optimization. arXiv preprint arXiv:2510.14545

work page arXiv 2025
[9]

Guanting Dong, Yifei Chen, Xiaoxi Li, Jiajie Jin, Hongjin Qian, Yutao Zhu, Hangyu Mao, Guorui Zhou, Zhicheng Dou, and Ji - Rong Wen. 2025 b . Tool-star: Empowering llm-brained multi-tool reasoner via reinforcement learning. arXiv preprint arXiv:2505.16410

work page arXiv 2025
[10]

Guanting Dong, Hangyu Mao, Kai Ma, Licheng Bao, Yifei Chen, Zhongyuan Wang, Zhongxia Chen, Jiazhen Du, Huiyang Wang, Fuzheng Zhang, Guorui Zhou, Yutao Zhu, Ji - Rong Wen, and Zhicheng Dou. 2025 c . Agentic reinforced policy optimization. arXiv preprint arXiv:2507.19849

work page arXiv 2025
[11]

Jiazhan Feng, Shijue Huang, Xingwei Qu, Ge Zhang, Yujia Qin, Baoquan Zhong, Chengquan Jiang, Jinxin Chi, and Wanjun Zhong. 2025 a . Retool: Reinforcement learning for strategic tool use in llms. arXiv preprint arXiv:2504.11536

work page internal anchor Pith review arXiv 2025
[12]

Lang Feng, Zhenghai Xue, Tingcong Liu, and Bo An. 2025 b . Group-in-group policy optimization for llm agent training. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS)

2025
[13]

Huan - ang Gao, Jiayi Geng, Wenyue Hua, Mengkang Hu, and Xinzhe Juan. 2025. A survey of self-evolving agents: On path to artificial super intelligence. arXiv preprint arXiv:2507.21046

work page internal anchor Pith review arXiv 2025
[14]

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. 2021. Measuring mathematical problem solving with the MATH dataset. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS)

2021
[15]

Xanh Ho, Anh-Khoa Duong Nguyen, Saku Sugawara, and Akiko Aizawa. 2020. Constructing a multi-hop QA dataset for comprehensive evaluation of reasoning steps. In Proceedings of the International Conference on Computational Linguistics (COLING), pages 6609--6625

2020
[16]

Bowen Jin, Hansi Zeng, Zhenrui Yue, Dong Wang, Hamed Zamani, and Jiawei Han. 2025 a . Search-r1: Training llms to reason and leverage search engines with reinforcement learning. In Proceedings of the Conference on Language Modeling (COLM)

2025
[17]

Bowen Jin, Hansi Zeng, Zhenrui Yue, Jinsung Yoon, Sercan Arik, Dong Wang, Hamed Zamani, and Jiawei Han. 2025 b . Search-r1: Training llms to reason and leverage search engines with reinforcement learning. arXiv preprint arXiv:2503.09516

work page Pith review arXiv 2025
[18]

Xiaoxi Li, Guanting Dong, Jiajie Jin, Yuyao Zhang, Yujia Zhou, Yutao Zhu, Peitian Zhang, and Zhicheng Dou. 2025 a . Search-o1: Agentic search-enhanced large reasoning models. arXiv preprint arXiv:2501.05366

work page internal anchor Pith review arXiv 2025
[19]

Xuefeng Li, Haoyang Zou, and Pengfei Liu. 2025 b . Torl: Scaling tool-integrated RL . arXiv preprint arXiv:2503.23383

work page arXiv 2025
[20]

Heng Lin and Zhongwen Xu. 2025. Understanding tool-integrated reasoning. arXiv preprint arXiv:2508.19201

work page arXiv 2025
[21]

Xingtai Lv, Yuxin Zuo, Youbang Sun, Hongyi Liu, Yuntian Wei, Zhekai Chen, Lixuan He, Xuekai Zhu, Kaiyan Zhang, Bingning Wang, Ning Ding, and Bowen Zhou. 2025. Towards a unified view of large language model post-training. arXiv preprint arXiv:2509.04419

work page arXiv 2025
[22]

Smith, and Mike Lewis

Ofir Press, Muru Zhang, Sewon Min, Ludwig Schmidt, Noah A. Smith, and Mike Lewis. 2023. Measuring and narrowing the compositionality gap in language models. In Findings of the Association for Computational Linguistics (EMNLP), pages 5687--5711

2023
[23]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. 2024. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300

work page internal anchor Pith review Pith/arXiv arXiv 2024
[24]

Joykirat Singh, Raghav Magazine, Yash Pandya, and Akshay Nambi. 2025. Agentic reasoning and tool integration for llms via reinforcement learning. arXiv preprint arXiv:2505.01441

work page arXiv 2025
[25]

Huatong Song, Jinhao Jiang, Yingqian Min, Jie Chen, Zhipeng Chen, Wayne Xin Zhao, Lei Fang, and Ji - Rong Wen. 2025. R1-searcher: Incentivizing the search capability in llms via reinforcement learning. arXiv preprint arXiv:2503.05592

work page internal anchor Pith review arXiv 2025
[26]

Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. 2022. M u S i Q ue: Multihop questions via single-hop question composition. Transactions of the Association for Computational Linguistics (TACL)

2022
[27]

Guoqing Wang, Sunhao Dai, Guangze Ye, Zeyu Gan, Wei Yao, Yong Deng, Xiaofeng Wu, and Zhenzhe Ying. 2025 a . Information gain-based policy optimization: A simple and effective approach for multi-turn LLM agents. arXiv preprint arXiv:2510.14967

work page arXiv 2025
[28]

Hongru Wang, Cheng Qian, Wanjun Zhong, Xiusi Chen, Jiahao Qiu, Shijue Huang, Bowen Jin, Mengdi Wang, Kam-Fai Wong, and Heng Ji. 2025 b . Acting less is reasoning more! teaching model to act efficiently. arXiv preprint arXiv:2504.14870

work page arXiv 2025
[30]

Zihan Wang, Kangrui Wang, Qineng Wang, Pingyue Zhang, and Linjie Li. 2025 d . RAGEN: understanding self-evolution in LLM agents via multi-turn reinforcement learning. arXiv preprint arXiv:2504.20073

work page internal anchor Pith review arXiv 2025
[31]

Ziliang Wang, Xuhui Zheng, Kang An, Cijun Ouyang, Jialu Cai, Yuhang Wang, and Yichao Wu. 2025 e . Stepsearch: Igniting llms search ability via step-wise proximal policy optimization. arXiv preprint arXiv:2505.15107

work page arXiv 2025
[32]

Zhenghai Xue, Longtao Zheng, Qian Liu, Yingru Li, Xiaosen Zheng, Zejun Ma, and Bo An. 2025. Simpletir: End-to-end reinforcement learning for multi-turn tool-integrated reasoning. arXiv preprint arXiv:2509.02479

work page arXiv 2025
[33]

Jianhao Yan, Yafu Li, Zican Hu, Zhi Wang, Ganqu Cui, Xiaoye Qu, Yu Cheng, and Yue Zhang. 2025. Learning to reason under off-policy guidance. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS)

2025
[34]

Cohen, Ruslan Salakhutdinov, and Christopher D

Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W. Cohen, Ruslan Salakhutdinov, and Christopher D. Manning. 2018. HotpotQA : A dataset for diverse, explainable multi-hop question answering. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP)

2018
[35]

Zhaochen Yu, Ling Yang, Jiaru Zou, Shuicheng Yan, and Mengdi Wang. 2025. Demystifying reinforcement learning in agentic reasoning. arXiv preprint arXiv:2510.11701

work page arXiv 2025
[36]

Guibin Zhang, Hejia Geng, Xiaohang Yu, Zhenfei Yin, and Zaibin Zhang. 2025. The landscape of agentic reinforcement learning for llms: A survey. arXiv preprint arXiv:2509.02547

work page internal anchor Pith review Pith/arXiv arXiv 2025
[37]

Bingxi Zhao, Lin Geng Foo, Ping Hu, Christian Theobalt, Hossein Rahmani, and Jun Liu. 2025. Llm-based agentic reasoning frameworks: A survey from methods to scenarios. arXiv preprint arXiv:2508.17692

work page arXiv 2025