RTMC: Step-Level Credit Assignment via Rollout Trees

Suhang Zheng; Tao Wang; Xiaoxiao Xu

arxiv: 2604.11037 · v1 · submitted 2026-04-13 · 💻 cs.LG · cs.AI

RTMC: Step-Level Credit Assignment via Rollout Trees

Tao Wang , Suhang Zheng , Xiaoxiao Xu This is my paper

Pith reviewed 2026-05-10 15:35 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords reinforcement learningcredit assignmentrollout treesmulti-step agentsMonte Carlo estimationagentic RLSWE-benchadvantage estimation

0 comments

The pith

RTMC delivers step-level credit assignment in agentic RL by aggregating returns across shared states in group rollouts without a learned critic.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Multi-step agent tasks in reinforcement learning need precise credit for each action, but long trajectories and sparse rewards make this hard. Standard critic-free methods like GRPO spread the same advantage across an entire rollout, while learned critics add overhead and can fail when rewards are infrequent. The paper notes that multiple rollouts for one problem usually pass through the same early states before branching into different paths. By treating these paths as branches of an implicit tree and matching the shared states, RTMC computes distinct Q-values and advantages for each step directly from the actual returns collected. The result is finer credit assignment that improves pass rates on difficult benchmarks while avoiding extra neural networks.

Core claim

We observe that group rollouts targeting the same problem often traverse overlapping intermediate states, implicitly forming a tree whose branches diverge at successive decision points. Building on this insight, we introduce Rollout-Tree Monte Carlo (RTMC) advantage estimation, which aggregates return statistics across rollouts sharing a common state to produce per-step Q-values and advantages without any learned critic. A state-action signature system compresses raw interaction histories into compact, comparable representations, making cross-rollout state matching tractable. On SWE-bench Verified, RTMC improves pass@1 by 3.2 percentage points over GRPO.

What carries the argument

Rollout-Tree Monte Carlo (RTMC) advantage estimation, which builds an implicit tree from overlapping states visited by group rollouts and uses compressed state-action signatures to match states and aggregate observed returns into per-step Q-values and advantages.

If this is right

It supplies per-step advantages in critic-free settings by exploiting natural state overlap in group sampling.
It produces a 3.2 percentage point lift in pass@1 on SWE-bench Verified relative to GRPO.
It removes the training cost and fragility of value networks when rewards are sparse.
It renders cross-rollout state comparison practical through a lightweight signature compression step.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same overlap idea could be applied inside explicit tree-search planners to refine their value backups using only sampled leaves.
Environments that generate high state revisitation under group sampling may see larger variance reductions than those with mostly unique paths.
RTMC could be combined with occasional light critic training to handle the few states that appear in only one rollout.

Load-bearing premise

Group rollouts targeting the same problem often traverse overlapping intermediate states, implicitly forming a tree whose branches diverge at successive decision points.

What would settle it

Run the same agent on a task suite where independent rollouts rarely revisit the same states after the start; if RTMC then shows no gain over GRPO or produces unstable advantages, the central claim does not hold.

Figures

Figures reproduced from arXiv: 2604.11037 by Suhang Zheng, Tao Wang, Xiaoxiao Xu.

**Figure 2.** Figure 2: Training episode resolve rate on the R2E dataset. All methods start from the same pretrained checkpoint [PITH_FULL_IMAGE:figures/full_fig_p010_2.png] view at source ↗

**Figure 3.** Figure 3: Rollout-Tree visualization extracted from training-time rollouts on [PITH_FULL_IMAGE:figures/full_fig_p015_3.png] view at source ↗

**Figure 4.** Figure 4: Step-level advantage for each of the 8 rollouts (4 success, 4 failure) in [PITH_FULL_IMAGE:figures/full_fig_p016_4.png] view at source ↗

**Figure 5.** Figure 5: Quadrant analysis of step-level advantage agreement between GRPO+Step and RTMC on [PITH_FULL_IMAGE:figures/full_fig_p017_5.png] view at source ↗

read the original abstract

Multi-step agentic reinforcement learning benefits from fine-grained credit assignment, yet existing approaches offer limited options: critic-free methods like GRPO assign a uniform advantage to every action in a trajectory, while learned value networks introduce notable overhead and can be fragile under sparse rewards. We observe that group rollouts targeting the same problem often traverse overlapping intermediate states, implicitly forming a tree whose branches diverge at successive decision points. Building on this insight, we introduce Rollout-Tree Monte Carlo (RTMC) advantage estimation, which aggregates return statistics across rollouts sharing a common state to produce per-step Q-values and advantages--without any learned critic. A state-action signature system compresses raw interaction histories into compact, comparable representations, making cross-rollout state matching tractable. On SWE-bench Verified, RTMC improves pass@1 by 3.2 percentage points over GRPO.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

RTMC tries to get step-level advantages from matching states across group rollouts with signatures, but the evidence is thin and the matching step looks fragile.

read the letter

The paper's main contribution is a critic-free method for step-level credit assignment. It notices that group rollouts for the same task often share intermediate states and form an implicit tree. RTMC then aggregates returns only from rollouts that hit the same state (via a signature) to estimate per-step Q-values and advantages, instead of giving every step the same score like GRPO does. On SWE-bench Verified it reports a 3.2 point pass@1 gain over GRPO. That is the punchline: a simple aggregation trick that avoids training a value network.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes Rollout-Tree Monte Carlo (RTMC) advantage estimation for multi-step agentic RL. It observes that group rollouts for the same task often share intermediate states and thus form implicit trees; RTMC aggregates Monte-Carlo returns only across rollouts that match at a given state (via a state-action signature compression scheme) to obtain per-step Q-values and advantages, without training a critic. The abstract reports that this yields a 3.2 percentage-point gain in pass@1 on SWE-bench Verified relative to GRPO.

Significance. If the state-matching step is shown to be reliable, RTMC would supply a simple, critic-free route to step-level credit assignment that exploits the natural tree structure already present in group rollouts. This could improve sample efficiency for sparse-reward agentic tasks such as code editing without the overhead or fragility of learned value networks. The approach is parameter-free once the signature function is fixed and directly extends existing group-RL baselines.

major comments (2)

[Abstract and §4] The central empirical claim (3.2 pp pass@1 improvement over GRPO on SWE-bench Verified) is presented in the abstract and §4 without error bars, number of independent runs, or ablation studies that isolate the contribution of the rollout-tree aggregation from other implementation details. Because the gain is attributed specifically to finer credit assignment, these controls are required to rule out confounding factors.
[§3.2] §3.2 (state-action signature system): the method relies on the signature compression to decide which rollouts share an identical state for Q-value aggregation. No quantitative verification of matching accuracy (collision rate, false-positive rate, or manual inspection on SWE-bench states that include file contents and execution traces) is supplied. False matches would directly bias the Monte-Carlo estimates that replace GRPO’s uniform advantage, undermining the claim that the observed gain stems from improved credit assignment.

minor comments (2)

[§3] Notation for the signature function and the exact aggregation formula for the per-step advantage should be stated explicitly in a single equation block rather than distributed across prose.
[§3.2] The manuscript should clarify whether the signature system is deterministic or involves any learned embedding; if the latter, the training procedure and any additional parameters must be described.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive report. We address each of the major comments below and describe the revisions we will make to the manuscript.

read point-by-point responses

Referee: [Abstract and §4] The central empirical claim (3.2 pp pass@1 improvement over GRPO on SWE-bench Verified) is presented in the abstract and §4 without error bars, number of independent runs, or ablation studies that isolate the contribution of the rollout-tree aggregation from other implementation details. Because the gain is attributed specifically to finer credit assignment, these controls are required to rule out confounding factors.

Authors: We concur that error bars, run counts, and isolating ablations are necessary to support the attribution of the performance gain to the step-level credit assignment. The 3.2 percentage-point improvement reported in the abstract and Section 4 was measured in a single training run. For the revised manuscript, we will perform three additional independent runs and report the mean pass@1 with standard error. We will also add an ablation in Section 4 that applies uniform advantages within each group (as in GRPO) while retaining the RTMC implementation details otherwise, thereby isolating the effect of the rollout-tree aggregation. revision: yes
Referee: [§3.2] §3.2 (state-action signature system): the method relies on the signature compression to decide which rollouts share an identical state for Q-value aggregation. No quantitative verification of matching accuracy (collision rate, false-positive rate, or manual inspection on SWE-bench states that include file contents and execution traces) is supplied. False matches would directly bias the Monte-Carlo estimates that replace GRPO’s uniform advantage, undermining the claim that the observed gain stems from improved credit assignment.

Authors: We agree that empirical validation of the signature matching accuracy is essential to rule out bias in the aggregated returns. The state-action signature is formed by hashing file contents at each step, summarizing execution traces, and encoding actions to produce compact keys. In the revised manuscript, we will augment Section 3.2 with a verification subsection. This will include the estimated collision rate obtained by comparing signatures to exact state equality on a random sample of 1,000 states drawn from SWE-bench rollouts, the false-positive rate for matches, and results from manual inspection of 30 matched state pairs to confirm they correspond to identical file states and traces. revision: yes

Circularity Check

0 steps flagged

No significant circularity in RTMC derivation chain

full rationale

The paper defines RTMC as a direct Monte Carlo aggregation of returns over groups of rollouts that share states (identified via the signature compression system). This is a computational procedure applied to the observed rollout data and does not reduce any claimed prediction or Q-value to a fitted parameter or self-referential definition. No equations are presented that equate the output advantage to an input by construction, and the abstract contains no self-citations or uniqueness theorems. The performance improvement on SWE-bench is reported as an empirical outcome rather than a mathematical necessity derived from the method itself. The state-signature step is a practical engineering choice for tractability, not a load-bearing assumption that collapses the derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Information limited to abstract; ledger populated from stated observation and introduced component.

axioms (1)

domain assumption Group rollouts targeting the same problem often traverse overlapping intermediate states, implicitly forming a tree whose branches diverge at successive decision points.
This observation is presented as the foundation for building the rollout tree and aggregating statistics.

invented entities (1)

state-action signature system no independent evidence
purpose: Compresses raw interaction histories into compact, comparable representations for cross-rollout state matching.
Required to make tree construction and advantage aggregation tractable.

pith-pipeline@v0.9.0 · 5444 in / 1150 out tokens · 54667 ms · 2026-05-10T15:35:44.010232+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

32 extracted references · 32 canonical work pages

[1]

Brown, Miljan Martic, Shane Legg, and Dario Amodei

Paul Christiano, Jan Leike, Tom B. Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences, 2017

work page 2017
[2]

DeepSeek-R1 : Incentivizing reasoning capability in LLMs via reinforcement learning, 2025

DeepSeek-AI . DeepSeek-R1 : Incentivizing reasoning capability in LLMs via reinforcement learning, 2025

work page 2025
[3]

Group-in-group policy optimization for LLM agent training, 2025

Wentao Feng, Xiang Li, Pengfei Liu, and Weizhi Wang. Group-in-group policy optimization for LLM agent training, 2025

work page 2025
[4]

R2E-Gym : Procedural environments and hybrid verifiers for scaling open-weights SWE agents, 2025

Naman Jain, Jaskirat Singh, Manish Shetty, Liang Zheng, Koushik Sen, and Ion Stoica. R2E-Gym : Procedural environments and hybrid verifiers for scaling open-weights SWE agents, 2025

work page 2025
[5]

Estimation with quadratic loss

William James and Charles Stein. Estimation with quadratic loss. Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, 1: 0 361--379, 1961

work page 1961
[6]

Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan

Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. SWE -bench: Can language models resolve real-world GitHub issues? In International Conference on Learning Representations (ICLR), 2024

work page 2024
[7]

Approximately optimal approximate reinforcement learning

Sham Kakade and John Langford. Approximately optimal approximate reinforcement learning. In Proceedings of the 19th International Conference on Machine Learning (ICML), pages 267--274, 2002

work page 2002
[8]

VinePPO : Refining credit assignment in RL training of LLMs , 2025

Amirhossein Kazemnejad, Milad Aghajohari, Eva Portelance, Alessandro Sordoni, Siva Reddy, Aaron Courville, and Nicolas Le Roux . VinePPO : Refining credit assignment in RL training of LLMs , 2025

work page 2025
[9]

Kimi k1.5: Scaling reinforcement learning with LLMs , 2025

Kimi Team . Kimi k1.5: Scaling reinforcement learning with LLMs , 2025

work page 2025
[10]

Agentic reinforcement learning with implicit step rewards, 2025

Xiaoqian Liu, Xiangyuan Zhao, Ang Li, Yingce Li, Hao Liu, and Di He. Agentic reinforcement learning with implicit step rewards, 2025

work page 2025
[11]

MiniMax-01 : Scaling foundation models with lightning attention, 2025

MiniMax . MiniMax-01 : Scaling foundation models with lightning attention, 2025

work page 2025
[12]

Ng, Daishi Harada, and Stuart Russell

Andrew Y. Ng, Daishi Harada, and Stuart Russell. Policy invariance under reward transformations: Theory and application to reward shaping. In International Conference on Machine Learning (ICML), 1999

work page 1999
[13]

Training language models to follow instructions with human feedback

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedback....

work page 2022
[14]

Qwen3 technical report, 2025

Qwen Team . Qwen3 technical report, 2025

work page 2025
[15]

High-dimensional continuous control using generalized advantage estimation

John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel. High-dimensional continuous control using generalized advantage estimation. In International Conference on Learning Representations (ICLR), 2016

work page 2016
[16]

Proximal policy optimization algorithms, 2017

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms, 2017

work page 2017
[17]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Mingchuan Zhang, Y.K. Li, Y. Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models, 2024

work page 2024
[18]

R ^3 L : Reflect-then-retry reinforcement learning, 2026

Weijie Shi, Yanxi Chen, Zexi Li, Xuchen Pan, Yuchang Sun, Jiajie Xu, Xiaofang Zhou, and Yaliang Li. R ^3 L : Reflect-then-retry reinforcement learning, 2026

work page 2026
[19]

Sutton and Andrew G

Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction. MIT Press, 2nd edition, 2018

work page 2018
[20]

Sutton, David A

Richard S. Sutton, David A. McAllester, Satinder P. Singh, and Yishay Mansour. Policy gradient methods for reinforcement learning with function approximation. In Advances in Neural Information Processing Systems (NeurIPS), 1999

work page 1999
[21]

Exploiting tree structure for credit assignment in RL training of LLMs , 2025

Hieu Tran, Zonghai Yao, and Hong Yu. Exploiting tree structure for credit assignment in RL training of LLMs , 2025

work page 2025
[22]

A survey on large language model based autonomous agents, 2023

Lei Wang, Chen Ma, Xueyang Feng, Zeyu Zhang, Hao Yang, Jingsen Zhang, Zhiyuan Chen, Jiakai Tang, Xu Chen, Yankai Lin, Wayne Xin Zhao, Zhewei Wei, and Ji-Rong Wen. A survey on large language model based autonomous agents, 2023

work page 2023
[23]

A practitioner's guide to multi-turn agentic reinforcement learning, 2025

Ruiyi Wang and Prithviraj Ammanabrolu. A practitioner's guide to multi-turn agentic reinforcement learning, 2025

work page 2025
[24]

ROLL : Reinforcement learning optimization for large-scale learning, 2025 a

Weixun Wang, Shaopan Xiong, Gengru Chen, Wei Gao, Sheng Guo, Yancheng He, Ju Huang, Jiaheng Liu, Zhendong Li, Xiaoyang Li, Zichen Liu, Haizhou Zhao, Dakai An, Lunxi Cao, Qiyang Cao, Wanxi Deng, Feilei Du, Yiliang Gu, Jiahe Li, Xiang Li, Mingjie Liu, Yijia Luo, Zihe Liu, Yadao Wang, Pei Wang, Tianyuan Wu, Yanan Wu, Yuheng Zhao, Shuaibing Zhao, Jin Yang, Si...

work page 2025
[25]

Let it flow: Agentic crafting on rock and roll, building the ROME model within an open agentic learning ecosystem, 2025 b

Weixun Wang, XiaoXiao Xu, Wanhe An, Fangwen Dai, Wei Gao, Yancheng He, Ju Huang, Qiang Ji, Hanqi Jin, Xiaoyang Li, Yang Li, Zhongwen Li, Shirong Lin, Jiashun Liu, Zenan Liu, Tao Luo, Dilxat Muhtar, Yuanbin Qu, Jiaqiang Shi, Qinghui Sun, Yingshui Tan, Hao Tang, Runze Wang, Yi Wang, Zhaoguo Wang, Yanan Wu, Shaopan Xiong, Binchen Xu, Xander Xu, Yuchi Xu, Qip...

work page 2025
[26]

CLEANER : Self-purified trajectories boost agentic RL , 2026

Tianshi Xu, Yuteng Chen, and Meng Li. CLEANER : Self-purified trajectories boost agentic RL , 2026

work page 2026
[27]

Jimenez, Alexander Wettig, Kilian Liber, Karthik Narasimhan, and Ofir Press

John Yang, Carlos E. Jimenez, Alexander Wettig, Kilian Liber, Karthik Narasimhan, and Ofir Press. SWE -agent: Agent-computer interfaces enable automated software engineering, 2024

work page 2024
[28]

ReAct : Synergizing reasoning and acting in language models, 2022

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. ReAct : Synergizing reasoning and acting in language models, 2022

work page 2022
[29]

The landscape of agentic reinforcement learning for LLMs : A survey, 2025 a

Guibin Zhang, Hejia Geng, Xiaohan Yu, Zhenfei Yin, Zaibin Zhang, Hang Li, Yongkang Li, and Lichao Sun. The landscape of agentic reinforcement learning for LLMs : A survey, 2025 a

work page 2025
[30]

AgentRL : Scaling agentic reinforcement learning with a multi-turn, multi-task framework, 2025 b

Hanchen Zhang, Peilin Zhong, Qi Zhu, Hao Dong, and Mingyuan Chen. AgentRL : Scaling agentic reinforcement learning with a multi-turn, multi-task framework, 2025 b

work page 2025
[31]

CriticSearch : Fine-grained credit assignment via retrospective critic, 2025 c

Yaocheng Zhang, Haohuan Huang, Zijun Song, Yuanheng Zhu, Qichao Zhang, Zijie Zhao, and Dongbin Zhao. CriticSearch : Fine-grained credit assignment via retrospective critic, 2025 c

work page 2025
[32]

AT ^2 PO : Agentic turn-based policy optimization via tree search, 2026

Zefang Zong, Dingwei Chen, Yang Li, Qi Yi, Bo Zhou, Chengming Li, Bo Qian, Peng Chen, and Jie Jiang. AT ^2 PO : Agentic turn-based policy optimization via tree search, 2026

work page 2026

[1] [1]

Brown, Miljan Martic, Shane Legg, and Dario Amodei

Paul Christiano, Jan Leike, Tom B. Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences, 2017

work page 2017

[2] [2]

DeepSeek-R1 : Incentivizing reasoning capability in LLMs via reinforcement learning, 2025

DeepSeek-AI . DeepSeek-R1 : Incentivizing reasoning capability in LLMs via reinforcement learning, 2025

work page 2025

[3] [3]

Group-in-group policy optimization for LLM agent training, 2025

Wentao Feng, Xiang Li, Pengfei Liu, and Weizhi Wang. Group-in-group policy optimization for LLM agent training, 2025

work page 2025

[4] [4]

R2E-Gym : Procedural environments and hybrid verifiers for scaling open-weights SWE agents, 2025

Naman Jain, Jaskirat Singh, Manish Shetty, Liang Zheng, Koushik Sen, and Ion Stoica. R2E-Gym : Procedural environments and hybrid verifiers for scaling open-weights SWE agents, 2025

work page 2025

[5] [5]

Estimation with quadratic loss

William James and Charles Stein. Estimation with quadratic loss. Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, 1: 0 361--379, 1961

work page 1961

[6] [6]

Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan

Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. SWE -bench: Can language models resolve real-world GitHub issues? In International Conference on Learning Representations (ICLR), 2024

work page 2024

[7] [7]

Approximately optimal approximate reinforcement learning

Sham Kakade and John Langford. Approximately optimal approximate reinforcement learning. In Proceedings of the 19th International Conference on Machine Learning (ICML), pages 267--274, 2002

work page 2002

[8] [8]

VinePPO : Refining credit assignment in RL training of LLMs , 2025

Amirhossein Kazemnejad, Milad Aghajohari, Eva Portelance, Alessandro Sordoni, Siva Reddy, Aaron Courville, and Nicolas Le Roux . VinePPO : Refining credit assignment in RL training of LLMs , 2025

work page 2025

[9] [9]

Kimi k1.5: Scaling reinforcement learning with LLMs , 2025

Kimi Team . Kimi k1.5: Scaling reinforcement learning with LLMs , 2025

work page 2025

[10] [10]

Agentic reinforcement learning with implicit step rewards, 2025

Xiaoqian Liu, Xiangyuan Zhao, Ang Li, Yingce Li, Hao Liu, and Di He. Agentic reinforcement learning with implicit step rewards, 2025

work page 2025

[11] [11]

MiniMax-01 : Scaling foundation models with lightning attention, 2025

MiniMax . MiniMax-01 : Scaling foundation models with lightning attention, 2025

work page 2025

[12] [12]

Ng, Daishi Harada, and Stuart Russell

Andrew Y. Ng, Daishi Harada, and Stuart Russell. Policy invariance under reward transformations: Theory and application to reward shaping. In International Conference on Machine Learning (ICML), 1999

work page 1999

[13] [13]

Training language models to follow instructions with human feedback

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedback....

work page 2022

[14] [14]

Qwen3 technical report, 2025

Qwen Team . Qwen3 technical report, 2025

work page 2025

[15] [15]

High-dimensional continuous control using generalized advantage estimation

John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel. High-dimensional continuous control using generalized advantage estimation. In International Conference on Learning Representations (ICLR), 2016

work page 2016

[16] [16]

Proximal policy optimization algorithms, 2017

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms, 2017

work page 2017

[17] [17]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Mingchuan Zhang, Y.K. Li, Y. Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models, 2024

work page 2024

[18] [18]

R ^3 L : Reflect-then-retry reinforcement learning, 2026

Weijie Shi, Yanxi Chen, Zexi Li, Xuchen Pan, Yuchang Sun, Jiajie Xu, Xiaofang Zhou, and Yaliang Li. R ^3 L : Reflect-then-retry reinforcement learning, 2026

work page 2026

[19] [19]

Sutton and Andrew G

Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction. MIT Press, 2nd edition, 2018

work page 2018

[20] [20]

Sutton, David A

Richard S. Sutton, David A. McAllester, Satinder P. Singh, and Yishay Mansour. Policy gradient methods for reinforcement learning with function approximation. In Advances in Neural Information Processing Systems (NeurIPS), 1999

work page 1999

[21] [21]

Exploiting tree structure for credit assignment in RL training of LLMs , 2025

Hieu Tran, Zonghai Yao, and Hong Yu. Exploiting tree structure for credit assignment in RL training of LLMs , 2025

work page 2025

[22] [22]

A survey on large language model based autonomous agents, 2023

Lei Wang, Chen Ma, Xueyang Feng, Zeyu Zhang, Hao Yang, Jingsen Zhang, Zhiyuan Chen, Jiakai Tang, Xu Chen, Yankai Lin, Wayne Xin Zhao, Zhewei Wei, and Ji-Rong Wen. A survey on large language model based autonomous agents, 2023

work page 2023

[23] [23]

A practitioner's guide to multi-turn agentic reinforcement learning, 2025

Ruiyi Wang and Prithviraj Ammanabrolu. A practitioner's guide to multi-turn agentic reinforcement learning, 2025

work page 2025

[24] [24]

ROLL : Reinforcement learning optimization for large-scale learning, 2025 a

Weixun Wang, Shaopan Xiong, Gengru Chen, Wei Gao, Sheng Guo, Yancheng He, Ju Huang, Jiaheng Liu, Zhendong Li, Xiaoyang Li, Zichen Liu, Haizhou Zhao, Dakai An, Lunxi Cao, Qiyang Cao, Wanxi Deng, Feilei Du, Yiliang Gu, Jiahe Li, Xiang Li, Mingjie Liu, Yijia Luo, Zihe Liu, Yadao Wang, Pei Wang, Tianyuan Wu, Yanan Wu, Yuheng Zhao, Shuaibing Zhao, Jin Yang, Si...

work page 2025

[25] [25]

Let it flow: Agentic crafting on rock and roll, building the ROME model within an open agentic learning ecosystem, 2025 b

Weixun Wang, XiaoXiao Xu, Wanhe An, Fangwen Dai, Wei Gao, Yancheng He, Ju Huang, Qiang Ji, Hanqi Jin, Xiaoyang Li, Yang Li, Zhongwen Li, Shirong Lin, Jiashun Liu, Zenan Liu, Tao Luo, Dilxat Muhtar, Yuanbin Qu, Jiaqiang Shi, Qinghui Sun, Yingshui Tan, Hao Tang, Runze Wang, Yi Wang, Zhaoguo Wang, Yanan Wu, Shaopan Xiong, Binchen Xu, Xander Xu, Yuchi Xu, Qip...

work page 2025

[26] [26]

CLEANER : Self-purified trajectories boost agentic RL , 2026

Tianshi Xu, Yuteng Chen, and Meng Li. CLEANER : Self-purified trajectories boost agentic RL , 2026

work page 2026

[27] [27]

Jimenez, Alexander Wettig, Kilian Liber, Karthik Narasimhan, and Ofir Press

John Yang, Carlos E. Jimenez, Alexander Wettig, Kilian Liber, Karthik Narasimhan, and Ofir Press. SWE -agent: Agent-computer interfaces enable automated software engineering, 2024

work page 2024

[28] [28]

ReAct : Synergizing reasoning and acting in language models, 2022

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. ReAct : Synergizing reasoning and acting in language models, 2022

work page 2022

[29] [29]

The landscape of agentic reinforcement learning for LLMs : A survey, 2025 a

Guibin Zhang, Hejia Geng, Xiaohan Yu, Zhenfei Yin, Zaibin Zhang, Hang Li, Yongkang Li, and Lichao Sun. The landscape of agentic reinforcement learning for LLMs : A survey, 2025 a

work page 2025

[30] [30]

AgentRL : Scaling agentic reinforcement learning with a multi-turn, multi-task framework, 2025 b

Hanchen Zhang, Peilin Zhong, Qi Zhu, Hao Dong, and Mingyuan Chen. AgentRL : Scaling agentic reinforcement learning with a multi-turn, multi-task framework, 2025 b

work page 2025

[31] [31]

CriticSearch : Fine-grained credit assignment via retrospective critic, 2025 c

Yaocheng Zhang, Haohuan Huang, Zijun Song, Yuanheng Zhu, Qichao Zhang, Zijie Zhao, and Dongbin Zhao. CriticSearch : Fine-grained credit assignment via retrospective critic, 2025 c

work page 2025

[32] [32]

AT ^2 PO : Agentic turn-based policy optimization via tree search, 2026

Zefang Zong, Dingwei Chen, Yang Li, Qi Yi, Bo Zhou, Chengming Li, Bo Qian, Peng Chen, and Jie Jiang. AT ^2 PO : Agentic turn-based policy optimization via tree search, 2026

work page 2026