pith. sign in

arxiv: 2606.22995 · v1 · pith:GYUUNYSLnew · submitted 2026-06-22 · 💻 cs.LG · cs.AI· cs.CL

Group-Graph Policy Optimization for Long-Horizon Agentic Reinforcement Learning

Pith reviewed 2026-06-26 08:42 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL
keywords group-based reinforcement learningstate-transition graphagentic RLpolicy optimizationlong-horizon taskscredit assignmentvariance reductionTD error standardization
0
0 comments X

The pith

G2PO turns linear agent trajectories into a shared state-transition graph to cut variance in long-horizon value estimates.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that standard step-level RL still treats each agent trajectory as an isolated line, which inflates variance when the same observation appears in different histories. G2PO instead builds one global graph from all trajectories and aggregates identical observations into shared state nodes. This produces lower-variance value estimates and lets the algorithm standardize TD errors across every edge in the graph rather than locally. The result is credit assignment that tracks absolute task progress instead of myopic per-trajectory signals. On three long-horizon benchmarks the method raises success rates by as much as 22.2 percent over prior group-based baselines.

Core claim

G2PO explicitly transforms linear interaction trajectories into a global state-transition graph. By aggregating identical observations across different trajectories, group-aggregation state-value estimation reduces sampling variance and trajectory-dependent bias. Furthermore, the method redefines actions as edges and applies an edge-centric advantage estimator that globally standardizes TD errors, thereby identifying and prioritizing the transitions that drive absolute task progress.

What carries the argument

Global state-transition graph whose nodes aggregate identical observations and whose edges carry standardized TD-error advantages.

If this is right

  • State-value estimates become independent of any single trajectory once the same observation is seen elsewhere.
  • Credit assignment can favor transitions that advance the global task even when they appear in low-reward local paths.
  • Policy updates prioritize edges with high standardized advantage rather than locally high TD error.
  • The same aggregation step simultaneously lowers variance and removes trajectory-specific bias.
  • Success-rate gains appear on WebShop, ALFWorld, and AppWorld when compared with GRPO.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same graph-construction step could be applied to any episodic setting where states recur across episodes.
  • If observation matching is noisy, the variance-reduction benefit may shrink faster than the bias cost grows.
  • Edge-centric standardization might combine with existing graph neural network value functions for larger state spaces.

Load-bearing premise

Observations that look identical across separate trajectories can be merged without discarding history-dependent information or creating systematic bias.

What would settle it

Run the same agent on a benchmark where two trajectories reach an observation that is textually identical but requires different future actions; measure whether merging those states lowers final success rate relative to a non-aggregated baseline.

Figures

Figures reproduced from arXiv: 2606.22995 by Feng Sun, Furu Wei, Haizhen Huang, MingHui Song, Qi Zhang, Shaohan Huang, Weiwei Deng, Yunan Wang, Zihan Zhang.

Figure 1
Figure 1. Figure 1: (a) compares the training sample units between trajectory-level and step-level training. (b) [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of G2PO. For the sampled trajectories, we first construct a state group graph by [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Ablation results. The y-axis shows success rate (%). In this part, we conduct ablation studies to validate the key components of G2PO. We start with a baseline Episode-level Advantage (AEP ) and incrementally introduce Node-Centric Advantage (ANC ), Group￾Aggregation (GA) state-value estimation, and Edge-Centric Advantage (AEC ) to evaluate their individual contributions. We train on Qwen2.5- 1.5B-Instruct… view at source ↗
Figure 4
Figure 4. Figure 4: Results for analysis. Figure 4a: results under different values of parameter w; Figure 4b: variance distribution of state-value estimation with step-level RL methods, highlighting the necessity of group-aggregation mechanism to reduce variance; Figure 4c: the number of inference steps required to finish tasks. G2PO demonstrates substantial advantages over existing methods in inference efficiency. high-vari… view at source ↗
Figure 5
Figure 5. Figure 5: Breakdown of time con￾sumption per training step. G2PO in￾troduces only 1s of additional time overhead when computing advantages. In this section, we demonstrate the efficiency of G2PO by analyzing the time consumption per training step. We use Qwen2.5-1.5B-Instruct as the base model and train it on the ALFWorld benchmark. As illustrated in [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Dynamics of average group size during the training using Qwen2.5-1.5B-Instruct. 1 2-4 5-7 8-10 11-20 21-30 31-40 41-50>50 Group Size 0 5 10 15 20 Proportion (%) 11.9 10.0 4.2 15.5 13.3 18.9 9.2 6.2 10.8 WebShop 1 2-4 5-7 8-10 11-20 21-30 31-40 41-50>50 Group Size 0 5 10 15 20 Proportion (%) 8.1 17.7 14.6 17.116.3 7.6 4.4 3.5 10.7 ALFWorld [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗
Figure 8
Figure 8. Figure 8: Prompt template for ALFWorld. Prompt Template for WebShop You are an expert autonomous agent operating in the WebShop e-commerce environment. Your task is to: {task_description}. Prior to this step, you have already taken {step_count} step(s). Below are the most recent {history_length} observations and the corresponding actions you took: {action_history}. You are now at step {current_step} and your current… view at source ↗
Figure 9
Figure 9. Figure 9: Prompt template for WebShop. F Case Study To demonstrate the superiority of edge-centric advantage (AEC ) over local group comparisons, we provide a case graph constructed during the advantage estimation phase of G2PO training on ALFWorld. As shown in [PITH_FULL_IMAGE:figures/full_fig_p018_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Prompt template for AppWorld. 𝐺39 𝐺7 𝐺15 𝐺6 × 3 × 1 × 1 × 3 𝑉 = 0.00 𝑉 = 2.10 𝑉 = 6.41 𝑉 = 1.42 Global Mean Value Gain: 0.07 Local Mean Value: 1.54 × 𝑛: Transition Count 𝑉𝑎𝑙𝑢𝑒 𝐺𝑎𝑖𝑛 = 2.10 Task: Cool some bowl and put it in microwave. State Observation 𝐺39 You move the cup 1 to the microwave 1 𝐺7 You arrive at fridge 1. The fridge 1 is open. In it, you see nothing. 𝐺15 You arrive at microwave 1. The microw… view at source ↗
Figure 11
Figure 11. Figure 11: A case study illustrating the superiority of the edge-centric advantage over local group [PITH_FULL_IMAGE:figures/full_fig_p019_11.png] view at source ↗
read the original abstract

Group-based Reinforcement Learning (RL) has significantly enhanced Large Language Models (LLMs) in agentic scenarios. To achieve finer-grained policy updates, recent agentic RL frameworks have shifted from trajectory-level to step-level training. However, long-horizon agentic RL suffers from severe reward sparsity and delay, as feedback is often deferred for dozens of interaction steps. While existing step-level frameworks refine training granularity, their credit assignment remains coarse-grained and still treats agent exploration as isolated, linear trajectories. This oversimplified perspective ignores the inherent graph structure of state transitions, leading to high-variance state-value estimation and myopic, localized credit assignment. To overcome these critical bottlenecks, we propose Group-Graph Policy Optimization (G2PO), a novel group-based RL algorithm tailored for multi-turn agentic tasks. G2PO explicitly transforms linear interaction trajectories into a global state-transition graph. By aggregating identical observations across different trajectories, we introduce group-aggregation state-value estimation that reduces sampling variance and trajectory-dependent bias. Furthermore, we redefine agent actions as transitions between state nodes and propose an edge-centric advantage estimation strategy. By globally standardizing Temporal Difference (TD) errors across the entire graph, G2PO explicitly identifies and prioritizes critical transitions that drive absolute task progress. Extensive experiments on representative long-horizon benchmarks-WebShop, ALFWorld, and AppWorld-demonstrate that G2PO substantially outperforms state-of-the-art prompt-based and RL baselines, achieving remarkable success rate improvements of up to 22.2% over GRPO.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper introduces Group-Graph Policy Optimization (G2PO) for long-horizon agentic RL with LLMs. It claims to convert linear trajectories into a global state-transition graph by aggregating identical observations across trajectories, introduce group-aggregation state-value estimation to reduce sampling variance and trajectory-dependent bias, and propose edge-centric advantage estimation that globally standardizes TD errors to identify critical transitions, yielding up to 22.2% success-rate gains over GRPO on WebShop, ALFWorld, and AppWorld.

Significance. If the state-equivalence assumption holds and the graph-based aggregation and standardization deliver the claimed variance reduction without introducing bias, the work would provide a concrete mechanism for finer-grained, global credit assignment in sparse-reward, multi-turn agentic settings, extending group-based RL beyond isolated trajectories. The multi-benchmark empirical gains would then constitute a practically relevant advance.

major comments (2)
  1. [Abstract] Abstract (and method description): the central variance-reduction and bias-reduction claims rest on aggregating 'identical observations' into shared state nodes, yet no equivalence relation, embedding, history augmentation, or test for state identity is supplied. In the POMDP benchmarks cited (ALFWorld, WebShop), the same textual observation can arise from distinct histories, so the aggregation step risks conflating non-Markovian states; this directly undermines the 'reduces sampling variance and trajectory-dependent bias' claim.
  2. [Abstract] Abstract: the edge-centric advantage is described as 'globally standardizing TD errors across the entire graph,' but no equation, pseudocode, or definition of the standardization operator (e.g., how per-edge TD errors are collected, normalized, or used in the policy gradient) is provided. Without this, the 'prioritizes critical transitions' mechanism cannot be verified or reproduced.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and will incorporate clarifications in the revised manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract (and method description): the central variance-reduction and bias-reduction claims rest on aggregating 'identical observations' into shared state nodes, yet no equivalence relation, embedding, history augmentation, or test for state identity is supplied. In the POMDP benchmarks cited (ALFWorld, WebShop), the same textual observation can arise from distinct histories, so the aggregation step risks conflating non-Markovian states; this directly undermines the 'reduces sampling variance and trajectory-dependent bias' claim.

    Authors: We agree this is a substantive gap. The manuscript relies on exact string matching of textual observations for state identity without an explicit equivalence relation or discussion of non-Markovian effects in POMDPs. We will revise the method section to formally define state equivalence via exact observation matching, add a limitations paragraph on potential bias in POMDP settings, and qualify the variance-reduction claim accordingly. revision: yes

  2. Referee: [Abstract] Abstract: the edge-centric advantage is described as 'globally standardizing TD errors across the entire graph,' but no equation, pseudocode, or definition of the standardization operator (e.g., how per-edge TD errors are collected, normalized, or used in the policy gradient) is provided. Without this, the 'prioritizes critical transitions' mechanism cannot be verified or reproduced.

    Authors: We concur that the description is insufficient for verification. Although the full manuscript outlines the approach in Section 3, it lacks the explicit standardization equation and pseudocode. We will add the z-score normalization formula for per-edge TD errors, the collection procedure across the graph, and its integration into the policy gradient, along with pseudocode, to ensure reproducibility. revision: yes

Circularity Check

0 steps flagged

No circularity: new graph construction and aggregation applied to standard TD errors

full rationale

The paper proposes G2PO as a novel algorithmic structure that converts trajectories to a state-transition graph, aggregates identical observations for group state-value estimation, and applies edge-centric advantage via global TD standardization. No equations, derivations, or self-citations are shown that reduce these mechanisms to fitted parameters renamed as predictions or to self-definitional identities. The central claims rest on the new graph structure and empirical results on WebShop/ALFWorld/AppWorld rather than any reduction to prior inputs by construction. The Markovian aggregation assumption is a modeling choice open to critique on POMDP grounds but does not constitute circularity in the derivation chain.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that observations can be aggregated across trajectories and on the standard RL machinery of TD errors; no free parameters or invented entities are named in the abstract.

axioms (1)
  • domain assumption Observations can be treated as identical across trajectories for the purpose of state aggregation
    The method aggregates identical observations to reduce variance and trajectory-dependent bias.

pith-pipeline@v0.9.1-grok · 5833 in / 1304 out tokens · 32815 ms · 2026-06-26T08:42:30.662210+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

68 extracted references · 24 canonical work pages · 17 internal anchors

  1. [1]

    Back to basics: Revisiting reinforce-style optimization for learning from human feedback in llms

    Arash Ahmadian, Chris Cremer, Matthias Gallé, Marzieh Fadaee, Julia Kreutzer, Olivier Pietquin, Ahmet Üstün, and Sara Hooker. Back to basics: Revisiting reinforce-style optimization for learning from human feedback in llms. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 12248–12267, 2024

  2. [2]

    Playing text-adventure games with graph-based deep reinforcement learning

    Prithviraj Ammanabrolu and Mark Riedl. Playing text-adventure games with graph-based deep reinforcement learning. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 3557–3565, 2019

  3. [3]

    Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

    Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, et al. Training a helpful and harmless assistant with reinforcement learning from human feedback.arXiv preprint arXiv:2204.05862, 2022

  4. [4]

    More bang for the buck: Process reward modeling with entropy-driven uncertainty.arXiv preprint arXiv:2503.22233, 2025

    Lang Cao, Renhong Chen, Yingtian Zou, Chao Peng, Huacong Xu, Yuxian Wang, Wu Ning, Qian Chen, Mofan Peng, Zijie Chen, et al. More bang for the buck: Process reward modeling with entropy-driven uncertainty.arXiv preprint arXiv:2503.22233, 2025

  5. [5]

    Dreamprm: Domain-reweighted process reward model for multimodal reasoning

    Qi Cao, Ruiyi Wang, Ruiyi Zhang, Sai Ashish Somayajula, and Pengtao Xie. Dreamprm: Domain-reweighted process reward model for multimodal reasoning. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems

  6. [6]

    Qwen3-Coder-Next Technical Report

    Ruisheng Cao, Mouxiang Chen, Jiawei Chen, Zeyu Cui, Yunlong Feng, Binyuan Hui, Yuheng Jing, Kaixin Li, Mingze Li, Junyang Lin, et al. Qwen3-coder-next technical report.arXiv preprint arXiv:2603.00729, 2026

  7. [7]

    Reinforcement learning for long-horizon interactive llm agents.arXiv preprint arXiv:2502.01600, 2025

    Kevin Chen, Marco Cusumano-Towner, Brody Huval, Aleksei Petrenko, Jackson Hamburger, Vladlen Koltun, and Philipp Krähenbühl. Reinforcement learning for long-horizon interactive llm agents.arXiv preprint arXiv:2502.01600, 2025

  8. [8]

    Context-lite multi-turn reinforcement learning for llm agents

    Wentse Chen, Jiayu Chen, Hao Zhu, and Jeff Schneider. Context-lite multi-turn reinforcement learning for llm agents. InES-FoMo III: 3rd Workshop on Efficient Systems for Foundation Models, 2025

  9. [9]

    Deep reinforcement learning from human preferences.Advances in neural information processing systems, 30, 2017

    Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences.Advances in neural information processing systems, 30, 2017

  10. [10]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025

  11. [11]

    Mind2web: Towards a generalist agent for the web.Advances in Neural Information Processing Systems, 36:28091–28114, 2023

    Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Sam Stevens, Boshi Wang, Huan Sun, and Yu Su. Mind2web: Towards a generalist agent for the web.Advances in Neural Information Processing Systems, 36:28091–28114, 2023

  12. [12]

    Towards efficient online tuning of vlm agents via counterfactual soft reinforcement learning

    Lang Feng, Weihao Tan, Zhiyi Lyu, Longtao Zheng, Haiyang Xu, Ming Yan, Fei Huang, and Bo An. Towards efficient online tuning of vlm agents via counterfactual soft reinforcement learning. InInternational Conference on Machine Learning, pages 16884–16903. PMLR, 2025

  13. [13]

    Group-in-group policy optimization for llm agent training

    Lang Feng, Zhenghai Xue, Tingcong Liu, and Bo An. Group-in-group policy optimization for llm agent training. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems

  14. [14]

    From LLM Reasoning to Autonomous AI Agents: A Comprehensive Review

    Mohamed Amine Ferrag, Norbert Tihanyi, and Merouane Debbah. From llm reasoning to autonomous ai agents: A comprehensive review.arXiv preprint arXiv:2504.19678, 2025

  15. [15]

    Soft Adaptive Policy Optimization

    Chang Gao, Chujie Zheng, Xiong-Hui Chen, Kai Dang, Shixuan Liu, Bowen Yu, An Yang, Shuai Bai, Jingren Zhou, and Junyang Lin. Soft adaptive policy optimization.arXiv preprint arXiv:2511.20347, 2025. 10

  16. [16]

    A new era of intelligence with gemini 3, 2025

    Google. A new era of intelligence with gemini 3, 2025

  17. [17]

    Deepseek-r1 incentivizes reasoning in llms through reinforcement learning.Nature, 645(8081):633–638, 2025

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1 incentivizes reasoning in llms through reinforcement learning.Nature, 645(8081):633–638, 2025

  18. [18]

    Metagpt: Meta programming for a multi-agent collaborative framework

    Sirui Hong, Mingchen Zhuge, Jonathan Chen, Xiawu Zheng, Yuheng Cheng, Jinlin Wang, Ceyao Zhang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, et al. Metagpt: Meta programming for a multi-agent collaborative framework. InThe twelfth international conference on learning representations, 2023

  19. [19]

    Agentgen: Enhancing planning abilities for large language model based agent via environment and task generation

    Mengkang Hu, Pu Zhao, Can Xu, Qingfeng Sun, Jian-Guang Lou, Qingwei Lin, Ping Luo, and Saravan Rajmohan. Agentgen: Enhancing planning abilities for large language model based agent via environment and task generation. InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V . 1, pages 496–507, 2025

  20. [20]

    Understanding the planning of LLM agents: A survey

    Xu Huang, Weiwen Liu, Xiaolong Chen, Xingmei Wang, Hao Wang, Defu Lian, Yasheng Wang, Ruiming Tang, and Enhong Chen. Understanding the planning of llm agents: A survey.arXiv preprint arXiv:2402.02716, 2024

  21. [21]

    Search-r1: Training llms to reason and leverage search engines with reinforcement learning

    Bowen Jin, Hansi Zeng, Zhenrui Yue, Jinsung Yoon, Sercan O Arik, Dong Wang, Hamed Zamani, and Jiawei Han. Search-r1: Training llms to reason and leverage search engines with reinforcement learning. InSecond Conference on Language Modeling

  22. [22]

    Gonzalez, Hao Zhang, and Ion Stoica

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large lan- guage model serving with pagedattention. InProceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023

  23. [23]

    Process reward model with q-value rankings

    Wendi Li and Yixuan Li. Process reward model with q-value rankings. InThe Thirteenth International Conference on Learning Representations

  24. [24]

    DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models

    Aixin Liu, Aoxue Mei, Bangcai Lin, Bing Xue, Bingxuan Wang, Bingzheng Xu, Bochao Wu, Bowei Zhang, Chaofan Lin, Chen Dong, et al. Deepseek-v3. 2: Pushing the frontier of open large language models.arXiv preprint arXiv:2512.02556, 2025

  25. [25]

    Llm collaboration with multi-agent reinforcement learning.arXiv preprint arXiv:2508.04652, 2025

    Shuo Liu, Tianle Chen, Zeyu Liang, Xueguang Lyu, and Christopher Amato. Llm collaboration with multi-agent reinforcement learning.arXiv preprint arXiv:2508.04652, 2025

  26. [26]

    Large Language Model Agent: A Survey on Methodology, Applications and Challenges

    Junyu Luo, Weizhi Zhang, Ye Yuan, Yusheng Zhao, Junwei Yang, Yiyang Gu, Bohan Wu, Binqi Chen, Ziyue Qiao, Qingqing Long, et al. Large language model agent: A survey on methodology, applications and challenges.arXiv preprint arXiv:2503.21460, 2025

  27. [27]

    Agent lightning: Train any ai agents with reinforcement learning.arXiv preprint arXiv:2508.03680, 2025

    Xufang Luo, Yuge Zhang, Zhiyuan He, Zilong Wang, Siyun Zhao, Dongsheng Li, Luna K Qiu, and Yuqing Yang. Agent lightning: Train any ai agents with reinforcement learning.arXiv preprint arXiv:2508.03680, 2025

  28. [28]

    Simpo: Simple preference optimization with a reference-free reward.Advances in Neural Information Processing Systems, 37:124198–124235, 2024

    Yu Meng, Mengzhou Xia, and Danqi Chen. Simpo: Simple preference optimization with a reference-free reward.Advances in Neural Information Processing Systems, 37:124198–124235, 2024

  29. [29]

    Gaia: a benchmark for general ai assistants

    Grégoire Mialon, Clémentine Fourrier, Thomas Wolf, Yann LeCun, and Thomas Scialom. Gaia: a benchmark for general ai assistants. InThe Twelfth International Conference on Learning Representations, 2023

  30. [30]

    Human-level control through deep reinforcement learning.nature, 518(7540):529–533, 2015

    V olodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning.nature, 518(7540):529–533, 2015

  31. [31]

    Language understanding for text- based games using deep reinforcement learning

    Karthik Narasimhan, Tejas Kulkarni, and Regina Barzilay. Language understanding for text- based games using deep reinforcement learning. InProceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 1–11, 2015

  32. [32]

    Introducing gpt-5.2, 2025

    OpenAI. Introducing gpt-5.2, 2025. 11

  33. [33]

    Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

  34. [34]

    Advantage-Weighted Regression: Simple and Scalable Off-Policy Reinforcement Learning

    Xue Bin Peng, Aviral Kumar, Grace Zhang, and Sergey Levine. Advantage-weighted regression: Simple and scalable off-policy reinforcement learning.arXiv preprint arXiv:1910.00177, 2019

  35. [35]

    Direct preference optimization: Your language model is secretly a reward model

    Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in neural information processing systems, 36:53728–53741, 2023

  36. [36]

    DeepSeek-Prover-V2: Advancing Formal Mathematical Reasoning via Reinforcement Learning for Subgoal Decomposition

    ZZ Ren, Zhihong Shao, Junxiao Song, Huajian Xin, Haocheng Wang, Wanjia Zhao, Liyue Zhang, Zhe Fu, Qihao Zhu, Dejian Yang, et al. Deepseek-prover-v2: Advancing formal mathematical reasoning via reinforcement learning for subgoal decomposition.arXiv preprint arXiv:2504.21801, 2025

  37. [37]

    Toolformer: Language models can teach themselves to use tools.Advances in neural information processing systems, 36:68539– 68551, 2023

    Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools.Advances in neural information processing systems, 36:68539– 68551, 2023

  38. [38]

    Agent laboratory: Using llm agents as research assistants.Findings of the Association for Computational Linguistics: EMNLP 2025, pages 5977–6043, 2025

    Samuel Schmidgall, Yusheng Su, Ze Wang, Ximeng Sun, Jialian Wu, Xiaodong Yu, Jiang Liu, Michael Moor, Zicheng Liu, and Emad Barsoum. Agent laboratory: Using llm agents as research assistants.Findings of the Association for Computational Linguistics: EMNLP 2025, pages 5977–6043, 2025

  39. [39]

    Proximal Policy Optimization Algorithms

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

  40. [40]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

  41. [41]

    Reflexion: Language agents with verbal reinforcement learning.Advances in neural information processing systems, 36:8634–8652, 2023

    Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning.Advances in neural information processing systems, 36:8634–8652, 2023

  42. [42]

    Alfred: A benchmark for interpreting grounded instructions for everyday tasks

    Mohit Shridhar, Jesse Thomason, Daniel Gordon, Yonatan Bisk, Winson Han, Roozbeh Mot- taghi, Luke Zettlemoyer, and Dieter Fox. Alfred: A benchmark for interpreting grounded instructions for everyday tasks. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10740–10749, 2020

  43. [43]

    ALFWorld: Aligning Text and Embodied Environments for Interactive Learning

    Mohit Shridhar, Xingdi Yuan, Marc-Alexandre Côté, Yonatan Bisk, Adam Trischler, and Matthew Hausknecht. Alfworld: Aligning text and embodied environments for interactive learning.arXiv preprint arXiv:2010.03768, 2020

  44. [44]

    Reinforcement learning: An introduction second edition

    Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction second edition. Adaptive computation and machine learning: The MIT Press, Cambridge, MA and London, 2018

  45. [45]

    Qwen2.5: A party of foundation models, September 2024

    Qwen Team. Qwen2.5: A party of foundation models, September 2024

  46. [46]

    Appworld: A controllable world of apps and people for benchmarking interactive coding agents

    Harsh Trivedi, Tushar Khot, Mareike Hartmann, Ruskin Manku, Vinty Dong, Edward Li, Shashank Gupta, Ashish Sabharwal, and Niranjan Balasubramanian. Appworld: A controllable world of apps and people for benchmarking interactive coding agents. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), p...

  47. [47]

    Reinforcement learning optimization for large- scale learning: An efficient and user-friendly scaling library.arXiv preprint arXiv:2506.06122, 2025

    Weixun Wang, Shaopan Xiong, Gengru Chen, Wei Gao, Sheng Guo, Yancheng He, Ju Huang, Jiaheng Liu, Zhendong Li, Xiaoyang Li, et al. Reinforcement learning optimization for large- scale learning: An efficient and user-friendly scaling library.arXiv preprint arXiv:2506.06122, 2025. 12

  48. [48]

    Executable code actions elicit better llm agents

    Xingyao Wang, Yangyi Chen, Lifan Yuan, Yizhe Zhang, Yunzhu Li, Hao Peng, and Heng Ji. Executable code actions elicit better llm agents. InForty-first International Conference on Machine Learning, 2024

  49. [49]

    RAGEN: Understanding Self-Evolution in LLM Agents via Multi-Turn Reinforcement Learning

    Zihan Wang, Kangrui Wang, Qineng Wang, Pingyue Zhang, Linjie Li, Zhengyuan Yang, Xing Jin, Kefan Yu, Minh Nhat Nguyen, Licheng Liu, et al. Ragen: Understanding self-evolution in llm agents via multi-turn reinforcement learning.arXiv preprint arXiv:2504.20073, 2025

  50. [50]

    Stepsearch: Igniting llms search ability via step-wise proximal policy optimization.arXiv preprint arXiv:2505.15107, 2025

    Ziliang Wang, Xuhui Zheng, Kang An, Cijun Ouyang, Jialu Cai, Yuhang Wang, and Yichao Wu. Stepsearch: Igniting llms search ability via step-wise proximal policy optimization.arXiv preprint arXiv:2505.15107, 2025

  51. [51]

    A brain-inspired agentic architec- ture to improve planning with llms.Nature Communications, 16(1):8633, 2025

    Taylor Webb, Shanka Subhra Mondal, and Ida Momennejad. A brain-inspired agentic architec- ture to improve planning with llms.Nature Communications, 16(1):8633, 2025

  52. [52]

    Agentic reasoning: Reasoning llms with tools for the deep research

    Junde Wu, Jiayuan Zhu, and Yuyuan Liu. Agentic reasoning: Reasoning llms with tools for the deep research

  53. [53]

    Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments.Advances in Neural Information Processing Systems, 37:52040–52094, 2024

    Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh J Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, et al. Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments.Advances in Neural Information Processing Systems, 37:52040–52094, 2024

  54. [54]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

  55. [55]

    Webshop: Towards scalable real-world web interaction with grounded language agents.Advances in Neural Information Processing Systems, 35:20744–20757, 2022

    Shunyu Yao, Howard Chen, John Yang, and Karthik Narasimhan. Webshop: Towards scalable real-world web interaction with grounded language agents.Advances in Neural Information Processing Systems, 35:20744–20757, 2022

  56. [56]

    React: Synergizing reasoning and acting in language models

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InThe eleventh international conference on learning representations, 2022

  57. [57]

    MemAgent: Reshaping Long-Context LLM with Multi-Conv RL-based Memory Agent

    Hongli Yu, Tinghong Chen, Jiangtao Feng, Jiangjie Chen, Weinan Dai, Qiying Yu, Ya-Qin Zhang, Wei-Ying Ma, Jingjing Liu, Mingxuan Wang, et al. Memagent: Reshaping long-context llm with multi-conv rl-based memory agent.arXiv preprint arXiv:2507.02259, 2025

  58. [58]

    DAPO: An Open-Source LLM Reinforcement Learning System at Scale

    Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476, 2025

  59. [59]

    Douzero: Mastering doudizhu with self-play deep reinforcement learning

    Daochen Zha, Jingru Xie, Wenye Ma, Sheng Zhang, Xiangru Lian, Xia Hu, and Ji Liu. Douzero: Mastering doudizhu with self-play deep reinforcement learning. Ininternational conference on machine learning, pages 12333–12344. PMLR, 2021

  60. [60]

    Agentrl: Scaling agentic reinforcement learning with a multi-turn, multi-task framework.arXiv preprint arXiv:2510.04206, 2025

    Hanchen Zhang, Xiao Liu, Bowen Lv, Xueqiao Sun, Bohao Jing, Iat Long Iong, Zhenyu Hou, Zehan Qi, Hanyu Lai, Yifan Xu, et al. Agentrl: Scaling agentic reinforcement learning with a multi-turn, multi-task framework.arXiv preprint arXiv:2510.04206, 2025

  61. [61]

    Language agent tree search unifies reasoning, acting, and planning in language models

    Andy Zhou, Kai Yan, Michal Shlapentokh-Rothman, Haohan Wang, and Yu-Xiong Wang. Language agent tree search unifies reasoning, acting, and planning in language models. In Proceedings of the 41st International Conference on Machine Learning, pages 62138–62160, 2024

  62. [62]

    Webarena: A realistic web environment for build- ing autonomous agents

    Shuyan Zhou, Frank F Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, et al. Webarena: A realistic web environment for build- ing autonomous agents. InThe Twelfth International Conference on Learning Representations

  63. [63]

    Wpo: Enhancing rlhf with weighted pref- erence optimization

    Wenxuan Zhou, Ravi Agrawal, Shujian Zhang, Sathish Reddy Indurthi, Sanqiang Zhao, Kaiqiang Song, Silei Xu, and Chenguang Zhu. Wpo: Enhancing rlhf with weighted pref- erence optimization. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 8328–8340, 2024. 13 A Limitations One limitation of our paper is that we o...

  64. [64]

    into interactive, text-based environments using TextWorld. The benchmark features six categories of household tasks that require agents to perceive environment states, plan long-horizon trajectories, and generalize their knowledge across diverse indoor scenes. AppWorld. AppWorld serves as a rigorous testing ground for LLM-based agents, providing a control...

  65. [65]

    spotify_password) in the example above were only for demonstration

    The email addresses, access tokens and variables (e.g. spotify_password) in the example above were only for demonstration. Obtain the correct information by calling relevant APIs yourself

  66. [66]

    Any thoughts should be put as code comments

    Only generate valid code blocks, i.e., do not put them in “‘...“‘ or add any extra formatting. Any thoughts should be put as code comments

  67. [67]

    You can use the variables from the previous code blocks in the subsequent code blocks

  68. [68]

    If the task requires an answer, provide it using the answer argument — for example, ‘apis.supervisor.complete_task(answer=<answer>)‘

    Once you believe the task is complete, you MUST call ‘apis.supervisor.complete_task()‘ to finalize it. If the task requires an answer, provide it using the answer argument — for example, ‘apis.supervisor.complete_task(answer=<answer>)‘. For tasks that do not require an answer, either omit the argument. The task will not end automatically — it will remain ...