pith. machine review for the scientific record. sign in

arxiv: 2605.14558 · v1 · submitted 2026-05-14 · 💻 cs.LG · cs.AI· cs.CL

Recognition: 2 theorem links

· Lean Theorem

Resolving Action Bottleneck: Agentic Reinforcement Learning Informed by Token-Level Energy

Authors on Pith no claims yet

Pith reviewed 2026-05-15 01:43 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL
keywords agentic reinforcement learningaction bottlenecktoken reweightingenergy-based modelingpolicy gradientcredit assignmentlarge language modelstoken-level signals
0
0 comments X

The pith

Token-level signals concentrate on action tokens in agentic RL, so reweighting gradients toward them outperforms uniform policy gradients.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Agentic reinforcement learning trains language models on trajectories that interleave long reasoning traces with brief environment actions. Standard policy-gradient methods assign credit uniformly to every token, yet the actual signals, measured as correlations between token gradients and reward variance across rollouts, concentrate sharply on the few action tokens. The paper introduces ActFocus, which downweights reasoning tokens and further boosts high-uncertainty action tokens through an energy-based redistribution step. Across four environments and multiple model sizes, this change produces large final-step gains over PPO and GRPO with no added runtime or memory cost. A reader would care because the finding points to a lightweight fix for credit assignment that respects where decisions actually occur in long agent trajectories.

Core claim

The central claim is that token-level training signals, quantified by their correlations with reward variance of different rollouts from the same prompt, concentrate sharply on action tokens rather than reasoning tokens even though action tokens form only a small fraction of the trajectory. This creates an action bottleneck under uniform credit assignment. From an energy-based modeling perspective the work shows that down-weighting reasoning tokens while increasing weights on high-uncertainty action tokens via a simple redistribution mechanism resolves the bottleneck and yields consistent improvements.

What carries the argument

ActFocus, a token reweighting scheme that downweights gradients on reasoning tokens and applies energy-based redistribution to increase weights on high-uncertainty action tokens.

If this is right

  • Final-step gains reach 65.2 percentage points over PPO and 63.7 over GRPO.
  • Improvements appear consistently across four environments and different model sizes.
  • The method adds no runtime or memory overhead during training.
  • Credit assignment becomes more effective once gradients respect the observed concentration on action tokens.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Agent architectures might benefit from explicitly marking or isolating action tokens so that training objectives can target them more directly.
  • Energy-based views of token importance could be tested in non-agentic fine-tuning where decision points are also sparse within long sequences.
  • The same signal-concentration pattern may appear when training on other long-horizon tasks that mix planning text with discrete choices.
  • Scaling studies for agentic models could track separate signal strengths for reasoning versus action tokens to predict where bottlenecks will emerge.

Load-bearing premise

Down-weighting reasoning tokens will not degrade the quality of the reasoning chain or introduce instabilities in long-horizon trajectories.

What would settle it

A controlled run in which down-weighting reasoning tokens produces shorter or less accurate reasoning steps and lower final reward would show that the reweighting harms the trajectory structure.

Figures

Figures reproduced from arXiv: 2605.14558 by David Wipf, Henry Peng Zou, Junhua Liu, Junyou Zhu, Langzhou He, Philip S. Yu, Qitian Wu, Wei-Chieh Huang, Yue Zhou, Zhengyao Gu.

Figure 1
Figure 1. Figure 1: Action Bottleneck in agentic reinforcement learning. Left: A real response from a trajectory, where reasoning tokens far outnumber action tokens. Middle: Action tokens constitute only 4% of model-generated tokens; our token-level reweighting redirects gradient mass towards them. Right: Training signal concentrates in action spans. Results are from Sokoban 3B. typical agentic RL rollout, such outcome variab… view at source ↗
Figure 2
Figure 2. Figure 2: Detailed energy–reward diagnostic behind [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: PPO success rate across training steps on FrozenLake 3B, Sudoku 1.5B, Sokoban 3B, and [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: PPO response length across training steps on FrozenLake 3B, Sudoku 1.5B, Sokoban 3B, [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: GRPO success rate across training steps on FrozenLake 3B, Sudoku 1.5B, Sokoban 3B, [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: GRPO response length across training steps on FrozenLake 3B, Sudoku 1.5B, Sokoban 3B, [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Effect of α on training dynamics. 5.4 How Much Reasoning Credit Should Be Retained? Although ACTFOCUS improves over UNIFORM across all main PPO experiments, the optimal α varies by configuration. The intuition is simple: when the base model already reasons well enough for the task, a smaller α removes redundant reasoning updates and concentrates RL on action decisions; when reasoning still needs refinement… view at source ↗
Figure 8
Figure 8. Figure 8: Ablation studies on the reward-shaping design. [PITH_FULL_IMAGE:figures/full_fig_p010_8.png] view at source ↗
read the original abstract

Agentic reinforcement learning trains large language models using multi-turn trajectories that interleave long reasoning traces with short environment-facing actions. Common policy-gradient methods, such as PPO and GRPO, treat each token in a trajectory equally, leading to uniform credit assignment. In this paper, we critically demonstrate that such uniform credit assignment largely misallocates token-level training signals. From an energy-based modeling perspective, we show that token-level training signals, quantified by their correlations with reward variance of different rollouts sampled from a given prompt, concentrate sharply on action tokens rather than reasoning tokens, even though action tokens account for only a small fraction of the trajectory. We refer to this phenomenon as the Action Bottleneck. Motivated by this observation, we propose an embarrassingly simple token reweighting approach, ActFocus, that downweights gradients on reasoning tokens, along with an additional energy-based redistribution mechanism that further increases the weights on action tokens with higher uncertainty. Across four environments and different model sizes, ActFocus consistently outperforms PPO and GRPO, yielding final-step gains of up to 65.2 and 63.7 percentage points, respectively, without any additional runtime or memory cost.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper identifies an 'Action Bottleneck' in agentic RL for LLMs, where token-level training signals (measured via correlation with reward variance across rollouts from the same prompt) concentrate on action tokens rather than reasoning tokens, despite actions comprising a small fraction of trajectories. It proposes ActFocus, a reweighting scheme that downweights reasoning tokens and applies energy-based redistribution to boost uncertain action tokens, reporting consistent outperformance over PPO and GRPO with gains up to 65.2 and 63.7 percentage points across four environments and multiple model sizes, at no added runtime or memory cost.

Significance. If the empirical gains are confirmed with proper controls, this provides a simple, zero-cost mechanism to improve credit assignment in multi-turn agentic training by focusing gradients on environment-facing tokens. The energy-based view of token signals offers a useful diagnostic for RL on LLMs and could influence designs for more efficient long-horizon agents.

major comments (3)
  1. [Experimental Results] Experimental Results section: the abstract and main results report large gains (up to 65.2 pp) but provide no details on number of random seeds, standard deviations, statistical significance tests, or controls for prompt/trajectory length, leaving the central claim of consistent outperformance difficult to evaluate.
  2. [ActFocus Method] ActFocus Method section: the down-weighting of reasoning tokens rests on the untested assumption that this will not degrade reasoning chain quality or coherence; no metrics on chain length, logical consistency, or effects on subsequent actions are reported, which is load-bearing for long-horizon validity.
  3. [Ablation Studies] Ablation Studies: no experiments isolate the reasoning down-weight factor from the energy redistribution strength (both free parameters), so it is unclear which component drives the reported improvements over baselines.
minor comments (1)
  1. [Abstract] The abstract uses the informal phrase 'embarrassingly simple'; this could be rephrased for a formal journal submission.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive feedback. We address each major comment below and will incorporate revisions to improve clarity and rigor.

read point-by-point responses
  1. Referee: [Experimental Results] Experimental Results section: the abstract and main results report large gains (up to 65.2 pp) but provide no details on number of random seeds, standard deviations, statistical significance tests, or controls for prompt/trajectory length, leaving the central claim of consistent outperformance difficult to evaluate.

    Authors: We agree these statistical details are essential. Experiments used 5 independent random seeds per configuration; we will report means, standard deviations, and paired t-test significance results in the revised Experimental Results section and appendix. All methods were evaluated on identical prompt sets with matched maximum trajectory lengths to control for length effects. revision: yes

  2. Referee: [ActFocus Method] ActFocus Method section: the down-weighting of reasoning tokens rests on the untested assumption that this will not degrade reasoning chain quality or coherence; no metrics on chain length, logical consistency, or effects on subsequent actions are reported, which is load-bearing for long-horizon validity.

    Authors: We acknowledge the assumption requires supporting evidence. While final-task gains imply preserved reasoning utility, we will add quantitative metrics (average reasoning chain length, token entropy as a proxy for coherence) and qualitative trace examples in the appendix of the revision to directly address chain quality and downstream action effects. revision: yes

  3. Referee: [Ablation Studies] Ablation Studies: no experiments isolate the reasoning down-weight factor from the energy redistribution strength (both free parameters), so it is unclear which component drives the reported improvements over baselines.

    Authors: We agree that component isolation would strengthen the claims. We will include new ablation tables in the revision that fix one hyperparameter while varying the other (down-weight factor with energy term disabled, and vice versa) across the four environments to quantify individual contributions. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper's central result is an empirical observation that token-level training signals (quantified via correlation with reward variance across rollouts from one prompt) concentrate on action tokens. This is presented as a data-driven finding from sampling trajectories rather than a mathematical derivation that reduces to its own inputs by construction. The ActFocus reweighting rule is motivated by this observation but does not redefine the measured correlation or variance quantities in terms of the reweighting itself. No equations or self-citations are shown that would create self-definitional loops, fitted-input predictions, or uniqueness claims imported from prior author work. The energy-based redistribution is described as an additional mechanism, not a tautological re-expression of the input data. The chain is therefore self-contained with independent empirical content.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The method introduces at least two tunable scalars (reasoning-token down-weight factor and energy redistribution strength) whose values are chosen to produce the reported gains; the energy-based view itself rests on the domain assumption that reward variance is a reliable proxy for token importance.

free parameters (2)
  • reasoning down-weight factor
    Scalar that reduces gradient magnitude on reasoning tokens; value chosen to optimize final performance.
  • energy redistribution strength
    Coefficient controlling how much extra weight is given to high-uncertainty action tokens.
axioms (1)
  • domain assumption Reward variance across rollouts from the same prompt is a valid proxy for token-level training signal strength.
    Invoked when the authors quantify token-level signals by correlation with reward variance.

pith-pipeline@v0.9.0 · 5542 in / 1460 out tokens · 38961 ms · 2026-05-15T01:43:12.935990+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

39 extracted references · 39 canonical work pages · 17 internal anchors

  1. [1]

    Lmrl gym: Benchmarks for multi-turn reinforcement learning with language models, 2023

    Marwa Abdulhai, Isadora White, Charlie Snell, Charles Sun, Joey Hong, Yuexiang Zhai, Kelvin Xu, and Sergey Levine. Lmrl gym: Benchmarks for multi-turn reinforcement learning with language models, 2023. URLhttps://arxiv.org/abs/2311.18232

  2. [2]

    Chan, Hao Sun, Samuel Holt, and Mihaela van der Schaar

    Alex J. Chan, Hao Sun, Samuel Holt, and Mihaela van der Schaar. Dense reward for free in reinforcement learning from human feedback, 2024. URL https://arxiv.org/abs/2402. 00782

  3. [3]

    Reinforcement learning for long-horizon interactive llm agents, 2025

    Kevin Chen, Marco Cusumano-Towner, Brody Huval, Aleksei Petrenko, Jackson Hamburger, Vladlen Koltun, and Philipp Krähenbühl. Reinforcement learning for long-horizon interactive llm agents, 2025. URLhttps://arxiv.org/abs/2502.01600

  4. [4]

    Process reinforcement through implicit rewards,

    Ganqu Cui, Lifan Yuan, Zefan Wang, Hanbin Wang, Yuchen Zhang, Jiacheng Chen, Wendi Li, Bingxiang He, Yuchen Fan, Tianyu Yu, Qixin Xu, Weize Chen, Jiarui Yuan, Huayu Chen, Kaiyan Zhang, Xingtai Lv, Shuo Wang, Yuan Yao, Xu Han, Hao Peng, Yu Cheng, Zhiyuan Liu, Maosong Sun, Bowen Zhou, and Ning Ding. Process reinforcement through implicit rewards,

  5. [5]

    URLhttps://arxiv.org/abs/2502.01456

  6. [6]

    The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models

    Ganqu Cui, Yuchen Zhang, Jiacheng Chen, Lifan Yuan, Zhi Wang, Yuxin Zuo, Haozhan Li, Yuchen Fan, Huayu Chen, Weize Chen, Zhiyuan Liu, Hao Peng, Lei Bai, Wanli Ouyang, Yu Cheng, Bowen Zhou, and Ning Ding. The entropy mechanism of reinforcement learning for reasoning language models, 2025. URLhttps://arxiv.org/abs/2505.22617

  7. [7]

    Group-in-Group Policy Optimization for LLM Agent Training

    Lang Feng, Zhenghai Xue, Tingcong Liu, and Bo An. Group-in-group policy optimization for llm agent training, 2025. URLhttps://arxiv.org/abs/2505.10978

  8. [8]

    Your classifier is secretly an energy based model and you should treat it like one, 2020

    Will Grathwohl, Kuan-Chieh Wang, Jörn-Henrik Jacobsen, David Duvenaud, Mohammad Norouzi, and Kevin Swersky. Your classifier is secretly an energy based model and you should treat it like one, 2020. URLhttps://arxiv.org/abs/1912.03263. 10

  9. [9]

    Deepseek-r1 incentivizes reasoning in llms through reinforcement learning.Nature, 645(8081):633–638, 2025

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1 incentivizes reasoning in llms through reinforcement learning.Nature, 645(8081):633–638, 2025

  10. [10]

    OpenAI o1 System Card

    Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card.arXiv preprint arXiv:2412.16720, 2024

  11. [11]

    Vineppo: Refining credit assignment in rl training of llms, 2025

    Amirhossein Kazemnejad, Milad Aghajohari, Eva Portelance, Alessandro Sordoni, Siva Reddy, Aaron Courville, and Nicolas Le Roux. Vineppo: Refining credit assignment in rl training of llms, 2025. URLhttps://arxiv.org/abs/2410.01679

  12. [12]

    LLMs Get Lost In Multi-Turn Conversation

    Philippe Laban, Hiroaki Hayashi, Yingbo Zhou, and Jennifer Neville. Llms get lost in multi-turn conversation, 2025. URLhttps://arxiv.org/abs/2505.06120

  13. [13]

    Energy-based out-of-distribution detection.Advances in neural information processing systems, 33:21464–21475, 2020

    Weitang Liu, Xiaoyun Wang, John Owens, and Yixuan Li. Energy-based out-of-distribution detection.Advances in neural information processing systems, 33:21464–21475, 2020

  14. [14]

    Agentic reinforcement learning with implicit step rewards, 2025

    Xiaoqian Liu, Ke Wang, Yuchuan Wu, Fei Huang, Yongbin Li, Junge Zhang, and Jianbin Jiao. Agentic reinforcement learning with implicit step rewards, 2025. URL https://arxiv.org/ abs/2509.19199

  15. [15]

    Heterogeneous Adaptive Policy Optimization: Tailoring Optimization to Every Token's Nature

    Zheng Liu, Mengjie Liu, Siwei Wen, Mengzhang Cai, Bin Cui, Conghui He, and Wentao Zhang. From uniform to heterogeneous: Tailoring policy optimization to every token’s nature, 2025. URLhttps://arxiv.org/abs/2509.16591

  16. [16]

    Fipo: Eliciting deep reasoning with future-kl influenced policy optimization, 2026

    Chiyu Ma, Shuo Yang, Kexin Huang, Jinda Lu, Haoming Meng, Shangshang Wang, Bolin Ding, Soroush V osoughi, Guoyin Wang, and Jingren Zhou. Fipo: Eliciting deep reasoning with future-kl influenced policy optimization, 2026. URL https://arxiv.org/abs/2603.19835

  17. [17]

    Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedback,...

  18. [18]

    A survey of temporal credit assignment in deep reinforcement learning, 2024

    Eduardo Pignatelli, Johan Ferret, Matthieu Geist, Thomas Mesnard, Hado van Hasselt, Olivier Pietquin, and Laura Toni. A survey of temporal credit assignment in deep reinforcement learning, 2024. URLhttps://arxiv.org/abs/2312.01072

  19. [19]

    Qwen2.5 Technical Report

    Qwen, :, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li,...

  20. [20]

    Toolformer: Language Models Can Teach Themselves to Use Tools

    Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Luke Zettle- moyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools, 2023. URLhttps://arxiv.org/abs/2302.04761

  21. [21]

    Proximal Policy Optimization Algorithms

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms, 2017. URLhttps://arxiv.org/abs/1707.06347

  22. [22]

    High-Dimensional Continuous Control Using Generalized Advantage Estimation

    John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel. High- dimensional continuous control using generalized advantage estimation, 2018. URL https: //arxiv.org/abs/1506.02438

  23. [23]

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y . K. Li, Y . Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models, 2024. URL https://arxiv.org/abs/ 2402.03300. 11

  24. [24]

    CARL: Criticality-Aware Agentic Reinforcement Learning

    Leyang Shen, Yang Zhang, Chun Kai Ling, Xiaoyan Zhao, and Tat-Seng Chua. Carl: Focusing agentic reinforcement learning on critical actions, 2026. URL https://arxiv.org/abs/ 2512.04949

  25. [25]

    MIT press Cambridge, 1998

    Richard S Sutton, Andrew G Barto, et al.Reinforcement learning: An introduction, volume 1. MIT press Cambridge, 1998

  26. [26]

    Gtpo and grpo-s: Token and sequence-level reward shaping with policy entropy, 2026

    Hongze Tan, Zihan Wang, Jianfei Pan, Jinghao Lin, Hao Wang, Yifan Wu, Tao Chen, Zhihang Zheng, Zhihao Tang, and Haihua Yang. Gtpo and grpo-s: Token and sequence-level reward shaping with policy entropy, 2026. URLhttps://arxiv.org/abs/2508.04349

  27. [27]

    Ignore the KL penalty! boosting exploration on critical tokens to enhance RL fine-tuning

    Jean Vassoyan, Nathanaël Beau, and Roman Plaud. Ignore the KL penalty! boosting exploration on critical tokens to enhance RL fine-tuning. In Luis Chiruzzo, Alan Ritter, and Lu Wang, editors, Findings of the Association for Computational Linguistics: NAACL 2025, pages 6123–6133, Albuquerque, New Mexico, April 2025. Association for Computational Linguistics...

  28. [28]

    A survey on large language model based autonomous agents.Frontiers of Computer Science, 18(6):186345, 2024

    Lei Wang, Chen Ma, Xueyang Feng, Zeyu Zhang, Hao Yang, Jingsen Zhang, Zhiyuan Chen, Jiakai Tang, Xu Chen, Yankai Lin, et al. A survey on large language model based autonomous agents.Frontiers of Computer Science, 18(6):186345, 2024

  29. [29]

    Arlarena: A unified framework for stable agentic reinforcement learning, 2026

    Xiaoxuan Wang, Han Zhang, Haixin Wang, Yidan Shi, Ruoyan Li, Kaiqiao Han, Chenyi Tong, Haoran Deng, Renliang Sun, Alexander Taylor, Yanqiao Zhu, Jason Cong, Yizhou Sun, and Wei Wang. Arlarena: A unified framework for stable agentic reinforcement learning, 2026. URL https://arxiv.org/abs/2602.21534

  30. [30]

    RAGEN: Understanding Self-Evolution in LLM Agents via Multi-Turn Reinforcement Learning

    Zihan Wang, Kangrui Wang, Qineng Wang, Pingyue Zhang, Linjie Li, Zhengyuan Yang, Xing Jin, Kefan Yu, Minh Nhat Nguyen, Licheng Liu, Eli Gottlieb, Yiping Lu, Kyunghyun Cho, Jiajun Wu, Li Fei-Fei, Lijuan Wang, Yejin Choi, and Manling Li. Ragen: Understanding self-evolution in llm agents via multi-turn reinforcement learning, 2025. URLhttps://arxiv. org/abs/...

  31. [31]

    Reinforcing language agents via policy optimization with action decomposition, 2024

    Muning Wen, Ziyu Wan, Weinan Zhang, Jun Wang, and Ying Wen. Reinforcing language agents via policy optimization with action decomposition, 2024. URL https://arxiv.org/ abs/2405.15821

  32. [32]

    Llm agents making agent tools, 2025

    Georg Wölflein, Dyke Ferber, Daniel Truhn, Ognjen Arandjelovi´c, and Jakob Nikolas Kather. Llm agents making agent tools, 2025. URLhttps://arxiv.org/abs/2502.11705

  33. [33]

    Agentgym-rl: Training LLM agents for long-horizon decision making through multi-turn reinforcement learning

    Zhiheng Xi, Jixuan Huang, Chenyang Liao, Baodai Huang, Honglin Guo, Jiaqi Liu, Rui Zheng, Junjie Ye, Jiazheng Zhang, Wenxiang Chen, Wei He, Yiwen Ding, Guanyu Li, Zehui Chen, Zhengyin Du, Xuesong Yao, Yufei Xu, Jiecao Chen, Tao Gui, Zuxuan Wu, Qi Zhang, Xuanjing Huang, and Yu-Gang Jiang. Agentgym-rl: Training llm agents for long-horizon decision making th...

  34. [34]

    A theory of generative convnet

    Jianwen Xie, Yang Lu, Song-Chun Zhu, and Yingnian Wu. A theory of generative convnet. In International conference on machine learning, pages 2635–2644. PMLR, 2016

  35. [35]

    Webshop: Towards scalable real-world web interaction with grounded language agents,

    Shunyu Yao, Howard Chen, John Yang, and Karthik Narasimhan. Webshop: Towards scalable real-world web interaction with grounded language agents, 2023. URL https://arxiv.org/ abs/2207.01206

  36. [36]

    ReAct: Synergizing Reasoning and Acting in Language Models

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models, 2023. URL https: //arxiv.org/abs/2210.03629

  37. [37]

    DAPO: An Open-Source LLM Reinforcement Learning System at Scale

    Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, Xin Liu, Haibin Lin, Zhiqi Lin, Bole Ma, Guangming Sheng, Yuxuan Tong, Chi Zhang, Mofan Zhang, Wang Zhang, Hang Zhu, Jinhua Zhu, Jiaze Chen, Jiangjie Chen, Chengyi Wang, Hongli Yu, Yuxuan Song, Xiangpeng Wei, Hao Zhou, Jingjing Liu, W...

  38. [38]

    WebArena: A Realistic Web Environment for Building Autonomous Agents

    Shuyan Zhou, Frank F Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, et al. Webarena: A realistic web environment for building autonomous agents.arXiv preprint arXiv:2307.13854, 2023. 13 A Notation Table 3 summarizes the notation used throughout the paper. Symbols are grouped by category for ease ...

  39. [39]

    -------------

    | . . ------------- . [3]| [4] . . [4]| [3][2] Legend: [N]=initial cell, N=user-placed, *N*=conflict, .=empty VALID NUMBERS FOR EMPTY CELLS: - (2,2): [2] - (2,3): [1] - (2,4): [1, 3] - (3,1): [1, 2] - (3,4): [1] - (4,1): [1] Progress: 10/16 cells filled (10 initial, 0 placed) Steps: 0/16 You have 10 actions left. Always output: <think> [Your thoughts] </t...