pith. sign in

arxiv: 2606.12384 · v1 · pith:6DNHLJBGnew · submitted 2026-06-10 · 💻 cs.LG · cs.AI

APPO: Agentic Procedural Policy Optimization

Pith reviewed 2026-06-27 10:18 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords agentic reinforcement learningpolicy optimizationcredit assignmentbranching scoretool uselarge language model agentsprocedural policy
0
0 comments X

The pith

APPO shifts agentic RL credit assignment from tool-call boundaries to fine-grained decision points in the sequence.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that influential decision points in agentic RL sequences are broadly distributed rather than clustered at tool calls, and that token entropy alone fails to identify those with real downstream impact. It introduces APPO to address this by selecting branching locations via a Branching Score and applying procedure-level advantage scaling for credit distribution. Experiments demonstrate that this yields consistent gains of nearly 4 points over strong baselines across 13 benchmarks while preserving efficient tool use and interpretability.

Core claim

APPO selects branching locations using a Branching Score that combines token uncertainty with policy-induced likelihood gains of subsequent continuations, enabling more targeted exploration while filtering spurious high-entropy positions, and further introduces procedure-level advantage scaling to better distribute credit across branched rollouts.

What carries the argument

The Branching Score, which combines token uncertainty with policy-induced likelihood gains of subsequent continuations to identify fine-grained decision points for branching and credit assignment.

If this is right

  • Improves strong agentic RL baselines by nearly 4 points on 13 benchmarks.
  • Maintains efficient tool-calls during execution.
  • Preserves behavior interpretability of the resulting agents.
  • Moves credit assignment from coarse heuristic units like tool calls to fine-grained sequence positions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The approach may extend to sequential decision problems outside tool-using agents where credit must be assigned over long traces.
  • It suggests that pure entropy signals for exploration could be systematically augmented with likelihood-gain terms in other policy optimization settings.
  • Longer-horizon agent tasks might benefit disproportionately because the method targets distributed rather than boundary-concentrated decisions.

Load-bearing premise

The pilot analysis finding that influential decision points are broadly distributed and that the Branching Score reliably filters spurious high-entropy positions will generalize to the full experimental setting.

What would settle it

Running the 13 benchmarks with APPO's Branching Score and procedure-level scaling disabled or replaced by pure entropy-based branching, and observing no performance gain over the original baselines.

Figures

Figures reproduced from arXiv: 2606.12384 by Guanhua Chen, Pengkun Wang, Shidong Yang, Xiangxiang Chu, Xucong Wang, Yong Wang, Yuxiang Ji, Ziyu Ma.

Figure 1
Figure 1. Figure 1: (a): The token entropy distribution in the tool-integrated rollout (sampled from Tool￾Star’s [9] 54K dataset). (b): Average accuracy of branches generated from each token, shown by bins of the entropy and the APPO’s Branching Score (BS). (c): The pass@k of rollouts resampled via different criteria (“oracle” means to resample from the points with the highest accuracy uncertainty); The performance comparison… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of APPO. The agent first interacts with the environment to generate initial [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Analysis of Pass@1 to Pass@5 of ARPO and APPO on four datasets respectively. [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: The training dynamics of pure-token branching and APPO’s procedural guided branching [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: The visualization of the branch distribu [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 7
Figure 7. Figure 7: WordCloud of tokens selected by alternative designs of the BS metric. [PITH_FULL_IMAGE:figures/full_fig_p019_7.png] view at source ↗
read the original abstract

Recent advances in agentic Reinforcement Learning (RL) have substantially improved the multi-turn tool-use capabilities of large language model agents. However, most existing methods assign credit over coarse heuristic units, such as tool-call boundaries or fixed workflows, making it difficult to identify which intermediate decisions influence downstream outcomes. In this work, we study agentic RL from two perspectives: \textit{where to branch and how to assign credit after branching}. Our pilot analysis shows that influential decision points are broadly distributed throughout the generated sequence rather than concentrated at tool calls, while token entropy alone does not reliably reflect their impact on final outcomes. Motivated by these observations, we propose \textbf{Agentic Procedural Policy Optimization (APPO)}, which shifts branching and credit assignment from coarse interaction units to fine-grained decision points in the sequence. APPO selects branching locations using a Branching Score that combines token uncertainty with policy-induced likelihood gains of subsequent continuations, enabling more targeted exploration while filtering out spurious high-entropy positions. It further introduces procedure-level advantage scaling to better distribute credit across branched rollouts. Experiments on 13 benchmarks show that APPO consistently improves strong agentic RL baselines by nearly 4 points, while keeping efficient tool-calls and maintaining behavior interpretability.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes Agentic Procedural Policy Optimization (APPO) for agentic RL with LLMs. It reports a pilot analysis indicating that influential decision points are broadly distributed in sequences (rather than concentrated at tool calls) and that token entropy alone is unreliable for identifying them. APPO introduces a Branching Score (token uncertainty combined with policy-induced likelihood gains of continuations) to select branching locations and procedure-level advantage scaling for credit assignment. It claims consistent ~4-point gains over strong baselines on 13 benchmarks while preserving efficient tool use and interpretability.

Significance. If the reported gains are robust and the Branching Score's filtering property generalizes beyond the pilot, the work would advance fine-grained credit assignment in long-horizon agentic settings, offering a more targeted alternative to coarse heuristic units such as tool-call boundaries.

major comments (2)
  1. [Pilot analysis] Pilot analysis: the central empirical claim of consistent ~4-point gains on 13 benchmarks rests on the assumption that the pilot observations (broad distribution of influential decisions and superiority of the Branching Score over entropy) generalize to the evaluation tasks. No evidence is supplied that the distribution or filtering property holds on those benchmarks; without it, gains could arise from procedure-level advantage scaling or other unablated factors rather than the targeted branching mechanism.
  2. [Experiments] Experiments: the abstract states empirical gains but supplies no experimental details, baselines, statistical tests, ablation results, or variance measures. This prevents assessment of whether the data actually supports the central claim of consistent improvement attributable to APPO.
minor comments (1)
  1. The abstract would be clearer if it briefly named the 13 benchmarks or the strong agentic RL baselines used.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and outline revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: Pilot analysis: the central empirical claim of consistent ~4-point gains on 13 benchmarks rests on the assumption that the pilot observations (broad distribution of influential decisions and superiority of the Branching Score over entropy) generalize to the evaluation tasks. No evidence is supplied that the distribution or filtering property holds on those benchmarks; without it, gains could arise from procedure-level advantage scaling or other unablated factors rather than the targeted branching mechanism.

    Authors: The pilot analysis was performed to identify general characteristics of influential decision points in agentic trajectories and to motivate the Branching Score design. While the exact pilot statistics are not replicated on every one of the 13 benchmarks, the consistent gains across diverse tasks support the utility of the full APPO procedure. To directly address the concern about attribution, we will add an ablation isolating the Branching Score (versus procedure-level scaling alone) together with a verification of its filtering behavior on a representative subset of the evaluation benchmarks. revision: yes

  2. Referee: Experiments: the abstract states empirical gains but supplies no experimental details, baselines, statistical tests, ablation results, or variance measures. This prevents assessment of whether the data actually supports the central claim of consistent improvement attributable to APPO.

    Authors: The abstract is intentionally concise. Complete experimental details—including the 13 benchmarks, baselines, statistical tests, ablation studies, and variance across random seeds—are reported in Section 4 of the manuscript. We will revise the abstract to include a short clause referencing the evaluation protocol and results section. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained

full rationale

The paper motivates APPO from a pilot analysis of decision-point distribution and Branching Score properties, then reports empirical gains on 13 benchmarks. No equations, fitted parameters, or self-citations are presented that reduce the claimed improvements, Branching Score definition, or procedure-level advantage scaling to tautological inputs by construction. The pilot observations function as external motivation rather than a load-bearing fitted quantity renamed as prediction, and the experimental results stand as independent evidence.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no explicit free parameters, axioms, or invented entities; review limited to abstract prevents enumeration.

pith-pipeline@v0.9.1-grok · 5768 in / 945 out tokens · 19929 ms · 2026-06-27T10:18:55.339437+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

111 extracted references · 43 linked inside Pith

  1. [1]

    Agarwal, N

    A. Agarwal, N. Jiang, S. M. Kakade, and W. Sun. Reinforcement learning: Theory and algorithms.CS Dept., UW Seattle, Seattle, WA, USA, Tech. Rep, 32:96, 2019

  2. [2]

    S. Cao, D. Li, F. Zhao, S. Yuan, S. R. Hegde, C. Chen, C. Ruan, T. Griggs, S. Liu, E. Tang, et al. Skyrl-agent: Efficient rl training for multi-turn llm agent.arXiv preprint arXiv:2511.16108, 2025

  3. [3]

    A. Chen, A. Li, B. Gong, B. Jiang, B. Fei, B. Yang, B. Shan, C. Yu, C. Wang, C. Zhu, et al. Minimax-m1: Scaling test-time compute efficiently with lightning attention.arXiv preprint arXiv:2506.13585, 2025

  4. [4]

    K. Chen, Y . Ren, Y . Liu, X. Hu, H. Tian, T. Xie, F. Liu, H. Zhang, H. Liu, Y . Gong, et al. xbench: Tracking agents productivity scaling with profession-aligned real-world evaluations.arXiv preprint arXiv:2506.13651, 2025

  5. [5]

    X. Chu, H. Huang, X. Zhang, F. Wei, and Y . Wang. GPG: A simple and strong reinforcement learning baseline for model reasoning. InThe Fourteenth International Conference on Learning Representations, 2026

  6. [6]

    Cobbe, V

    K. Cobbe, V . Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton, R. Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021

  7. [7]

    T. Dao. Flashattention-2: Faster attention with better parallelism and work partitioning.arXiv preprint arXiv:2307.08691, 2023

  8. [8]

    G. Dong, L. Bao, Z. Wang, K. Zhao, X. Li, J. Jin, J. Yang, H. Mao, F. Zhang, K. Gai, et al. Agentic entropy-balanced policy optimization.arXiv preprint arXiv:2510.14545, 2025

  9. [9]

    G. Dong, Y . Chen, X. Li, J. Jin, H. Qian, Y . Zhu, H. Mao, G. Zhou, Z. Dou, and J.-R. Wen. Tool-star: Empowering llm-brained multi-tool reasoner via reinforcement learning.arXiv preprint arXiv:2505.16410, 2025

  10. [10]

    G. Dong, H. Mao, K. Ma, L. Bao, Y . Chen, Z. Wang, Z. Chen, J. Du, H. Wang, F. Zhang, et al. Agentic reinforced policy optimization.arXiv preprint arXiv:2507.19849, 2025

  11. [11]

    Dubey, A

    A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Yang, A. Fan, et al. The llama 3 herd of models.arXiv e-prints, pages arXiv–2407, 2024

  12. [12]

    Ester, H.-P

    M. Ester, H.-P. Kriegel, J. Sander, X. Xu, et al. A density-based algorithm for discovering clusters in large spatial databases with noise. Inkdd, volume 96, pages 226–231, 1996

  13. [13]

    J. Feng, S. Huang, X. Qu, G. Zhang, Y . Qin, B. Zhong, C. Jiang, J. Chi, and W. Zhong. Retool: Reinforcement learning for strategic tool use in llms.arXiv preprint arXiv:2504.11536, 2025

  14. [14]

    L. Feng, Z. Xue, T. Liu, and B. An. Group-in-group policy optimization for llm agent training.arXiv preprint arXiv:2505.10978, 2025

  15. [15]

    X. Feng, Z. Wan, M. Wen, S. M. McAleer, Y . Wen, W. Zhang, and J. Wang. Alphazero-like tree-search can guide large language model decoding and training.arXiv preprint arXiv:2309.17179, 2023

  16. [16]

    J. Gao, W. Fu, M. Xie, S. Xu, C. He, Z. Mei, B. Zhu, and Y . Wu. Beyond ten turns: Unlocking long-horizon agentic search with large-scale asynchronous rl.arXiv preprint arXiv:2508.07976, 2025

  17. [17]

    Z. Gou, Z. Shao, Y . Gong, Y . Shen, Y . Yang, M. Huang, N. Duan, and W. Chen. Tora: A tool-integrated reasoning agent for mathematical problem solving.arXiv preprint arXiv:2309.17452, 2023

  18. [18]

    D. Guo, D. Yang, H. Zhang, J. Song, P. Wang, Q. Zhu, R. Xu, R. Zhang, S. Ma, X. Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

  19. [19]

    Y . Guo, L. Xu, J. Liu, D. Ye, and S. Qiu. Segment policy optimization: Effective segment-level credit assignment in rl for large language models.arXiv preprint arXiv:2505.23564, 2025

  20. [20]

    D. Hafner. Benchmarking the spectrum of agent capabilities.arXiv preprint arXiv:2109.06780, 2021

  21. [21]

    Hendrycks, C

    D. Hendrycks, C. Burns, S. Kadavath, A. Arora, S. Basart, E. Tang, D. Song, and J. Steinhardt. Measuring mathematical problem solving with the math dataset.arXiv preprint arXiv:2103.03874, 2021

  22. [22]

    Ho, A.-K

    X. Ho, A.-K. D. Nguyen, S. Sugawara, and A. Aizawa. Constructing a multi-hop qa dataset for comprehensive evaluation of reasoning steps. InProceedings of the 28th International Conference on Computational Linguistics, pages 6609–6625, 2020. 10

  23. [23]

    Z. Hou, Z. Hu, Y . Li, R. Lu, J. Tang, and Y . Dong. Treerl: Llm reinforcement learning with on-policy tree search. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 12355–12369, 2025

  24. [24]

    J. Hu, J. K. Liu, H. Xu, and W. Shen. Reinforce++: Stabilizing critic-free policy optimization with global advantage normalization.arXiv preprint arXiv:2501.03262, 2025

  25. [25]

    Huang, H

    K. Huang, H. Meng, J. Wu, J. Lu, C. Ma, Z. Chen, X. Wang, B. Ding, J. Wu, X. Wang, et al. On the direc- tion of rlvr updates for llm reasoning: Identification and exploitation.arXiv preprint arXiv:2603.22117, 2026

  26. [26]

    B. Hui, J. Yang, Z. Cui, J. Yang, D. Liu, L. Zhang, T. Liu, J. Zhang, B. Yu, K. Lu, et al. Qwen2. 5-coder technical report.arXiv preprint arXiv:2409.12186, 2024

  27. [27]

    Hurst, A

    A. Hurst, A. Lerer, A. P. Goucher, A. Perelman, A. Ramesh, A. Clark, A. Ostrow, A. Welihinda, A. Hayes, A. Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024

  28. [28]

    Jaech, A

    A. Jaech, A. Kalai, A. Lerer, A. Richardson, A. El-Kishky, A. Low, A. Helyar, A. Madry, A. Beutel, A. Carney, et al. Openai o1 system card.arXiv preprint arXiv:2412.16720, 2024

  29. [29]

    Y . Ji, Z. Ma, Y . Wang, G. Chen, X. Chu, and L. Wu. Tree search for llm agent reinforcement learning. arXiv preprint arXiv:2509.21240, 2025

  30. [30]

    J. Jin, X. Li, G. Dong, Y . Zhang, Y . Zhu, Y . Zhao, H. Qian, and Z. Dou. Hira: A hierarchical reasoning framework for decoupled planning and execution in deep search.arXiv preprint arXiv:2507.02652, 2025

  31. [31]

    J. Jin, Y . Zhu, Z. Dou, G. Dong, X. Yang, C. Zhang, T. Zhao, Z. Yang, and J.-R. Wen. Flashrag: A modular toolkit for efficient retrieval-augmented generation research. InCompanion Proceedings of the ACM on Web Conference 2025, pages 737–740, 2025

  32. [32]

    Kaufmann, P

    T. Kaufmann, P. Weng, V . Bengs, and E. Hüllermeier. A survey of reinforcement learning from human feedback.arXiv preprint arXiv:2312.14925, 2023

  33. [33]

    X. Lai, Z. Tian, Y . Chen, S. Yang, X. Peng, and J. Jia. Step-dpo: Step-wise preference optimization for long-chain reasoning of llms.arXiv preprint arXiv:2406.18629, 2024

  34. [34]

    H. Lee, S. Phatale, H. Mansoor, K. R. Lu, T. Mesnard, J. Ferret, C. Bishop, E. Hall, V . Carbune, and A. Rastogi. Rlaif: Scaling reinforcement learning from human feedback with ai feedback. 2023

  35. [35]

    Lewis, E

    P. Lewis, E. Perez, A. Piktus, F. Petroni, V . Karpukhin, N. Goyal, H. Küttler, M. Lewis, W.-t. Yih, T. Rocktäschel, et al. Retrieval-augmented generation for knowledge-intensive nlp tasks.Advances in neural information processing systems, 33:9459–9474, 2020

  36. [36]

    P. Li, Z. Gao, B. Zhang, Y . Mi, X. Ma, C. Shi, T. Yuan, Y . Wu, Y . Jia, S.-C. Zhu, et al. Iterative tool usage exploration for multimodal agents via step-wise preference tuning.arXiv preprint arXiv:2504.21561, 2025

  37. [37]

    X. Li, G. Dong, J. Jin, Y . Zhang, Y . Zhou, Y . Zhu, P. Zhang, and Z. Dou. Search-o1: Agentic search- enhanced large reasoning models. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 5420–5438, 2025

  38. [38]

    X. Li, W. Jiao, J. Jin, G. Dong, J. Jin, Y . Wang, H. Wang, Y . Zhu, J.-R. Wen, Y . Lu, et al. Deepagent: A general reasoning agent with scalable toolsets.arXiv preprint arXiv:2510.21618, 2025

  39. [39]

    X. Li, J. Jin, G. Dong, H. Qian, Y . Wu, J.-R. Wen, Y . Zhu, and Z. Dou. Webthinker: Empowering large reasoning models with deep research capability.arXiv preprint arXiv:2504.21776, 2025

  40. [40]

    X. Li, J. Jin, Y . Zhou, Y . Wu, Z. Li, Y . Qi, and Z. Dou. Retrollm: Empowering large language models to retrieve fine-grained evidence within generation. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 16754–16779, 2025

  41. [41]

    X. Li, H. Zou, and P. Liu. Torl: Scaling tool-integrated rl.arXiv preprint arXiv:2503.23383, 2025

  42. [42]

    Z.-Z. Li, D. Zhang, M.-L. Zhang, J. Zhang, Z. Liu, Y . Yao, H. Xu, J. Zheng, P.-J. Wang, X. Chen, et al. From system 1 to system 2: A survey of reasoning large language models.arXiv preprint arXiv:2502.17419, 2025

  43. [43]

    Lightman, V

    H. Lightman, V . Kosaraju, Y . Burda, H. Edwards, B. Baker, T. Lee, J. Leike, J. Schulman, I. Sutskever, and K. Cobbe. Let’s verify step by step. InThe twelfth international conference on learning representations, 2023. 11

  44. [44]

    Z. Liu, C. Chen, W. Li, P. Qi, T. Pang, C. Du, W. S. Lee, and M. Lin. Understanding r1-zero-like training: A critical perspective.arXiv preprint arXiv:2503.20783, 2025

  45. [45]

    K. Lu, C. Chen, X. Wang, B. Cui, Y . Liu, and W. Zhang. Pilotrl: Training language model agents via global planning-guided progressive reinforcement learning.arXiv preprint arXiv:2508.00344, 2025

  46. [46]

    Z. Lu, Z. Yao, J. Wu, C. Han, Q. Gu, X. Cai, W. Lu, J. Xiao, Y . Zhuang, and Y . Shen. Skill0: In-context agentic reinforcement learning for skill internalization.arXiv preprint arXiv:2604.02268, 2026

  47. [47]

    C. Ma, S. Yang, K. Huang, J. Lu, H. Meng, S. Wang, B. Ding, S. V osoughi, G. Wang, and J. Zhou. Fipo: Eliciting deep reasoning with future-kl influenced policy optimization.arXiv preprint arXiv:2603.19835, 2026

  48. [48]

    Z. Ma, S. Yang, Y . Ji, X. Wang, Y . Wang, Y . Hu, T. Huang, and X. Chu. Skillclaw: Let skills evolve collectively with agentic evolver.arXiv preprint arXiv:2604.08377, 2026

  49. [49]

    McInnes, J

    L. McInnes, J. Healy, and J. Melville. Umap: Uniform manifold approximation and projection for dimension reduction.arXiv preprint arXiv:1802.03426, 2018

  50. [50]

    H. Meng, K. Huang, S. Wei, C. Ma, S. Yang, X. Wang, G. Wang, B. Ding, and J. Zhou. Sparse but critical: A token-level analysis of distributional shifts in rlvr fine-tuning of llms.arXiv preprint arXiv:2603.22446, 2026

  51. [51]

    Mialon, C

    G. Mialon, C. Fourrier, T. Wolf, Y . LeCun, and T. Scialom. Gaia: a benchmark for general ai assistants. InThe Twelfth International Conference on Learning Representations, 2023

  52. [52]

    Muennighoff, Z

    N. Muennighoff, Z. Yang, W. Shi, X. L. Li, L. Fei-Fei, H. Hajishirzi, L. Zettlemoyer, P. Liang, E. Candès, and T. B. Hashimoto. s1: Simple test-time scaling. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 20286–20332, 2025

  53. [53]

    L. Phan, A. Gatti, Z. Han, N. Li, J. Hu, H. Zhang, C. B. C. Zhang, M. Shaaban, J. Ling, S. Shi, et al. Humanity’s last exam.arXiv preprint arXiv:2501.14249, 2025

  54. [54]

    Press, M

    O. Press, M. Zhang, S. Min, L. Schmidt, N. A. Smith, and M. Lewis. Measuring and narrowing the compositionality gap in language models. InFindings of the Association for Computational Linguistics: EMNLP 2023, pages 5687–5711, 2023

  55. [55]

    C. Qian, E. C. Acikgoz, Q. He, H. Wang, X. Chen, D. Hakkani-Tür, G. Tur, and H. Ji. Toolrl: Reward is all tool learning needs.arXiv preprint arXiv:2504.13958, 2025

  56. [56]

    C. Qian, E. C. Acikgoz, H. Wang, X. Chen, A. Sil, D. Hakkani-Tur, G. Tur, and H. Ji. Smart: Self-aware agent for tool overuse mitigation. InFindings of the Association for Computational Linguistics: ACL 2025, pages 4604–4621, 2025

  57. [57]

    Radford, J

    A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever, et al. Language models are unsupervised multitask learners.OpenAI blog, 1(8):9, 2019

  58. [58]

    Rafailov, A

    R. Rafailov, A. Sharma, E. Mitchell, C. D. Manning, S. Ermon, and C. Finn. Direct preference opti- mization: Your language model is secretly a reward model.Advances in neural information processing systems, 36:53728–53741, 2023

  59. [59]

    Rasley, S

    J. Rasley, S. Rajbhandari, O. Ruwase, and Y . He. Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters. InProceedings of the 26th ACM SIGKDD international conference on knowledge discovery & data mining, pages 3505–3506, 2020

  60. [60]

    Schulman, F

    J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017

  61. [61]

    Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, X. Bi, H. Zhang, M. Zhang, Y . Li, Y . Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

  62. [62]

    L. Shen, Y . Zhang, C. K. Ling, X. Zhao, and T.-S. Chua. Carl: Critical action focused reinforcement learning for multi-step agent.arXiv preprint arXiv:2512.04949, 2025

  63. [63]

    Sheng, C

    G. Sheng, C. Zhang, Z. Ye, X. Wu, W. Zhang, R. Zhang, Y . Peng, H. Lin, and C. Wu. Hybridflow: A flexible and efficient rlhf framework. InProceedings of the Twentieth European Conference on Computer Systems, pages 1279–1297, 2025. 12

  64. [64]

    Shinn, F

    N. Shinn, F. Cassano, A. Gopinath, K. Narasimhan, and S. Yao. Reflexion: Language agents with verbal reinforcement learning.Advances in neural information processing systems, 36:8634–8652, 2023

  65. [65]

    Shridhar, X

    M. Shridhar, X. Yuan, M.-A. Côté, Y . Bisk, A. Trischler, and M. Hausknecht. Alfworld: Aligning text and embodied environments for interactive learning.arXiv preprint arXiv:2010.03768, 2020

  66. [66]

    Snell, J

    C. Snell, J. Lee, K. Xu, and A. Kumar. Scaling llm test-time compute optimally can be more effective than scaling model parameters.arXiv preprint arXiv:2408.03314, 2024

  67. [67]

    H. Su, S. Diao, X. Lu, M. Liu, J. Xu, X. Dong, Y . Fu, P. Belcak, H. Ye, H. Yin, et al. Toolorchestra: Elevating intelligence via efficient model and tool orchestration.arXiv preprint arXiv:2511.21689, 2025

  68. [68]

    Z. Su, L. Pan, X. Bai, D. Liu, G. Dong, J. Huang, M. Lv, W. Hu, F. Zhang, K. Gai, et al. Klear-reasoner: Advancing reasoning capability via gradient-preserving clipping policy optimization.arXiv preprint arXiv:2508.07629, 2025

  69. [69]

    K. Team, Y . Bai, Y . Bao, Y . Charles, C. Chen, G. Chen, H. Chen, H. Chen, J. Chen, N. Chen, et al. Kimi k2: Open agentic intelligence.arXiv preprint arXiv:2507.20534, 2025

  70. [70]

    K. Team, A. Du, B. Gao, B. Xing, C. Jiang, C. Chen, C. Li, C. Xiao, C. Du, C. Liao, et al. Kimi k1. 5: Scaling reinforcement learning with llms.arXiv preprint arXiv:2501.12599, 2025

  71. [71]

    M. L. Team, B. Li, B. Lei, B. Wang, B. Rong, C. Wang, C. Zhang, C. Gao, C. Zhang, C. Sun, et al. Longcat-flash technical report.arXiv preprint arXiv:2509.01322, 2025

  72. [72]

    Q. Team. Qwq: Reflect deeply on the boundaries of the unknown, 2024

  73. [73]

    Trivedi, N

    H. Trivedi, N. Balasubramanian, T. Khot, and A. Sabharwal. Musique: Multihop questions via single-hop question composition.Transactions of the Association for Computational Linguistics, 10:539–554, 2022

  74. [74]

    H. Wang, C. Qian, M. Li, J. Qiu, B. Xue, M. Wang, H. Ji, and K.-F. Wong. Toward a theory of agents as tool-use decision-makers.arXiv preprint arXiv:2506.00886, 2025

  75. [75]

    L. Wang, W. Xu, Y . Lan, Z. Hu, Y . Lan, R. K.-W. Lee, and E.-P. Lim. Plan-and-solve prompting: Improving zero-shot chain-of-thought reasoning by large language models. InProceedings of the 61st annual meeting of the association for computational linguistics (volume 1: long papers), pages 2609–2634, 2023

  76. [76]

    P. Wang, S. Xu, J. Li, Y . Luo, D. Li, J. Hao, and M. Zhang.Re 2: Unlocking llm reasoning via reinforcement learning with re-solving.arXiv preprint arXiv:2603.07197, 2026

  77. [77]

    S. Wang, L. Yu, C. Gao, C. Zheng, S. Liu, R. Lu, K. Dang, X. Chen, J. Yang, Z. Zhang, et al. Beyond the 80/20 rule: High-entropy minority tokens drive effective reinforcement learning for llm reasoning.arXiv preprint arXiv:2506.01939, 2025

  78. [78]

    J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V . Le, D. Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837, 2022

  79. [79]

    Welleck, A

    S. Welleck, A. Bertsch, M. Finlayson, H. Schoelkopf, A. Xie, G. Neubig, I. Kulikov, and Z. Harchaoui. From decoding to meta-generation: Inference-time algorithms for large language models.arXiv preprint arXiv:2406.16838, 2024

  80. [80]

    D. Wu, D. S. Sachan, W.-t. Yih, and M. Chen. Procedural knowledge at scale improves reasoning.arXiv preprint arXiv:2604.01348, 2026

Showing first 80 references.