APPO: Agentic Procedural Policy Optimization

Guanhua Chen; Pengkun Wang; Shidong Yang; Xiangxiang Chu; Xucong Wang; Yong Wang; Yuxiang Ji; Ziyu Ma

arxiv: 2606.12384 · v1 · pith:6DNHLJBGnew · submitted 2026-06-10 · 💻 cs.LG · cs.AI

APPO: Agentic Procedural Policy Optimization

Xucong Wang , Ziyu Ma , Yong Wang , Yuxiang Ji , Shidong Yang , Guanhua Chen , Pengkun Wang , Xiangxiang Chu This is my paper

Pith reviewed 2026-06-27 10:18 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords agentic reinforcement learningpolicy optimizationcredit assignmentbranching scoretool uselarge language model agentsprocedural policy

0 comments

The pith

APPO shifts agentic RL credit assignment from tool-call boundaries to fine-grained decision points in the sequence.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that influential decision points in agentic RL sequences are broadly distributed rather than clustered at tool calls, and that token entropy alone fails to identify those with real downstream impact. It introduces APPO to address this by selecting branching locations via a Branching Score and applying procedure-level advantage scaling for credit distribution. Experiments demonstrate that this yields consistent gains of nearly 4 points over strong baselines across 13 benchmarks while preserving efficient tool use and interpretability.

Core claim

APPO selects branching locations using a Branching Score that combines token uncertainty with policy-induced likelihood gains of subsequent continuations, enabling more targeted exploration while filtering spurious high-entropy positions, and further introduces procedure-level advantage scaling to better distribute credit across branched rollouts.

What carries the argument

The Branching Score, which combines token uncertainty with policy-induced likelihood gains of subsequent continuations to identify fine-grained decision points for branching and credit assignment.

If this is right

Improves strong agentic RL baselines by nearly 4 points on 13 benchmarks.
Maintains efficient tool-calls during execution.
Preserves behavior interpretability of the resulting agents.
Moves credit assignment from coarse heuristic units like tool calls to fine-grained sequence positions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach may extend to sequential decision problems outside tool-using agents where credit must be assigned over long traces.
It suggests that pure entropy signals for exploration could be systematically augmented with likelihood-gain terms in other policy optimization settings.
Longer-horizon agent tasks might benefit disproportionately because the method targets distributed rather than boundary-concentrated decisions.

Load-bearing premise

The pilot analysis finding that influential decision points are broadly distributed and that the Branching Score reliably filters spurious high-entropy positions will generalize to the full experimental setting.

What would settle it

Running the 13 benchmarks with APPO's Branching Score and procedure-level scaling disabled or replaced by pure entropy-based branching, and observing no performance gain over the original baselines.

Figures

Figures reproduced from arXiv: 2606.12384 by Guanhua Chen, Pengkun Wang, Shidong Yang, Xiangxiang Chu, Xucong Wang, Yong Wang, Yuxiang Ji, Ziyu Ma.

**Figure 1.** Figure 1: (a): The token entropy distribution in the tool-integrated rollout (sampled from ToolStar’s [9] 54K dataset). (b): Average accuracy of branches generated from each token, shown by bins of the entropy and the APPO’s Branching Score (BS). (c): The pass@k of rollouts resampled via different criteria (“oracle” means to resample from the points with the highest accuracy uncertainty); The performance comparison… view at source ↗

**Figure 2.** Figure 2: Overview of APPO. The agent first interacts with the environment to generate initial [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Analysis of Pass@1 to Pass@5 of ARPO and APPO on four datasets respectively. [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: The training dynamics of pure-token branching and APPO’s procedural guided branching [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: The visualization of the branch distribu [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 7.** Figure 7: WordCloud of tokens selected by alternative designs of the BS metric. [PITH_FULL_IMAGE:figures/full_fig_p019_7.png] view at source ↗

read the original abstract

Recent advances in agentic Reinforcement Learning (RL) have substantially improved the multi-turn tool-use capabilities of large language model agents. However, most existing methods assign credit over coarse heuristic units, such as tool-call boundaries or fixed workflows, making it difficult to identify which intermediate decisions influence downstream outcomes. In this work, we study agentic RL from two perspectives: \textit{where to branch and how to assign credit after branching}. Our pilot analysis shows that influential decision points are broadly distributed throughout the generated sequence rather than concentrated at tool calls, while token entropy alone does not reliably reflect their impact on final outcomes. Motivated by these observations, we propose \textbf{Agentic Procedural Policy Optimization (APPO)}, which shifts branching and credit assignment from coarse interaction units to fine-grained decision points in the sequence. APPO selects branching locations using a Branching Score that combines token uncertainty with policy-induced likelihood gains of subsequent continuations, enabling more targeted exploration while filtering out spurious high-entropy positions. It further introduces procedure-level advantage scaling to better distribute credit across branched rollouts. Experiments on 13 benchmarks show that APPO consistently improves strong agentic RL baselines by nearly 4 points, while keeping efficient tool-calls and maintaining behavior interpretability.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

APPO's branching score and procedure scaling target a real credit assignment issue in agentic RL but the abstract supplies no experimental details or ablations to support the claimed gains.

read the letter

APPO proposes a Branching Score for deciding where to explore in multi-turn LLM agent RL, mixing token uncertainty with how the policy affects future likelihoods, along with procedure-level advantage scaling.

The pilot analysis is the starting point, showing that key decisions are spread out rather than clustered at tool calls and that entropy by itself does not pick the right spots.

The paper does a good job of motivating why coarse credit assignment is a bottleneck and why a finer approach could help. The combination in the Branching Score is a concrete step beyond just using entropy.

The soft spot is the complete absence of experimental details. We get the headline result of nearly 4 point gains on 13 benchmarks, but nothing on baselines, ablations, or whether the pilot findings actually held on those tasks. If the distribution of influential points differs on the test set, the gains might not come from the new score at all.

This matches the stress-test concern exactly. The central claim depends on the pilot generalizing, but the abstract supplies no check on that.

The approach looks like honest engagement with the problem of credit in agentic settings. No obvious circularity or fitting issues.

This paper is for the small group working on RL fine-tuning of tool-using LLM agents. Someone already running similar experiments could get value from the specific mechanisms if the full paper shows the ablations.

It deserves a serious referee because the problem is recognized and the proposed changes are specific. The editor should ask for the full experimental section and tests of the pilot assumptions.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes Agentic Procedural Policy Optimization (APPO) for agentic RL with LLMs. It reports a pilot analysis indicating that influential decision points are broadly distributed in sequences (rather than concentrated at tool calls) and that token entropy alone is unreliable for identifying them. APPO introduces a Branching Score (token uncertainty combined with policy-induced likelihood gains of continuations) to select branching locations and procedure-level advantage scaling for credit assignment. It claims consistent ~4-point gains over strong baselines on 13 benchmarks while preserving efficient tool use and interpretability.

Significance. If the reported gains are robust and the Branching Score's filtering property generalizes beyond the pilot, the work would advance fine-grained credit assignment in long-horizon agentic settings, offering a more targeted alternative to coarse heuristic units such as tool-call boundaries.

major comments (2)

[Pilot analysis] Pilot analysis: the central empirical claim of consistent ~4-point gains on 13 benchmarks rests on the assumption that the pilot observations (broad distribution of influential decisions and superiority of the Branching Score over entropy) generalize to the evaluation tasks. No evidence is supplied that the distribution or filtering property holds on those benchmarks; without it, gains could arise from procedure-level advantage scaling or other unablated factors rather than the targeted branching mechanism.
[Experiments] Experiments: the abstract states empirical gains but supplies no experimental details, baselines, statistical tests, ablation results, or variance measures. This prevents assessment of whether the data actually supports the central claim of consistent improvement attributable to APPO.

minor comments (1)

The abstract would be clearer if it briefly named the 13 benchmarks or the strong agentic RL baselines used.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and outline revisions to strengthen the manuscript.

read point-by-point responses

Referee: Pilot analysis: the central empirical claim of consistent ~4-point gains on 13 benchmarks rests on the assumption that the pilot observations (broad distribution of influential decisions and superiority of the Branching Score over entropy) generalize to the evaluation tasks. No evidence is supplied that the distribution or filtering property holds on those benchmarks; without it, gains could arise from procedure-level advantage scaling or other unablated factors rather than the targeted branching mechanism.

Authors: The pilot analysis was performed to identify general characteristics of influential decision points in agentic trajectories and to motivate the Branching Score design. While the exact pilot statistics are not replicated on every one of the 13 benchmarks, the consistent gains across diverse tasks support the utility of the full APPO procedure. To directly address the concern about attribution, we will add an ablation isolating the Branching Score (versus procedure-level scaling alone) together with a verification of its filtering behavior on a representative subset of the evaluation benchmarks. revision: yes
Referee: Experiments: the abstract states empirical gains but supplies no experimental details, baselines, statistical tests, ablation results, or variance measures. This prevents assessment of whether the data actually supports the central claim of consistent improvement attributable to APPO.

Authors: The abstract is intentionally concise. Complete experimental details—including the 13 benchmarks, baselines, statistical tests, ablation studies, and variance across random seeds—are reported in Section 4 of the manuscript. We will revise the abstract to include a short clause referencing the evaluation protocol and results section. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained

full rationale

The paper motivates APPO from a pilot analysis of decision-point distribution and Branching Score properties, then reports empirical gains on 13 benchmarks. No equations, fitted parameters, or self-citations are presented that reduce the claimed improvements, Branching Score definition, or procedure-level advantage scaling to tautological inputs by construction. The pilot observations function as external motivation rather than a load-bearing fitted quantity renamed as prediction, and the experimental results stand as independent evidence.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no explicit free parameters, axioms, or invented entities; review limited to abstract prevents enumeration.

pith-pipeline@v0.9.1-grok · 5768 in / 945 out tokens · 19929 ms · 2026-06-27T10:18:55.339437+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

111 extracted references · 43 linked inside Pith

[1]

Agarwal, N

A. Agarwal, N. Jiang, S. M. Kakade, and W. Sun. Reinforcement learning: Theory and algorithms.CS Dept., UW Seattle, Seattle, WA, USA, Tech. Rep, 32:96, 2019

2019
[2]

S. Cao, D. Li, F. Zhao, S. Yuan, S. R. Hegde, C. Chen, C. Ruan, T. Griggs, S. Liu, E. Tang, et al. Skyrl-agent: Efficient rl training for multi-turn llm agent.arXiv preprint arXiv:2511.16108, 2025

arXiv 2025
[3]

A. Chen, A. Li, B. Gong, B. Jiang, B. Fei, B. Yang, B. Shan, C. Yu, C. Wang, C. Zhu, et al. Minimax-m1: Scaling test-time compute efficiently with lightning attention.arXiv preprint arXiv:2506.13585, 2025

Pith/arXiv arXiv 2025
[4]

K. Chen, Y . Ren, Y . Liu, X. Hu, H. Tian, T. Xie, F. Liu, H. Zhang, H. Liu, Y . Gong, et al. xbench: Tracking agents productivity scaling with profession-aligned real-world evaluations.arXiv preprint arXiv:2506.13651, 2025

arXiv 2025
[5]

X. Chu, H. Huang, X. Zhang, F. Wei, and Y . Wang. GPG: A simple and strong reinforcement learning baseline for model reasoning. InThe Fourteenth International Conference on Learning Representations, 2026

2026
[6]

Cobbe, V

K. Cobbe, V . Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton, R. Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021

Pith/arXiv arXiv 2021
[7]

T. Dao. Flashattention-2: Faster attention with better parallelism and work partitioning.arXiv preprint arXiv:2307.08691, 2023

Pith/arXiv arXiv 2023
[8]

G. Dong, L. Bao, Z. Wang, K. Zhao, X. Li, J. Jin, J. Yang, H. Mao, F. Zhang, K. Gai, et al. Agentic entropy-balanced policy optimization.arXiv preprint arXiv:2510.14545, 2025

arXiv 2025
[9]

G. Dong, Y . Chen, X. Li, J. Jin, H. Qian, Y . Zhu, H. Mao, G. Zhou, Z. Dou, and J.-R. Wen. Tool-star: Empowering llm-brained multi-tool reasoner via reinforcement learning.arXiv preprint arXiv:2505.16410, 2025

arXiv 2025
[10]

G. Dong, H. Mao, K. Ma, L. Bao, Y . Chen, Z. Wang, Z. Chen, J. Du, H. Wang, F. Zhang, et al. Agentic reinforced policy optimization.arXiv preprint arXiv:2507.19849, 2025

Pith/arXiv arXiv 2025
[11]

Dubey, A

A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Yang, A. Fan, et al. The llama 3 herd of models.arXiv e-prints, pages arXiv–2407, 2024

2024
[12]

Ester, H.-P

M. Ester, H.-P. Kriegel, J. Sander, X. Xu, et al. A density-based algorithm for discovering clusters in large spatial databases with noise. Inkdd, volume 96, pages 226–231, 1996

1996
[13]

J. Feng, S. Huang, X. Qu, G. Zhang, Y . Qin, B. Zhong, C. Jiang, J. Chi, and W. Zhong. Retool: Reinforcement learning for strategic tool use in llms.arXiv preprint arXiv:2504.11536, 2025

Pith/arXiv arXiv 2025
[14]

L. Feng, Z. Xue, T. Liu, and B. An. Group-in-group policy optimization for llm agent training.arXiv preprint arXiv:2505.10978, 2025

Pith/arXiv arXiv 2025
[15]

X. Feng, Z. Wan, M. Wen, S. M. McAleer, Y . Wen, W. Zhang, and J. Wang. Alphazero-like tree-search can guide large language model decoding and training.arXiv preprint arXiv:2309.17179, 2023

arXiv 2023
[16]

J. Gao, W. Fu, M. Xie, S. Xu, C. He, Z. Mei, B. Zhu, and Y . Wu. Beyond ten turns: Unlocking long-horizon agentic search with large-scale asynchronous rl.arXiv preprint arXiv:2508.07976, 2025

arXiv 2025
[17]

Z. Gou, Z. Shao, Y . Gong, Y . Shen, Y . Yang, M. Huang, N. Duan, and W. Chen. Tora: A tool-integrated reasoning agent for mathematical problem solving.arXiv preprint arXiv:2309.17452, 2023

Pith/arXiv arXiv 2023
[18]

D. Guo, D. Yang, H. Zhang, J. Song, P. Wang, Q. Zhu, R. Xu, R. Zhang, S. Ma, X. Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

Pith/arXiv arXiv 2025
[19]

Y . Guo, L. Xu, J. Liu, D. Ye, and S. Qiu. Segment policy optimization: Effective segment-level credit assignment in rl for large language models.arXiv preprint arXiv:2505.23564, 2025

arXiv 2025
[20]

D. Hafner. Benchmarking the spectrum of agent capabilities.arXiv preprint arXiv:2109.06780, 2021

arXiv 2021
[21]

Hendrycks, C

D. Hendrycks, C. Burns, S. Kadavath, A. Arora, S. Basart, E. Tang, D. Song, and J. Steinhardt. Measuring mathematical problem solving with the math dataset.arXiv preprint arXiv:2103.03874, 2021

Pith/arXiv arXiv 2021
[22]

Ho, A.-K

X. Ho, A.-K. D. Nguyen, S. Sugawara, and A. Aizawa. Constructing a multi-hop qa dataset for comprehensive evaluation of reasoning steps. InProceedings of the 28th International Conference on Computational Linguistics, pages 6609–6625, 2020. 10

2020
[23]

Z. Hou, Z. Hu, Y . Li, R. Lu, J. Tang, and Y . Dong. Treerl: Llm reinforcement learning with on-policy tree search. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 12355–12369, 2025

2025
[24]

J. Hu, J. K. Liu, H. Xu, and W. Shen. Reinforce++: Stabilizing critic-free policy optimization with global advantage normalization.arXiv preprint arXiv:2501.03262, 2025

Pith/arXiv arXiv 2025
[25]

Huang, H

K. Huang, H. Meng, J. Wu, J. Lu, C. Ma, Z. Chen, X. Wang, B. Ding, J. Wu, X. Wang, et al. On the direc- tion of rlvr updates for llm reasoning: Identification and exploitation.arXiv preprint arXiv:2603.22117, 2026

arXiv 2026
[26]

B. Hui, J. Yang, Z. Cui, J. Yang, D. Liu, L. Zhang, T. Liu, J. Zhang, B. Yu, K. Lu, et al. Qwen2. 5-coder technical report.arXiv preprint arXiv:2409.12186, 2024

Pith/arXiv arXiv 2024
[27]

Hurst, A

A. Hurst, A. Lerer, A. P. Goucher, A. Perelman, A. Ramesh, A. Clark, A. Ostrow, A. Welihinda, A. Hayes, A. Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024

Pith/arXiv arXiv 2024
[28]

Jaech, A

A. Jaech, A. Kalai, A. Lerer, A. Richardson, A. El-Kishky, A. Low, A. Helyar, A. Madry, A. Beutel, A. Carney, et al. Openai o1 system card.arXiv preprint arXiv:2412.16720, 2024

Pith/arXiv arXiv 2024
[29]

Y . Ji, Z. Ma, Y . Wang, G. Chen, X. Chu, and L. Wu. Tree search for llm agent reinforcement learning. arXiv preprint arXiv:2509.21240, 2025

arXiv 2025
[30]

J. Jin, X. Li, G. Dong, Y . Zhang, Y . Zhu, Y . Zhao, H. Qian, and Z. Dou. Hira: A hierarchical reasoning framework for decoupled planning and execution in deep search.arXiv preprint arXiv:2507.02652, 2025

arXiv 2025
[31]

J. Jin, Y . Zhu, Z. Dou, G. Dong, X. Yang, C. Zhang, T. Zhao, Z. Yang, and J.-R. Wen. Flashrag: A modular toolkit for efficient retrieval-augmented generation research. InCompanion Proceedings of the ACM on Web Conference 2025, pages 737–740, 2025

2025
[32]

Kaufmann, P

T. Kaufmann, P. Weng, V . Bengs, and E. Hüllermeier. A survey of reinforcement learning from human feedback.arXiv preprint arXiv:2312.14925, 2023

arXiv 2023
[33]

X. Lai, Z. Tian, Y . Chen, S. Yang, X. Peng, and J. Jia. Step-dpo: Step-wise preference optimization for long-chain reasoning of llms.arXiv preprint arXiv:2406.18629, 2024

Pith/arXiv arXiv 2024
[34]

H. Lee, S. Phatale, H. Mansoor, K. R. Lu, T. Mesnard, J. Ferret, C. Bishop, E. Hall, V . Carbune, and A. Rastogi. Rlaif: Scaling reinforcement learning from human feedback with ai feedback. 2023

2023
[35]

Lewis, E

P. Lewis, E. Perez, A. Piktus, F. Petroni, V . Karpukhin, N. Goyal, H. Küttler, M. Lewis, W.-t. Yih, T. Rocktäschel, et al. Retrieval-augmented generation for knowledge-intensive nlp tasks.Advances in neural information processing systems, 33:9459–9474, 2020

2020
[36]

P. Li, Z. Gao, B. Zhang, Y . Mi, X. Ma, C. Shi, T. Yuan, Y . Wu, Y . Jia, S.-C. Zhu, et al. Iterative tool usage exploration for multimodal agents via step-wise preference tuning.arXiv preprint arXiv:2504.21561, 2025

Pith/arXiv arXiv 2025
[37]

X. Li, G. Dong, J. Jin, Y . Zhang, Y . Zhou, Y . Zhu, P. Zhang, and Z. Dou. Search-o1: Agentic search- enhanced large reasoning models. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 5420–5438, 2025

2025
[38]

X. Li, W. Jiao, J. Jin, G. Dong, J. Jin, Y . Wang, H. Wang, Y . Zhu, J.-R. Wen, Y . Lu, et al. Deepagent: A general reasoning agent with scalable toolsets.arXiv preprint arXiv:2510.21618, 2025

arXiv 2025
[39]

X. Li, J. Jin, G. Dong, H. Qian, Y . Wu, J.-R. Wen, Y . Zhu, and Z. Dou. Webthinker: Empowering large reasoning models with deep research capability.arXiv preprint arXiv:2504.21776, 2025

Pith/arXiv arXiv 2025
[40]

X. Li, J. Jin, Y . Zhou, Y . Wu, Z. Li, Y . Qi, and Z. Dou. Retrollm: Empowering large language models to retrieve fine-grained evidence within generation. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 16754–16779, 2025

2025
[41]

X. Li, H. Zou, and P. Liu. Torl: Scaling tool-integrated rl.arXiv preprint arXiv:2503.23383, 2025

arXiv 2025
[42]

Z.-Z. Li, D. Zhang, M.-L. Zhang, J. Zhang, Z. Liu, Y . Yao, H. Xu, J. Zheng, P.-J. Wang, X. Chen, et al. From system 1 to system 2: A survey of reasoning large language models.arXiv preprint arXiv:2502.17419, 2025

Pith/arXiv arXiv 2025
[43]

Lightman, V

H. Lightman, V . Kosaraju, Y . Burda, H. Edwards, B. Baker, T. Lee, J. Leike, J. Schulman, I. Sutskever, and K. Cobbe. Let’s verify step by step. InThe twelfth international conference on learning representations, 2023. 11

2023
[44]

Z. Liu, C. Chen, W. Li, P. Qi, T. Pang, C. Du, W. S. Lee, and M. Lin. Understanding r1-zero-like training: A critical perspective.arXiv preprint arXiv:2503.20783, 2025

Pith/arXiv arXiv 2025
[45]

K. Lu, C. Chen, X. Wang, B. Cui, Y . Liu, and W. Zhang. Pilotrl: Training language model agents via global planning-guided progressive reinforcement learning.arXiv preprint arXiv:2508.00344, 2025

arXiv 2025
[46]

Z. Lu, Z. Yao, J. Wu, C. Han, Q. Gu, X. Cai, W. Lu, J. Xiao, Y . Zhuang, and Y . Shen. Skill0: In-context agentic reinforcement learning for skill internalization.arXiv preprint arXiv:2604.02268, 2026

Pith/arXiv arXiv 2026
[47]

C. Ma, S. Yang, K. Huang, J. Lu, H. Meng, S. Wang, B. Ding, S. V osoughi, G. Wang, and J. Zhou. Fipo: Eliciting deep reasoning with future-kl influenced policy optimization.arXiv preprint arXiv:2603.19835, 2026

arXiv 2026
[48]

Z. Ma, S. Yang, Y . Ji, X. Wang, Y . Wang, Y . Hu, T. Huang, and X. Chu. Skillclaw: Let skills evolve collectively with agentic evolver.arXiv preprint arXiv:2604.08377, 2026

Pith/arXiv arXiv 2026
[49]

McInnes, J

L. McInnes, J. Healy, and J. Melville. Umap: Uniform manifold approximation and projection for dimension reduction.arXiv preprint arXiv:1802.03426, 2018

Pith/arXiv arXiv 2018
[50]

H. Meng, K. Huang, S. Wei, C. Ma, S. Yang, X. Wang, G. Wang, B. Ding, and J. Zhou. Sparse but critical: A token-level analysis of distributional shifts in rlvr fine-tuning of llms.arXiv preprint arXiv:2603.22446, 2026

arXiv 2026
[51]

Mialon, C

G. Mialon, C. Fourrier, T. Wolf, Y . LeCun, and T. Scialom. Gaia: a benchmark for general ai assistants. InThe Twelfth International Conference on Learning Representations, 2023

2023
[52]

Muennighoff, Z

N. Muennighoff, Z. Yang, W. Shi, X. L. Li, L. Fei-Fei, H. Hajishirzi, L. Zettlemoyer, P. Liang, E. Candès, and T. B. Hashimoto. s1: Simple test-time scaling. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 20286–20332, 2025

2025
[53]

L. Phan, A. Gatti, Z. Han, N. Li, J. Hu, H. Zhang, C. B. C. Zhang, M. Shaaban, J. Ling, S. Shi, et al. Humanity’s last exam.arXiv preprint arXiv:2501.14249, 2025

Pith/arXiv arXiv 2025
[54]

Press, M

O. Press, M. Zhang, S. Min, L. Schmidt, N. A. Smith, and M. Lewis. Measuring and narrowing the compositionality gap in language models. InFindings of the Association for Computational Linguistics: EMNLP 2023, pages 5687–5711, 2023

2023
[55]

C. Qian, E. C. Acikgoz, Q. He, H. Wang, X. Chen, D. Hakkani-Tür, G. Tur, and H. Ji. Toolrl: Reward is all tool learning needs.arXiv preprint arXiv:2504.13958, 2025

Pith/arXiv arXiv 2025
[56]

C. Qian, E. C. Acikgoz, H. Wang, X. Chen, A. Sil, D. Hakkani-Tur, G. Tur, and H. Ji. Smart: Self-aware agent for tool overuse mitigation. InFindings of the Association for Computational Linguistics: ACL 2025, pages 4604–4621, 2025

2025
[57]

Radford, J

A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever, et al. Language models are unsupervised multitask learners.OpenAI blog, 1(8):9, 2019

2019
[58]

Rafailov, A

R. Rafailov, A. Sharma, E. Mitchell, C. D. Manning, S. Ermon, and C. Finn. Direct preference opti- mization: Your language model is secretly a reward model.Advances in neural information processing systems, 36:53728–53741, 2023

2023
[59]

Rasley, S

J. Rasley, S. Rajbhandari, O. Ruwase, and Y . He. Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters. InProceedings of the 26th ACM SIGKDD international conference on knowledge discovery & data mining, pages 3505–3506, 2020

2020
[60]

Schulman, F

J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017

Pith/arXiv arXiv 2017
[61]

Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, X. Bi, H. Zhang, M. Zhang, Y . Li, Y . Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

Pith/arXiv arXiv 2024
[62]

L. Shen, Y . Zhang, C. K. Ling, X. Zhao, and T.-S. Chua. Carl: Critical action focused reinforcement learning for multi-step agent.arXiv preprint arXiv:2512.04949, 2025

Pith/arXiv arXiv 2025
[63]

Sheng, C

G. Sheng, C. Zhang, Z. Ye, X. Wu, W. Zhang, R. Zhang, Y . Peng, H. Lin, and C. Wu. Hybridflow: A flexible and efficient rlhf framework. InProceedings of the Twentieth European Conference on Computer Systems, pages 1279–1297, 2025. 12

2025
[64]

Shinn, F

N. Shinn, F. Cassano, A. Gopinath, K. Narasimhan, and S. Yao. Reflexion: Language agents with verbal reinforcement learning.Advances in neural information processing systems, 36:8634–8652, 2023

2023
[65]

Shridhar, X

M. Shridhar, X. Yuan, M.-A. Côté, Y . Bisk, A. Trischler, and M. Hausknecht. Alfworld: Aligning text and embodied environments for interactive learning.arXiv preprint arXiv:2010.03768, 2020

Pith/arXiv arXiv 2010
[66]

Snell, J

C. Snell, J. Lee, K. Xu, and A. Kumar. Scaling llm test-time compute optimally can be more effective than scaling model parameters.arXiv preprint arXiv:2408.03314, 2024

Pith/arXiv arXiv 2024
[67]

H. Su, S. Diao, X. Lu, M. Liu, J. Xu, X. Dong, Y . Fu, P. Belcak, H. Ye, H. Yin, et al. Toolorchestra: Elevating intelligence via efficient model and tool orchestration.arXiv preprint arXiv:2511.21689, 2025

arXiv 2025
[68]

Z. Su, L. Pan, X. Bai, D. Liu, G. Dong, J. Huang, M. Lv, W. Hu, F. Zhang, K. Gai, et al. Klear-reasoner: Advancing reasoning capability via gradient-preserving clipping policy optimization.arXiv preprint arXiv:2508.07629, 2025

arXiv 2025
[69]

K. Team, Y . Bai, Y . Bao, Y . Charles, C. Chen, G. Chen, H. Chen, H. Chen, J. Chen, N. Chen, et al. Kimi k2: Open agentic intelligence.arXiv preprint arXiv:2507.20534, 2025

Pith/arXiv arXiv 2025
[70]

K. Team, A. Du, B. Gao, B. Xing, C. Jiang, C. Chen, C. Li, C. Xiao, C. Du, C. Liao, et al. Kimi k1. 5: Scaling reinforcement learning with llms.arXiv preprint arXiv:2501.12599, 2025

Pith/arXiv arXiv 2025
[71]

M. L. Team, B. Li, B. Lei, B. Wang, B. Rong, C. Wang, C. Zhang, C. Gao, C. Zhang, C. Sun, et al. Longcat-flash technical report.arXiv preprint arXiv:2509.01322, 2025

arXiv 2025
[72]

Q. Team. Qwq: Reflect deeply on the boundaries of the unknown, 2024

2024
[73]

Trivedi, N

H. Trivedi, N. Balasubramanian, T. Khot, and A. Sabharwal. Musique: Multihop questions via single-hop question composition.Transactions of the Association for Computational Linguistics, 10:539–554, 2022

2022
[74]

H. Wang, C. Qian, M. Li, J. Qiu, B. Xue, M. Wang, H. Ji, and K.-F. Wong. Toward a theory of agents as tool-use decision-makers.arXiv preprint arXiv:2506.00886, 2025

Pith/arXiv arXiv 2025
[75]

L. Wang, W. Xu, Y . Lan, Z. Hu, Y . Lan, R. K.-W. Lee, and E.-P. Lim. Plan-and-solve prompting: Improving zero-shot chain-of-thought reasoning by large language models. InProceedings of the 61st annual meeting of the association for computational linguistics (volume 1: long papers), pages 2609–2634, 2023

2023
[76]

P. Wang, S. Xu, J. Li, Y . Luo, D. Li, J. Hao, and M. Zhang.Re 2: Unlocking llm reasoning via reinforcement learning with re-solving.arXiv preprint arXiv:2603.07197, 2026

arXiv 2026
[77]

S. Wang, L. Yu, C. Gao, C. Zheng, S. Liu, R. Lu, K. Dang, X. Chen, J. Yang, Z. Zhang, et al. Beyond the 80/20 rule: High-entropy minority tokens drive effective reinforcement learning for llm reasoning.arXiv preprint arXiv:2506.01939, 2025

Pith/arXiv arXiv 2025
[78]

J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V . Le, D. Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837, 2022

2022
[79]

Welleck, A

S. Welleck, A. Bertsch, M. Finlayson, H. Schoelkopf, A. Xie, G. Neubig, I. Kulikov, and Z. Harchaoui. From decoding to meta-generation: Inference-time algorithms for large language models.arXiv preprint arXiv:2406.16838, 2024

arXiv 2024
[80]

D. Wu, D. S. Sachan, W.-t. Yih, and M. Chen. Procedural knowledge at scale improves reasoning.arXiv preprint arXiv:2604.01348, 2026

Pith/arXiv arXiv 2026

Showing first 80 references.

[1] [1]

Agarwal, N

A. Agarwal, N. Jiang, S. M. Kakade, and W. Sun. Reinforcement learning: Theory and algorithms.CS Dept., UW Seattle, Seattle, WA, USA, Tech. Rep, 32:96, 2019

2019

[2] [2]

S. Cao, D. Li, F. Zhao, S. Yuan, S. R. Hegde, C. Chen, C. Ruan, T. Griggs, S. Liu, E. Tang, et al. Skyrl-agent: Efficient rl training for multi-turn llm agent.arXiv preprint arXiv:2511.16108, 2025

arXiv 2025

[3] [3]

A. Chen, A. Li, B. Gong, B. Jiang, B. Fei, B. Yang, B. Shan, C. Yu, C. Wang, C. Zhu, et al. Minimax-m1: Scaling test-time compute efficiently with lightning attention.arXiv preprint arXiv:2506.13585, 2025

Pith/arXiv arXiv 2025

[4] [4]

K. Chen, Y . Ren, Y . Liu, X. Hu, H. Tian, T. Xie, F. Liu, H. Zhang, H. Liu, Y . Gong, et al. xbench: Tracking agents productivity scaling with profession-aligned real-world evaluations.arXiv preprint arXiv:2506.13651, 2025

arXiv 2025

[5] [5]

X. Chu, H. Huang, X. Zhang, F. Wei, and Y . Wang. GPG: A simple and strong reinforcement learning baseline for model reasoning. InThe Fourteenth International Conference on Learning Representations, 2026

2026

[6] [6]

Cobbe, V

K. Cobbe, V . Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton, R. Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021

Pith/arXiv arXiv 2021

[7] [7]

T. Dao. Flashattention-2: Faster attention with better parallelism and work partitioning.arXiv preprint arXiv:2307.08691, 2023

Pith/arXiv arXiv 2023

[8] [8]

G. Dong, L. Bao, Z. Wang, K. Zhao, X. Li, J. Jin, J. Yang, H. Mao, F. Zhang, K. Gai, et al. Agentic entropy-balanced policy optimization.arXiv preprint arXiv:2510.14545, 2025

arXiv 2025

[9] [9]

G. Dong, Y . Chen, X. Li, J. Jin, H. Qian, Y . Zhu, H. Mao, G. Zhou, Z. Dou, and J.-R. Wen. Tool-star: Empowering llm-brained multi-tool reasoner via reinforcement learning.arXiv preprint arXiv:2505.16410, 2025

arXiv 2025

[10] [10]

G. Dong, H. Mao, K. Ma, L. Bao, Y . Chen, Z. Wang, Z. Chen, J. Du, H. Wang, F. Zhang, et al. Agentic reinforced policy optimization.arXiv preprint arXiv:2507.19849, 2025

Pith/arXiv arXiv 2025

[11] [11]

Dubey, A

A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Yang, A. Fan, et al. The llama 3 herd of models.arXiv e-prints, pages arXiv–2407, 2024

2024

[12] [12]

Ester, H.-P

M. Ester, H.-P. Kriegel, J. Sander, X. Xu, et al. A density-based algorithm for discovering clusters in large spatial databases with noise. Inkdd, volume 96, pages 226–231, 1996

1996

[13] [13]

J. Feng, S. Huang, X. Qu, G. Zhang, Y . Qin, B. Zhong, C. Jiang, J. Chi, and W. Zhong. Retool: Reinforcement learning for strategic tool use in llms.arXiv preprint arXiv:2504.11536, 2025

Pith/arXiv arXiv 2025

[14] [14]

L. Feng, Z. Xue, T. Liu, and B. An. Group-in-group policy optimization for llm agent training.arXiv preprint arXiv:2505.10978, 2025

Pith/arXiv arXiv 2025

[15] [15]

X. Feng, Z. Wan, M. Wen, S. M. McAleer, Y . Wen, W. Zhang, and J. Wang. Alphazero-like tree-search can guide large language model decoding and training.arXiv preprint arXiv:2309.17179, 2023

arXiv 2023

[16] [16]

J. Gao, W. Fu, M. Xie, S. Xu, C. He, Z. Mei, B. Zhu, and Y . Wu. Beyond ten turns: Unlocking long-horizon agentic search with large-scale asynchronous rl.arXiv preprint arXiv:2508.07976, 2025

arXiv 2025

[17] [17]

Z. Gou, Z. Shao, Y . Gong, Y . Shen, Y . Yang, M. Huang, N. Duan, and W. Chen. Tora: A tool-integrated reasoning agent for mathematical problem solving.arXiv preprint arXiv:2309.17452, 2023

Pith/arXiv arXiv 2023

[18] [18]

D. Guo, D. Yang, H. Zhang, J. Song, P. Wang, Q. Zhu, R. Xu, R. Zhang, S. Ma, X. Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

Pith/arXiv arXiv 2025

[19] [19]

Y . Guo, L. Xu, J. Liu, D. Ye, and S. Qiu. Segment policy optimization: Effective segment-level credit assignment in rl for large language models.arXiv preprint arXiv:2505.23564, 2025

arXiv 2025

[20] [20]

D. Hafner. Benchmarking the spectrum of agent capabilities.arXiv preprint arXiv:2109.06780, 2021

arXiv 2021

[21] [21]

Hendrycks, C

D. Hendrycks, C. Burns, S. Kadavath, A. Arora, S. Basart, E. Tang, D. Song, and J. Steinhardt. Measuring mathematical problem solving with the math dataset.arXiv preprint arXiv:2103.03874, 2021

Pith/arXiv arXiv 2021

[22] [22]

Ho, A.-K

X. Ho, A.-K. D. Nguyen, S. Sugawara, and A. Aizawa. Constructing a multi-hop qa dataset for comprehensive evaluation of reasoning steps. InProceedings of the 28th International Conference on Computational Linguistics, pages 6609–6625, 2020. 10

2020

[23] [23]

Z. Hou, Z. Hu, Y . Li, R. Lu, J. Tang, and Y . Dong. Treerl: Llm reinforcement learning with on-policy tree search. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 12355–12369, 2025

2025

[24] [24]

J. Hu, J. K. Liu, H. Xu, and W. Shen. Reinforce++: Stabilizing critic-free policy optimization with global advantage normalization.arXiv preprint arXiv:2501.03262, 2025

Pith/arXiv arXiv 2025

[25] [25]

Huang, H

K. Huang, H. Meng, J. Wu, J. Lu, C. Ma, Z. Chen, X. Wang, B. Ding, J. Wu, X. Wang, et al. On the direc- tion of rlvr updates for llm reasoning: Identification and exploitation.arXiv preprint arXiv:2603.22117, 2026

arXiv 2026

[26] [26]

B. Hui, J. Yang, Z. Cui, J. Yang, D. Liu, L. Zhang, T. Liu, J. Zhang, B. Yu, K. Lu, et al. Qwen2. 5-coder technical report.arXiv preprint arXiv:2409.12186, 2024

Pith/arXiv arXiv 2024

[27] [27]

Hurst, A

A. Hurst, A. Lerer, A. P. Goucher, A. Perelman, A. Ramesh, A. Clark, A. Ostrow, A. Welihinda, A. Hayes, A. Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024

Pith/arXiv arXiv 2024

[28] [28]

Jaech, A

A. Jaech, A. Kalai, A. Lerer, A. Richardson, A. El-Kishky, A. Low, A. Helyar, A. Madry, A. Beutel, A. Carney, et al. Openai o1 system card.arXiv preprint arXiv:2412.16720, 2024

Pith/arXiv arXiv 2024

[29] [29]

Y . Ji, Z. Ma, Y . Wang, G. Chen, X. Chu, and L. Wu. Tree search for llm agent reinforcement learning. arXiv preprint arXiv:2509.21240, 2025

arXiv 2025

[30] [30]

J. Jin, X. Li, G. Dong, Y . Zhang, Y . Zhu, Y . Zhao, H. Qian, and Z. Dou. Hira: A hierarchical reasoning framework for decoupled planning and execution in deep search.arXiv preprint arXiv:2507.02652, 2025

arXiv 2025

[31] [31]

J. Jin, Y . Zhu, Z. Dou, G. Dong, X. Yang, C. Zhang, T. Zhao, Z. Yang, and J.-R. Wen. Flashrag: A modular toolkit for efficient retrieval-augmented generation research. InCompanion Proceedings of the ACM on Web Conference 2025, pages 737–740, 2025

2025

[32] [32]

Kaufmann, P

T. Kaufmann, P. Weng, V . Bengs, and E. Hüllermeier. A survey of reinforcement learning from human feedback.arXiv preprint arXiv:2312.14925, 2023

arXiv 2023

[33] [33]

X. Lai, Z. Tian, Y . Chen, S. Yang, X. Peng, and J. Jia. Step-dpo: Step-wise preference optimization for long-chain reasoning of llms.arXiv preprint arXiv:2406.18629, 2024

Pith/arXiv arXiv 2024

[34] [34]

H. Lee, S. Phatale, H. Mansoor, K. R. Lu, T. Mesnard, J. Ferret, C. Bishop, E. Hall, V . Carbune, and A. Rastogi. Rlaif: Scaling reinforcement learning from human feedback with ai feedback. 2023

2023

[35] [35]

Lewis, E

P. Lewis, E. Perez, A. Piktus, F. Petroni, V . Karpukhin, N. Goyal, H. Küttler, M. Lewis, W.-t. Yih, T. Rocktäschel, et al. Retrieval-augmented generation for knowledge-intensive nlp tasks.Advances in neural information processing systems, 33:9459–9474, 2020

2020

[36] [36]

P. Li, Z. Gao, B. Zhang, Y . Mi, X. Ma, C. Shi, T. Yuan, Y . Wu, Y . Jia, S.-C. Zhu, et al. Iterative tool usage exploration for multimodal agents via step-wise preference tuning.arXiv preprint arXiv:2504.21561, 2025

Pith/arXiv arXiv 2025

[37] [37]

X. Li, G. Dong, J. Jin, Y . Zhang, Y . Zhou, Y . Zhu, P. Zhang, and Z. Dou. Search-o1: Agentic search- enhanced large reasoning models. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 5420–5438, 2025

2025

[38] [38]

X. Li, W. Jiao, J. Jin, G. Dong, J. Jin, Y . Wang, H. Wang, Y . Zhu, J.-R. Wen, Y . Lu, et al. Deepagent: A general reasoning agent with scalable toolsets.arXiv preprint arXiv:2510.21618, 2025

arXiv 2025

[39] [39]

X. Li, J. Jin, G. Dong, H. Qian, Y . Wu, J.-R. Wen, Y . Zhu, and Z. Dou. Webthinker: Empowering large reasoning models with deep research capability.arXiv preprint arXiv:2504.21776, 2025

Pith/arXiv arXiv 2025

[40] [40]

X. Li, J. Jin, Y . Zhou, Y . Wu, Z. Li, Y . Qi, and Z. Dou. Retrollm: Empowering large language models to retrieve fine-grained evidence within generation. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 16754–16779, 2025

2025

[41] [41]

X. Li, H. Zou, and P. Liu. Torl: Scaling tool-integrated rl.arXiv preprint arXiv:2503.23383, 2025

arXiv 2025

[42] [42]

Z.-Z. Li, D. Zhang, M.-L. Zhang, J. Zhang, Z. Liu, Y . Yao, H. Xu, J. Zheng, P.-J. Wang, X. Chen, et al. From system 1 to system 2: A survey of reasoning large language models.arXiv preprint arXiv:2502.17419, 2025

Pith/arXiv arXiv 2025

[43] [43]

Lightman, V

H. Lightman, V . Kosaraju, Y . Burda, H. Edwards, B. Baker, T. Lee, J. Leike, J. Schulman, I. Sutskever, and K. Cobbe. Let’s verify step by step. InThe twelfth international conference on learning representations, 2023. 11

2023

[44] [44]

Z. Liu, C. Chen, W. Li, P. Qi, T. Pang, C. Du, W. S. Lee, and M. Lin. Understanding r1-zero-like training: A critical perspective.arXiv preprint arXiv:2503.20783, 2025

Pith/arXiv arXiv 2025

[45] [45]

K. Lu, C. Chen, X. Wang, B. Cui, Y . Liu, and W. Zhang. Pilotrl: Training language model agents via global planning-guided progressive reinforcement learning.arXiv preprint arXiv:2508.00344, 2025

arXiv 2025

[46] [46]

Z. Lu, Z. Yao, J. Wu, C. Han, Q. Gu, X. Cai, W. Lu, J. Xiao, Y . Zhuang, and Y . Shen. Skill0: In-context agentic reinforcement learning for skill internalization.arXiv preprint arXiv:2604.02268, 2026

Pith/arXiv arXiv 2026

[47] [47]

C. Ma, S. Yang, K. Huang, J. Lu, H. Meng, S. Wang, B. Ding, S. V osoughi, G. Wang, and J. Zhou. Fipo: Eliciting deep reasoning with future-kl influenced policy optimization.arXiv preprint arXiv:2603.19835, 2026

arXiv 2026

[48] [48]

Z. Ma, S. Yang, Y . Ji, X. Wang, Y . Wang, Y . Hu, T. Huang, and X. Chu. Skillclaw: Let skills evolve collectively with agentic evolver.arXiv preprint arXiv:2604.08377, 2026

Pith/arXiv arXiv 2026

[49] [49]

McInnes, J

L. McInnes, J. Healy, and J. Melville. Umap: Uniform manifold approximation and projection for dimension reduction.arXiv preprint arXiv:1802.03426, 2018

Pith/arXiv arXiv 2018

[50] [50]

H. Meng, K. Huang, S. Wei, C. Ma, S. Yang, X. Wang, G. Wang, B. Ding, and J. Zhou. Sparse but critical: A token-level analysis of distributional shifts in rlvr fine-tuning of llms.arXiv preprint arXiv:2603.22446, 2026

arXiv 2026

[51] [51]

Mialon, C

G. Mialon, C. Fourrier, T. Wolf, Y . LeCun, and T. Scialom. Gaia: a benchmark for general ai assistants. InThe Twelfth International Conference on Learning Representations, 2023

2023

[52] [52]

Muennighoff, Z

N. Muennighoff, Z. Yang, W. Shi, X. L. Li, L. Fei-Fei, H. Hajishirzi, L. Zettlemoyer, P. Liang, E. Candès, and T. B. Hashimoto. s1: Simple test-time scaling. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 20286–20332, 2025

2025

[53] [53]

L. Phan, A. Gatti, Z. Han, N. Li, J. Hu, H. Zhang, C. B. C. Zhang, M. Shaaban, J. Ling, S. Shi, et al. Humanity’s last exam.arXiv preprint arXiv:2501.14249, 2025

Pith/arXiv arXiv 2025

[54] [54]

Press, M

O. Press, M. Zhang, S. Min, L. Schmidt, N. A. Smith, and M. Lewis. Measuring and narrowing the compositionality gap in language models. InFindings of the Association for Computational Linguistics: EMNLP 2023, pages 5687–5711, 2023

2023

[55] [55]

C. Qian, E. C. Acikgoz, Q. He, H. Wang, X. Chen, D. Hakkani-Tür, G. Tur, and H. Ji. Toolrl: Reward is all tool learning needs.arXiv preprint arXiv:2504.13958, 2025

Pith/arXiv arXiv 2025

[56] [56]

C. Qian, E. C. Acikgoz, H. Wang, X. Chen, A. Sil, D. Hakkani-Tur, G. Tur, and H. Ji. Smart: Self-aware agent for tool overuse mitigation. InFindings of the Association for Computational Linguistics: ACL 2025, pages 4604–4621, 2025

2025

[57] [57]

Radford, J

A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever, et al. Language models are unsupervised multitask learners.OpenAI blog, 1(8):9, 2019

2019

[58] [58]

Rafailov, A

R. Rafailov, A. Sharma, E. Mitchell, C. D. Manning, S. Ermon, and C. Finn. Direct preference opti- mization: Your language model is secretly a reward model.Advances in neural information processing systems, 36:53728–53741, 2023

2023

[59] [59]

Rasley, S

J. Rasley, S. Rajbhandari, O. Ruwase, and Y . He. Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters. InProceedings of the 26th ACM SIGKDD international conference on knowledge discovery & data mining, pages 3505–3506, 2020

2020

[60] [60]

Schulman, F

J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017

Pith/arXiv arXiv 2017

[61] [61]

Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, X. Bi, H. Zhang, M. Zhang, Y . Li, Y . Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

Pith/arXiv arXiv 2024

[62] [62]

L. Shen, Y . Zhang, C. K. Ling, X. Zhao, and T.-S. Chua. Carl: Critical action focused reinforcement learning for multi-step agent.arXiv preprint arXiv:2512.04949, 2025

Pith/arXiv arXiv 2025

[63] [63]

Sheng, C

G. Sheng, C. Zhang, Z. Ye, X. Wu, W. Zhang, R. Zhang, Y . Peng, H. Lin, and C. Wu. Hybridflow: A flexible and efficient rlhf framework. InProceedings of the Twentieth European Conference on Computer Systems, pages 1279–1297, 2025. 12

2025

[64] [64]

Shinn, F

N. Shinn, F. Cassano, A. Gopinath, K. Narasimhan, and S. Yao. Reflexion: Language agents with verbal reinforcement learning.Advances in neural information processing systems, 36:8634–8652, 2023

2023

[65] [65]

Shridhar, X

M. Shridhar, X. Yuan, M.-A. Côté, Y . Bisk, A. Trischler, and M. Hausknecht. Alfworld: Aligning text and embodied environments for interactive learning.arXiv preprint arXiv:2010.03768, 2020

Pith/arXiv arXiv 2010

[66] [66]

Snell, J

C. Snell, J. Lee, K. Xu, and A. Kumar. Scaling llm test-time compute optimally can be more effective than scaling model parameters.arXiv preprint arXiv:2408.03314, 2024

Pith/arXiv arXiv 2024

[67] [67]

H. Su, S. Diao, X. Lu, M. Liu, J. Xu, X. Dong, Y . Fu, P. Belcak, H. Ye, H. Yin, et al. Toolorchestra: Elevating intelligence via efficient model and tool orchestration.arXiv preprint arXiv:2511.21689, 2025

arXiv 2025

[68] [68]

Z. Su, L. Pan, X. Bai, D. Liu, G. Dong, J. Huang, M. Lv, W. Hu, F. Zhang, K. Gai, et al. Klear-reasoner: Advancing reasoning capability via gradient-preserving clipping policy optimization.arXiv preprint arXiv:2508.07629, 2025

arXiv 2025

[69] [69]

K. Team, Y . Bai, Y . Bao, Y . Charles, C. Chen, G. Chen, H. Chen, H. Chen, J. Chen, N. Chen, et al. Kimi k2: Open agentic intelligence.arXiv preprint arXiv:2507.20534, 2025

Pith/arXiv arXiv 2025

[70] [70]

K. Team, A. Du, B. Gao, B. Xing, C. Jiang, C. Chen, C. Li, C. Xiao, C. Du, C. Liao, et al. Kimi k1. 5: Scaling reinforcement learning with llms.arXiv preprint arXiv:2501.12599, 2025

Pith/arXiv arXiv 2025

[71] [71]

M. L. Team, B. Li, B. Lei, B. Wang, B. Rong, C. Wang, C. Zhang, C. Gao, C. Zhang, C. Sun, et al. Longcat-flash technical report.arXiv preprint arXiv:2509.01322, 2025

arXiv 2025

[72] [72]

Q. Team. Qwq: Reflect deeply on the boundaries of the unknown, 2024

2024

[73] [73]

Trivedi, N

H. Trivedi, N. Balasubramanian, T. Khot, and A. Sabharwal. Musique: Multihop questions via single-hop question composition.Transactions of the Association for Computational Linguistics, 10:539–554, 2022

2022

[74] [74]

H. Wang, C. Qian, M. Li, J. Qiu, B. Xue, M. Wang, H. Ji, and K.-F. Wong. Toward a theory of agents as tool-use decision-makers.arXiv preprint arXiv:2506.00886, 2025

Pith/arXiv arXiv 2025

[75] [75]

L. Wang, W. Xu, Y . Lan, Z. Hu, Y . Lan, R. K.-W. Lee, and E.-P. Lim. Plan-and-solve prompting: Improving zero-shot chain-of-thought reasoning by large language models. InProceedings of the 61st annual meeting of the association for computational linguistics (volume 1: long papers), pages 2609–2634, 2023

2023

[76] [76]

P. Wang, S. Xu, J. Li, Y . Luo, D. Li, J. Hao, and M. Zhang.Re 2: Unlocking llm reasoning via reinforcement learning with re-solving.arXiv preprint arXiv:2603.07197, 2026

arXiv 2026

[77] [77]

S. Wang, L. Yu, C. Gao, C. Zheng, S. Liu, R. Lu, K. Dang, X. Chen, J. Yang, Z. Zhang, et al. Beyond the 80/20 rule: High-entropy minority tokens drive effective reinforcement learning for llm reasoning.arXiv preprint arXiv:2506.01939, 2025

Pith/arXiv arXiv 2025

[78] [78]

J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V . Le, D. Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837, 2022

2022

[79] [79]

Welleck, A

S. Welleck, A. Bertsch, M. Finlayson, H. Schoelkopf, A. Xie, G. Neubig, I. Kulikov, and Z. Harchaoui. From decoding to meta-generation: Inference-time algorithms for large language models.arXiv preprint arXiv:2406.16838, 2024

arXiv 2024

[80] [80]

D. Wu, D. S. Sachan, W.-t. Yih, and M. Chen. Procedural knowledge at scale improves reasoning.arXiv preprint arXiv:2604.01348, 2026

Pith/arXiv arXiv 2026