arxiv: 2511.14846 · v2 · submitted 2025-11-18 · 💻 cs.LG · cs.AI· cs.CL

Empowering Multi-Turn Tool-Integrated Agentic Reasoning with Group Turn Policy Optimization

Yifeng Ding , Hung Le , Songyang Han , Kangrui Ruan , Zhenghui Jin , Varun Kumar , Zijian Wang , Anoop Deoras This is my paper

Pith reviewed 2026-05-17 20:23 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL

keywords reinforcement learninglarge language modelsmulti-turn reasoningtool-integrated reasoningpolicy optimizationagentic reasoningreward shaping

0 comments

The pith

GTPO equips large language models with turn-by-turn reward signals to master complex multi-turn tool use and reasoning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Group Turn Policy Optimization as a reinforcement learning method tailored for training large language models on multi-turn Tool-Integrated Reasoning tasks. Existing approaches like Group Relative Policy Optimization deliver rewards only after complete trajectories, which supplies weak learning signals across many steps and stalls progress on problems that require iterative reasoning, code generation, and verification. GTPO instead breaks rewards down to individual turns, computes advantages from normalized discounted returns, and densifies sparse outcome rewards by shaping them with self-supervision drawn from the code the model itself produces. If these mechanisms succeed, models can extract usable feedback from each step of a long interaction rather than waiting for a final binary success or failure.

Core claim

GTPO is a reinforcement learning algorithm for multi-turn Tool-Integrated Reasoning that assigns rewards at the level of individual turns, uses normalized discounted returns as advantages, and applies self-supervised reward shaping based on generated code to make sparse binary rewards denser.

What carries the argument

Group Turn Policy Optimization (GTPO), which decomposes trajectory-level policy optimization into per-turn reward assignment and advantage estimation to supply finer-grained learning signals during multi-turn interactions.

If this is right

GTPO outperforms GRPO by 3.0 percent on diverse math reasoning benchmarks.
GTPO improves GRPO by 3.9 percent on commonsense reasoning and program synthesis tasks.
GTPO adds negligible computational overhead during training.
The approach generalizes from math domains to other multi-turn reasoning settings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Turn-level reward designs may transfer to agent training in longer-horizon settings such as web navigation or sequential decision tasks.
Self-supervised shaping from verifiable outputs like code could extend to other structured generations such as proofs or data transformations.
Testing the same per-turn machinery inside larger models or different base optimizers would show whether the gains remain consistent as scale increases.

Load-bearing premise

That the reported performance gains arise from the three proposed changes to reward structure and advantage calculation rather than from unstated differences in hyperparameters, training protocols, or evaluation procedures.

What would settle it

A controlled comparison that trains both GTPO and GRPO under identical hyperparameters, random seeds, and benchmark scripts and then measures whether the accuracy gap closes or reverses.

Figures

Figures reproduced from arXiv: 2511.14846 by Anoop Deoras, Hung Le, Kangrui Ruan, Songyang Han, Varun Kumar, Yifeng Ding, Zhenghui Jin, Zijian Wang.

**Figure 1.** Figure 1: Tool-integrated reasoning (TIR): Given a problem, the model progresses over multiple turns, where each turn consists of: (1) generating textual reasoning, (2) invoking tools (e.g., code), and (3) incorporating tool execution results to refine its understanding. The model repeats this cycle until a termination condition is met, either by producing a final answer or by reaching a predefined stopping criterio… view at source ↗

**Figure 2.** Figure 2: An overview of GTPO: Unlike existing approaches that rely on trajectory-level rewards, GTPO introduces a turn-level reward function that assigns diverse, rule-based rewards for individual turns within each trajectory and performs turn-level return-based discounting for advantage calculation. 1. 0 0. 1 . . . Que 0. 5 stion Policy LLM Text Code Exec Text CodeExec Text Code Exec ... Text Ans. Text Ans. Ans. .… view at source ↗

**Figure 3.** Figure 3: GTPO reward shaping strategy: In GTPO, each rollout trajectory is partitioned by final outcome (correct vs. incorrect), and the code content is extracted. For each trajectory in the incorrect group, we compute its average similarity against all samples in the correct group and use the similarity score as its partial reward, so that wrong trajectories can still be properly utilized during training for more … view at source ↗

**Figure 5.** Figure 5: Training accuracy curves of GRPO and GTPO under the same experimental setup and training datasets. 0 5 10 15 20 25 30 35 40 Training Step 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Code Ratio GRPO GTPO [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 4.** Figure 4: Qualitative example: We demonstrated an AIME24 example task to compare the distinct coding patterns of GRPO and GTPO. Qwen2.5-7BInstruct trained with GTPO can write correct code along with accurate tests that thoroughly validate the code correctness, while Qwen2.5-7B-Instruct trained with GRPO fails to solve the problem. 0 5 10 15 20 25 30 35 40 Training Step 0.10 0.15 0.20 0.25 0.30 0.35 0.40 Accuracy GR… view at source ↗

**Figure 7.** Figure 7: An AMC23 example to compare the distinction in generation samples between GRPO and GTPO. [PITH_FULL_IMAGE:figures/full_fig_p012_7.png] view at source ↗

**Figure 8.** Figure 8: demonstrates the format correctness curves of GRPO and GTPO. GTPO exhibits superior performance throughout the training process, achieving a robust improvement to around 99% by training step 40. In contrast, GRPO shows more volatile behavior, particularly evident in the dramatic spike and subsequent drop around training steps 20-25. While GRPO eventually recovers and stabilizes around 97% by the end of t… view at source ↗

read the original abstract

Training Large Language Models (LLMs) for multi-turn Tool-Integrated Reasoning (TIR) - where models iteratively reason, generate code, and verify through execution - remains challenging for existing reinforcement learning (RL) approaches. Current RL methods, exemplified by Group Relative Policy Optimization (GRPO), suffer from coarse-grained, trajectory-level rewards that provide insufficient learning signals for complex multi-turn interactions, leading to training stagnation. To address this issue, we propose Group Turn Policy Optimization (GTPO), a novel RL algorithm specifically designed for training LLMs on multi-turn TIR tasks. GTPO introduces three key innovations: (1) turn-level reward assignment that provides fine-grained feedback for individual turns, (2) return-based advantage estimation where normalized discounted returns are calculated as advantages, and (3) self-supervised reward shaping that exploits self-supervision signals from generated code to densify sparse binary outcome-based rewards. Our comprehensive evaluation demonstrates that GTPO outperforms GRPO by 3.0% across diverse math reasoning benchmarks, establishing its effectiveness. GTPO also improves GRPO by 3.9% on commonsense reasoning and program synthesis tasks, demonstrating its generalizability to non-math domains. Importantly, GTPO incurs negligible overhead, ensuring its practicality for real-world scenarios.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GTPO refines GRPO with turn-level rewards, return-based advantages, and code self-supervision to get a steady 3% lift on multi-turn tool reasoning tasks.

read the letter

GTPO refines GRPO with turn-level rewards, return-based advantages, and code self-supervision to get a steady 3% lift on multi-turn tool reasoning tasks. The paper spells out the three changes with explicit equations and shows the gains hold on math benchmarks plus commonsense and program synthesis work, all with negligible overhead and matched training setups between GTPO and GRPO runs. That setup makes the comparison look fair and ties the improvement to the proposed pieces rather than hidden differences in tuning or data.

Referee Report

0 major / 3 minor

Summary. The manuscript presents Group Turn Policy Optimization (GTPO), a reinforcement learning algorithm designed for multi-turn Tool-Integrated Reasoning (TIR) with large language models. It builds upon Group Relative Policy Optimization (GRPO) by introducing turn-level reward assignment for fine-grained feedback, return-based advantage estimation with normalized discounted returns, and self-supervised reward shaping using signals from generated code to mitigate sparse rewards. The evaluation demonstrates that GTPO achieves a 3.0% improvement over GRPO on diverse math reasoning benchmarks and a 3.9% improvement on commonsense reasoning and program synthesis tasks, with negligible additional computational cost.

Significance. Should the empirical gains hold under rigorous controls, GTPO represents a meaningful step forward in addressing the challenges of coarse-grained rewards in long-horizon agentic tasks. The method's ability to provide denser signals through turn-level processing and self-supervision could improve training efficiency for tool-using agents. The reported generalizability beyond math domains and low overhead are notable strengths that suggest practical applicability. Explicit formulations of the components aid reproducibility.

minor comments (3)

The abstract would be strengthened by briefly mentioning the specific benchmarks used and the base model to provide immediate context for the performance claims.
Consider adding a figure or table summarizing the ablation studies if performed to highlight the contribution of each innovation.
Ensure consistent use of terminology, such as 'turn-level' versus 'trajectory-level', throughout the manuscript for clarity.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive assessment of our work and the recommendation for minor revision. We are pleased that the referee recognizes the potential of GTPO to provide denser learning signals for multi-turn tool-integrated reasoning and notes its generalizability and low overhead.

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper proposes GTPO as a new RL algorithm whose three innovations (turn-level reward assignment, return-based advantage estimation, and self-supervised reward shaping) are explicitly formulated in the method sections as independent extensions to GRPO. These are not derived from or equivalent to any fitted parameters, self-referential definitions, or prior results within this work; instead, they are presented as novel design choices whose value is assessed via direct empirical comparisons on math, commonsense, and program synthesis benchmarks. No load-bearing self-citations, uniqueness theorems, or ansatzes reduce the central performance claims to inputs by construction. The evaluation results stand as external evidence rather than tautological outputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Review is limited to the abstract, so specific parameters and assumptions cannot be fully audited. The method implicitly relies on standard RL modeling of sequential reasoning as a decision process and introduces new algorithmic components without external validation details.

axioms (1)

domain assumption Sequential tool-integrated reasoning can be modeled as a Markov decision process suitable for policy optimization
Standard assumption underlying both GRPO and the proposed GTPO for applying RL to multi-turn tasks.

pith-pipeline@v0.9.0 · 5552 in / 1436 out tokens · 46592 ms · 2026-05-17T20:23:07.152888+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

GTPO introduces three key innovations: (1) turn-level reward assignment ... (2) return-based advantage estimation ... (3) self-supervised reward shaping that exploits self-supervision signals from generated code
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean LogicNat recovery and embed_strictMono unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

JGTPO(θ) = E ... min(wi,j,t bAi,j, clip...) with Ri,j = sum γ^{m-j} ri,m

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

TRACER: Verifiable Generative Provenance for Multimodal Tool-Using Agents
cs.CL 2026-05 unverdicted novelty 7.0

TRACER attaches verifiable sentence-level provenance records to multimodal agent outputs using tool-turn alignment and semantic relations, yielding 78.23% answer accuracy and fewer tool calls than baselines on TRACE-Bench.
Multi-Turn Reinforcement Learning for Tool-Calling Agents with Iterative Reward Calibration
cs.AI 2026-04 conditional novelty 7.0

Iterative Reward Calibration with MT-GRPO and GTPO enables effective multi-turn RL for tool-calling agents, raising Tau-Bench success from 63.8% to 66.7% for a 4B model and from 58.0% to 69.5% for a 30B model.
Learning CLI Agents with Structured Action Credit under Selective Observation
cs.AI 2026-05 unverdicted novelty 5.0

CLI agents trained with RL benefit from selective observation via σ-Reveal and structured credit assignment via A³ that leverages AST action sub-chains and trajectory margins.

Reference graph

Works this paper leans on

4 extracted references · 4 canonical work pages · cited by 3 Pith papers · 3 internal anchors

[1]

OpenAI o1 System Card

ToRA: A tool-integrated reasoning agent for mathematical problem solving. InThe Twelfth Interna- tional Conference on Learning Representations. Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Thai, Junhao Shen, Jinyi Hu, Xu Han, Yujie Huang, Yuxiang Zhang, Jie Liu, Lei Qi, Zhiyuan Liu, and Maosong Sun. 2024. OlympiadBench: A challeng- ing benchmark...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[2]

In2nd AI for Math Workshop @ ICML 2025

Understanding r1-zero-like training: A critical perspective. In2nd AI for Math Workshop @ ICML 2025. Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, and 1 others. 2022. Training language models to follow in- structions with human feedback.Advances in neural info...

work page 2025
[3]

Are nlp models really able to solve simple math word problems?Preprint, arXiv:2103.07191. Qwen, :, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, and 25 others. 2025. Qwen2.5 technical report.P...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

Group Sequence Policy Optimization

Group sequence policy optimization.Preprint, arXiv:2507.18071. A Appendix A.1 Turn-level Format Reward Design In practice, considering the nature of TIR tasks, we focus on two major format requirements: (1) the format of tool calling must be correct, and (2) there must exist at least one tool call throughout the trajectory. Specifically, we assign rformat...

work page internal anchor Pith review Pith/arXiv arXiv 2025