pith. machine review for the scientific record. sign in

arxiv: 2511.14846 · v2 · submitted 2025-11-18 · 💻 cs.LG · cs.AI· cs.CL

Recognition: unknown

Empowering Multi-Turn Tool-Integrated Agentic Reasoning with Group Turn Policy Optimization

Authors on Pith no claims yet
classification 💻 cs.LG cs.AIcs.CL
keywords gtpomulti-turnreasoninggroupgrpooptimizationpolicytraining
0
0 comments X
read the original abstract

Training Large Language Models (LLMs) for multi-turn Tool-Integrated Reasoning (TIR) - where models iteratively reason, generate code, and verify through execution - remains challenging for existing reinforcement learning (RL) approaches. Current RL methods, exemplified by Group Relative Policy Optimization (GRPO), suffer from coarse-grained, trajectory-level rewards that provide insufficient learning signals for complex multi-turn interactions, leading to training stagnation. To address this issue, we propose Group Turn Policy Optimization (GTPO), a novel RL algorithm specifically designed for training LLMs on multi-turn TIR tasks. GTPO introduces three key innovations: (1) turn-level reward assignment that provides fine-grained feedback for individual turns, (2) return-based advantage estimation where normalized discounted returns are calculated as advantages, and (3) self-supervised reward shaping that exploits self-supervision signals from generated code to densify sparse binary outcome-based rewards. Our comprehensive evaluation demonstrates that GTPO outperforms GRPO by 3.0% across diverse math reasoning benchmarks, establishing its effectiveness. GTPO also improves GRPO by 3.9% on commonsense reasoning and program synthesis tasks, demonstrating its generalizability to non-math domains. Importantly, GTPO incurs negligible overhead, ensuring its practicality for real-world scenarios.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. TRACER: Verifiable Generative Provenance for Multimodal Tool-Using Agents

    cs.CL 2026-05 unverdicted novelty 7.0

    TRACER attaches verifiable sentence-level provenance records to multimodal agent outputs using tool-turn alignment and semantic relations, yielding 78.23% answer accuracy and fewer tool calls than baselines on TRACE-Bench.

  2. Multi-Turn Reinforcement Learning for Tool-Calling Agents with Iterative Reward Calibration

    cs.AI 2026-04 conditional novelty 7.0

    Iterative Reward Calibration with MT-GRPO and GTPO enables effective multi-turn RL for tool-calling agents, raising Tau-Bench success from 63.8% to 66.7% for a 4B model and from 58.0% to 69.5% for a 30B model.

  3. Learning CLI Agents with Structured Action Credit under Selective Observation

    cs.AI 2026-05 unverdicted novelty 5.0

    CLI agents trained with RL benefit from selective observation via σ-Reveal and structured credit assignment via A³ that leverages AST action sub-chains and trajectory margins.