arxiv: 2604.02869 · v1 · submitted 2026-04-03 · 💻 cs.AI

Recognition: 2 theorem links

· Lean Theorem

Multi-Turn Reinforcement Learning for Tool-Calling Agents with Iterative Reward Calibration

Wachiravit Modecrua , Krittanon Kaewtawee , Krittin Pachtrachai , Touchapon Kraisingkorn

Authors on Pith no claims yet

Pith reviewed 2026-05-13 19:47 UTC · model grok-4.3

classification 💻 cs.AI

keywords multi-turn reinforcement learningtool-calling agentsreward calibrationTau-Benchpolicy optimizationLLM agentscustomer service

0 comments

The pith

Iterative reward calibration aligns per-turn signals to let small models beat GPT-4 on multi-turn tool tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper trains tool-calling agents with reinforcement learning on multi-turn customer service conversations using an LLM-based simulator. Standard dense rewards at each turn often misalign with final success and can drop performance by up to 14 points. The authors introduce Iterative Reward Calibration, which analyzes rollout data to set per-turn rewards that better match advantage direction, then combines it with MT-GRPO and a GTPO hybrid advantage. On the Tau-Bench airline benchmark the calibrated approach raises a 4B model from 63.8 percent to 66.7 percent and a 30B model from 58 percent to 69.5 percent, surpassing GPT-4.1 and GPT-4o while approaching Claude Sonnet 4.5. A reader cares because the work supplies a practical, data-driven recipe for making reliable agents without requiring the largest models.

Core claim

The paper establishes that naively designed dense per-turn rewards degrade performance through misalignment between reward discriminativeness and advantage direction, while Iterative Reward Calibration derived from empirical rollout analysis, paired with the GTPO hybrid advantage formulation inside MT-GRPO, removes that misalignment and produces the reported gains on Tau-Bench, including the first published RL training results for the benchmark.

What carries the argument

Iterative Reward Calibration, a procedure that uses empirical analysis of rollout data to design per-turn rewards whose discriminative power aligns with the correct advantage direction.

If this is right

A 4B-parameter model exceeds GPT-4.1 and GPT-4o on the benchmark despite being roughly 50 times smaller.
A 30B MoE model approaches the performance of Claude Sonnet 4.5.
The GTPO hybrid advantage eliminates the advantage misalignment that arises with naive dense rewards.
These constitute the first published reinforcement-learning training results on Tau-Bench.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The calibration procedure may transfer to other multi-turn agent domains if comparable simulators exist.
Lowering the size of models needed for high-performing agents could reduce inference costs in production settings.
Explicitly measuring simulator-to-real transfer on a held-out human cohort would strengthen claims about practical utility.

Load-bearing premise

Improvements measured against an LLM-based user simulator will transfer to real human customers.

What would settle it

Deploy the trained agent with actual human users on airline tasks and compare measured task success rates directly to the simulated Tau-Bench scores.

Figures

Figures reproduced from arXiv: 2604.02869 by Krittanon Kaewtawee, Krittin Pachtrachai, Touchapon Kraisingkorn, Wachiravit Modecrua.

read the original abstract

Training tool-calling agents with reinforcement learning on multi-turn tasks remains challenging due to sparse outcome rewards and difficult credit assignment across conversation turns. We present the first application of MT-GRPO (Multi-Turn Group Relative Policy Optimization) combined with GTPO (Generalized Token-level Policy Optimization) for training a tool-calling agent on realistic customer service tasks with an LLM-based user simulator. Through systematic analysis of training rollouts, we discover that naively designed dense per-turn rewards degrade performance by up to 14 percentage points due to misalignment between reward discriminativeness and advantage direction. We introduce Iterative Reward Calibration, a methodology for designing per-turn rewards using empirical discriminative analysis of rollout data, and show that our GTPO hybrid advantage formulation eliminates the advantage misalignment problem. Applied to the Tau-Bench airline benchmark, our approach improves Qwen3.5-4B from 63.8 percent to 66.7 percent (+2.9pp) and Qwen3-30B-A3B from 58.0 percent to 69.5 percent (+11.5pp) -- with the trained 4B model exceeding GPT-4.1 (49.4 percent) and GPT-4o (42.8 percent) despite being 50 times smaller, and the 30.5B MoE model approaching Claude Sonnet 4.5 (70.0 percent). To our knowledge, these are the first published RL training results on Tau-Bench. We release our code, reward calibration analysis, and training recipes.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces Iterative Reward Calibration combined with MT-GRPO and a GTPO hybrid advantage formulation to train tool-calling agents on multi-turn customer service tasks. Using an LLM-based user simulator on the Tau-Bench airline benchmark, it reports that the method improves Qwen3.5-4B from 63.8% to 66.7% and Qwen3-30B-A3B from 58.0% to 69.5%, with the smaller model exceeding GPT-4.1 and GPT-4o; the authors claim these are the first published RL results on Tau-Bench and release code and recipes.

Significance. If the gains prove robust and transferable, the work would be significant as the first RL training results on Tau-Bench, showing that careful per-turn reward design can yield substantial improvements in multi-turn agent performance and allow small open models to outperform much larger proprietary systems. The released code and calibration analysis support reproducibility.

major comments (3)

[Evaluation] Evaluation section: The headline gains (+2.9pp and +11.5pp) are reported without statistical significance tests, confidence intervals, number of evaluation rollouts, or random seeds used, which is load-bearing for the central empirical claim that the method reliably outperforms baselines.
[Simulator and Iterative Reward Calibration] Simulator and Iterative Reward Calibration sections: All reward calibration, advantage estimation, and final performance numbers are derived from and measured inside the same LLM-based user simulator; no human evaluation, cross-simulator transfer experiment, or real-user study is provided, leaving open whether the observed policy improvements survive distribution shift to actual customers.
[§4.2] §4.2: The procedure for selecting per-turn reward scaling factors via 'empirical discriminative analysis' is described at a high level but lacks pseudocode, exact thresholds, or the full set of rollout statistics used, hindering exact reproduction of the calibration that avoids the reported 14pp degradation from naive rewards.

minor comments (2)

[Abstract] Abstract: Mentions 'systematic analysis of training rollouts' without specifying the number of rollouts or the precise discriminative metrics employed.
[Results tables] Table 1 or results tables: Clarify the exact versions and prompting setups for the GPT-4.1, GPT-4o, and Claude Sonnet 4.5 baselines to ensure fair comparison.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We address each major comment point-by-point below. We have revised the manuscript to add statistical details, pseudocode, and expanded discussion where feasible.

read point-by-point responses

Referee: Evaluation section: The headline gains (+2.9pp and +11.5pp) are reported without statistical significance tests, confidence intervals, number of evaluation rollouts, or random seeds used, which is load-bearing for the central empirical claim that the method reliably outperforms baselines.

Authors: We agree that statistical rigor is essential for the central claims. In the revised manuscript, we will report that all results are averaged over 100 evaluation rollouts per setting using three random seeds (42, 43, 44). We will add 95% bootstrap confidence intervals and paired t-test p-values (p < 0.05 for the 4B model and p < 0.01 for the 30B model) to Table 1 and the Evaluation section. revision: yes
Referee: Simulator and Iterative Reward Calibration sections: All reward calibration, advantage estimation, and final performance numbers are derived from and measured inside the same LLM-based user simulator; no human evaluation, cross-simulator transfer experiment, or real-user study is provided, leaving open whether the observed policy improvements survive distribution shift to actual customers.

Authors: We acknowledge that all calibration and performance numbers are obtained within the Tau-Bench LLM simulator, which follows the benchmark's standard protocol. We do not include human evaluations or real-user studies, as these would require resources and access outside the scope of this work. We have added a dedicated Limitations paragraph discussing potential distribution shift and outlining future cross-simulator and human validation experiments. revision: partial
Referee: §4.2: The procedure for selecting per-turn reward scaling factors via 'empirical discriminative analysis' is described at a high level but lacks pseudocode, exact thresholds, or the full set of rollout statistics used, hindering exact reproduction of the calibration that avoids the reported 14pp degradation from naive rewards.

Authors: We thank the referee for highlighting this reproducibility gap. We have expanded §4.2 with pseudocode for the Iterative Reward Calibration procedure and specified the exact selection criteria (discriminative accuracy > 0.65 and advantage correlation > 0.75). We also added Appendix C containing the full rollout statistics from 500 trajectories and the per-turn degradation analysis for naive rewards. revision: yes

Circularity Check

0 steps flagged

Empirical RL training results on fixed benchmark show no circularity

full rationale

The paper's central claims consist of measured task success rates on the Tau-Bench airline benchmark after applying MT-GRPO, GTPO, and Iterative Reward Calibration. These rates (e.g., 66.7% and 69.5%) are direct empirical outcomes from rollout evaluations, not quantities algebraically reduced to the calibrated per-turn rewards or advantage formulations by the paper's own equations. The reward calibration step analyzes rollout statistics to adjust discriminativeness, but the final benchmark metric remains an independent task-completion count. No self-citation chains, self-definitional reductions, or fitted-input predictions are load-bearing for the reported improvements. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the validity of the LLM simulator as a training environment and on the empirical calibration procedure derived from rollout statistics; no new physical entities are postulated.

free parameters (1)

per-turn reward scaling factors
Determined iteratively from empirical analysis of rollout data to align reward discriminativeness with advantage direction

axioms (1)

domain assumption LLM-based user simulator accurately models real customer behavior for the airline tasks
Invoked throughout training and evaluation on Tau-Bench

pith-pipeline@v0.9.0 · 5600 in / 1364 out tokens · 57874 ms · 2026-05-13T19:47:56.788282+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We introduce Iterative Reward Calibration (IRC), a methodology for designing per-turn rewards using empirical discriminative analysis of rollout data... reward values should be proportional to discriminative power—the empirical correlation between a reward tier’s presence and task success
IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean alpha_pin_under_high_calibration unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

GTPO hybrid advantage formulation eliminates the advantage misalignment problem... Ahybrid i,k = GN(∑ γ^{l-k} r_{i,l} + γ^{K-k} o_i) + λ · A^O_i (λ=0.3)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

9 extracted references · 9 canonical work pages · 3 internal anchors

[1]

Empowering Multi-Turn Tool-Integrated Agentic Reasoning with Group Turn Policy Optimization

Empowering multi-turn tool- integrated reasoning with group turn policy optimiza- tion.arXiv preprint arXiv:2511.14846. Yangyi Fang, Jiaye Lin, Xiaoliang Fu, Cong Qin, Haolin Shi, Chang Liu, and Peilin Zhao

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Junbo Li, Peng Zhou, Rui Meng, Meet P

Proximity- based multi-turn optimization: Practical credit as- signment for llm agent training.arXiv preprint arXiv:2602.19225. Junbo Li, Peng Zhou, Rui Meng, Meet P. Vadera, Li- hong Li, and Yang Li

work page arXiv
[3]

Zihan Lin, Xiaohan Wang, Hexiong Yang, Jiajun Chai, Jie Cao, Guojun Yin, Wei Lin, and Ran He

Turn-ppo: Turn-level ad- vantage estimation with ppo for improved multi-turn rl in agentic llms.arXiv preprint arXiv:2512.17008. Zihan Lin, Xiaohan Wang, Hexiong Yang, Jiajun Chai, Jie Cao, Guojun Yin, Wei Lin, and Ran He

work page arXiv
[4]

arXiv preprint arXiv:2512.19126

Awpo: Enhancing tool-use of large language models through adaptive integration of reasoning rewards. arXiv preprint arXiv:2512.19126. Shih-Yang Liu, Xin Dong, Ximing Lu, Shizhe Diao, Peter Belcak, Mingjie Liu, Min-Hung Chen, Hongxu Yin, Yu-Chiang Frank Wang, Kwang-Ting Cheng, Yejin Choi, Jan Kautz, and Pavlo Molchanov. 2025a. Gdpo: Group reward-decoupled ...

work page arXiv
[5]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300. Guangming Sheng and 1 others

work page internal anchor Pith review Pith/arXiv arXiv
[6]

Webagent-r1: Training web agents via end-to-end multi-turn reinforcement learning, 2025

Webagent-r1: Training web agents via end-to-end multi-turn reinforcement learning.arXiv preprint arXiv:2505.16421. Shunyu Yao, Noah Shinn, Karthik Narasimhan, and Shunyu Yao

work page arXiv
[7]

$\tau$-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains

τ-bench: A benchmark for tool- agent-user interaction in real-world domains.arXiv preprint arXiv:2406.12045. Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao

work page internal anchor Pith review Pith/arXiv arXiv
[8]

Reinforcing multi-turn reasoning in llm agents via turn-level reward design, 2025

Multi-turn rein- forcement learning from preference human feed- back via group relative policy optimization.Inter- national Conference on Machine Learning (ICML). ArXiv:2505.11821. Yifei Zhou, Song Jiang, Yuandong Tian, Jason We- ston, Sergey Levine, Sainbayar Sukhbaatar, and Xian Li

work page arXiv
[9]

A Reward Tier Definitions Read-only tools: get_user_details, get_reservation_details, search_direct_flight, search_onestop_flight, list_all_airports,calculate

Sweet-rl: Training multi-turn llm agents on collaborative reasoning tasks.arXiv preprint arXiv:2503.15478. A Reward Tier Definitions Read-only tools: get_user_details, get_reservation_details, search_direct_flight, search_onestop_flight, list_all_airports,calculate. State-changing tools:book_reservation, cancel_reservation, update_reservation_*, send_cert...

work page arXiv