Iterative Reward Calibration with MT-GRPO and GTPO enables effective multi-turn RL for tool-calling agents, raising Tau-Bench success from 63.8% to 66.7% for a 4B model and from 58.0% to 69.5% for a 30B model.
arXiv preprint arXiv:2512.19126
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.AI 1years
2026 1verdicts
CONDITIONAL 1representative citing papers
citing papers explorer
-
Multi-Turn Reinforcement Learning for Tool-Calling Agents with Iterative Reward Calibration
Iterative Reward Calibration with MT-GRPO and GTPO enables effective multi-turn RL for tool-calling agents, raising Tau-Bench success from 63.8% to 66.7% for a 4B model and from 58.0% to 69.5% for a 30B model.