arxiv: 2601.15625 · v2 · submitted 2026-01-22 · 💻 cs.LG · cs.AI

Recognition: no theorem link

Robust Tool Use via Fission-GRPO: Learning to Recover from Execution Errors

Zhiwei Zhang , Fei Zhao , Rui Wang , Zezhong Wang , Bin Liang , Jiakang Wang , Yao Hu , Shaosheng Cao

show 1 more author

Kam-Fai Wong

Authors on Pith no claims yet

Pith reviewed 2026-05-16 12:22 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords tool usereinforcement learningerror recoveryLLM agentsmulti-turn interactionon-policy learningGRPO

0 comments

The pith

Fission-GRPO converts execution errors into on-policy recovery training for LLMs by splitting failed trajectories and adding simulator feedback.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Fission-GRPO teaches language models to recover from tool execution errors during reinforcement learning training. Standard approaches either give only negative rewards on mistakes or rely on fixed datasets that no longer match what the model is currently doing wrong. The new method splits each failed attempt, adds targeted feedback from an error simulator, and generates fresh recovery attempts from the model itself. This on-policy correction helps the model learn precise fixes instead of repeating errors. If the approach holds, it would make tool-using agents more reliable in ongoing conversations without extra data collection.

Core claim

Fission-GRPO converts execution errors into on-policy corrective supervision within the RL training loop. The core mechanism fissions each failed trajectory into a new training instance by augmenting it with diagnostic feedback from a fine-tuned Error Simulator, then resampling multiple recovery rollouts on-policy. This enables the model to learn from the precise errors it makes during exploration, rather than from static, pre-collected error cases.

What carries the argument

The fission mechanism that splits failed trajectories and augments them with diagnostic feedback from a fine-tuned Error Simulator to create on-policy recovery examples.

If this is right

Error recovery rate on BFCL v4 Multi-Turn increases by 5.7 percentage points for Qwen3-8B.
Overall task accuracy rises from 42.75% to 46.75% on the same benchmark.
Performance gains extend to TAU-Bench and TAU2-Bench with improvements up to 17.4% in some settings.
The method surpasses both standard RL baselines and existing specialized tool-use agents.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The technique may reduce reliance on pre-collected error datasets by generating corrective examples dynamically during training.
Similar fission-based supervision could improve recovery learning in other sequential decision domains such as code execution or robotic control.
The error simulator may require periodic retraining as the policy improves to keep feedback aligned with new failure modes.

Load-bearing premise

The fine-tuned Error Simulator produces diagnostic feedback that accurately matches the evolving failure modes of the policy being trained.

What would settle it

Training the same model with standard RL and with Fission-GRPO on BFCL v4 Multi-Turn and observing no gain in error recovery rate would refute the value of the fission step.

read the original abstract

Large language models (LLMs) can call tools effectively, yet they remain brittle in multi-turn execution: after a tool-call error, smaller models often fall into repetitive invalid re-invocations instead of interpreting the feedback and recovering. This failure mode persists because current training paradigms do not explicitly teach models how to recover from execution errors. In particular, standard reinforcement learning (RL) collapses rich failure experience into sparse negative rewards, while pre-collected error-correction datasets become mismatched to the policy's evolving failure modes. To bridge this gap, we propose Fission-GRPO, a framework that converts execution errors into on-policy corrective supervision within the RL training loop. Our core mechanism fissions each failed trajectory into a new training instance by augmenting it with diagnostic feedback from a fine-tuned Error Simulator, then resampling multiple recovery rollouts on-policy. This enables the model to learn from the precise errors it makes during exploration, rather than from static, pre-collected error cases. On BFCL v4 Multi-Turn, Fission-GRPO improves the error recovery rate of Qwen3-8B by 5.7% absolute and overall accuracy by 4.0% (from 42.75% to 46.75%), outperforming both RL baselines and specialized tool-use agents. The method further generalizes to TAU-Bench and TAU2-Bench, achieving leading results across most settings with gains up to +17.4%.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Fission-GRPO turns failed tool trajectories into on-policy recovery data via an error simulator, delivering small but measurable gains on multi-turn benchmarks.

read the letter

The paper's core move is to fission each failed tool-use trajectory, feed it through a fine-tuned error simulator for diagnostic feedback, and then resample recovery rollouts on-policy inside the RL loop. This is positioned as a fix for the usual problems where standard RL just hands out sparse negatives and static error datasets drift away from what the current policy actually does wrong. On BFCL v4 Multi-Turn it lifts Qwen3-8B recovery rate by 5.7 points and overall accuracy by 4 points, and it posts leading numbers on TAU-Bench and TAU2-Bench with gains up to 17 points in some settings. Those are concrete, usable improvements for anyone trying to make tool-calling agents less brittle after the first mistake.

Referee Report

2 major / 2 minor

Summary. The paper proposes Fission-GRPO, an RL framework for robust multi-turn tool use in LLMs. Failed trajectories are fissioned by augmenting them with diagnostic feedback from a separately fine-tuned Error Simulator; multiple recovery rollouts are then resampled on-policy to convert execution errors into corrective supervision. On BFCL v4 Multi-Turn, the method raises Qwen3-8B error recovery rate by 5.7% absolute and overall accuracy by 4.0% (42.75% to 46.75%), outperforming standard RL baselines and specialized tool-use agents; gains up to +17.4% are also reported on TAU-Bench and TAU2-Bench.

Significance. If the Error Simulator remains aligned with the policy's shifting failure distribution, Fission-GRPO offers a concrete way to extract dense, on-policy recovery signals from sparse execution failures, addressing a persistent brittleness in agentic tool calling that standard RL and static datasets do not. The reported absolute gains on a challenging multi-turn benchmark and generalization across suites suggest the approach could improve reliability of smaller open models in real tool-use loops, provided the on-policy property is verified.

major comments (2)

[Method] Method section (core fission step): The claim that fissioned trajectories supply on-policy corrective supervision requires the fine-tuned Error Simulator to produce diagnostics that track the policy's current, evolving error distribution at each RL stage. The manuscript provides no description of simulator update frequency, whether its training data is refreshed from the latest policy rollouts, or any quantitative check (e.g., agreement rate between simulator feedback and observed policy failures) that alignment is maintained. A static or lagged simulator would render the added feedback off-policy or noisy, undermining the central distinction from pre-collected error-correction datasets.
[Experiments] Experiments (BFCL v4 results): The 5.7% recovery-rate and 4.0% accuracy gains are presented without ablations that isolate the contribution of the fission mechanism versus the Error Simulator itself, reward shaping, or additional data volume. No error analysis of remaining failure modes or sensitivity tests under varied simulator quality is reported, making it difficult to confirm that improvements stem specifically from on-policy recovery supervision rather than auxiliary factors.

minor comments (2)

[Abstract] Abstract: The generalization statement 'gains up to +17.4%' on TAU-Bench/TAU2-Bench does not specify the exact metric (accuracy, recovery rate, or success rate) or the precise baselines being compared.
[Method] Notation: The term 'fission' is used for the trajectory-augmentation step but is not formally defined with pseudocode or a precise algorithmic description, which would aid reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and insightful comments, which help clarify the presentation of Fission-GRPO. We address each major point below and will incorporate revisions to strengthen the manuscript.

read point-by-point responses

Referee: [Method] Method section (core fission step): The claim that fissioned trajectories supply on-policy corrective supervision requires the fine-tuned Error Simulator to produce diagnostics that track the policy's current, evolving error distribution at each RL stage. The manuscript provides no description of simulator update frequency, whether its training data is refreshed from the latest policy rollouts, or any quantitative check (e.g., agreement rate between simulator feedback and observed policy failures) that alignment is maintained. A static or lagged simulator would render the added feedback off-policy or noisy, undermining the central distinction from pre-collected error-correction datasets.

Authors: We appreciate the referee's emphasis on verifying the on-policy property. The on-policy character of Fission-GRPO arises from resampling multiple recovery rollouts directly from the current policy at each RL step; the Error Simulator supplies only the diagnostic label for the original failure, which is then used to construct the corrective training instance. The simulator is trained once on error trajectories collected from the base policy prior to RL and remains fixed thereafter for computational efficiency. In the revised manuscript we will (i) explicitly describe this training procedure and data source in Section 3, (ii) add a quantitative alignment check reporting agreement rates between simulator diagnostics and observed policy failures at multiple RL checkpoints, and (iii) clarify that the corrective supervision remains on-policy because the recovery actions themselves are generated by the evolving policy rather than by any static dataset. These additions will be placed in the Method section and supported by a new table in the appendix. revision: partial
Referee: [Experiments] Experiments (BFCL v4 results): The 5.7% recovery-rate and 4.0% accuracy gains are presented without ablations that isolate the contribution of the fission mechanism versus the Error Simulator itself, reward shaping, or additional data volume. No error analysis of remaining failure modes or sensitivity tests under varied simulator quality is reported, making it difficult to confirm that improvements stem specifically from on-policy recovery supervision rather than auxiliary factors.

Authors: We agree that isolating the fission mechanism and providing sensitivity analyses will make the source of gains clearer. In the revised Experiments section we will add: (a) an ablation removing the fission step while retaining the Error Simulator, (b) a comparison that varies simulator quality by training simulators on subsets of error data, (c) a breakdown of reward-shaping versus data-volume effects, and (d) a qualitative error analysis of remaining failure modes together with sensitivity plots under degraded simulator conditions. These new results will be reported on BFCL v4 and summarized for the other benchmarks. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical claims rest on independent simulator and on-policy resampling

full rationale

The paper describes an RL framework (Fission-GRPO) whose core step augments failed trajectories with feedback from a separately fine-tuned Error Simulator before on-policy resampling. This introduces a practical dependency on simulator alignment, but the reported gains (e.g., +5.7% recovery rate on BFCL v4) are presented as empirical outcomes from benchmark evaluation rather than any mathematical derivation or prediction that reduces by construction to fitted inputs, self-citations, or renamed patterns. No equations equate the final accuracy figures to the simulator parameters; the method remains falsifiable against external baselines and does not invoke uniqueness theorems or ansatzes from prior self-work as load-bearing premises. The derivation chain is therefore self-contained against the stated benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach assumes an effective Error Simulator can be fine-tuned separately and that its feedback aligns with policy failures; no explicit free parameters or invented physical entities are named in the abstract.

axioms (1)

domain assumption Execution errors can be converted into useful on-policy training signals via augmentation with diagnostic feedback.
Central to the fission mechanism described in the abstract.

pith-pipeline@v0.9.0 · 5592 in / 1213 out tokens · 29465 ms · 2026-05-16T12:22:07.298769+00:00 · methodology