What Does Vision Tool-Use Reinforcement Learning Really Learn? Disentangling Tool-Induced and Intrinsic Effects for Crop-and-Zoom

Linge Du; Pengfei Liu; Tianle Li; Weiyu Zhang; Xuyang Shen; Yan Ma

arxiv: 2602.01334 · v2 · pith:C6KAPZUSnew · submitted 2026-02-01 · 💻 cs.CV

What Does Vision Tool-Use Reinforcement Learning Really Learn? Disentangling Tool-Induced and Intrinsic Effects for Crop-and-Zoom

Yan Ma , Weiyu Zhang , Tianle Li , Linge Du , Xuyang Shen , Pengfei Liu This is my paper

Pith reviewed 2026-05-22 12:22 UTC · model grok-4.3

classification 💻 cs.CV

keywords vision tool-usereinforcement learningcrop-and-zoomdisentanglementintrinsic capabilitiestool-induced effectsVLMsMED framework

0 comments

The pith

Vision tool-use reinforcement learning reduces tool harm more than it teaches effective correction

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces the MED framework to separate performance gains in vision language models using crop-and-zoom tools into those from the model's own improving vision skills versus those from better tool handling. It shows that most gains come from intrinsic capability growth while the reinforcement learning step mainly cuts down on tool-related mistakes like bad calls and schema conflicts. A sympathetic reader would care because this distinction reveals whether current methods are actually building tool mastery or simply making models safer around tools they still use clumsily. The work tests this across multiple models and benchmarks to track how these effects evolve over training checkpoints.

Core claim

In the crop-and-zoom setting on two VLMs and six benchmarks, performance improvements under vision tool-use RL are dominated by intrinsic learning rather than tool-use advances. Tool-induced effects consist mainly of reduced harm such as fewer call-induced errors and weaker schema interference, with only limited progress toward using the tool to correct intrinsic model failures. Overall the RL process teaches models to coexist safely with tools instead of mastering their corrective use.

What carries the argument

The MED framework, which measures total performance differences then decomposes tool-induced portions into separate gain and harm terms to isolate tool effects from intrinsic changes.

If this is right

Intrinsic capability growth drives the bulk of observed performance lifts in tool-augmented RL.
Reinforcement learning mainly shrinks tool-induced harm including call errors and schema interference.
Tool-based correction of intrinsic failures shows only modest improvement across checkpoints.
Models acquire safer tool coexistence rather than active tool mastery in the studied setting.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Future work could test whether stronger reward signals focused on successful corrections would shift the balance toward genuine tool mastery.
The same disentanglement approach might reveal similar patterns when applied to other vision operators beyond crop-and-zoom.
If intrinsic learning continues to dominate, scaling RL compute alone may not produce models that reliably leverage tools to overcome their own weaknesses.

Load-bearing premise

The MED decomposition isolates tool-induced effects from intrinsic changes without leftover influence from benchmark selection, model starting points, or specific checkpoints.

What would settle it

Run a controlled experiment that measures whether post-RL models correct specific intrinsic vision errors via crop-and-zoom calls at a higher rate than pre-RL models while holding intrinsic capability fixed; a clear increase would challenge the limited-correction finding.

read the original abstract

Vision tool-use reinforcement learning (RL) can equip vision language models with visual operators such as crop-and-zoom and achieves strong performance gains, yet it remains unclear whether these gains are driven by improvements in tool use or evolving intrinsic capabilities. We introduce MED (Measure--Explain--Diagnose), a coarse-to-fine framework that disentangles intrinsic capability changes from tool-induced effects, decomposes the tool-induced performance difference into gain and harm terms, and probes the mechanisms driving their evolution. Across checkpoint-level analyses in the crop-and-zoom setting on two VLMs with different tool priors and six benchmarks, we find that improvements are dominated by intrinsic learning, while tool-use RL mainly reduces tool-induced harm (e.g., fewer call-induced errors and weaker tool schema interference) and yields limited progress in tool-based correction of intrinsic failures. Overall, in the crop-and-zoom setting studied here, current vision tool-use RL learns to coexist safely with tools rather than master them.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper introduces the MED (Measure-Explain-Diagnose) framework to disentangle intrinsic capability changes from tool-induced effects in vision tool-use RL for crop-and-zoom. Through checkpoint-level analyses on two VLMs with differing tool priors and six benchmarks, it reports that performance improvements are dominated by intrinsic learning while RL primarily reduces tool-induced harm (e.g., fewer call-induced errors and schema interference) with limited gains in tool-based correction of intrinsic failures, concluding that current methods teach safe coexistence with tools rather than mastery.

Significance. If the decomposition holds without substantial confounding, the work offers a useful diagnostic lens for tool-use RL in VLMs and cautions against over-attributing gains to tool mastery. The multi-model, multi-benchmark checkpoint analysis is a concrete strength that could inform evaluation practices in the field.

major comments (1)

[§3.2] §3.2: The gain/harm decomposition Δ = G - H, computed from paired tool-on/tool-off evaluations at matched checkpoints, is load-bearing for the central claim; without explicit checks that benchmark priors or checkpoint timing do not correlate with intrinsic trajectories, residual confounding could misattribute intrinsic gains to harm reduction and weaken the conclusion of limited tool-based correction.

minor comments (2)

[Abstract] The abstract and results sections could report quantitative magnitudes (e.g., average fraction of Δ attributable to G vs. H) rather than qualitative dominance statements to strengthen interpretability.
Figure captions and axis labels in the checkpoint analyses would benefit from explicit indication of which curves correspond to tool-on vs. tool-off conditions.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and the opportunity to clarify our methodological choices. We address the single major comment below and describe the revisions we will undertake.

read point-by-point responses

Referee: [§3.2] §3.2: The gain/harm decomposition Δ = G - H, computed from paired tool-on/tool-off evaluations at matched checkpoints, is load-bearing for the central claim; without explicit checks that benchmark priors or checkpoint timing do not correlate with intrinsic trajectories, residual confounding could misattribute intrinsic gains to harm reduction and weaken the conclusion of limited tool-based correction.

Authors: We agree that explicit verification of the absence of confounding is important for strengthening the central claim. Checkpoints were selected at fixed training-step intervals rather than conditioned on benchmark performance, and the same schedule was used for both tool-on and tool-off runs. The consistency of the harm-reduction pattern across two VLMs that differ in tool priors and across six benchmarks already provides indirect evidence against strong model- or benchmark-specific correlations. Nevertheless, to directly address the referee’s concern we will add, in the revised manuscript, supplementary analyses that (i) plot intrinsic (tool-off) performance trajectories against checkpoint index for each benchmark and (ii) report Pearson correlations between these trajectories and both checkpoint timing and benchmark-specific priors. These diagnostics will be presented for both models so that readers can judge the magnitude of any residual confounding. revision: yes

Circularity Check

0 steps flagged

No circularity: MED decomposition is a definitional measurement applied to external evaluations

full rationale

The paper introduces the MED framework and defines its core decomposition Δ = G - H in Section 3.2 from paired tool-on/tool-off evaluations at matched checkpoints. This is presented as a measurement procedure rather than a derivation or prediction that reduces to fitted parameters by construction. No self-citations are invoked as load-bearing for uniqueness theorems, no ansatz is smuggled, and no known result is merely renamed. The empirical claims rest on checkpoint-level analyses across two VLMs and six benchmarks, which remain externally falsifiable and independent of the measurement definitions themselves. The derivation chain is therefore self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that MED correctly separates effects; no free parameters or invented entities are introduced beyond standard RL components.

axioms (1)

domain assumption The MED framework correctly disentangles intrinsic capability changes from tool-induced effects via its gain and harm decomposition.
This premise is required for the reported dominance of intrinsic learning and limited tool-based correction.

pith-pipeline@v0.9.0 · 5714 in / 1171 out tokens · 62671 ms · 2026-05-22T12:22:55.719862+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

G(t) = P(Dfail)P(c|Dfail)P(✓|c,Dfail) + ... − P(Dsucc)P(c|Dsucc)P(×|c,Dsucc) − ... (Eq. 7, §3.2)
IndisputableMonolith/Foundation/RealityFromDistinction reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Term(D,a,o) = P(D) · P(a|D) · P(o|a,D) (Mass/Policy/Quality, Eq. 8, §3.3)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.