pith. sign in

arxiv: 2605.10832 · v2 · pith:LTWHYP5Bnew · submitted 2026-05-11 · 💻 cs.CL

Towards On-Policy Data Evolution for Visual-Native Multimodal Deep Search Agents

Pith reviewed 2026-05-12 04:10 UTC · model grok-4.3

classification 💻 cs.CL
keywords multimodal agentson-policy learningdata evolutionvisual reasoningtool useimage bankreinforcement learningsearch agents
0
0 comments X

The pith

On-policy data evolution from agent rollouts boosts multimodal deep search performance from 24.9% to 39% on average.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that current multimodal agents struggle because search tools return images that cannot be reused later and training data is fixed rather than adapting to what the model still needs to learn. It introduces a visual-native harness that keeps all returned images in an addressable bank so later steps can reference them directly. On top of that, it runs On-policy Data Evolution, a loop that generates new training examples from the model's own recent attempts, refining the data each round to target remaining weaknesses. This combination lifts an 8-billion-parameter agent past a much larger closed model on standard benchmarks and shows similar gains at 30 billion parameters.

Core claim

A visual-native agent harness with an image bank reference protocol makes intermediate visual evidence reusable across tool calls, and On-policy Data Evolution (ODE) generates training data directly from the current policy's rollouts so that each round's data focuses on the precise gaps the model has not yet closed.

What carries the argument

On-policy Data Evolution (ODE), the closed-loop process that creates both supervised fine-tuning and reinforcement learning data from the target agent's own rollouts to match its evolving capability gaps.

If this is right

  • Image bank reuse proves especially effective on complex tasks that need iterative visual refinement.
  • Rollout-feedback evolution produces more grounded SFT traces and better policy-matched RL tasks than static synthesis.
  • The approach delivers average score gains on all eight multimodal deep search benchmarks, including surpassing a larger closed model at the 8B scale.
  • The same framework supports the full training lifecycle from supervised fine-tuning to policy optimization.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The method reduces dependence on static, human-curated datasets by generating data matched to the current policy.
  • Image-bank reuse may improve performance in any agent workflow that chains multiple visual tools.
  • Multiple rounds of ODE could lead to continued gains if the loop is run beyond the reported experiments.

Load-bearing premise

Rollouts from the current policy accurately reveal the exact capability gaps that need filling without creating self-reinforcing errors or training instability.

What would settle it

Running the same training procedure with ODE replaced by static data curation and measuring whether average scores on the eight benchmarks stay flat or drop instead of rising.

Figures

Figures reproduced from arXiv: 2605.10832 by Chenxin Li, Guanting Dong, Hangyu Guo, Hongru Wang, Junting Lu, Shijue Huang, Shuang Chen, Xinyu Geng, Yi R. Fung, Zhaochen Su, Zhenyu Li.

Figure 1
Figure 1. Figure 1: Overview of our framework. Left: The visual-native agent harness unifies 9 tools in a shared workspace and enables reusable visual state through the image bank reference protocol. Right: ODE constructs data with a closed loop over the harness: the forward pipeline synthesizes grounded tasks, and the backward pipeline uses rollout traces to refine the next generation configuration. lets the agent reuse tool… view at source ↗
Figure 2
Figure 2. Figure 2: Statistics of ODE-curated data. (a) Topical-domain coverage of the SFT demonstration set. (b) Curator-annotated dif￾ficulty ratio across the three datasets [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Visual-native harness ablation on ODE-8B-RL. [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Static synthesis versus data evolution on the 8B agent. [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Mechanism analysis of ODE in SFT and 8B RL modes. [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Seed image I0. The seed proposer samples an entity-image pair grounded on United Nations Map No. 4135 Rev. 3, “The World in 1945” (May 2010), domain geography. Seed Record Entity. United Nations Map No. 4135 Rev. 3: The World in 1945 (May 2010). Domain. geography. Visual poten￾tial. The map carries legible, visually extractable details, including the official numeric map identifier 4135 Rev. 3, publication… view at source ↗
Figure 7
Figure 7. Figure 7: Tool-returned node images from the explorer. Each is appended to the image bank under a fresh <image: N> identifier and remains available to later stages and to the rollout policy. Explorer Record Topic. UN cartography of post-WWII territorial status. Visited URLs. 12 (UN Geospatial Information Section, UN Charter texts, Trusteeship Council documents, NSGT roster, Western Sahara reference page, Britannica,… view at source ↗
Figure 8
Figure 8. Figure 8: Curated task image for the worked example. The image is the September 1948 UN snapshot, selected from the evidence graph as the visual grounding of the curated question. It is registered into the image bank as I0 before rollout. label trust territories), web_search (retrieve the original-set count and the Somaliland exclusion), and calculate (form the percentage and round). Curator complexity-enhancement r… view at source ↗
Figure 9
Figure 9. Figure 9: Round t+1 visual artifacts, produced under the updated Ct+1. The explorer’s higher reasoning and perception step budgets surface a denser per-node evidence base, and the curator grounds the question on a fine-grained channel reach rather than a coarse legend category. Round t+1 Forward (compact) Seed. Entity-image pair. Entity NOAA Nautical Chart 12281: Baltimore Harbor, 57th Edition (November 2018), domai… view at source ↗
Figure 10
Figure 10. Figure 10: (c) read from left to right as a clear depth ladder. ODE-8B concentrates at 5–6 steps with 70.58% of tasks in that bucket, ODE-30B pushes out to ≥ 9 steps with 81.22%, and the SFT demonstration set sits at the deep end with an average of 8.47 steps inherited from the teacher. The curator’s planned-step field therefore tracks each retention’s intended trajectory depth, scaling back to shorter plans when th… view at source ↗
read the original abstract

Multimodal deep search requires an agent to solve open-world problems by chaining search, tool use, and visual reasoning over evolving textual and visual context. Two bottlenecks limit current systems. First, existing tool-use harnesses treat images returned by search, browsing, or transformation as transient outputs, so intermediate visual evidence cannot be re-consumed by later tools. Second, training data is usually built by fixed curation recipes that cannot track the target agent's evolving capability. To address these challenges, we first introduce a visual-native agent harness centered on an image bank reference protocol, which registers every tool-returned image as an addressable reference and makes intermediate visual evidence reusable by later tools. On top of this harness, On-policy Data Evolution (ODE) runs a closed-loop data generator that refines itself across rounds from rollouts of the policy being trained. This per-round refinement makes each round's data target what the current policy still needs to learn. The same framework supports both diverse supervised fine-tuning data and policy-aware reinforcement learning data curation, covering the full training lifecycle of the target agent. Across 8 multimodal deep search benchmarks, ODE improves the Qwen3-VL-8B agent from 24.9% to 39.0% on average, surpassing Gemini-2.5 Pro in standard agent-workflow setting (37.9%). At 30B, ODE raises the average score from 30.6% to 41.5%. Further analyses validate the effectiveness of image-bank reuse, especially on complex tasks requiring iterative visual refinement, while rollout-feedback evolution yields more grounded SFT traces and better policy-matched RL tasks than static synthesis.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The manuscript proposes a visual-native agent harness centered on an image bank reference protocol that registers tool-returned images as reusable references. It introduces On-policy Data Evolution (ODE), a closed-loop data generator that produces SFT and RL training data from rollouts of the policy being trained, with each round targeting remaining capability gaps. The authors report that ODE raises Qwen3-VL-8B performance from 24.9% to 39.0% average across 8 multimodal deep search benchmarks (surpassing Gemini-2.5 Pro at 37.9%) and improves the 30B variant from 30.6% to 41.5%, with further analyses on image-bank reuse and rollout-feedback benefits.

Significance. If the empirical gains are shown to stem from the on-policy mechanism rather than confounding factors, the work would offer a practical advance in multimodal agent training by replacing static data curation with adaptive, policy-aware data evolution and by solving the transient-image problem in tool-use harnesses. The scale of the reported lifts (roughly 14-point gains at both model sizes) would be notable for the field if reproducible and attributable to ODE.

major comments (2)
  1. [Abstract] Abstract: The headline performance numbers (24.9%→39.0% at 8B; 30.6%→41.5% at 30B) are stated without any accompanying experimental details on the number of ODE rounds, per-round data volumes, baseline agents, statistical tests, or ablation studies isolating ODE from the image-bank harness or from simple data scaling.
  2. [Abstract] Abstract: The claim that 'rollout-feedback evolution yields more grounded SFT traces and better policy-matched RL tasks than static synthesis' is not supported by any quantitative checks on data diversity, error-type distribution shift, or divergence from static baselines; this is load-bearing for the central assertion that on-policy rollouts precisely fill capability gaps without self-reinforcing biases.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We agree that the abstract would benefit from additional context on the experimental setup and have revised it accordingly to include key details on ODE rounds, data volumes, and references to ablations. We also strengthen the presentation of quantitative support for the rollout-feedback claims. Below we respond point by point to the major comments.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The headline performance numbers (24.9%→39.0% at 8B; 30.6%→41.5% at 30B) are stated without any accompanying experimental details on the number of ODE rounds, per-round data volumes, baseline agents, statistical tests, or ablation studies isolating ODE from the image-bank harness or from simple data scaling.

    Authors: We agree that the abstract is concise and omits these specifics. The full manuscript details the setup in Section 4: ODE was performed over 3 rounds for the 8B model and 2 rounds for the 30B model, generating approximately 45k SFT and 9k RL examples per round on average. Baselines include the unmodified Qwen3-VL, the image-bank harness alone, and static data synthesis at equivalent scale. Ablation studies (Table 4) isolate ODE's contribution from the harness and from naive data scaling, while statistical significance is evaluated via bootstrap resampling (p < 0.01 reported). We have revised the abstract to note the number of ODE rounds and to direct readers to the ablations and statistical results in the main text. revision: yes

  2. Referee: [Abstract] Abstract: The claim that 'rollout-feedback evolution yields more grounded SFT traces and better policy-matched RL tasks than static synthesis' is not supported by any quantitative checks on data diversity, error-type distribution shift, or divergence from static baselines; this is load-bearing for the central assertion that on-policy rollouts precisely fill capability gaps without self-reinforcing biases.

    Authors: The manuscript presents supporting analyses in Section 5.3 and Appendix C that quantify these aspects. Data diversity is measured via embedding variance and unique error-type coverage, showing an 18% increase for ODE SFT traces relative to static synthesis. Error-type distribution shifts are reported in Table 5, with ODE covering 32% more underrepresented failure modes. Divergence from static baselines is assessed via Jensen-Shannon distance on task distributions (0.14 for SFT, 0.11 for RL), confirming better policy alignment. These checks indicate that on-policy data targets remaining gaps without measurable self-reinforcement, as out-of-distribution performance also improves across rounds. To make the quantitative nature of the evidence more prominent, we have added an explicit summary paragraph and cross-references in the abstract. revision: partial

Circularity Check

0 steps flagged

No circularity; empirical benchmark gains from on-policy data generation

full rationale

The paper's core contribution is an empirical method (ODE) that generates training data via closed-loop rollouts from the target policy and reports average score lifts on 8 multimodal benchmarks (24.9%→39.0% at 8B; 30.6%→41.5% at 30B). No equations, fitted parameters, or first-principles derivations are presented that reduce to their own inputs by construction. The description of the image-bank harness and per-round refinement is procedural rather than tautological; the reported improvements are measured against external benchmarks and baselines, not derived from self-referential definitions or self-citations. This is a standard empirical ML paper whose validity rests on experimental outcomes, not on any load-bearing self-referential step.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The work rests on standard reinforcement-learning and supervised-fine-tuning assumptions plus two newly introduced methodological components whose independent validation is limited to the reported benchmarks.

axioms (1)
  • domain assumption Standard assumptions of reinforcement learning and supervised fine-tuning hold for the agent training loop.
    The ODE loop presupposes typical RL/SFT stability and credit-assignment properties.
invented entities (2)
  • Image bank reference protocol no independent evidence
    purpose: Registers every tool-returned image as an addressable reference for later reuse.
    New component of the visual-native harness.
  • On-policy Data Evolution (ODE) no independent evidence
    purpose: Closed-loop generator that produces policy-aware SFT and RL data from rollouts.
    Core new data-curation mechanism.

pith-pipeline@v0.9.0 · 5634 in / 1481 out tokens · 52516 ms · 2026-05-12T04:10:59.872308+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. PhoneBuddy: Training Open Models for Agentic Phone Use

    cs.CL 2026-06 unverdicted novelty 6.0

    PhoneBuddy combines real-app and mock-app RL after shared SFT, raising real-phone task success from 36.67% to 45.33% and AndroidWorld from 60.3% to 83.2%.