arxiv: 2605.05185 · v1 · submitted 2026-05-06 · 💻 cs.CV

Recognition: unknown

OpenSearch-VL: An Open Recipe for Frontier Multimodal Search Agents

Shuang Chen , Kaituo Feng , Hangting Chen , Wenxuan Huang , Dasen Dai , Quanxin Shou , Yunlong Lin , Xiangyu Yue

show 2 more authors

Shenghua Gao Tianyu Pang

Authors on Pith no claims yet

Pith reviewed 2026-05-08 16:21 UTC · model grok-4.3

classification 💻 cs.CV

keywords multimodal search agentsagentic reinforcement learningdeep searchopen-source training recipeGRPO algorithmvisual groundingtool environmentmultimodal reasoning

0 comments

The pith

An open recipe for training multimodal agents enables deep search and multi-step reasoning comparable to proprietary systems.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a fully open-source method to train frontier-level multimodal deep search agents using agentic reinforcement learning. It details a data curation pipeline that generates high-quality training examples by sampling Wikipedia paths and applying visual grounding to avoid shortcuts. A unified tool environment allows agents to search text and images while performing operations like OCR and image correction. A specialized training algorithm manages failures across multiple tool-use turns. This matters because it provides the missing pieces—data, code, and models—for the community to build and study advanced visual-language agents that were previously hard to reproduce.

Core claim

By combining a dedicated data pipeline for high-quality trajectories, a diverse set of perception and search tools, and a multi-turn fatal-aware GRPO algorithm, the OpenSearch-VL recipe trains agents that deliver over 10-point average improvements on seven benchmarks and reach results comparable to proprietary commercial models on several tasks.

What carries the argument

The multi-turn fatal-aware GRPO training algorithm, which masks post-failure tokens and applies one-sided advantage clamping to preserve useful pre-failure reasoning despite cascading tool failures.

Load-bearing premise

The substantial performance gains result directly from the data pipeline, tool environment, and fatal-aware GRPO algorithm rather than from unspecified details of model choice or training setup.

What would settle it

Reproducing the training with and without the fatal-aware components of the GRPO algorithm, then comparing average scores on the seven benchmarks to see if the improvement falls below ten points.

read the original abstract

Deep search has become a crucial capability for frontier multimodal agents, enabling models to solve complex questions through active search, evidence verification, and multi-step reasoning. Despite rapid progress, top-tier multimodal search agents remain difficult to reproduce, largely due to the absence of open high-quality training data, transparent trajectory synthesis pipelines, or detailed training recipes. To this end, we introduce OpenSearch-VL, a fully open-source recipe for training frontier multimodal deep search agents with agentic reinforcement learning. First, we curated a dedicated pipeline to construct high-quality training data through Wikipedia path sampling, fuzzy entity rewriting, and source-anchor visual grounding, which jointly reduce shortcuts and one-step retrieval collapse. Based on this pipeline, we curate two training datasets, SearchVL-SFT-36k for SFT and SearchVL-RL-8k for RL. Besides, we design a diverse tool environment that unifies text search, image search, OCR, cropping, sharpening, super-resolution, and perspective correction, enabling agents to combine active perception with external knowledge acquisition. Finally, we propose a multi-turn fatal-aware GRPO training algorithm that handles cascading tool failures by masking post-failure tokens while preserving useful pre-failure reasoning through one-sided advantage clamping. Built on this recipe, OpenSearch-VL delivers substantial performance gains, with over 10-point average improvements across seven benchmarks, and achieves results comparable to proprietary commercial models on several tasks. We will release all data, code, and models to support open research on multimodal deep search agents.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

OpenSearch-VL gives a concrete open recipe for multimodal search agents with data synthesis and a failure-aware RL tweak, but the big performance numbers sit on untested attributions.

read the letter

The paper's core contribution is a complete, documented pipeline for building multimodal agents that actively search and reason. It starts with Wikipedia path sampling to create multi-step trajectories, adds fuzzy entity rewriting and visual grounding to cut shortcuts, then layers on a unified tool set covering search, OCR, cropping, and image fixes. On top of that it introduces fatal-aware GRPO, which masks tokens after tool failures while keeping earlier reasoning via one-sided advantage clamping. They back this with two new datasets (36k SFT, 8k RL) and say they will release everything. That level of openness is the strongest part; anyone working on agentic multimodal systems now has a reproducible starting point instead of guessing at closed-model internals.

Referee Report

1 major / 0 minor

Summary. The manuscript introduces OpenSearch-VL as a fully open-source recipe for training multimodal deep search agents via agentic reinforcement learning. It details a data construction pipeline involving Wikipedia path sampling, fuzzy entity rewriting, and source-anchor visual grounding to create the SearchVL-SFT-36k and SearchVL-RL-8k datasets; a unified tool environment encompassing text/image search, OCR, cropping, sharpening, super-resolution, and perspective correction; and a multi-turn fatal-aware GRPO algorithm that masks post-failure tokens with one-sided advantage clamping. The authors report that this recipe yields over 10-point average improvements across seven benchmarks and achieves performance comparable to proprietary commercial models on several tasks, with plans to release all data, code, and models.

Significance. If the performance claims are validated through detailed experiments, this work would significantly advance the field by providing a transparent, reproducible framework for frontier multimodal search agents, addressing the current lack of open high-quality training data and pipelines. The commitment to open-sourcing the datasets, code, and models is a particular strength that could facilitate further research and community-driven improvements in agentic multimodal systems.

major comments (1)

[Abstract] Abstract: the claim of 'over 10-point average improvements across seven benchmarks' and 'results comparable to proprietary commercial models' is presented without any benchmark names, baseline comparisons, ablation tables, error bars, or methodology details. This is load-bearing for the central claim that the Wikipedia-path sampling + fuzzy rewriting + visual grounding pipeline, the unified tool set, and the fatal-aware GRPO algorithm are the primary drivers of the gains, because without isolating controls (e.g., SFT/RL runs with each component removed while holding model size and tokens fixed) the improvements could stem from unmentioned factors such as base-model choice or data leakage.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful reading and the constructive comment on the abstract. We agree that the abstract is high-level and will revise it to provide more context while preserving its brevity. The full paper contains the supporting details, tables, and ablations referenced below.

read point-by-point responses

Referee: [Abstract] Abstract: the claim of 'over 10-point average improvements across seven benchmarks' and 'results comparable to proprietary commercial models' is presented without any benchmark names, baseline comparisons, ablation tables, error bars, or methodology details. This is load-bearing for the central claim that the Wikipedia-path sampling + fuzzy rewriting + visual grounding pipeline, the unified tool set, and the fatal-aware GRPO algorithm are the primary drivers of the gains, because without isolating controls (e.g., SFT/RL runs with each component removed while holding model size and tokens fixed) the improvements could stem from unmentioned factors such as base-model choice or data leakage.

Authors: We acknowledge that the abstract, as currently written, does not name the benchmarks or include quantitative details. In the revised manuscript we will update the abstract to explicitly list the seven benchmarks, name the primary baselines, and briefly note the scale of the gains. The full paper already reports per-benchmark results with baselines (Table 1), ablation studies that isolate each component of the data pipeline and the fatal-aware GRPO algorithm while controlling for model size and token budget (Section 4.3 and Appendix C), and standard error bars on all main results. These controls demonstrate that the reported gains are not explained by base-model choice or obvious leakage; we will add a short sentence in the abstract pointing readers to these sections. We believe this addresses the concern that the central claims rest on unverified factors. revision: yes

Circularity Check

0 steps flagged

No circularity: paper is a descriptive engineering recipe with no derivations or self-referential reductions

full rationale

The manuscript presents a data-curation pipeline (Wikipedia path sampling, fuzzy rewriting, visual grounding), a unified tool set, and a multi-turn fatal-aware GRPO algorithm, then reports empirical gains on benchmarks. No equations, fitted parameters, or first-principles derivations appear in the provided text. The central claims are attributions of performance to the described components, which are not shown to reduce to their own inputs by construction. No self-citation load-bearing steps, uniqueness theorems, or ansatzes smuggled via prior work are present. This is a standard non-circular engineering contribution whose validity rests on external replication rather than internal definitional equivalence.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an applied engineering paper focused on data pipelines and training procedures rather than theoretical models. No free parameters, mathematical axioms, or new postulated entities are introduced or required by the abstract.

pith-pipeline@v0.9.0 · 5604 in / 1241 out tokens · 66270 ms · 2026-05-08T16:21:59.793351+00:00 · methodology

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

From Web to Pixels: Bringing Agentic Search into Visual Perception
cs.CV 2026-05 unverdicted novelty 7.0

WebEye benchmark and Pixel-Searcher agent enable visual perception tasks by using web search to resolve object identities before precise localization or answering.
Towards On-Policy Data Evolution for Visual-Native Multimodal Deep Search Agents
cs.CL 2026-05 unverdicted novelty 7.0

A new image-bank harness and closed-loop on-policy data evolution method raises multimodal agent performance on visual search benchmarks from 24.9% to 39.0% for an 8B model and from 30.6% to 41.5% for a 30B model.

Reference graph

Works this paper leans on

23 extracted references · cited by 2 Pith papers

[1]

Iferi <0, theng ours i =g hard i = 0
[2]

17) ofτi restricted to its viable prefix{t:s(t)< f i, M gen(yi,t) = 1}, evaluated at the non-negative advantageeri

Iferi ≥0 , thenghard i = 0, whilegours i coincides with the search-augmented GRPO gradient (Eq. 17) ofτi restricted to its viable prefix{t:s(t)< f i, M gen(yi,t) = 1}, evaluated at the non-negative advantageeri. Consequently, gours i is weakly informative-dominant overghard i : it never propagates gradient through the post-fatal suffix, never penalises th...

2024
[3]

any title containing the substring(disambiguation), or beginning withList of , Outline of , Index of , or Timeline of
[4]

all non-article namespaces, i.e.Template:,Category:,File:,User:,Help:,Portal:,Wikipedia:
[5]

managed by

redirect pages: each outgoing link is first dereferenced to its target article, and the exclusion rules above are applied to the dereferenced target rather than the surface link. Seeds are drawn by stratified sampling across five coarse domains—Person,Building/Place,Location (non-hub), Organism, andArtifact—to balance the representation of visually ground...

2009
[11]

yes” if the report contains the correct answer, “no

Answer with “yes” if the report contains the correct answer, “no” if it doesn’t or contradicts it Output format: correct: [yes/no] reasoning: [your explanation] Figure 8|The GPT-4o judge prompt used to compute the accuracy rewardracc ∈ {0,1}during RL training. Benchmark Evaluation Judge Prompt (GPT-4o) You are an impartial judge evaluating whether a deep ...
[12]

Read through the entire research report carefully
[13]

Look for the correct answer anywhere in the report (it may be embedded in paragraphs, tables, or sections)
[14]

Check if the information in the report is consistent with the correct answer
[15]

final answer

The answer does NOT need to be in a specific format or labelled as “final answer”
[16]

Provide your reasoning
[17]

yes” if the report contains the correct answer, “no

Answer with “yes” if the report contains the correct answer, “no” if it doesn’t or contradicts it Output format: correct: [yes/no] reasoning: [your explanation] Figure 10 |GPT-4ojudgepromptusedforbenchmarkevaluationofOpenSearch-VLandallbaselines. Wedeliberately keep this prompt aligned with the evaluation protocol released by Vision-DeepResearch (Huang et...

2026
[18]

Image search utility: Did image searches retrieve visual evidence that genuinely supports answering the question? Were the images relevant, or just noise? 2.Text search utility: Did text searches find relevant textual information? Were queries well-formed and targeted?
[19]

Query progression: Did the queries show logical progression—refining, narrowing, or covering different aspects? Or did they repeat / drift aimlessly?
[20]

Verify, Don’t Guess

Complementarity: Did image and text searches complement each other, providing evidence that one modality alone couldn’t supply? 5.Evidence vs. noise ratio: What fraction of retrieved results actually contained useful evidence versus irrelevant content? Score the overall query utility from 0.0 to 1.0: - 0.0: No useful information retrieved; all searches ir...
[21]

Never rely on the internal encoder when a tool gives a sharper view

Tool-First Mindset: small text⇒crop; blurry⇒sharpen; tilted⇒perspective_correct. Never rely on the internal encoder when a tool gives a sharper view
[22]

perspective_correct→crop→layout_parsing

Chain Your Tools: non-trivial queries usually require a pipeline, e.g. perspective_correct→crop→layout_parsing. 3.External Validation: whenever the answer depends on facts not purely visible in the pixels, youmustcalltext_search
[23]

name": <f>,

Tool Calling Format Function signatures are provided inside<tools>...</tools>. Emit tool calls as one JSON object inside <tool_call>{"name": <f>, "arguments": <args>}</tool_call>, one call per turn
[24]

Your Toolbox {Tool List} Foreachoftheseventoolstheproductionpromptspecifiesitstrigger(whentocall),params(JSONschema),andoutput(howto consume the return value). The tools fall into three families:visual perception(crop, layout_parsing);image enhancement (perspective_correct, super_resolution, sharpen); andknowledge retrieval(text_search, image_search; imag...
[25]

Thinking Protocol Before any action, emit a<think>block with: (i)analyse request; (ii)assess image quality—legibility, geometry, and target size, mapping any deficiency to the corresponding enhancement tool; (iii)identify information gaps—retrieval needed?; (iv) formulate plan—commit to a single next action. Critical reminders: (a) whenlayout_parsingretur...
[26]

•Dense chart:crop(region of interest)→layout_parsing

Workflow Recipes •Unreadable document:perspective_correct→sharpen→layout_parsing. •Dense chart:crop(region of interest)→layout_parsing. •Entity identification:image_search→text_search(mandatory follow-up)
[27]

2.Think first: never emit<tool_call>without a preceding<think>

Output Rules 1.Single action per turn; wait for its result before the next. 2.Think first: never emit<tool_call>without a preceding<think>. 3.Image refs: initial image isimg_1; each tool output yieldsimg_2,img_3, ...; always operate on the latest best version. 4.Final answer: emit<response>...</response>once evidence suffices
[28]

name": "perspective_correct

Execution Example(tool-use turn) <think> Invoice img_1 is skewed; correct perspective first. </think> <tool_call>{"name": "perspective_correct", "arguments": {"image": "img_1"}}</tool_call> Figure 7 | Condensed agent system prompt used during both inference and SFT trajectory collection. The placeholder {Tool List}stands in for the per-tool description bl...

1982