Recognition: unknown
OpenSearch-VL: An Open Recipe for Frontier Multimodal Search Agents
Pith reviewed 2026-05-08 16:21 UTC · model grok-4.3
The pith
An open recipe for training multimodal agents enables deep search and multi-step reasoning comparable to proprietary systems.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By combining a dedicated data pipeline for high-quality trajectories, a diverse set of perception and search tools, and a multi-turn fatal-aware GRPO algorithm, the OpenSearch-VL recipe trains agents that deliver over 10-point average improvements on seven benchmarks and reach results comparable to proprietary commercial models on several tasks.
What carries the argument
The multi-turn fatal-aware GRPO training algorithm, which masks post-failure tokens and applies one-sided advantage clamping to preserve useful pre-failure reasoning despite cascading tool failures.
Load-bearing premise
The substantial performance gains result directly from the data pipeline, tool environment, and fatal-aware GRPO algorithm rather than from unspecified details of model choice or training setup.
What would settle it
Reproducing the training with and without the fatal-aware components of the GRPO algorithm, then comparing average scores on the seven benchmarks to see if the improvement falls below ten points.
read the original abstract
Deep search has become a crucial capability for frontier multimodal agents, enabling models to solve complex questions through active search, evidence verification, and multi-step reasoning. Despite rapid progress, top-tier multimodal search agents remain difficult to reproduce, largely due to the absence of open high-quality training data, transparent trajectory synthesis pipelines, or detailed training recipes. To this end, we introduce OpenSearch-VL, a fully open-source recipe for training frontier multimodal deep search agents with agentic reinforcement learning. First, we curated a dedicated pipeline to construct high-quality training data through Wikipedia path sampling, fuzzy entity rewriting, and source-anchor visual grounding, which jointly reduce shortcuts and one-step retrieval collapse. Based on this pipeline, we curate two training datasets, SearchVL-SFT-36k for SFT and SearchVL-RL-8k for RL. Besides, we design a diverse tool environment that unifies text search, image search, OCR, cropping, sharpening, super-resolution, and perspective correction, enabling agents to combine active perception with external knowledge acquisition. Finally, we propose a multi-turn fatal-aware GRPO training algorithm that handles cascading tool failures by masking post-failure tokens while preserving useful pre-failure reasoning through one-sided advantage clamping. Built on this recipe, OpenSearch-VL delivers substantial performance gains, with over 10-point average improvements across seven benchmarks, and achieves results comparable to proprietary commercial models on several tasks. We will release all data, code, and models to support open research on multimodal deep search agents.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces OpenSearch-VL as a fully open-source recipe for training multimodal deep search agents via agentic reinforcement learning. It details a data construction pipeline involving Wikipedia path sampling, fuzzy entity rewriting, and source-anchor visual grounding to create the SearchVL-SFT-36k and SearchVL-RL-8k datasets; a unified tool environment encompassing text/image search, OCR, cropping, sharpening, super-resolution, and perspective correction; and a multi-turn fatal-aware GRPO algorithm that masks post-failure tokens with one-sided advantage clamping. The authors report that this recipe yields over 10-point average improvements across seven benchmarks and achieves performance comparable to proprietary commercial models on several tasks, with plans to release all data, code, and models.
Significance. If the performance claims are validated through detailed experiments, this work would significantly advance the field by providing a transparent, reproducible framework for frontier multimodal search agents, addressing the current lack of open high-quality training data and pipelines. The commitment to open-sourcing the datasets, code, and models is a particular strength that could facilitate further research and community-driven improvements in agentic multimodal systems.
major comments (1)
- [Abstract] Abstract: the claim of 'over 10-point average improvements across seven benchmarks' and 'results comparable to proprietary commercial models' is presented without any benchmark names, baseline comparisons, ablation tables, error bars, or methodology details. This is load-bearing for the central claim that the Wikipedia-path sampling + fuzzy rewriting + visual grounding pipeline, the unified tool set, and the fatal-aware GRPO algorithm are the primary drivers of the gains, because without isolating controls (e.g., SFT/RL runs with each component removed while holding model size and tokens fixed) the improvements could stem from unmentioned factors such as base-model choice or data leakage.
Simulated Author's Rebuttal
We thank the referee for the careful reading and the constructive comment on the abstract. We agree that the abstract is high-level and will revise it to provide more context while preserving its brevity. The full paper contains the supporting details, tables, and ablations referenced below.
read point-by-point responses
-
Referee: [Abstract] Abstract: the claim of 'over 10-point average improvements across seven benchmarks' and 'results comparable to proprietary commercial models' is presented without any benchmark names, baseline comparisons, ablation tables, error bars, or methodology details. This is load-bearing for the central claim that the Wikipedia-path sampling + fuzzy rewriting + visual grounding pipeline, the unified tool set, and the fatal-aware GRPO algorithm are the primary drivers of the gains, because without isolating controls (e.g., SFT/RL runs with each component removed while holding model size and tokens fixed) the improvements could stem from unmentioned factors such as base-model choice or data leakage.
Authors: We acknowledge that the abstract, as currently written, does not name the benchmarks or include quantitative details. In the revised manuscript we will update the abstract to explicitly list the seven benchmarks, name the primary baselines, and briefly note the scale of the gains. The full paper already reports per-benchmark results with baselines (Table 1), ablation studies that isolate each component of the data pipeline and the fatal-aware GRPO algorithm while controlling for model size and token budget (Section 4.3 and Appendix C), and standard error bars on all main results. These controls demonstrate that the reported gains are not explained by base-model choice or obvious leakage; we will add a short sentence in the abstract pointing readers to these sections. We believe this addresses the concern that the central claims rest on unverified factors. revision: yes
Circularity Check
No circularity: paper is a descriptive engineering recipe with no derivations or self-referential reductions
full rationale
The manuscript presents a data-curation pipeline (Wikipedia path sampling, fuzzy rewriting, visual grounding), a unified tool set, and a multi-turn fatal-aware GRPO algorithm, then reports empirical gains on benchmarks. No equations, fitted parameters, or first-principles derivations appear in the provided text. The central claims are attributions of performance to the described components, which are not shown to reduce to their own inputs by construction. No self-citation load-bearing steps, uniqueness theorems, or ansatzes smuggled via prior work are present. This is a standard non-circular engineering contribution whose validity rests on external replication rather than internal definitional equivalence.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 2 Pith papers
-
From Web to Pixels: Bringing Agentic Search into Visual Perception
WebEye benchmark and Pixel-Searcher agent enable visual perception tasks by using web search to resolve object identities before precise localization or answering.
-
Towards On-Policy Data Evolution for Visual-Native Multimodal Deep Search Agents
A new image-bank harness and closed-loop on-policy data evolution method raises multimodal agent performance on visual search benchmarks from 24.9% to 39.0% for an 8B model and from 30.6% to 41.5% for a 30B model.
Reference graph
Works this paper leans on
-
[1]
Iferi <0, theng ours i =g hard i = 0
-
[2]
17) ofτi restricted to its viable prefix{t:s(t)< f i, M gen(yi,t) = 1}, evaluated at the non-negative advantageeri
Iferi ≥0 , thenghard i = 0, whilegours i coincides with the search-augmented GRPO gradient (Eq. 17) ofτi restricted to its viable prefix{t:s(t)< f i, M gen(yi,t) = 1}, evaluated at the non-negative advantageeri. Consequently, gours i is weakly informative-dominant overghard i : it never propagates gradient through the post-fatal suffix, never penalises th...
2024
-
[3]
any title containing the substring(disambiguation), or beginning withList of , Outline of , Index of , or Timeline of
-
[4]
all non-article namespaces, i.e.Template:,Category:,File:,User:,Help:,Portal:,Wikipedia:
-
[5]
managed by
redirect pages: each outgoing link is first dereferenced to its target article, and the exclusion rules above are applied to the dereferenced target rather than the surface link. Seeds are drawn by stratified sampling across five coarse domains—Person,Building/Place,Location (non-hub), Organism, andArtifact—to balance the representation of visually ground...
2009
-
[11]
yes” if the report contains the correct answer, “no
Answer with “yes” if the report contains the correct answer, “no” if it doesn’t or contradicts it Output format: correct: [yes/no] reasoning: [your explanation] Figure 8|The GPT-4o judge prompt used to compute the accuracy rewardracc ∈ {0,1}during RL training. Benchmark Evaluation Judge Prompt (GPT-4o) You are an impartial judge evaluating whether a deep ...
-
[12]
Read through the entire research report carefully
-
[13]
Look for the correct answer anywhere in the report (it may be embedded in paragraphs, tables, or sections)
-
[14]
Check if the information in the report is consistent with the correct answer
-
[15]
final answer
The answer does NOT need to be in a specific format or labelled as “final answer”
-
[16]
Provide your reasoning
-
[17]
yes” if the report contains the correct answer, “no
Answer with “yes” if the report contains the correct answer, “no” if it doesn’t or contradicts it Output format: correct: [yes/no] reasoning: [your explanation] Figure 10 |GPT-4ojudgepromptusedforbenchmarkevaluationofOpenSearch-VLandallbaselines. Wedeliberately keep this prompt aligned with the evaluation protocol released by Vision-DeepResearch (Huang et...
2026
-
[18]
Image search utility: Did image searches retrieve visual evidence that genuinely supports answering the question? Were the images relevant, or just noise? 2.Text search utility: Did text searches find relevant textual information? Were queries well-formed and targeted?
-
[19]
Query progression: Did the queries show logical progression—refining, narrowing, or covering different aspects? Or did they repeat / drift aimlessly?
-
[20]
Verify, Don’t Guess
Complementarity: Did image and text searches complement each other, providing evidence that one modality alone couldn’t supply? 5.Evidence vs. noise ratio: What fraction of retrieved results actually contained useful evidence versus irrelevant content? Score the overall query utility from 0.0 to 1.0: - 0.0: No useful information retrieved; all searches ir...
-
[21]
Never rely on the internal encoder when a tool gives a sharper view
Tool-First Mindset: small text⇒crop; blurry⇒sharpen; tilted⇒perspective_correct. Never rely on the internal encoder when a tool gives a sharper view
-
[22]
perspective_correct→crop→layout_parsing
Chain Your Tools: non-trivial queries usually require a pipeline, e.g. perspective_correct→crop→layout_parsing. 3.External Validation: whenever the answer depends on facts not purely visible in the pixels, youmustcalltext_search
-
[23]
name": <f>,
Tool Calling Format Function signatures are provided inside<tools>...</tools>. Emit tool calls as one JSON object inside <tool_call>{"name": <f>, "arguments": <args>}</tool_call>, one call per turn
-
[24]
Your Toolbox {Tool List} Foreachoftheseventoolstheproductionpromptspecifiesitstrigger(whentocall),params(JSONschema),andoutput(howto consume the return value). The tools fall into three families:visual perception(crop, layout_parsing);image enhancement (perspective_correct, super_resolution, sharpen); andknowledge retrieval(text_search, image_search; imag...
-
[25]
Thinking Protocol Before any action, emit a<think>block with: (i)analyse request; (ii)assess image quality—legibility, geometry, and target size, mapping any deficiency to the corresponding enhancement tool; (iii)identify information gaps—retrieval needed?; (iv) formulate plan—commit to a single next action. Critical reminders: (a) whenlayout_parsingretur...
-
[26]
•Dense chart:crop(region of interest)→layout_parsing
Workflow Recipes •Unreadable document:perspective_correct→sharpen→layout_parsing. •Dense chart:crop(region of interest)→layout_parsing. •Entity identification:image_search→text_search(mandatory follow-up)
-
[27]
2.Think first: never emit<tool_call>without a preceding<think>
Output Rules 1.Single action per turn; wait for its result before the next. 2.Think first: never emit<tool_call>without a preceding<think>. 3.Image refs: initial image isimg_1; each tool output yieldsimg_2,img_3, ...; always operate on the latest best version. 4.Final answer: emit<response>...</response>once evidence suffices
-
[28]
name": "perspective_correct
Execution Example(tool-use turn) <think> Invoice img_1 is skewed; correct perspective first. </think> <tool_call>{"name": "perspective_correct", "arguments": {"image": "img_1"}}</tool_call> Figure 7 | Condensed agent system prompt used during both inference and SFT trajectory collection. The placeholder {Tool List}stands in for the per-tool description bl...
1982
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.