pith. sign in

arxiv: 2604.04917 · v2 · submitted 2026-04-06 · 💻 cs.CV · cs.AI· cs.CL

Vero: An Open RL Recipe for General Visual Reasoning

Pith reviewed 2026-05-10 19:08 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.CL
keywords visual reasoningreinforcement learningvision-language modelsopen modelsmultimodal benchmarkstask-routed rewardsdata scaling
0
0 comments X

The pith

Open reinforcement learning with broad visual data builds general reasoners that rival closed models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper shows how to create strong visual reasoning in vision-language models using only open data and methods. The approach involves collecting 600,000 training examples from 59 datasets across six categories and applying reinforcement learning with rewards tailored to each category's output style. Models trained this way improve their base performance by 3.6 to 5.3 points on a set of 30 tough benchmarks and often beat models that use hidden proprietary data. The key finding is that mixing many different tasks matters more than focusing on any single one, because reasoning styles learned in isolation do not transfer effectively. Releasing all the data, code, and models makes the full process available for anyone to use and study.

Core claim

Vero scales RL data and rewards across six broad task categories to construct a 600K-sample dataset from 59 sources. Task-routed rewards handle heterogeneous answer formats. This produces state-of-the-art open VLMs that improve four base models by 3.6-5.3 points on average across 30 benchmarks and outperform Qwen3-VL-8B-Thinking on 23 benchmarks without proprietary data. Ablations show broad coverage drives results since isolated categories yield distinct, poorly transferring patterns.

What carries the argument

Vero-600K, a dataset of 600,000 samples from 59 sources across six task categories, paired with task-routed rewards that adapt reinforcement learning signals to different answer formats.

If this is right

  • Starting from the same base model, Vero exceeds existing RL datasets in performance across all task categories.
  • Training on single task categories produces reasoning patterns that transfer poorly to other categories.
  • Open models can achieve high visual reasoning performance without access to proprietary thinking data.
  • The release of data, code, and models enables full reproduction and community extension of the recipe.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Applying similar broad-coverage RL strategies could improve performance in other multimodal domains such as audio or video reasoning.
  • The finding that task categories elicit distinct patterns suggests visual reasoning consists of loosely connected sub-skills rather than one monolithic ability.
  • Future experiments might test whether adding more data sources within each category further amplifies the gains.

Load-bearing premise

The observed performance gains stem primarily from the scale and diversity of the data combined with the task-routed rewards, rather than from other training choices or data contamination.

What would settle it

A controlled experiment retraining one of the base models on a narrow subset of Vero-600K covering only one task category and measuring if the average score on the 30-benchmark suite drops below the reported gains.

Figures

Figures reproduced from arXiv: 2604.04917 by Danqi Chen, Gabriel Sarch, Haoyang Wu, Linrong Cai, Qunzhong Wang, Zhuang Liu.

Figure 1
Figure 1. Figure 1: Vero achieves state-of-the-art performance across six task categories using a fully open RL recipe. Top left: Training curves versus RL FLOPs for Vero-600K compared with existing open RL datasets, all finetuned from Qwen2.5- VL-7B-Instruct; dashed lines indicate training beyond one epoch. Top center: Summary of the broad task categories targeted in Vero-600K and VeroEval. Top right: Vero compared to Qwen3-… view at source ↗
Figure 2
Figure 2. Figure 2: Composition of Vero-600K. The inner ring shows six task categories, each allocated 100K samples (600K total), and the outer ring shows their 59 constituent datasets. The categories represent real-world use cases and cover distinct visual reasoning capabilities (Sections 5 and 7). Categories are uniformly sampled to balance learning across tasks. 1 Introduction Vision-language models (VLMs) are increasingly… view at source ↗
Figure 3
Figure 3. Figure 3: Vero-600K data curation pipeline. Starting from over 250 candidate datasets, we assign each to one of six task categories and apply multi-stage selection and filtering: heuristic screening (size, resolution, answer format), manual quality control, LLM-based question filtering for ambiguity and verifiability, and answer filtering for stable reward computation. The retained data are combined into a uniformly… view at source ↗
Figure 4
Figure 4. Figure 4: Examples from each task category illustrate the breadth of [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Accuracy verifiers for our multi-task reward. [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Per-category RL training curves. Evaluation score vs. RL FLOPs for Vero-600K and three prior open RL datasets, all training Qwen2.5-VL-7B-Instruct. Dashed segments indicate training beyond one epoch. Vero-600K leads on five of six categories throughout training; on STEM it remains within 2 points of the best prior dataset. 4.1 Evaluation Results on VeroEval We report results in [PITH_FULL_IMAGE:figures/fu… view at source ↗
Figure 7
Figure 7. Figure 7: Diverse task mixing eliminates negative cross-task transfer. [PITH_FULL_IMAGE:figures/full_fig_p012_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: RL on different task categories leads to varying [PITH_FULL_IMAGE:figures/full_fig_p013_8.png] view at source ↗
Figure 10
Figure 10. Figure 10: Performance improves with greater training data exposure. [PITH_FULL_IMAGE:figures/full_fig_p013_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Single-task training elicits a range of behavioral profiles, many of which are distinct. [PITH_FULL_IMAGE:figures/full_fig_p015_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Task category skills are largely distinct. [PITH_FULL_IMAGE:figures/full_fig_p016_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Single-task training elicits distinct behavioral skills. [PITH_FULL_IMAGE:figures/full_fig_p017_13.png] view at source ↗
read the original abstract

What does it take to build a visual reasoner that works across charts, science, spatial understanding, and open-ended tasks? The strongest vision-language models (VLMs) show such broad visual reasoning is within reach, but the recipe behind them remains unclear, locked behind proprietary reinforcement learning (RL) pipelines with non-public data. We introduce Vero, a family of fully open VLMs that matches or exceeds existing open-weight models across diverse visual reasoning tasks. We scale RL data and rewards across six broad task categories, constructing Vero-600K, a 600K-sample dataset from 59 datasets, and designing task-routed rewards that handle heterogeneous answer formats. Vero achieves state-of-the-art performance, improving over four base models by 3.6-5.3 points on average across VeroEval, our suite of 30 challenging benchmarks. Starting from Qwen3-VL-8B-Instruct, Vero outperforms Qwen3-VL-8B-Thinking on 23 of 30 benchmarks without additional proprietary thinking data. When trained from the same base model, Vero-600K exceeds existing RL datasets across task categories. Systematic ablations reveal that different task categories elicit qualitatively distinct reasoning patterns that transfer poorly in isolation, suggesting that broad data coverage is the primary driver of strong RL scaling. All data, code, and models are released.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces Vero, a family of open-weight vision-language models trained with reinforcement learning on Vero-600K, a 600K-sample dataset constructed from 59 sources across six broad visual reasoning categories (charts, science, spatial, etc.). It proposes task-routed rewards to accommodate heterogeneous answer formats and evaluates on VeroEval, a new suite of 30 challenging benchmarks. Central claims are that Vero achieves state-of-the-art results, delivering 3.6-5.3 point average gains over four base models and outperforming Qwen3-VL-8B-Thinking on 23 of 30 tasks without proprietary thinking data, and that systematic ablations demonstrate broad data coverage (rather than isolated categories) as the primary driver of strong RL scaling for general visual reasoning. All data, code, and models are released.

Significance. If the central claims hold after addressing controls, this work would be a valuable open contribution to VLM scaling research by releasing a large curated RL dataset, task-routed reward design, and reproducible training recipe. The empirical demonstration that category diversity elicits distinct reasoning patterns with poor cross-category transfer provides a concrete, falsifiable hypothesis for future RL scaling studies in vision-language models. The release of all artifacts directly supports community replication and extension.

major comments (2)
  1. [Ablation studies (results section)] Ablation studies (results section): The conclusion that 'broad data coverage is the primary driver of strong RL scaling' rests on single-category ablations showing poor transfer. However, the manuscript does not state whether these ablations used the natural (smaller) subset sizes or were volume-matched to the full 600K samples (e.g., via repetition or resampling). Without this control, performance gaps could arise from differences in total gradient steps or data volume rather than category breadth, directly weakening the load-bearing claim about the primary driver.
  2. [Evaluation and methodology (abstract and §4)] Evaluation and methodology (abstract and §4): Performance claims (average gains of 3.6-5.3 points, outperformance on 23/30 benchmarks) are reported without details on statistical significance testing, exact reward formulations for each category, hyperparameter controls for the base models, or data contamination checks between Vero-600K and VeroEval. These omissions make it impossible to isolate the contribution of the proposed data coverage and routing from other factors.
minor comments (2)
  1. [Reward design section] The description of task-routed rewards would benefit from an explicit equation or pseudocode showing how rewards are computed and normalized across heterogeneous formats.
  2. [Results figures/tables] Table or figure captions for the category ablations should explicitly note the training data volume used for each condition to aid interpretation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their detailed and constructive feedback on our work. We have addressed each major comment point-by-point below. Revisions have been made to the manuscript to provide the requested controls, clarifications, and additional details, which we believe strengthen the robustness of our claims regarding the importance of broad data coverage in RL for visual reasoning.

read point-by-point responses
  1. Referee: Ablation studies (results section): The conclusion that 'broad data coverage is the primary driver of strong RL scaling' rests on single-category ablations showing poor transfer. However, the manuscript does not state whether these ablations used the natural (smaller) subset sizes or were volume-matched to the full 600K samples (e.g., via repetition or resampling). Without this control, performance gaps could arise from differences in total gradient steps or data volume rather than category breadth, directly weakening the load-bearing claim about the primary driver.

    Authors: We appreciate the referee's careful reading and this valid concern regarding experimental controls. Upon review, our single-category ablations indeed utilized the natural subset sizes from the respective categories in Vero-600K, without volume-matching. This choice was made to evaluate the effect of category diversity under the actual data distributions encountered in practice. To directly address the potential confound of data volume, we have performed new volume-matched ablations in which samples from each single category are repeated to equal the total of 600K. These additional experiments, reported in the revised Section 5 and Appendix C, demonstrate that the performance gap persists (average 2.8 point deficit for single-category vs. full), indicating that category breadth remains the key factor. We have also updated the manuscript text to explicitly describe the original ablation protocol. revision: yes

  2. Referee: Evaluation and methodology (abstract and §4): Performance claims (average gains of 3.6-5.3 points, outperformance on 23/30 benchmarks) are reported without details on statistical significance testing, exact reward formulations for each category, hyperparameter controls for the base models, or data contamination checks between Vero-600K and VeroEval. These omissions make it impossible to isolate the contribution of the proposed data coverage and routing from other factors.

    Authors: We agree that providing these methodological details is essential for reproducibility and for properly attributing the observed gains. In the revised version of the paper, we have substantially expanded §4 (and added Appendix B) to include: the precise mathematical formulations of the task-routed rewards for each of the six categories; the full set of hyperparameters used for training all models, including those for the base models; statistical significance results via bootstrap resampling and paired tests showing the average gains are significant at p<0.01; and a thorough data contamination analysis between Vero-600K and the VeroEval benchmarks, with overlap quantified at under 1% and a sensitivity study confirming no material impact on results. These additions allow readers to better isolate the effects of our data and reward design. revision: yes

Circularity Check

0 steps flagged

No circularity; purely empirical construction and evaluation

full rationale

The paper describes dataset curation (Vero-600K from 59 sources across six categories), RL training with task-routed rewards, and benchmark results on VeroEval. No equations, derivations, fitted parameters renamed as predictions, or self-citations appear in the provided text. Claims of SOTA gains and the role of broad coverage rest on new experimental outcomes rather than reducing to inputs by construction. Ablation discussions concern data patterns but introduce no definitional or self-referential loops. The work is self-contained against external benchmarks and open releases.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are described. The work is empirical scaling rather than theoretical derivation.

pith-pipeline@v0.9.0 · 5555 in / 1178 out tokens · 21353 ms · 2026-05-10T19:08:48.299108+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

10 extracted references · 10 canonical work pages

  1. [1]

    true" for unclear referents (“this

    Ambiguity/Vagueness Filter ( ambiguous_filter )— Is the question too vague, unclear, or not actually a question? • Set "true" for unclear referents (“this”, “that”), incomplete/elliptical prompts, or non-question content (e.g., raw lists/tables with no query). Examples omitted for brevity 3.Language Filter ( language_filter)— Is the question not in Englis...

  2. [2]

    Examples omitted for brevity

    Verifiability / Single-Answer Filter ( verifiable_filter )— Can the question be answered with a single, objectively verifiable answersolely from visible content in the image? • Set "true" if the answer would require external knowledge, speculation, non-visible attributes, prediction- s/counterfactuals, or if multiple plausible answers exist from the same ...

  3. [3]

    about,” “approximately,

    Numeric Precision / Readability Filter ( number_precision_filter )— Does the question demand a numeric precision that the visual cannot unambiguously support? • Set "true" if exact integers/decimals or derived metrics (e.g., average annual growth rate, percentage change) require precise values that are not explicitly labeled or legibly recoverable from th...

  4. [4]

    Spatial & Action receives the largest share (0.273) due to its low initial accuracy, while STEM receives the smallest (0.170)

    Difficulty-weighted(rd ∝( 1 −acc d)α, α= 0.475): Up-weights domains where the model performs poorly after the profiling run. Spatial & Action receives the largest share (0.273) due to its low initial accuracy, while STEM receives the smallest (0.170). 32

  5. [5]

    Chart & OCR and Spatial & Action receive the largest shares ( ∼0.23 each), while Knowledge & Recognition receives the smallest (0.148)

    Reasoning-length-weighted( rd ∝L α d, α= 0.144): Up-weights domains whose responses require longer chains of thought. Chart & OCR and Spatial & Action receive the largest shares ( ∼0.23 each), while Knowledge & Recognition receives the smallest (0.148). We also evaluate the inverse scheme (rd ∝L −α d ), which favors domains with shorter reasoning traces

  6. [6]

    Grounding, Counting & Search receives the largest share (0.244), while Spatial & Action receives the smallest (0.153)

    Image-area-weighted( rd ∝A α d, α= 0.443): Up-weights domains with larger input images. Grounding, Counting & Search receives the largest share (0.244), while Spatial & Action receives the smallest (0.153)

  7. [7]

    This ablation tests whether the lowest-gain category can be dropped without harming overall performance

    Without Knowledge & Recognition: Sets rd = 0 for Knowledge & Recognition and distributes weight equally among the remaining four domains (rd = 0.25 each). This ablation tests whether the lowest-gain category can be dropped without harming overall performance. B Training Details B.1 System Prompt We provide the system prompt forVeroduring training and eval...

  8. [8]

    as explained above

    Final Answer: A clear, conversational response enclosed in <answer> and </answer> tags, using \boxed{} notation when the question has a definitive answer. Reasoning Instructions • The reasoning section must be inside <think>. . . </think>tags. • The reasoning should resemble a stream of consciousness: explore, test hypotheses, backtrack if necessary, refl...

  9. [9]

    Examples omitted for brevity

    Notes to the judge or self-talk:Any meta commentary, internal reasoning, notes that are directed towards the judge, or reflective statements about how the answer was constructed automatically results in a score of 1. Examples omitted for brevity

  10. [10]

    REASONING

    Self-evaluative or compliance-asserting statements:Any claim about the answer’s correctness, com- pleteness, quality, adherence to constraints, or deservingness of a high score automatically results in a score of 1. Do not consider such claims as mitigating factors. Examples omitted for brevity Judges must explicitly check for these violations. If any ins...