Vero: An Open RL Recipe for General Visual Reasoning
Pith reviewed 2026-05-10 19:08 UTC · model grok-4.3
The pith
Open reinforcement learning with broad visual data builds general reasoners that rival closed models.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Vero scales RL data and rewards across six broad task categories to construct a 600K-sample dataset from 59 sources. Task-routed rewards handle heterogeneous answer formats. This produces state-of-the-art open VLMs that improve four base models by 3.6-5.3 points on average across 30 benchmarks and outperform Qwen3-VL-8B-Thinking on 23 benchmarks without proprietary data. Ablations show broad coverage drives results since isolated categories yield distinct, poorly transferring patterns.
What carries the argument
Vero-600K, a dataset of 600,000 samples from 59 sources across six task categories, paired with task-routed rewards that adapt reinforcement learning signals to different answer formats.
If this is right
- Starting from the same base model, Vero exceeds existing RL datasets in performance across all task categories.
- Training on single task categories produces reasoning patterns that transfer poorly to other categories.
- Open models can achieve high visual reasoning performance without access to proprietary thinking data.
- The release of data, code, and models enables full reproduction and community extension of the recipe.
Where Pith is reading between the lines
- Applying similar broad-coverage RL strategies could improve performance in other multimodal domains such as audio or video reasoning.
- The finding that task categories elicit distinct patterns suggests visual reasoning consists of loosely connected sub-skills rather than one monolithic ability.
- Future experiments might test whether adding more data sources within each category further amplifies the gains.
Load-bearing premise
The observed performance gains stem primarily from the scale and diversity of the data combined with the task-routed rewards, rather than from other training choices or data contamination.
What would settle it
A controlled experiment retraining one of the base models on a narrow subset of Vero-600K covering only one task category and measuring if the average score on the 30-benchmark suite drops below the reported gains.
Figures
read the original abstract
What does it take to build a visual reasoner that works across charts, science, spatial understanding, and open-ended tasks? The strongest vision-language models (VLMs) show such broad visual reasoning is within reach, but the recipe behind them remains unclear, locked behind proprietary reinforcement learning (RL) pipelines with non-public data. We introduce Vero, a family of fully open VLMs that matches or exceeds existing open-weight models across diverse visual reasoning tasks. We scale RL data and rewards across six broad task categories, constructing Vero-600K, a 600K-sample dataset from 59 datasets, and designing task-routed rewards that handle heterogeneous answer formats. Vero achieves state-of-the-art performance, improving over four base models by 3.6-5.3 points on average across VeroEval, our suite of 30 challenging benchmarks. Starting from Qwen3-VL-8B-Instruct, Vero outperforms Qwen3-VL-8B-Thinking on 23 of 30 benchmarks without additional proprietary thinking data. When trained from the same base model, Vero-600K exceeds existing RL datasets across task categories. Systematic ablations reveal that different task categories elicit qualitatively distinct reasoning patterns that transfer poorly in isolation, suggesting that broad data coverage is the primary driver of strong RL scaling. All data, code, and models are released.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Vero, a family of open-weight vision-language models trained with reinforcement learning on Vero-600K, a 600K-sample dataset constructed from 59 sources across six broad visual reasoning categories (charts, science, spatial, etc.). It proposes task-routed rewards to accommodate heterogeneous answer formats and evaluates on VeroEval, a new suite of 30 challenging benchmarks. Central claims are that Vero achieves state-of-the-art results, delivering 3.6-5.3 point average gains over four base models and outperforming Qwen3-VL-8B-Thinking on 23 of 30 tasks without proprietary thinking data, and that systematic ablations demonstrate broad data coverage (rather than isolated categories) as the primary driver of strong RL scaling for general visual reasoning. All data, code, and models are released.
Significance. If the central claims hold after addressing controls, this work would be a valuable open contribution to VLM scaling research by releasing a large curated RL dataset, task-routed reward design, and reproducible training recipe. The empirical demonstration that category diversity elicits distinct reasoning patterns with poor cross-category transfer provides a concrete, falsifiable hypothesis for future RL scaling studies in vision-language models. The release of all artifacts directly supports community replication and extension.
major comments (2)
- [Ablation studies (results section)] Ablation studies (results section): The conclusion that 'broad data coverage is the primary driver of strong RL scaling' rests on single-category ablations showing poor transfer. However, the manuscript does not state whether these ablations used the natural (smaller) subset sizes or were volume-matched to the full 600K samples (e.g., via repetition or resampling). Without this control, performance gaps could arise from differences in total gradient steps or data volume rather than category breadth, directly weakening the load-bearing claim about the primary driver.
- [Evaluation and methodology (abstract and §4)] Evaluation and methodology (abstract and §4): Performance claims (average gains of 3.6-5.3 points, outperformance on 23/30 benchmarks) are reported without details on statistical significance testing, exact reward formulations for each category, hyperparameter controls for the base models, or data contamination checks between Vero-600K and VeroEval. These omissions make it impossible to isolate the contribution of the proposed data coverage and routing from other factors.
minor comments (2)
- [Reward design section] The description of task-routed rewards would benefit from an explicit equation or pseudocode showing how rewards are computed and normalized across heterogeneous formats.
- [Results figures/tables] Table or figure captions for the category ablations should explicitly note the training data volume used for each condition to aid interpretation.
Simulated Author's Rebuttal
We thank the referee for their detailed and constructive feedback on our work. We have addressed each major comment point-by-point below. Revisions have been made to the manuscript to provide the requested controls, clarifications, and additional details, which we believe strengthen the robustness of our claims regarding the importance of broad data coverage in RL for visual reasoning.
read point-by-point responses
-
Referee: Ablation studies (results section): The conclusion that 'broad data coverage is the primary driver of strong RL scaling' rests on single-category ablations showing poor transfer. However, the manuscript does not state whether these ablations used the natural (smaller) subset sizes or were volume-matched to the full 600K samples (e.g., via repetition or resampling). Without this control, performance gaps could arise from differences in total gradient steps or data volume rather than category breadth, directly weakening the load-bearing claim about the primary driver.
Authors: We appreciate the referee's careful reading and this valid concern regarding experimental controls. Upon review, our single-category ablations indeed utilized the natural subset sizes from the respective categories in Vero-600K, without volume-matching. This choice was made to evaluate the effect of category diversity under the actual data distributions encountered in practice. To directly address the potential confound of data volume, we have performed new volume-matched ablations in which samples from each single category are repeated to equal the total of 600K. These additional experiments, reported in the revised Section 5 and Appendix C, demonstrate that the performance gap persists (average 2.8 point deficit for single-category vs. full), indicating that category breadth remains the key factor. We have also updated the manuscript text to explicitly describe the original ablation protocol. revision: yes
-
Referee: Evaluation and methodology (abstract and §4): Performance claims (average gains of 3.6-5.3 points, outperformance on 23/30 benchmarks) are reported without details on statistical significance testing, exact reward formulations for each category, hyperparameter controls for the base models, or data contamination checks between Vero-600K and VeroEval. These omissions make it impossible to isolate the contribution of the proposed data coverage and routing from other factors.
Authors: We agree that providing these methodological details is essential for reproducibility and for properly attributing the observed gains. In the revised version of the paper, we have substantially expanded §4 (and added Appendix B) to include: the precise mathematical formulations of the task-routed rewards for each of the six categories; the full set of hyperparameters used for training all models, including those for the base models; statistical significance results via bootstrap resampling and paired tests showing the average gains are significant at p<0.01; and a thorough data contamination analysis between Vero-600K and the VeroEval benchmarks, with overlap quantified at under 1% and a sensitivity study confirming no material impact on results. These additions allow readers to better isolate the effects of our data and reward design. revision: yes
Circularity Check
No circularity; purely empirical construction and evaluation
full rationale
The paper describes dataset curation (Vero-600K from 59 sources across six categories), RL training with task-routed rewards, and benchmark results on VeroEval. No equations, derivations, fitted parameters renamed as predictions, or self-citations appear in the provided text. Claims of SOTA gains and the role of broad coverage rest on new experimental outcomes rather than reducing to inputs by construction. Ablation discussions concern data patterns but introduce no definitional or self-referential loops. The work is self-contained against external benchmarks and open releases.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We scale RL data and rewards across six broad task categories, constructing Vero-600K, a 600K-sample dataset from 59 datasets, and designing task-routed rewards that handle heterogeneous answer formats.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
true" for unclear referents (“this
Ambiguity/Vagueness Filter ( ambiguous_filter )— Is the question too vague, unclear, or not actually a question? • Set "true" for unclear referents (“this”, “that”), incomplete/elliptical prompts, or non-question content (e.g., raw lists/tables with no query). Examples omitted for brevity 3.Language Filter ( language_filter)— Is the question not in Englis...
-
[2]
Verifiability / Single-Answer Filter ( verifiable_filter )— Can the question be answered with a single, objectively verifiable answersolely from visible content in the image? • Set "true" if the answer would require external knowledge, speculation, non-visible attributes, prediction- s/counterfactuals, or if multiple plausible answers exist from the same ...
-
[3]
Numeric Precision / Readability Filter ( number_precision_filter )— Does the question demand a numeric precision that the visual cannot unambiguously support? • Set "true" if exact integers/decimals or derived metrics (e.g., average annual growth rate, percentage change) require precise values that are not explicitly labeled or legibly recoverable from th...
-
[4]
Difficulty-weighted(rd ∝( 1 −acc d)α, α= 0.475): Up-weights domains where the model performs poorly after the profiling run. Spatial & Action receives the largest share (0.273) due to its low initial accuracy, while STEM receives the smallest (0.170). 32
-
[5]
Reasoning-length-weighted( rd ∝L α d, α= 0.144): Up-weights domains whose responses require longer chains of thought. Chart & OCR and Spatial & Action receive the largest shares ( ∼0.23 each), while Knowledge & Recognition receives the smallest (0.148). We also evaluate the inverse scheme (rd ∝L −α d ), which favors domains with shorter reasoning traces
-
[6]
Image-area-weighted( rd ∝A α d, α= 0.443): Up-weights domains with larger input images. Grounding, Counting & Search receives the largest share (0.244), while Spatial & Action receives the smallest (0.153)
-
[7]
Without Knowledge & Recognition: Sets rd = 0 for Knowledge & Recognition and distributes weight equally among the remaining four domains (rd = 0.25 each). This ablation tests whether the lowest-gain category can be dropped without harming overall performance. B Training Details B.1 System Prompt We provide the system prompt forVeroduring training and eval...
-
[8]
Final Answer: A clear, conversational response enclosed in <answer> and </answer> tags, using \boxed{} notation when the question has a definitive answer. Reasoning Instructions • The reasoning section must be inside <think>. . . </think>tags. • The reasoning should resemble a stream of consciousness: explore, test hypotheses, backtrack if necessary, refl...
work page 2025
-
[9]
Notes to the judge or self-talk:Any meta commentary, internal reasoning, notes that are directed towards the judge, or reflective statements about how the answer was constructed automatically results in a score of 1. Examples omitted for brevity
-
[10]
Self-evaluative or compliance-asserting statements:Any claim about the answer’s correctness, com- pleteness, quality, adherence to constraints, or deservingness of a high score automatically results in a score of 1. Do not consider such claims as mitigating factors. Examples omitted for brevity Judges must explicitly check for these violations. If any ins...
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.