One RL to See Them All: Visual Triple Unified Reinforcement Learning
Pith reviewed 2026-05-19 12:47 UTC · model grok-4.3
The pith
Unified reinforcement learning trains one vision-language model on both reasoning and perception tasks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Built on V-Triune, we develop Orsta (7B, 32B), a family of models jointly trained on eight reasoning and perception tasks. Under matched budgets, unified training matches or outperforms specialist mixtures. The final Orsta models improve over their backbones on MEGA-Bench, compare favorably with strong multi-task RL-VLM baselines, and transfer these gains to a broad set of downstream benchmarks. These results show that unified RL can improve both reasoning and perception within a single VLM RL pipeline.
What carries the argument
V-Triune, which organizes training around Sample-Level Reward Routing, Verifier-Level Outcome Verification, and Source-Level Diagnostics, and applies Dynamic IoU for localization-specific reward shaping.
If this is right
- Unified training matches or exceeds the results of training separate specialist models for each task type.
- Gains from the joint process transfer to a wide range of downstream benchmarks.
- Both reasoning and perception capabilities advance together inside one model and one pipeline.
- Specialist mixtures are not required to reach strong performance levels under matched budgets.
Where Pith is reading between the lines
- If the coordination scales, it could reduce the engineering overhead of maintaining multiple reward systems for multimodal models.
- The same abstractions might extend to other mixed-capability settings, such as combining language reasoning with tool use.
- Larger-scale tests could reveal whether task conflicts appear only beyond the eight-task regime examined here.
Load-bearing premise
The three abstractions can be coordinated effectively for heterogeneous reasoning and perception-heavy tasks without introducing performance conflicts or reward issues that undermine the unified training gains.
What would settle it
If separate specialist RL trainings on the same eight tasks, using equivalent total compute, achieve higher average scores than the unified Orsta models on MEGA-Bench and the downstream benchmarks, the unified approach would be challenged.
read the original abstract
Reinforcement learning (RL) is becoming an important direction for post-training vision-language models (VLMs), but public training methodologies for unified multimodal RL remain much less mature, especially for heterogeneous reasoning and perception-heavy tasks. We propose V-Triune, a Visual Triple Unified Reinforcement Learning methodology for unified multimodal RL. It organizes training around three coordinated abstractions: Sample-Level Reward Routing, Verifier-Level Outcome Verification, and Source-Level Diagnostics. Within this methodology, Dynamic IoU provides localization-specific reward shaping that avoids reward ambiguity under loose thresholds and reward sparsity under strict ones. Built on V-Triune, we develop Orsta (7B, 32B), a family of models jointly trained on eight reasoning and perception tasks. Under matched budgets, unified training matches or outperforms specialist mixtures. The final Orsta models improve over their backbones on MEGA-Bench, compare favorably with strong multi-task RL-VLM baselines, and transfer these gains to a broad set of downstream benchmarks. These results show that unified RL can improve both reasoning and perception within a single VLM RL pipeline.The V-Triune system, along with the Orsta models, is publicly available at https://github.com/MiniMax-AI/One-RL-to-See-Them-All.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces V-Triune, a Visual Triple Unified Reinforcement Learning methodology for post-training vision-language models on heterogeneous reasoning and perception tasks. It organizes training around three coordinated abstractions—Sample-Level Reward Routing, Verifier-Level Outcome Verification, and Source-Level Diagnostics—along with Dynamic IoU for localization-specific reward shaping. The authors develop Orsta (7B and 32B) models jointly trained on eight tasks, claiming that under matched budgets unified training matches or outperforms specialist mixtures. The final models improve over backbones on MEGA-Bench, compare favorably to multi-task RL-VLM baselines, and transfer gains to downstream benchmarks. The V-Triune system and Orsta models are publicly released.
Significance. If the empirical claims are verified with explicit compute accounting, this would demonstrate that unified RL can handle diverse multimodal tasks in a single pipeline without introducing conflicts, simplifying VLM post-training. The public release of code and models is a clear strength supporting reproducibility.
major comments (1)
- [Experimental results] The central claim that 'under matched budgets, unified training matches or outperforms specialist mixtures' (abstract and experimental results) is load-bearing for attributing gains to V-Triune. The manuscript asserts budget equivalence but provides no tabulated breakdown of aggregate gradient steps, total tokens processed, or effective sampling rates across the eight tasks for the unified Orsta model versus per-task specialists. Without this accounting, the comparison cannot isolate the benefit of the three abstractions and Dynamic IoU from potential differences in optimization signal.
minor comments (2)
- [Methodology] The description of how Sample-Level Reward Routing, Verifier-Level Outcome Verification, and Source-Level Diagnostics are coordinated during joint training on heterogeneous tasks would benefit from pseudocode or a workflow diagram to clarify potential interactions.
- [Results] Specify the exact multi-task RL-VLM baselines and statistical details (error bars, number of runs) when reporting comparisons on MEGA-Bench and downstream benchmarks.
Simulated Author's Rebuttal
We thank the referee for the careful and constructive review. The feedback on experimental budget accounting is well-taken and will help strengthen the clarity of our claims. We address the major comment point-by-point below.
read point-by-point responses
-
Referee: [Experimental results] The central claim that 'under matched budgets, unified training matches or outperforms specialist mixtures' (abstract and experimental results) is load-bearing for attributing gains to V-Triune. The manuscript asserts budget equivalence but provides no tabulated breakdown of aggregate gradient steps, total tokens processed, or effective sampling rates across the eight tasks for the unified Orsta model versus per-task specialists. Without this accounting, the comparison cannot isolate the benefit of the three abstractions and Dynamic IoU from potential differences in optimization signal.
Authors: We agree that an explicit tabulated breakdown is necessary to rigorously support the matched-budget claim and to isolate the contributions of V-Triune. While the original experiments were designed with equivalent total training compute (same aggregate gradient steps and token volume allocated across the unified pipeline and the specialist mixtures), this equivalence was described narratively rather than presented in consolidated tabular form. In the revised manuscript we will add a new table (Table 3) that reports, for each setup: (i) total gradient steps, (ii) total tokens processed, and (iii) effective per-task sampling rates. This addition will make the budget equivalence transparent and allow readers to verify that performance differences are attributable to the three abstractions and Dynamic IoU rather than unequal optimization signal. revision: yes
Circularity Check
No circularity: claims rest on empirical training outcomes
full rationale
The paper introduces V-Triune via three named abstractions (Sample-Level Reward Routing, Verifier-Level Outcome Verification, Source-Level Diagnostics) plus Dynamic IoU reward shaping, then reports results from jointly training Orsta (7B/32B) models on eight tasks. All headline claims—unified training matching or outperforming specialist mixtures under matched budgets, gains on MEGA-Bench, and downstream transfer—are presented as direct experimental measurements rather than quantities derived from equations that loop back to fitted parameters or self-referential definitions. No load-bearing self-citations, uniqueness theorems, or ansatzes imported from prior author work appear in the derivation; the methodology is justified by its stated coordination of the three abstractions and the observed training results. The chain is therefore self-contained through external empirical validation.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
V-Triune organizes training around three coordinated abstractions: Sample-Level Reward Routing, Verifier-Level Outcome Verification, and Source-Level Diagnostics... Dynamic IoU provides localization-specific reward shaping
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Under matched budgets, unified training matches or outperforms specialist mixtures
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 3 Pith papers
-
Large Vision-Language Models Get Lost in Attention
In LVLMs, attention can be replaced by random Gaussian weights with little or no performance loss, indicating that current models get lost in attention rather than efficiently using visual context.
-
Perception-Aware Policy Optimization for Multimodal Reasoning
PAPO integrates perception-aware supervision via a KL-based loss into RLVR methods like GRPO, yielding 4.4-17.5% gains on multimodal benchmarks and 30.5% fewer perception errors, with larger gains on vision-heavy tasks.
-
Mixture-of-Visual-Thoughts: Exploring Context-Adaptive Reasoning Mode Selection for General Visual Reasoning
MoVT unifies different visual reasoning modes in a single model and uses the AdaVaR two-stage framework with supervised cold-start and RL via AdaGRPO to enable context-adaptive mode selection, yielding consistent gain...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.