One RL to See Them All: Visual Triple Unified Reinforcement Learning

Junjie Yan; Linge Du; Lizhuang Ma; Pengfei Li; Pengfei Liu; Qibing Ren; Shaoxiang Chen; Xuyang Shen; Yan Ma; Yuchao Dai

arxiv: 2505.18129 · v3 · submitted 2025-05-23 · 💻 cs.CV · cs.CL

One RL to See Them All: Visual Triple Unified Reinforcement Learning

Yan Ma , Linge Du , Xuyang Shen , Shaoxiang Chen , Pengfei Li , Qibing Ren , Lizhuang Ma , Yuchao Dai

show 2 more authors

Pengfei Liu Junjie Yan

This is my paper

Pith reviewed 2026-05-19 12:47 UTC · model grok-4.3

classification 💻 cs.CV cs.CL

keywords reinforcement learningvision-language modelsunified multimodal RLV-TriuneOrstareasoning tasksperception tasksreward routing

0 comments

The pith

Unified reinforcement learning trains one vision-language model on both reasoning and perception tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes V-Triune as a methodology to apply reinforcement learning across heterogeneous tasks in vision-language models. It coordinates three abstractions to route rewards, verify outcomes, and diagnose sources while using dynamic IoU to shape localization rewards without ambiguity or sparsity. The resulting Orsta models, trained jointly on eight tasks, match or exceed the performance of specialist mixtures under equal compute budgets and show gains on MEGA-Bench plus downstream evaluations. If the approach holds, it would allow a single training pipeline to advance both reasoning and perception capabilities instead of maintaining separate specialist systems.

Core claim

Built on V-Triune, we develop Orsta (7B, 32B), a family of models jointly trained on eight reasoning and perception tasks. Under matched budgets, unified training matches or outperforms specialist mixtures. The final Orsta models improve over their backbones on MEGA-Bench, compare favorably with strong multi-task RL-VLM baselines, and transfer these gains to a broad set of downstream benchmarks. These results show that unified RL can improve both reasoning and perception within a single VLM RL pipeline.

What carries the argument

V-Triune, which organizes training around Sample-Level Reward Routing, Verifier-Level Outcome Verification, and Source-Level Diagnostics, and applies Dynamic IoU for localization-specific reward shaping.

If this is right

Unified training matches or exceeds the results of training separate specialist models for each task type.
Gains from the joint process transfer to a wide range of downstream benchmarks.
Both reasoning and perception capabilities advance together inside one model and one pipeline.
Specialist mixtures are not required to reach strong performance levels under matched budgets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the coordination scales, it could reduce the engineering overhead of maintaining multiple reward systems for multimodal models.
The same abstractions might extend to other mixed-capability settings, such as combining language reasoning with tool use.
Larger-scale tests could reveal whether task conflicts appear only beyond the eight-task regime examined here.

Load-bearing premise

The three abstractions can be coordinated effectively for heterogeneous reasoning and perception-heavy tasks without introducing performance conflicts or reward issues that undermine the unified training gains.

What would settle it

If separate specialist RL trainings on the same eight tasks, using equivalent total compute, achieve higher average scores than the unified Orsta models on MEGA-Bench and the downstream benchmarks, the unified approach would be challenged.

read the original abstract

Reinforcement learning (RL) is becoming an important direction for post-training vision-language models (VLMs), but public training methodologies for unified multimodal RL remain much less mature, especially for heterogeneous reasoning and perception-heavy tasks. We propose V-Triune, a Visual Triple Unified Reinforcement Learning methodology for unified multimodal RL. It organizes training around three coordinated abstractions: Sample-Level Reward Routing, Verifier-Level Outcome Verification, and Source-Level Diagnostics. Within this methodology, Dynamic IoU provides localization-specific reward shaping that avoids reward ambiguity under loose thresholds and reward sparsity under strict ones. Built on V-Triune, we develop Orsta (7B, 32B), a family of models jointly trained on eight reasoning and perception tasks. Under matched budgets, unified training matches or outperforms specialist mixtures. The final Orsta models improve over their backbones on MEGA-Bench, compare favorably with strong multi-task RL-VLM baselines, and transfer these gains to a broad set of downstream benchmarks. These results show that unified RL can improve both reasoning and perception within a single VLM RL pipeline.The V-Triune system, along with the Orsta models, is publicly available at https://github.com/MiniMax-AI/One-RL-to-See-Them-All.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

V-Triune gives a workable set of abstractions for unified RL across mixed VLM tasks and the Orsta models look competitive, but the matched-budget comparison needs tighter accounting to hold up.

read the letter

The main point is that this paper lays out V-Triune as a way to run RL on heterogeneous reasoning and perception tasks inside one VLM instead of training separate specialists. The three abstractions—Sample-Level Reward Routing, Verifier-Level Outcome Verification, and Source-Level Diagnostics—plus the Dynamic IoU reward for localization are the concrete pieces they add to make that work without obvious reward collapse or task interference. They train Orsta at 7B and 32B on eight tasks jointly, release the code and models, and report that the unified run matches or beats specialist mixtures while lifting MEGA-Bench scores and carrying over to other benchmarks.

Referee Report

1 major / 2 minor

Summary. The manuscript introduces V-Triune, a Visual Triple Unified Reinforcement Learning methodology for post-training vision-language models on heterogeneous reasoning and perception tasks. It organizes training around three coordinated abstractions—Sample-Level Reward Routing, Verifier-Level Outcome Verification, and Source-Level Diagnostics—along with Dynamic IoU for localization-specific reward shaping. The authors develop Orsta (7B and 32B) models jointly trained on eight tasks, claiming that under matched budgets unified training matches or outperforms specialist mixtures. The final models improve over backbones on MEGA-Bench, compare favorably to multi-task RL-VLM baselines, and transfer gains to downstream benchmarks. The V-Triune system and Orsta models are publicly released.

Significance. If the empirical claims are verified with explicit compute accounting, this would demonstrate that unified RL can handle diverse multimodal tasks in a single pipeline without introducing conflicts, simplifying VLM post-training. The public release of code and models is a clear strength supporting reproducibility.

major comments (1)

[Experimental results] The central claim that 'under matched budgets, unified training matches or outperforms specialist mixtures' (abstract and experimental results) is load-bearing for attributing gains to V-Triune. The manuscript asserts budget equivalence but provides no tabulated breakdown of aggregate gradient steps, total tokens processed, or effective sampling rates across the eight tasks for the unified Orsta model versus per-task specialists. Without this accounting, the comparison cannot isolate the benefit of the three abstractions and Dynamic IoU from potential differences in optimization signal.

minor comments (2)

[Methodology] The description of how Sample-Level Reward Routing, Verifier-Level Outcome Verification, and Source-Level Diagnostics are coordinated during joint training on heterogeneous tasks would benefit from pseudocode or a workflow diagram to clarify potential interactions.
[Results] Specify the exact multi-task RL-VLM baselines and statistical details (error bars, number of runs) when reporting comparisons on MEGA-Bench and downstream benchmarks.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful and constructive review. The feedback on experimental budget accounting is well-taken and will help strengthen the clarity of our claims. We address the major comment point-by-point below.

read point-by-point responses

Referee: [Experimental results] The central claim that 'under matched budgets, unified training matches or outperforms specialist mixtures' (abstract and experimental results) is load-bearing for attributing gains to V-Triune. The manuscript asserts budget equivalence but provides no tabulated breakdown of aggregate gradient steps, total tokens processed, or effective sampling rates across the eight tasks for the unified Orsta model versus per-task specialists. Without this accounting, the comparison cannot isolate the benefit of the three abstractions and Dynamic IoU from potential differences in optimization signal.

Authors: We agree that an explicit tabulated breakdown is necessary to rigorously support the matched-budget claim and to isolate the contributions of V-Triune. While the original experiments were designed with equivalent total training compute (same aggregate gradient steps and token volume allocated across the unified pipeline and the specialist mixtures), this equivalence was described narratively rather than presented in consolidated tabular form. In the revised manuscript we will add a new table (Table 3) that reports, for each setup: (i) total gradient steps, (ii) total tokens processed, and (iii) effective per-task sampling rates. This addition will make the budget equivalence transparent and allow readers to verify that performance differences are attributable to the three abstractions and Dynamic IoU rather than unequal optimization signal. revision: yes

Circularity Check

0 steps flagged

No circularity: claims rest on empirical training outcomes

full rationale

The paper introduces V-Triune via three named abstractions (Sample-Level Reward Routing, Verifier-Level Outcome Verification, Source-Level Diagnostics) plus Dynamic IoU reward shaping, then reports results from jointly training Orsta (7B/32B) models on eight tasks. All headline claims—unified training matching or outperforming specialist mixtures under matched budgets, gains on MEGA-Bench, and downstream transfer—are presented as direct experimental measurements rather than quantities derived from equations that loop back to fitted parameters or self-referential definitions. No load-bearing self-citations, uniqueness theorems, or ansatzes imported from prior author work appear in the derivation; the methodology is justified by its stated coordination of the three abstractions and the observed training results. The chain is therefore self-contained through external empirical validation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Review based on abstract only; no explicit free parameters, axioms, or invented entities are detailed in the provided text. Dynamic IoU is introduced as a new reward shaping mechanism but its parameterization is not specified here.

pith-pipeline@v0.9.0 · 5785 in / 1368 out tokens · 68311 ms · 2026-05-19T12:47:24.033582+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

V-Triune organizes training around three coordinated abstractions: Sample-Level Reward Routing, Verifier-Level Outcome Verification, and Source-Level Diagnostics... Dynamic IoU provides localization-specific reward shaping
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Under matched budgets, unified training matches or outperforms specialist mixtures

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Large Vision-Language Models Get Lost in Attention
cs.AI 2026-05 unverdicted novelty 6.0

In LVLMs, attention can be replaced by random Gaussian weights with little or no performance loss, indicating that current models get lost in attention rather than efficiently using visual context.
Perception-Aware Policy Optimization for Multimodal Reasoning
cs.CL 2025-07 unverdicted novelty 6.0

PAPO integrates perception-aware supervision via a KL-based loss into RLVR methods like GRPO, yielding 4.4-17.5% gains on multimodal benchmarks and 30.5% fewer perception errors, with larger gains on vision-heavy tasks.
Mixture-of-Visual-Thoughts: Exploring Context-Adaptive Reasoning Mode Selection for General Visual Reasoning
cs.AI 2025-09 unverdicted novelty 5.0

MoVT unifies different visual reasoning modes in a single model and uses the AdaVaR two-stage framework with supervised cold-start and RL via AdaGRPO to enable context-adaptive mode selection, yielding consistent gain...