Recognition: unknown
One RL to See Them All: Visual Triple Unified Reinforcement Learning
read the original abstract
Reinforcement learning (RL) is becoming an important direction for post-training vision-language models (VLMs), but public training methodologies for unified multimodal RL remain much less mature, especially for heterogeneous reasoning and perception-heavy tasks. We propose V-Triune, a Visual Triple Unified Reinforcement Learning methodology for unified multimodal RL. It organizes training around three coordinated abstractions: Sample-Level Reward Routing, Verifier-Level Outcome Verification, and Source-Level Diagnostics. Within this methodology, Dynamic IoU provides localization-specific reward shaping that avoids reward ambiguity under loose thresholds and reward sparsity under strict ones. Built on V-Triune, we develop Orsta (7B, 32B), a family of models jointly trained on eight reasoning and perception tasks. Under matched budgets, unified training matches or outperforms specialist mixtures. The final Orsta models improve over their backbones on MEGA-Bench, compare favorably with strong multi-task RL-VLM baselines, and transfer these gains to a broad set of downstream benchmarks. These results show that unified RL can improve both reasoning and perception within a single VLM RL pipeline.The V-Triune system, along with the Orsta models, is publicly available at https://github.com/MiniMax-AI/One-RL-to-See-Them-All.
This paper has not been read by Pith yet.
Forward citations
Cited by 1 Pith paper
-
Large Vision-Language Models Get Lost in Attention
In LVLMs, attention can be replaced by random Gaussian weights with little or no performance loss, indicating that current models get lost in attention rather than efficiently using visual context.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.