arxiv: 2604.08539 · v2 · submitted 2026-04-09 · 💻 cs.CV · cs.AI· cs.CL

Recognition: unknown

OpenVLThinkerV2: A Generalist Multimodal Reasoning Model for Multi-domain Visual Tasks

Wenbo Hu , Xin Chen , Yan Gao-Tian , Yihe Deng , Nanyun Peng , Kai-Wei Chang

Authors on Pith no claims yet

Pith reviewed 2026-05-10 17:04 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.CL

keywords Gaussian GRPOmultimodal large language modelsreinforcement learningvisual reasoninggeneralist modeladvantage distributionpolicy optimizationentropy shaping

0 comments

The pith

Forcing advantage distributions to a standard normal distribution enables stable reinforcement learning for generalist multimodal models across diverse visual tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to overcome large differences in reward structures when applying reinforcement learning to train multimodal models on varied visual tasks. It does so by introducing Gaussian GRPO, which uses non-linear scaling to make the advantages for every task follow a normal distribution with mean zero and standard deviation one. This is paired with response length and entropy shaping to balance detailed reasoning against accurate visual perception. The resulting OpenVLThinkerV2 model is evaluated on 18 benchmarks and claimed to exceed both open-source alternatives and leading closed models. If correct, this would mean a path toward training one model that handles many visual domains without repeated hyperparameter adjustments.

Core claim

The paper's core claim is that Gaussian GRPO achieves inter-task gradient equity by forcing each task's advantage distribution to converge to N(0,1), which mitigates outlier effects and provides symmetric updates for positive and negative rewards. Combined with dynamic response length shaping to encourage longer chains for hard problems and entropy shaping to control exploration, this produces OpenVLThinkerV2, a general-purpose multimodal reasoning model that demonstrates superior performance over strong baselines on a wide range of visual tasks.

What carries the argument

Gaussian GRPO (G²RPO), the reinforcement learning objective that replaces linear scaling with distributional matching to a standard normal distribution.

If this is right

Training stability improves because gradients from different tasks become comparable without manual scaling.
Models can switch between extended reasoning and concise visual answers based on query complexity.
The policy avoids both under- and over-exploration through bounded entropy.
Overall benchmark scores rise above those of prior open and proprietary multimodal systems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This normalization technique could be tested in non-visual reinforcement learning settings to see if it reduces the need for reward engineering.
Applying it might allow scaling to even more tasks simultaneously than the 18 benchmarks tested here.
If the normal target proves robust, it might simplify the design of future generalist AI systems by removing task-specific reward normalizers.

Load-bearing premise

That making the advantage distribution of any task converge to a standard normal will automatically yield equitable gradients and symmetric updates across tasks without creating new instabilities.

What would settle it

A training run on a mix of visual tasks where one task's updates still overpower others or where performance requires per-task learning rate changes would contradict the claim.

Figures

Figures reproduced from arXiv: 2604.08539 by Kai-Wei Chang, Nanyun Peng, Wenbo Hu, Xin Chen, Yan Gao-Tian, Yihe Deng.

**Figure 2.** Figure 2: Comparison of advantage formulations against previous methods. By enforcing [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: Comparison of response length dynamics during training. [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Effect of task-level entropy shaping. G2RPO effectively prevents entropy explosion for reasoning-centric tasks and OOD task (spatial reasoning) while concurrently mitigating entropy collapse in vision-centric tasks. reasoning-centric tasks (e.g., math and science VQA), and hybrid tasks requiring both modalities (e.g., chart reasoning). To this end, we first analyze the task-level performance dynamics of th… view at source ↗

**Figure 5.** Figure 5: Average Accuracy Reward Comparison across all tasks on the Validation set [PITH_FULL_IMAGE:figures/full_fig_p016_5.png] view at source ↗

**Figure 6.** Figure 6: Average Length Reward Comparison across all tasks on the Validation set during [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗

**Figure 7.** Figure 7: Average Format Reward Comparison across all tasks on the Validation set during [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗

**Figure 8.** Figure 8: Average Structure Reward Comparison across all tasks on the Validation set during [PITH_FULL_IMAGE:figures/full_fig_p018_8.png] view at source ↗

read the original abstract

Group Relative Policy Optimization (GRPO) has emerged as the de facto Reinforcement Learning (RL) objective driving recent advancements in Multimodal Large Language Models. However, extending this success to open-source multimodal generalist models remains heavily constrained by two primary challenges: the extreme variance in reward topologies across diverse visual tasks, and the inherent difficulty of balancing fine-grained perception with multi-step reasoning capabilities. To address these issues, we introduce Gaussian GRPO (G$^2$RPO), a novel RL training objective that replaces standard linear scaling with non-linear distributional matching. By mathematically forcing the advantage distribution of any given task to strictly converge to a standard normal distribution, $\mathcal{N}(0,1)$, G$^2$RPO theoretically ensures inter-task gradient equity, mitigates vulnerabilities to heavy-tail outliers, and offers symmetric update for positive and negative rewards. Leveraging the enhanced training stability provided by G$^2$RPO, we introduce two task-level shaping mechanisms to seamlessly balance perception and reasoning. First, response length shaping dynamically elicits extended reasoning chains for complex queries while enforce direct outputs to bolster visual grounding. Second, entropy shaping tightly bounds the model's exploration zone, effectively preventing both entropy collapse and entropy explosion. Integrating these methodologies, we present OpenVLThinkerV2, a highly robust, general-purpose multimodal model. Extensive evaluations across 18 diverse benchmarks demonstrate its superior performance over strong open-source and leading proprietary frontier models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

G²RPO is a distributional tweak on GRPO meant to normalize advantages across tasks, but the paper asserts theoretical guarantees without showing the math or stability checks.

read the letter

The main thing to know is that this work replaces linear advantage scaling in GRPO with non-linear distributional matching so every task's advantage distribution is forced to N(0,1). They layer on response-length shaping and entropy bounding, then train OpenVLThinkerV2 and report gains across 18 benchmarks against both open and closed models. That evaluation scope is the strongest part of what they show. The idea of handling reward variance between perception and reasoning tasks is a real practical issue in multimodal RL, and trying to fix it with shaping mechanisms makes sense on the surface. The paper engages the existing GRPO literature directly and positions the changes as targeted fixes rather than a full rewrite. The soft spots sit in the central technical claim. The abstract states that the matching delivers inter-task gradient equity, outlier robustness, and symmetric updates, yet no transformation function, fixed-point argument, or analysis under heterogeneous reward shapes appears. Without those steps it is hard to tell whether the normalization actually holds or just adds its own instabilities when perception and multi-step reasoning rewards mix. The shaping mechanisms are presented as stable once G²RPO is in place, but that stability is not demonstrated. This paper is aimed at groups already running GRPO-style fine-tuning on open multimodal models. Readers who want concrete engineering knobs for length and entropy control could extract usable ideas even if they re-derive the matching themselves. It is worth sending to peer review because the problem is concrete, the benchmark set is wide, and the incremental nature is clear. A referee can ask for the missing derivations, ablations on the shaping terms, and controls that separate the contribution of G²RPO from the rest of the pipeline.

Referee Report

2 major / 2 minor

Summary. The paper introduces Gaussian GRPO (G²RPO), a reinforcement learning objective that replaces linear advantage scaling in GRPO with non-linear distributional matching to force the advantage distribution of any task to converge to N(0,1). This is claimed to ensure inter-task gradient equity, mitigate heavy-tail outliers, and provide symmetric positive/negative updates. The authors further propose response length shaping to balance extended reasoning with direct visual outputs and entropy shaping to bound exploration. These are integrated into OpenVLThinkerV2, which is reported to outperform strong open-source and proprietary models across 18 diverse multimodal benchmarks.

Significance. If the convergence properties and stability claims of G²RPO can be rigorously established and the benchmark gains hold under controlled ablations, the work would offer a practical advance in stable RL fine-tuning for generalist multimodal models handling heterogeneous visual tasks, addressing a key bottleneck in scaling perception-reasoning balance.

major comments (2)

[G²RPO objective (method description)] The central technical claim for G²RPO—that non-linear distributional matching forces any task's advantage distribution to converge exactly to N(0,1) and thereby guarantees inter-task gradient equity—lacks an explicit transformation function, fixed-point derivation, or analysis of convergence/stability under the heterogeneous reward topologies of visual perception versus multi-step reasoning tasks. This assumption is load-bearing for all subsequent claims about robustness and the two shaping mechanisms.
[Experimental evaluation and results] The manuscript asserts theoretical guarantees and superior benchmark results but supplies no derivations, ablation studies on the shaping mechanisms, or error analysis of the reported gains; without these, the empirical superiority over baselines on the 18 benchmarks cannot be properly assessed for robustness or confounding factors.

minor comments (2)

[Method] Notation for the non-linear matching operator and the precise form of the advantage transformation should be formalized with equations to allow reproduction.
[Introduction] The abstract and introduction would benefit from a brief comparison table contrasting G²RPO with standard GRPO and other recent RL objectives in multimodal settings.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment below and commit to incorporating the suggested improvements in the revised version.

read point-by-point responses

Referee: The central technical claim for G²RPO—that non-linear distributional matching forces any task's advantage distribution to converge exactly to N(0,1) and thereby guarantees inter-task gradient equity—lacks an explicit transformation function, fixed-point derivation, or analysis of convergence/stability under the heterogeneous reward topologies of visual perception versus multi-step reasoning tasks. This assumption is load-bearing for all subsequent claims about robustness and the two shaping mechanisms.

Authors: We agree that the current presentation of G²RPO would be strengthened by additional mathematical detail. In the revised manuscript we will include the explicit non-linear transformation function, a fixed-point derivation showing convergence of the advantage distribution to N(0,1), and a stability analysis that accounts for heterogeneous reward topologies across perception and reasoning tasks. These additions will directly support the claims of inter-task gradient equity and the utility of the shaping mechanisms. revision: yes
Referee: The manuscript asserts theoretical guarantees and superior benchmark results but supplies no derivations, ablation studies on the shaping mechanisms, or error analysis of the reported gains; without these, the empirical superiority over baselines on the 18 benchmarks cannot be properly assessed for robustness or confounding factors.

Authors: We acknowledge that the manuscript currently lacks these supporting elements. The revised version will add the mathematical derivations for the theoretical guarantees, ablation studies that isolate the individual contributions of response-length shaping and entropy shaping, and error analysis (including run-to-run variance and statistical significance) for the 18-benchmark results. These changes will enable a clearer assessment of robustness. revision: yes

Circularity Check

0 steps flagged

No circularity: G²RPO defined via external target distribution and independent shaping mechanisms

full rationale

The paper's derivation introduces G²RPO as a new objective using non-linear distributional matching to force any task's advantage distribution to the independent external target N(0,1), which is not fitted from the same data or outputs. Subsequent response-length and entropy shaping are presented as separate mechanisms that leverage the claimed stability, without reducing back to the original GRPO inputs or self-referential definitions. No load-bearing self-citations, uniqueness theorems from prior author work, or renamings of known results appear in the abstract or described chain. The central claims rest on asserted mathematical properties of the proposed matching rather than tautological equivalence to inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the unproven assertion that distributional matching to N(0,1) produces gradient equity across tasks; no free parameters are explicitly named in the abstract, but the target normal distribution itself functions as an imposed reference.

axioms (1)

ad hoc to paper Forcing any task's advantage distribution to converge exactly to N(0,1) yields inter-task gradient equity and symmetric positive/negative updates.
Stated as a theoretical property of G²RPO in the abstract without derivation or proof sketch.

pith-pipeline@v0.9.0 · 5574 in / 1248 out tokens · 58895 ms · 2026-05-10T17:04:26.571895+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

37 extracted references · 36 canonical work pages · 11 internal anchors

[2]

Qwen3-VL Technical Report

URLhttps://arxiv.org/abs/2511.21631. Michael Bereket and Jure Leskovec. Uncalibrated reasoning: Grpo induces overconfidence for stochastic outcomes.ArXiv preprint, abs/2508.11800,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

and Leskovec, J

URL https://arxiv.org/ abs/2508.11800. Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Jiaqi Wang, Yu Qiao, Dahua Lin, and Feng Zhao. Are we on the right way for evaluating large vision-language models? In Amir Globersons, Lester Mackey, Danielle Belgrave, Angela Fan, Ulrich Paquet, Jakub M. Tomczak, and Cheng Zhang (e...

work page arXiv 2024
[4]

Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models

URL http://papers.nips.cc/paper_files/paper/2024/hash/ 2f8ee6a3d766b426d2618e555b5aeb39-Abstract-Conference.html. Qiguang Chen, Libo Qin, Jinhao Liu, Dengyun Peng, Jiannan Guan, Peng Wang, Mengkang Hu, Yuhang Zhou, Te Gao, and Wanxiang Che. Towards reasoning era: A survey of long chain-of-thought for reasoning large language models.ArXiv preprint, abs/250...

work page internal anchor Pith review arXiv 2024
[6]

(2025), Gpg: A simple and strong reinforcement learning baseline for model reasoning, arXiv preprint arXiv:2504.02546

URLhttps://arxiv.org/abs/2504.02546. Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.ArXiv preprint, abs/2507.06261,

work page arXiv
[7]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

URL https://arxiv. org/abs/2507.06261. Yihe Deng, Hritik Bansal, Fan Yin, Nanyun Peng, Wei Wang, and Kai-Wei Chang. Open- vlthinker: An early exploration to complex vision-language reasoning via iterative self- improvement.ArXiv preprint, abs/2503.17352,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

Agentic reinforced policy optimization

URL https://arxiv.org/abs/ 2507.19849. Mengfei Du, Binhao Wu, Zejun Li, Xuanjing Huang, and Zhongyu Wei. EmbSpatial-bench: Benchmarking spatial understanding for embodied tasks with large vision-language models. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar (eds.),Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (...

work page arXiv
[10]

doi: 10.18653/v1/2024.acl-short.33

Association for Computational Linguistics. doi: 10.18653/v1/2024.acl-short.33. URLhttps://aclanthology.org/2024.acl-short.33/. Chengqi Duan, Kaiyue Sun, Rongyao Fang, Manyuan Zhang, Yan Feng, Ying Luo, Yufang Liu, Ke Wang, Peng Pei, Xunliang Cai, et al. Codeplot-cot: Mathematical visual reasoning by thinking with code-driven images.ArXiv preprint, abs/2510.11718,

work page doi:10.18653/v1/2024.acl-short.33 2024
[11]

Codeplot-cot: Mathematical visual reasoning by thinking with code-driven images.arXiv preprint arXiv:2510.11718,

URL https: //arxiv.org/abs/2510.11718. Kaituo Feng, Kaixiong Gong, Bohao Li, Zonghao Guo, Yibing Wang, Tianshuo Peng, Junfei Wu, Xiaoying Zhang, Benyou Wang, and Xiangyu Yue. Video-r1: Reinforcing video reasoning in mllms.ArXiv preprint, abs/2503.21776, 2025a. URL https://arxiv.org/abs/ 2503.21776. Kaituo Feng, Manyuan Zhang, Hongyu Li, Kaixuan Fan, Shuan...

work page arXiv
[12]

Space-llava: a vision-language model adapted to extraterrestrial applications,

URLhttps://arxiv.org/abs/2408.05924. Chang Gao, Chujie Zheng, Xiong-Hui Chen, Kai Dang, Shixuan Liu, Bowen Yu, An Yang, Shuai Bai, Jingren Zhou, and Junyang Lin. Soft adaptive policy optimization,

work page arXiv
[13]

Soft Adaptive Policy Optimization

URL https://arxiv.org/abs/2511.20347. Google Gemini Team. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context.ArXiv preprint, abs/2403.05530,

work page internal anchor Pith review arXiv
[14]

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

URL https://arxiv.org/abs/ 2403.05530. 11 Preprint. Under review. Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.ArXiv preprint, abs/2501.12948,

work page internal anchor Pith review Pith/arXiv arXiv
[15]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

URL https://arxiv.org/abs/2501.12948. Jack Hong, Chenxiao Zhao, ChengLin Zhu, Weiheng Lu, Guohai Xu, and Xing Yu. Deep- eyesv2: Toward agentic multimodal model.ArXiv preprint, abs/2511.05271,

work page internal anchor Pith review Pith/arXiv arXiv
[16]

Deepeyesv2: Toward agentic multimodal model

URL https://arxiv.org/abs/2511.05271. Wenbo Hu, Jingli Lin, Yilin Long, Yunlong Ran, Lihan Jiang, Yifan Wang, Chenming Zhu, Runsen Xu, Tai Wang, and Jiangmiao Pang. G 2vlm: Geometry grounded vision language model with unified 3d reconstruction and spatial reasoning.ArXiv preprint, abs/2511.21688,

work page arXiv
[17]

arXiv:2511.21688 (2025)

URLhttps://arxiv.org/abs/2511.21688. Wenke Huang, Quan Zhang, Yiyang Fang, Jian Liang, Xuankun Rong, Huanjin Yao, Guancheng Wan, Ke Liang, Wenwen He, Mingjun Li, et al. Mapo: Mixed advantage policy optimization.ArXiv preprint, abs/2509.18849,

work page arXiv
[18]

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al

URL https://arxiv.org/abs/ 2509.18849. Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.ArXiv preprint, abs/2410.21276,

work page arXiv
[19]

GPT-4o System Card

URLhttps://arxiv.org/abs/2410.21276. Sahar Kazemzadeh, Vicente Ordonez, Mark Matten, and Tamara Berg. ReferItGame: Re- ferring to objects in photographs of natural scenes. In Alessandro Moschitti, Bo Pang, and Walter Daelemans (eds.),Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 787–798, Doha, Qatar,

work page internal anchor Pith review Pith/arXiv arXiv 2014
[20]

R efer I t G ame: Referring to objects in photographs of natural scenes

Association for Computational Linguistics. doi: 10.3115/v1/D14-1086. URL https://aclanthology.org/ D14-1086. Aniruddha Kembhavi, Mike Salvato, Eric Kolve, Minjoon Seo, Hannaneh Hajishirzi, and Ali Farhadi. A diagram is worth a dozen images. InEuropean conference on computer vision, pp. 235–251. Springer,

work page doi:10.3115/v1/d14-1086
[21]

Optimal transport-based token weighting scheme for enhanced preference optimization, 2025a

Meng Li, Guangda Huzhang, Haibo Zhang, Xiting Wang, and Anxiang Zeng. Optimal transport-based token weighting scheme for enhanced preference optimization, 2025a. URLhttps://arxiv.org/abs/2505.18720. Zongzhao Li, Zongyang Ma, Mingze Li, Songyou Li, Yu Rong, Tingyang Xu, Ziqi Zhang, Deli Zhao, and Wenbing Huang. Star-r1: Spatial transformation reasoning by ...

work page arXiv
[22]

Yuliang Liu, Zhang Li, Mingxin Huang, Biao Yang, Wenwen Yu, Chunyuan Li, Xu-Cheng Yin, Cheng-Lin Liu, Lianwen Jin, and Xiang Bai

URL https://arxiv.org/abs/2303.05499. Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? InEuropean conference on computer vision, pp. 216–233. Springer, 2024a. Yuliang Liu, Zhang Li, Mingxin Huang, Biao Yang, Wenwen Yu, Ch...

work page doi:10.1007/s11432-024-4235-6 1919
[23]

Ahmed Masry, Xuan Long Do, Jia Qing Tan, Shafiq Joty, and Enamul Hoque

URLhttps://openreview.net/forum?id=KUNzEQMWU7. Ahmed Masry, Xuan Long Do, Jia Qing Tan, Shafiq Joty, and Enamul Hoque. ChartQA: A benchmark for question answering about charts with visual and logical reasoning. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio (eds.),Findings of the Association for Computational Linguistics: ACL 2022, pp. 2263–2...

2022
[24]

doi: 10.18653/v1/2022.findings-acl.177

Association for Computational Linguistics. doi: 10.18653/v1/2022.findings-acl.177. URL https://aclanthology.org/2022.findings-acl.177. Minesh Mathew, Dimosthenis Karatzas, and CV Jawahar. Docvqa: A dataset for vqa on document images. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2199–2208,

work page doi:10.18653/v1/2022.findings-acl.177 2022
[25]

Fanqing Meng, Lingxiao Du, Zongkai Liu, Zhixiang Zhou, Quanfeng Lu, Daocheng Fu, Tiancheng Han, Botian Shi, Wenhai Wang, Junjun He, et al

URL https://arxiv.org/abs/2406.05882. Fanqing Meng, Lingxiao Du, Zongkai Liu, Zhixiang Zhou, Quanfeng Lu, Daocheng Fu, Tiancheng Han, Botian Shi, Wenhai Wang, Junjun He, et al. Mm-eureka: Exploring the frontiers of multimodal reasoning with rule-based reinforcement learning.ArXiv preprint, abs/2503.07365, 2025a. URLhttps://arxiv.org/abs/2503.07365. Jiahao...

work page arXiv
[26]

Geraldin Nanfack, Eugene Belilovsky, and Elvis Dohmatob

URLhttps://arxiv.org/abs/2602.01685. Geraldin Nanfack, Eugene Belilovsky, and Elvis Dohmatob. Efficient refusal ablation in llm through optimal transport,

work page arXiv
[27]

Bytedance Seed

URLhttps://arxiv.org/abs/2603.04355. Bytedance Seed. Seed1.8 model card: Towards generalized real-world agency,

work page arXiv
[28]

Seed1.8 Model Card: Towards Generalized Real-World Agency

URL https://arxiv.org/abs/2603.20633. Haozhan Shen, Peng Liu, Jingcheng Li, Chunxin Fang, Yibo Ma, Jiajia Liao, Qiaoli Shen, Zilun Zhang, Kangjia Zhao, Qianqian Zhang, et al. Vlm-r1: A stable and generalizable r1-style large vision-language model.ArXiv preprint, abs/2504.07615,

work page internal anchor Pith review Pith/arXiv arXiv
[29]

URL https: //arxiv.org/abs/2504.07615. Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, Akshay Nathan, Alan Luo, Alec Helyar, Aleksander Madry, Aleksandr Efremov, Aleksandra Spyra, Alex Baker- Whitcomb, Alex Beutel, Alex Karpenko, Alex Makelov, Alex Neitz, Alex Wei, A...

work page internal anchor Pith review arXiv
[30]

OpenAI GPT-5 System Card

URL https://arxiv. org/abs/2601.03267. 13 Preprint. Under review. Chan Hee Song, Valts Blukis, Jonathan Tremblay, Stephen Tyree, Yu Su, and Stan Birchfield. RoboSpatial: Teaching spatial understanding to 2D and 3D vision-language models for robotics. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR),

work page internal anchor Pith review Pith/arXiv arXiv
[31]

Reinforcement fine-tuning powers reasoning capability of multimodal large language models.arXiv preprint arXiv:2505.18536, 2025

Oral Presentation. Haoyuan Sun, Jiaqi Wu, Bo Xia, Yifu Luo, Yifei Zhao, Kai Qin, Xufei Lv, Tiantian Zhang, Yongzhe Chang, and Xueqian Wang. Reinforcement fine-tuning powers reasoning capa- bility of multimodal large language models.ArXiv preprint, abs/2505.18536, 2025a. URL https://arxiv.org/abs/2505.18536. Peiwen Sun, Shiqiang Lang, Dongming Wu, Yi Ding,...

work page arXiv
[32]

arXiv preprint arXiv:2507.02029 , year=

URLhttps://arxiv.org/abs/2507.02029. Kimi Team. Kimi k2.5: Visual agentic intelligence,

work page arXiv
[34]

org/abs/2509.25848

URL https://arxiv. org/abs/2509.25848. Songjun Tu, Qichao Zhang, Jingbo Sun, Yuqian Fu, Linjing Li, Xiangyuan Lan, Dong- mei Jiang, Yaowei Wang, and Dongbin Zhao. Perception-consistency multimodal large language models reasoning via caption-regularized policy optimization,

work page arXiv
[35]

Chao Wang, Hehe Fan, Huichen Yang, Zhengdong Hu, Sarvnaz Karimi, Lina Yao, and Yi Yang

URL https://arxiv.org/abs/2509.21854. Chao Wang, Hehe Fan, Huichen Yang, Zhengdong Hu, Sarvnaz Karimi, Lina Yao, and Yi Yang. Regiondoc-r1: Reinforcing semantic layout-aware learning for document under- standing, 2025a. URLhttps://openreview.net/forum?id=pfHm4YJTzC. Haozhe Wang, Chao Qu, Zuming Huang, Wei Chu, Fangzhen Lin, and Wenhu Chen. Vl- rethinker: ...

work page arXiv 2024
[36]

Open vision reasoner: Transferring linguis- tic cognitive behavior for visual reasoning.arXiv preprint arXiv:2507.05255, 2025

URL https://arxiv. org/abs/2507.05255. Tian Xie, Zitian Gao, Qingnan Ren, Haoming Luo, Yuqian Hong, Bryan Dai, Joey Zhou, Kai Qiu, Zhirong Wu, and Chong Luo. Logic-rl: Unleashing llm reasoning with rule-based reinforcement learning.ArXiv preprint, abs/2502.14768,

work page arXiv
[37]

arXiv preprint arXiv:2502.14768 , year=

URL https://arxiv.org/ abs/2502.14768. Senqiao Yang, Junyi Li, Xin Lai, Bei Yu, Hengshuang Zhao, and Jiaya Jia. Visionthink: Smart and efficient vision language model via reinforcement learning.ArXiv preprint, abs/2507.13348, 2025a. URLhttps://arxiv.org/abs/2507.13348. Siqi Yang, Zilve Gao, Haibo Qiu, Fanfan Liu, Peng Shi, Zhixiong Zeng, Qingmin Liao, and...

work page arXiv
[38]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale.ArXiv preprint, abs/2503.14476, 2025a. URL https://arxiv.org/ abs/2503.14476. Wenwen Yu, Zhibo Yang, Yuliang Liu, and Xiang Bai. Docthinker: Explainable multimodal...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[39]

In: CVPR

doi: 10.1109/CVPR52733.2024.00913. URL https://doi.org/10.1109/CVPR52733. 2024.00913. Yuheng Zha, Kun Zhou, Yujia Wu, Yushu Wang, Jie Feng, Zhi Xu, Shibo Hao, Zhengzhong Liu, Eric P . Xing, and Zhiting Hu. Vision-g1: Towards general vision language reasoning with multi-domain data curation,

work page doi:10.1109/cvpr52733.2024.00913 2024
[40]

Vision-g1: Towards general vision language reasoning with multi-domain data curation.arXiv preprint arXiv:2508.12680, 2025

URLhttps://arxiv.org/abs/2508.12680. Renrui Zhang, Dongzhi Jiang, Yichi Zhang, Haokun Lin, Ziyu Guo, Pengshuo Qiu, Aojun Zhou, Pan Lu, Kai-Wei Chang, Yu Qiao, et al. Mathverse: Does your multi-modal llm truly see the diagrams in visual math problems? InEuropean Conference on Computer Vision, pp. 169–186. Springer,

work page arXiv
[41]

See less, see right: Bi-directional perceptual shaping for multimodal reasoning.arXiv preprint arXiv:2512.22120, 2025

Shuoshuo Zhang, Yizhen Zhang, Jingjing Fu, Lei Song, Jiang Bian, Yujiu Yang, and Rui Wang. See less, see right: Bi-directional perceptual shaping for multimodal reasoning, 2025a. URLhttps://arxiv.org/abs/2512.22120. 15 Preprint. Under review. Xiaoying Zhang, Hao Sun, Yipeng Zhang, Kaituo Feng, Chaochao Lu, Chao Yang, and Helen Meng. Critique-grpo: Advanci...

work page arXiv