pith. machine review for the scientific record. sign in

arxiv: 2605.09262 · v1 · submitted 2026-05-10 · 💻 cs.CV · cs.CL

Recognition: no theorem link

Reinforcing Multimodal Reasoning Against Visual Degradation

Authors on Pith no claims yet

Pith reviewed 2026-05-12 04:32 UTC · model grok-4.3

classification 💻 cs.CV cs.CL
keywords multimodal large language modelsreinforcement learningvisual robustnessfine-tuningreasoningimage degradationpolicy optimizationreward poisoning
0
0 comments X

The pith

A new RL fine-tuning framework called ROMA makes multimodal reasoning robust to visual degradations like blur and compression while matching clean-image accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to establish that critic-free reinforcement learning on multimodal large language models leaves reasoning policies brittle to real-world image corruptions because degraded inputs during rollouts trigger hallucinated trajectories and reward poisoning. ROMA addresses this by altering the optimization process with a dual-forward-pass strategy that evaluates corrupted views against clean trajectories via teacher forcing, plus token-level regularization and auxiliary losses that maintain a stable reward signal. A sympathetic reader would care because practical deployments often involve imperfect visuals from cameras or scans, where current methods would fail even if they perform well in ideal conditions. If the approach holds, it offers a way to train autoregressive MLLMs that reason reliably under visual noise without retraining from scratch or sacrificing baseline performance.

Core claim

ROMA modifies the optimization dynamics of RL fine-tuning for autoregressive MLLMs by using a dual-forward-pass strategy with teacher forcing to evaluate corrupted views against clean-image trajectories, applying a token-level surrogate KL penalty against the worst-case augmentation for distributional consistency, adding an auxiliary policy gradient loss anchored to clean-image advantages to preserve a reliable reward signal, and enforcing correctness-conditioned regularization that restricts invariance only to successful trajectories. This combination avoids reward poisoning from perceptual occlusions and prevents policy collapse under regularization, resulting in improved robustness on Qw

What carries the argument

The ROMA framework, which alters RL optimization through dual-forward-pass teacher forcing, token-level surrogate KL penalty, auxiliary policy gradient loss, and correctness-conditioned regularization to reinforce reasoning against degraded visual inputs.

Load-bearing premise

That the dual-forward-pass strategy with teacher forcing, token-level surrogate KL penalty, auxiliary policy gradient loss, and correctness-conditioned regularization together prevent reward poisoning and policy collapse in critic-free RL fine-tuning of autoregressive MLLMs without introducing new failure modes or reducing generalization.

What would settle it

A controlled experiment on Qwen3-VL or a similar model where applying ROMA during RL fine-tuning yields no gain in accuracy on corrupted test images from the seven benchmarks compared to standard GRPO, or where clean-image accuracy drops below the baseline.

Figures

Figures reproduced from arXiv: 2605.09262 by Dian Yu, Haitao Mi, Haolin Liu, Leoweiliang, Pratap Tokekar, Rui Liu, Runpeng Dai, Tong Zheng, Yucheng Shi.

Figure 1
Figure 1. Figure 1: Overview of ROMA. A standard RL rollout on the clean input yields a trajectory and reward defining the main RL objective. The trajectory is then re-evaluated under perturbations via two branches: a worst-case invariance branch applying a token-level KL penalty against the most divergent of multiple degraded views, gated by a correctness mask so it fires only on successful trajectories; and an auxiliary pol… view at source ↗
Figure 2
Figure 2. Figure 2: Robustness under increasing visual corruption severity. We report accuracy from Clean to Level 3 (severe) for seen and unseen degradations. ROMA consistently achieves higher accuracy at severe corruption levels and exhibits smaller performance degradation compared to the base model and GRPO. 4.3 Ablation Studies To address Question 3 and validate our design configurations, we conduct a series of ablation s… view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative examples of visual degradations. The perturbations are categorized into seen degradations (e.g., Gaussian noise, Gaussian blur, JPEG compression, and Resolution downscaling), which simulate common transmission artifacts encountered during training, and unseen degradations (e.g., motion blur, salt-and-pepper noise, speckle noise, posterization, and pixelation), which test out-of-distribution (OO… view at source ↗
read the original abstract

Reinforcement Learning has significantly advanced the reasoning capabilities of Multimodal Large Language Models (MLLMs), yet the resulting policies remain brittle against real-world visual degradations such as blur, compression artifacts, and low-resolution scans. Prior robustness techniques from vision and deep RL rely on static data augmentation or value-based regularization, neither of which transfers cleanly to critic-free RL fine-tuning of autoregressive MLLMs. Reinforcing reasoning against such corruptions is non-trivial: naively injecting degraded views during rollout induces reward poisoning, where perceptual occlusions trigger hallucinated trajectories and destabilize optimization. We propose ROMA, an RL fine-tuning framework that modifies the optimization dynamics to reinforce reasoning against visual degradation while preserving clean-input performance. A dual-forward-pass strategy uses teacher forcing to evaluate corrupted views against clean-image trajectories, avoiding new rollouts on degraded inputs. For distributional consistency, we apply a token-level surrogate KL penalty against the worst-case augmentation; to prevent policy collapse under regularization, an auxiliary policy gradient loss anchored to clean-image advantages preserves a reliable reward signal; and to avoid systematically incorrect invariance, correctness-conditioned regularization restricts enforcement to successful trajectories. On Qwen3-VL 4B/8B across seven multimodal reasoning benchmarks, our method improves robustness by +2.4% on seen and +2.3% on unseen corruptions over GRPO while matching clean accuracy.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces ROMA, a novel RL fine-tuning framework for MLLMs to improve robustness to visual degradations. It uses a dual-forward-pass with teacher forcing on clean trajectories for corrupted inputs, token-level surrogate KL penalty, auxiliary policy gradient loss anchored to clean advantages, and correctness-conditioned regularization. Results claim +2.4% improvement on seen corruptions and +2.3% on unseen over GRPO baseline on Qwen3-VL 4B/8B across seven benchmarks, while matching clean accuracy.

Significance. If the results are verified, this would represent a meaningful advance in making MLLM reasoning more reliable under real-world visual conditions without sacrificing performance on clean inputs. The approach avoids common pitfalls like reward poisoning in critic-free RL settings for autoregressive models, which is a practical contribution. The gains on both seen and unseen corruptions indicate potential for broad applicability.

major comments (2)
  1. The abstract reports specific percentage gains on named models and benchmarks, but provides no details on experimental protocols, statistical significance, hyperparameter choices, or ablation studies; the central claim therefore rests on unverified implementation specifics.
  2. The dual-forward-pass strategy uses teacher forcing to evaluate corrupted views against clean-image trajectories, avoiding new rollouts on degraded inputs. Consequently, policy gradient updates derive exclusively from clean-image trajectories. This creates a potential gap: the reported robustness gains on degraded test inputs may not reflect reinforced reasoning under visual degradation but rather distributional consistency or regularization effects on clean data. Additional analysis of reasoning chains on corrupted inputs is needed to substantiate the central claim.
minor comments (1)
  1. The term 'worst-case augmentation' in the surrogate KL penalty description could be clarified with a specific definition or reference to how it is computed.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback on our work. We address each major comment below with clarifications on the method and results, and we have revised the manuscript to incorporate additional details and analyses as appropriate.

read point-by-point responses
  1. Referee: The abstract reports specific percentage gains on named models and benchmarks, but provides no details on experimental protocols, statistical significance, hyperparameter choices, or ablation studies; the central claim therefore rests on unverified implementation specifics.

    Authors: We appreciate the referee noting the need for verifiability. The abstract serves as a concise summary of key outcomes, while the full manuscript details the experimental protocols in Section 4 (including the seven benchmarks, Qwen3-VL 4B/8B models, seen/unseen corruption types, and evaluation procedures), hyperparameter choices and training configurations in Section 4.2 and Appendix A, and ablation studies in Section 5.3. To directly address statistical significance, we have added multi-seed results with standard deviations and t-test p-values (p < 0.05) in the revised tables and text, confirming the reported gains over GRPO. revision: partial

  2. Referee: The dual-forward-pass strategy uses teacher forcing to evaluate corrupted views against clean-image trajectories, avoiding new rollouts on degraded inputs. Consequently, policy gradient updates derive exclusively from clean-image trajectories. This creates a potential gap: the reported robustness gains on degraded test inputs may not reflect reinforced reasoning under visual degradation but rather distributional consistency or regularization effects on clean data. Additional analysis of reasoning chains on corrupted inputs is needed to substantiate the central claim.

    Authors: We thank the referee for this precise analysis of the optimization dynamics. While policy gradients are computed from clean trajectories (to prevent reward poisoning from degraded rollouts), the dual-forward-pass explicitly evaluates corrupted inputs via teacher forcing against those trajectories. This enables the token-level surrogate KL penalty to enforce consistency under degradation and the correctness-conditioned regularization to apply only on successful trajectories, directly reinforcing robustness during training. The auxiliary loss preserves stable clean advantages without collapsing the policy. To substantiate the impact on reasoning under corruption, we have added a new analysis subsection (5.4) with qualitative examples of reasoning chains on corrupted inputs and quantitative metrics (e.g., step-wise correctness and coherence scores), showing ROMA maintains more reliable chains than GRPO on both seen and unseen degradations. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical method with independent benchmark comparisons

full rationale

The paper proposes ROMA, an RL fine-tuning framework for MLLMs using a dual-forward-pass strategy with teacher forcing on clean trajectories, token-level surrogate KL penalty, auxiliary policy gradient loss, and correctness-conditioned regularization. No equations, derivations, or self-referential quantities appear in the abstract or described components. Results are presented as direct empirical comparisons (+2.4% / +2.3% robustness gains over GRPO baseline on seven benchmarks while matching clean accuracy), which are externally falsifiable and not reduced to fitted inputs or self-citations by construction. The derivation chain is absent; the work is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities; the method is described at the level of high-level algorithmic components without mathematical formalization or hyperparameter disclosure.

pith-pipeline@v0.9.0 · 5564 in / 1195 out tokens · 37654 ms · 2026-05-12T04:32:41.041170+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

43 extracted references · 43 canonical work pages · 13 internal anchors

  1. [1]

    Qwen3-VL Technical Report

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

  2. [2]

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025

  3. [3]

    Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

    Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, et al. Training a helpful and harmless assistant with reinforcement learning from human feedback.arXiv preprint arXiv:2204.05862, 2022

  4. [4]

    Are We on the Right Way for Evaluating Large Vision-Language Models?

    Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Jiaqi Wang, Yu Qiao, Dahua Lin, and Feng Zhao. Are we on the right way for evaluating large vision-language models?, 2024. URLhttps://arxiv.org/abs/2403.20330

  5. [5]

    Cde: Curiosity-driven exploration for efficient reinforcement learning in large language models.arXiv preprint arXiv:2509.09675,

    Runpeng Dai, Linfeng Song, Haolin Liu, Zhenwen Liang, Dian Yu, Haitao Mi, Zhaopeng Tu, Rui Liu, Tong Zheng, Hongtu Zhu, and Dong Yu. Cde: Curiosity-driven exploration for efficient reinforcement learning in large language models, 2025. URL https://arxiv.org/ abs/2509.09675

  6. [6]

    Open- vlthinker: Complex vision-language reasoning via iterative sft-rl cycles.arXiv preprint arXiv:2503.17352, 2025

    Yihe Deng, Hritik Bansal, Fan Yin, Nanyun Peng, Wei Wang, and Kai-Wei Chang. Open- vlthinker: Complex vision-language reasoning via iterative sft-rl cycles, 2025. URL https: //arxiv.org/abs/2503.17352

  7. [7]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

  8. [8]

    Generalization in reinforcement learning by soft data augmentation

    Nicklas Hansen and Xiaolong Wang. Generalization in reinforcement learning by soft data augmentation. In2021 IEEE International Conference on Robotics and Automation (ICRA), pages 13611–13617. IEEE, 2021

  9. [9]

    Benchmarking Neural Network Robustness to Common Corruptions and Perturbations

    Dan Hendrycks and Thomas Dietterich. Benchmarking neural network robustness to common corruptions and perturbations.arXiv preprint arXiv:1903.12261, 2019

  10. [10]

    Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models

    Wenxuan Huang, Bohan Jia, Zijie Zhai, Shaosheng Cao, Zheyu Ye, Fei Zhao, Zhe Xu, Yao Hu, and Shaohui Lin. Vision-r1: Incentivizing reasoning capability in multimodal large language models.arXiv preprint arXiv:2503.06749, 2025

  11. [11]

    Tulu 3: Pushing Frontiers in Open Language Model Post-Training

    Nathan Lambert, Jacob Morrison, Valentina Pyatkin, Shengyi Huang, Hamish Ivison, Faeze Brahman, Lester James V Miranda, Alisa Liu, Nouha Dziri, Shane Lyu, et al. T\" ulu 3: Pushing frontiers in open language model post-training.arXiv preprint arXiv:2411.15124, 2024

  12. [12]

    Reinforcement learning with augmented data.Advances in neural information processing systems, 33:19884–19895, 2020

    Misha Laskin, Kimin Lee, Adam Stooke, Lerrel Pinto, Pieter Abbeel, and Aravind Srinivas. Reinforcement learning with augmented data.Advances in neural information processing systems, 33:19884–19895, 2020

  13. [13]

    Self-Rewarding Vision-Language Model via Reasoning Decomposition

    Zongxia Li, Wenhao Yu, Chengsong Huang, Rui Liu, Zhenwen Liang, Fuxiao Liu, Jingxi Che, Dian Yu, Jordan Boyd-Graber, Haitao Mi, et al. Self-rewarding vision-language model via reasoning decomposition.arXiv preprint arXiv:2508.19652, 2025

  14. [14]

    Save the good prefix: Precise error penalization via process-supervised rl to enhance llm reasoning.arXiv preprint arXiv:2601.18984, 2026

    Haolin Liu, Dian Yu, Sidi Lu, Yujun Zhou, Rui Liu, Zhenwen Liang, Haitao Mi, Chen-Yu Wei, and Dong Yu. Save the good prefix: Precise error penalization via process-supervised rl to enhance llm reasoning.arXiv preprint arXiv:2601.18984, 2026

  15. [15]

    Stable and efficient single-rollout rl for multimodal reasoning.arXiv preprint arXiv:2512.18215, 2025

    Rui Liu, Dian Yu, Lei Ke, Haolin Liu, Yujun Zhou, Zhenwen Liang, Haitao Mi, Pratap Tokekar, and Dong Yu. Stable and efficient single-rollout rl for multimodal reasoning.arXiv preprint arXiv:2512.18215, 2025. 10

  16. [16]

    V ogue: Guiding exploration with visual uncertainty improves multimodal reasoning.arXiv preprint arXiv:2510.01444, 2025

    Rui Liu, Dian Yu, Tong Zheng, Runpeng Dai, Zongxia Li, Wenhao Yu, Zhenwen Liang, Linfeng Song, Haitao Mi, Pratap Tokekar, et al. V ogue: Guiding exploration with visual uncertainty improves multimodal reasoning.arXiv preprint arXiv:2510.01444, 2025

  17. [17]

    Noisyrollout: Reinforcing visual reasoning with data aug- mentation.arXiv preprint arXiv:2504.13055, 2025

    Xiangyan Liu, Jinjie Ni, Zijian Wu, Chao Du, Longxu Dou, Haonan Wang, Tianyu Pang, and Michael Qizhe Shieh. Noisyrollout: Reinforcing visual reasoning with data augmentation. arXiv preprint arXiv:2504.13055, 2025

  18. [18]

    MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts

    Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts.arXiv preprint arXiv:2310.02255, 2023

  19. [19]

    Reft: Reasoning with reinforced fine-tuning, 2024

    Trung Quoc Luong, Xinbo Zhang, Zhanming Jie, Peng Sun, Xiaoran Jin, and Hang Li. Reft: Reasoning with reinforced fine-tuning.arXiv preprint arXiv:2401.08967, 2024

  20. [20]

    A comprehensive survey of data augmentation in visual reinforcement learning.International Journal of Computer Vision, 133(10):7368–7405, 2025

    Guozheng Ma, Zhen Wang, Zhecheng Yuan, Xueqian Wang, Bo Yuan, and Dacheng Tao. A comprehensive survey of data augmentation in visual reinforcement learning.International Journal of Computer Vision, 133(10):7368–7405, 2025

  21. [21]

    Chartqa: A benchmark for question answering about charts with visual and logical reasoning, 2022

    Ahmed Masry, Do Xuan Long, Jia Qing Tan, Shafiq Joty, and Enamul Hoque. Chartqa: A benchmark for question answering about charts with visual and logical reasoning.arXiv preprint arXiv:2203.10244, 2022

  22. [22]

    A survey of synthetic data augmentation methods in machine vision.Machine Intelligence Research, 21(5):831–869, 2024

    Alhassan Mumuni, Fuseini Mumuni, and Nana Kobina Gerrar. A survey of synthetic data augmentation methods in machine vision.Machine Intelligence Research, 21(5):831–869, 2024

  23. [23]

    Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

  24. [24]

    arXiv preprint arXiv:2503.07536 , year =

    Yingzhe Peng, Gongrui Zhang, Miaosen Zhang, Zhiyuan You, Jie Liu, Qipeng Zhu, Kai Yang, Xingzhong Xu, Xin Geng, and Xu Yang. Lmm-r1: Empowering 3b lmms with strong reasoning abilities through two-stage rule-based rl.arXiv preprint arXiv:2503.07536, 2025

  25. [25]

    We-math: Does your large multimodal model achieve human-like mathematical reasoning?arXiv preprint arXiv:2407.01284, 2024

    Runqi Qiao, Qiuna Tan, Guanting Dong, Minhui Wu, Chong Sun, Xiaoshuai Song, Zhuoma GongQue, Shanglin Lei, Zhe Wei, Miaoxuan Zhang, et al. We-math: Does your large multimodal model achieve human-like mathematical reasoning?arXiv preprint arXiv:2407.01284, 2024

  26. [26]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021

  27. [27]

    Auto- matic data augmentation for generalization in deep reinforcement learning.arXiv preprint arXiv:2006.12862, 2020

    Roberta Raileanu, Max Goldstein, Denis Yarats, Ilya Kostrikov, and Rob Fergus. Auto- matic data augmentation for generalization in deep reinforcement learning.arXiv preprint arXiv:2006.12862, 2020

  28. [28]

    Visualizing and understanding contrastive learning.IEEE Transactions on Image Processing, 33:541–555, 2023

    Fawaz Sammani, Boris Joukovsky, and Nikos Deligiannis. Visualizing and understanding contrastive learning.IEEE Transactions on Image Processing, 33:541–555, 2023

  29. [29]

    Proximal Policy Optimization Algorithms

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

  30. [30]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

  31. [31]

    Visualpuzzles: Decoupling multimodal reasoning evaluation from domain knowledge.arXiv preprint arXiv:2504.10342, 2025

    Yueqi Song, Tianyue Ou, Yibo Kong, Zecheng Li, Graham Neubig, and Xiang Yue. Visualpuz- zles: Decoupling multimodal reasoning evaluation from domain knowledge.arXiv preprint arXiv:2504.10342, 2025

  32. [32]

    Reason-rft: Reinforcement fine-tuning for visual reasoning of vision language models.arXiv preprint arXiv:2503.20752, 2025

    Huajie Tan, Yuheng Ji, Xiaoshuai Hao, Minglan Lin, Pengwei Wang, Zhongyuan Wang, and Shanghang Zhang. Reason-rft: Reinforcement fine-tuning for visual reasoning.arXiv preprint arXiv:2503.20752, 2025. 11

  33. [33]

    Qwen2.5: A party of foundation models, September 2024

    Qwen Team. Qwen2.5: A party of foundation models, September 2024. URL https:// qwenlm.github.io/blog/qwen2.5/

  34. [34]

    VL-Rethinker: Incentivizing Self-Reflection of Vision-Language Models with Reinforcement Learning

    Haozhe Wang, Chao Qu, Zuming Huang, Wei Chu, Fangzhen Lin, and Wenhu Chen. Vl- rethinker: Incentivizing self-reflection of vision-language models with reinforcement learning. arXiv preprint arXiv:2504.08837, 2025

  35. [35]

    Perception-Aware Policy Optimization for Multimodal Reasoning

    Zhenhailong Wang, Xuehang Guo, Sofia Stoica, Haiyang Xu, Hongru Wang, Hyeonjeong Ha, Xiusi Chen, Yangyi Chen, Ming Yan, Fei Huang, et al. Perception-aware policy optimization for multimodal reasoning.arXiv preprint arXiv:2507.06448, 2025

  36. [36]

    Grok-1.5 Vision Preview, 2024

    xAI. Grok-1.5 Vision Preview, 2024. URLhttps://x.ai/news/grok-1.5v

  37. [37]

    Logicvista: Multimodal llm logical reasoning benchmark in visual contexts.arXiv preprint arXiv:2407.04973, 2024

    Yijia Xiao, Edward Sun, Tianyu Liu, and Wei Wang. Logicvista: Multimodal llm logical reasoning benchmark in visual contexts.arXiv preprint arXiv:2407.04973, 2024

  38. [38]

    R1-onevision: Advancing gen- eralized multimodal reasoning through cross-modal formal- ization.arXiv preprint arXiv:2503.10615, 2025

    Yi Yang, Xiaoxuan He, Hongkun Pan, Xiyan Jiang, Yan Deng, Xingtao Yang, Haoyu Lu, Dacheng Yin, Fengyun Rao, Minfeng Zhu, et al. R1-onevision: Advancing generalized multimodal reasoning through cross-modal formalization.arXiv preprint arXiv:2503.10615, 2025

  39. [39]

    arXiv preprint arXiv:2505.16673 (2025)

    Huanjin Yao, Qixiang Yin, Jingyi Zhang, Min Yang, Yibo Wang, Wenhao Wu, Fei Su, Li Shen, Minghui Qiu, Dacheng Tao, et al. R1-sharevl: Incentivizing reasoning capability of multimodal large language models via share-grpo.arXiv preprint arXiv:2505.16673, 2025

  40. [40]

    Image augmentation is all you need: Regu- larizing deep reinforcement learning from pixels

    Denis Yarats, Ilya Kostrikov, and Rob Fergus. Image augmentation is all you need: Regu- larizing deep reinforcement learning from pixels. InInternational conference on learning representations, 2021

  41. [41]

    Parallel-r1: Towards parallel thinking via reinforcement learning

    Tong Zheng, Hongming Zhang, Wenhao Yu, Xiaoyang Wang, Xinyu Yang, Runpeng Dai, Rui Liu, Huiwen Bao, Chengsong Huang, Heng Huang, et al. Parallel-r1: Towards parallel thinking via reinforcement learning.arXiv preprint arXiv:2509.07980, 2025

  42. [42]

    Easyr1: An efficient, scalable, multi-modality rl training framework, 2025

    Yaowei Zheng, Junting Lu, Shenzhi Wang, Zhangchi Feng, Dongdong Kuang, and Yuwen Xiong. Easyr1: An efficient, scalable, multi-modality rl training framework, 2025. URL https://github.com/hiyouga/EasyR1

  43. [43]

    Shuffle-r1: Efficient rl framework for multimodal large language models via data-centric dynamic shuffle.arXiv preprint arXiv:2508.05612, 2025

    Linghao Zhu, Yiran Guan, Dingkang Liang, Jianzhong Ju, Zhenbo Luo, Bin Qin, Jian Luan, Yuliang Liu, and Xiang Bai. Shuffle-r1: Efficient rl framework for multimodal large language models via data-centric dynamic shuffle.arXiv preprint arXiv:2508.05612, 2025. 12 A Appendix A.1 Implementation Details We train all models on the MMRL30k dataset [43], which co...