arxiv: 2605.09262 · v1 · submitted 2026-05-10 · 💻 cs.CV · cs.CL

Recognition: no theorem link

Reinforcing Multimodal Reasoning Against Visual Degradation

Rui Liu , Dian Yu , Haolin Liu , Yucheng Shi , Tong Zheng , Runpeng Dai , Haitao Mi , Pratap Tokekar

show 1 more author

Leoweiliang

Authors on Pith no claims yet

Pith reviewed 2026-05-12 04:32 UTC · model grok-4.3

classification 💻 cs.CV cs.CL

keywords multimodal large language modelsreinforcement learningvisual robustnessfine-tuningreasoningimage degradationpolicy optimizationreward poisoning

0 comments

The pith

A new RL fine-tuning framework called ROMA makes multimodal reasoning robust to visual degradations like blur and compression while matching clean-image accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to establish that critic-free reinforcement learning on multimodal large language models leaves reasoning policies brittle to real-world image corruptions because degraded inputs during rollouts trigger hallucinated trajectories and reward poisoning. ROMA addresses this by altering the optimization process with a dual-forward-pass strategy that evaluates corrupted views against clean trajectories via teacher forcing, plus token-level regularization and auxiliary losses that maintain a stable reward signal. A sympathetic reader would care because practical deployments often involve imperfect visuals from cameras or scans, where current methods would fail even if they perform well in ideal conditions. If the approach holds, it offers a way to train autoregressive MLLMs that reason reliably under visual noise without retraining from scratch or sacrificing baseline performance.

Core claim

ROMA modifies the optimization dynamics of RL fine-tuning for autoregressive MLLMs by using a dual-forward-pass strategy with teacher forcing to evaluate corrupted views against clean-image trajectories, applying a token-level surrogate KL penalty against the worst-case augmentation for distributional consistency, adding an auxiliary policy gradient loss anchored to clean-image advantages to preserve a reliable reward signal, and enforcing correctness-conditioned regularization that restricts invariance only to successful trajectories. This combination avoids reward poisoning from perceptual occlusions and prevents policy collapse under regularization, resulting in improved robustness on Qw

What carries the argument

The ROMA framework, which alters RL optimization through dual-forward-pass teacher forcing, token-level surrogate KL penalty, auxiliary policy gradient loss, and correctness-conditioned regularization to reinforce reasoning against degraded visual inputs.

Load-bearing premise

That the dual-forward-pass strategy with teacher forcing, token-level surrogate KL penalty, auxiliary policy gradient loss, and correctness-conditioned regularization together prevent reward poisoning and policy collapse in critic-free RL fine-tuning of autoregressive MLLMs without introducing new failure modes or reducing generalization.

What would settle it

A controlled experiment on Qwen3-VL or a similar model where applying ROMA during RL fine-tuning yields no gain in accuracy on corrupted test images from the seven benchmarks compared to standard GRPO, or where clean-image accuracy drops below the baseline.

Figures

Figures reproduced from arXiv: 2605.09262 by Dian Yu, Haitao Mi, Haolin Liu, Leoweiliang, Pratap Tokekar, Rui Liu, Runpeng Dai, Tong Zheng, Yucheng Shi.

**Figure 1.** Figure 1: Overview of ROMA. A standard RL rollout on the clean input yields a trajectory and reward defining the main RL objective. The trajectory is then re-evaluated under perturbations via two branches: a worst-case invariance branch applying a token-level KL penalty against the most divergent of multiple degraded views, gated by a correctness mask so it fires only on successful trajectories; and an auxiliary pol… view at source ↗

**Figure 2.** Figure 2: Robustness under increasing visual corruption severity. We report accuracy from Clean to Level 3 (severe) for seen and unseen degradations. ROMA consistently achieves higher accuracy at severe corruption levels and exhibits smaller performance degradation compared to the base model and GRPO. 4.3 Ablation Studies To address Question 3 and validate our design configurations, we conduct a series of ablation s… view at source ↗

**Figure 3.** Figure 3: Qualitative examples of visual degradations. The perturbations are categorized into seen degradations (e.g., Gaussian noise, Gaussian blur, JPEG compression, and Resolution downscaling), which simulate common transmission artifacts encountered during training, and unseen degradations (e.g., motion blur, salt-and-pepper noise, speckle noise, posterization, and pixelation), which test out-of-distribution (OO… view at source ↗

read the original abstract

Reinforcement Learning has significantly advanced the reasoning capabilities of Multimodal Large Language Models (MLLMs), yet the resulting policies remain brittle against real-world visual degradations such as blur, compression artifacts, and low-resolution scans. Prior robustness techniques from vision and deep RL rely on static data augmentation or value-based regularization, neither of which transfers cleanly to critic-free RL fine-tuning of autoregressive MLLMs. Reinforcing reasoning against such corruptions is non-trivial: naively injecting degraded views during rollout induces reward poisoning, where perceptual occlusions trigger hallucinated trajectories and destabilize optimization. We propose ROMA, an RL fine-tuning framework that modifies the optimization dynamics to reinforce reasoning against visual degradation while preserving clean-input performance. A dual-forward-pass strategy uses teacher forcing to evaluate corrupted views against clean-image trajectories, avoiding new rollouts on degraded inputs. For distributional consistency, we apply a token-level surrogate KL penalty against the worst-case augmentation; to prevent policy collapse under regularization, an auxiliary policy gradient loss anchored to clean-image advantages preserves a reliable reward signal; and to avoid systematically incorrect invariance, correctness-conditioned regularization restricts enforcement to successful trajectories. On Qwen3-VL 4B/8B across seven multimodal reasoning benchmarks, our method improves robustness by +2.4% on seen and +2.3% on unseen corruptions over GRPO while matching clean accuracy.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ROMA offers a targeted RL setup for MLLM robustness using clean trajectories and surrogate penalties, but the gains likely come from indirect regularization rather than direct training on degraded inputs.

read the letter

ROMA is a new approach to RL fine-tuning for MLLMs that tries to improve robustness to visual degradations such as blur and low resolution. The key innovation is the dual-forward-pass strategy. It evaluates corrupted inputs against trajectories generated from clean images using teacher forcing, instead of rolling out on the degraded views. This is paired with a token-level surrogate KL penalty on the worst-case augmentation, an auxiliary policy gradient loss anchored to clean advantages, and correctness-conditioned regularization that only applies to successful clean trajectories. These pieces are meant to avoid reward poisoning and policy collapse while building some invariance. The paper shows this leads to modest gains of around 2.4 percent on seen corruptions and 2.3 percent on unseen ones over the GRPO baseline, with clean accuracy staying the same, on Qwen3-VL models across seven benchmarks. This setup does address a practical problem in applying standard robustness techniques to critic-free RL for autoregressive models. The framing of why naive augmentation fails is clear. The soft spots are the experimental details and the core mechanism. The abstract reports the gains but gives no ablations, no error bars, and no protocol for how the corruptions were generated or tested. More importantly, since all policy updates come from clean trajectories, the method encourages consistency between clean and corrupted views but does not directly optimize the policy on degraded inputs. The robustness numbers could come from reduced hallucinations on clean data or from the extra regularization rather than genuinely better reasoning when the image is bad. That distinction matters and needs to be shown in the results. This paper is for people working on making MLLMs work reliably outside controlled settings, like in robotics or with real documents. A reader focused on RL methods for vision-language models would find the specific components useful to consider. I recommend sending it for peer review. The problem is relevant and the proposed solution is tailored, even though the current writeup needs more evidence to back the claims.

Referee Report

2 major / 1 minor

Summary. The paper introduces ROMA, a novel RL fine-tuning framework for MLLMs to improve robustness to visual degradations. It uses a dual-forward-pass with teacher forcing on clean trajectories for corrupted inputs, token-level surrogate KL penalty, auxiliary policy gradient loss anchored to clean advantages, and correctness-conditioned regularization. Results claim +2.4% improvement on seen corruptions and +2.3% on unseen over GRPO baseline on Qwen3-VL 4B/8B across seven benchmarks, while matching clean accuracy.

Significance. If the results are verified, this would represent a meaningful advance in making MLLM reasoning more reliable under real-world visual conditions without sacrificing performance on clean inputs. The approach avoids common pitfalls like reward poisoning in critic-free RL settings for autoregressive models, which is a practical contribution. The gains on both seen and unseen corruptions indicate potential for broad applicability.

major comments (2)

The abstract reports specific percentage gains on named models and benchmarks, but provides no details on experimental protocols, statistical significance, hyperparameter choices, or ablation studies; the central claim therefore rests on unverified implementation specifics.
The dual-forward-pass strategy uses teacher forcing to evaluate corrupted views against clean-image trajectories, avoiding new rollouts on degraded inputs. Consequently, policy gradient updates derive exclusively from clean-image trajectories. This creates a potential gap: the reported robustness gains on degraded test inputs may not reflect reinforced reasoning under visual degradation but rather distributional consistency or regularization effects on clean data. Additional analysis of reasoning chains on corrupted inputs is needed to substantiate the central claim.

minor comments (1)

The term 'worst-case augmentation' in the surrogate KL penalty description could be clarified with a specific definition or reference to how it is computed.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback on our work. We address each major comment below with clarifications on the method and results, and we have revised the manuscript to incorporate additional details and analyses as appropriate.

read point-by-point responses

Referee: The abstract reports specific percentage gains on named models and benchmarks, but provides no details on experimental protocols, statistical significance, hyperparameter choices, or ablation studies; the central claim therefore rests on unverified implementation specifics.

Authors: We appreciate the referee noting the need for verifiability. The abstract serves as a concise summary of key outcomes, while the full manuscript details the experimental protocols in Section 4 (including the seven benchmarks, Qwen3-VL 4B/8B models, seen/unseen corruption types, and evaluation procedures), hyperparameter choices and training configurations in Section 4.2 and Appendix A, and ablation studies in Section 5.3. To directly address statistical significance, we have added multi-seed results with standard deviations and t-test p-values (p < 0.05) in the revised tables and text, confirming the reported gains over GRPO. revision: partial
Referee: The dual-forward-pass strategy uses teacher forcing to evaluate corrupted views against clean-image trajectories, avoiding new rollouts on degraded inputs. Consequently, policy gradient updates derive exclusively from clean-image trajectories. This creates a potential gap: the reported robustness gains on degraded test inputs may not reflect reinforced reasoning under visual degradation but rather distributional consistency or regularization effects on clean data. Additional analysis of reasoning chains on corrupted inputs is needed to substantiate the central claim.

Authors: We thank the referee for this precise analysis of the optimization dynamics. While policy gradients are computed from clean trajectories (to prevent reward poisoning from degraded rollouts), the dual-forward-pass explicitly evaluates corrupted inputs via teacher forcing against those trajectories. This enables the token-level surrogate KL penalty to enforce consistency under degradation and the correctness-conditioned regularization to apply only on successful trajectories, directly reinforcing robustness during training. The auxiliary loss preserves stable clean advantages without collapsing the policy. To substantiate the impact on reasoning under corruption, we have added a new analysis subsection (5.4) with qualitative examples of reasoning chains on corrupted inputs and quantitative metrics (e.g., step-wise correctness and coherence scores), showing ROMA maintains more reliable chains than GRPO on both seen and unseen degradations. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical method with independent benchmark comparisons

full rationale

The paper proposes ROMA, an RL fine-tuning framework for MLLMs using a dual-forward-pass strategy with teacher forcing on clean trajectories, token-level surrogate KL penalty, auxiliary policy gradient loss, and correctness-conditioned regularization. No equations, derivations, or self-referential quantities appear in the abstract or described components. Results are presented as direct empirical comparisons (+2.4% / +2.3% robustness gains over GRPO baseline on seven benchmarks while matching clean accuracy), which are externally falsifiable and not reduced to fitted inputs or self-citations by construction. The derivation chain is absent; the work is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities; the method is described at the level of high-level algorithmic components without mathematical formalization or hyperparameter disclosure.

pith-pipeline@v0.9.0 · 5564 in / 1195 out tokens · 37654 ms · 2026-05-12T04:32:41.041170+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

43 extracted references · 43 canonical work pages · 13 internal anchors

[1]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, et al. Training a helpful and harmless assistant with reinforcement learning from human feedback.arXiv preprint arXiv:2204.05862, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[4]

Are We on the Right Way for Evaluating Large Vision-Language Models?

Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Jiaqi Wang, Yu Qiao, Dahua Lin, and Feng Zhao. Are we on the right way for evaluating large vision-language models?, 2024. URLhttps://arxiv.org/abs/2403.20330

work page internal anchor Pith review arXiv 2024
[5]

Cde: Curiosity-driven exploration for efficient reinforcement learning in large language models.arXiv preprint arXiv:2509.09675,

Runpeng Dai, Linfeng Song, Haolin Liu, Zhenwen Liang, Dian Yu, Haitao Mi, Zhaopeng Tu, Rui Liu, Tong Zheng, Hongtu Zhu, and Dong Yu. Cde: Curiosity-driven exploration for efficient reinforcement learning in large language models, 2025. URL https://arxiv.org/ abs/2509.09675

work page arXiv 2025
[6]

Open- vlthinker: Complex vision-language reasoning via iterative sft-rl cycles.arXiv preprint arXiv:2503.17352, 2025

Yihe Deng, Hritik Bansal, Fan Yin, Nanyun Peng, Wei Wang, and Kai-Wei Chang. Open- vlthinker: Complex vision-language reasoning via iterative sft-rl cycles, 2025. URL https: //arxiv.org/abs/2503.17352

work page arXiv 2025
[7]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[8]

Generalization in reinforcement learning by soft data augmentation

Nicklas Hansen and Xiaolong Wang. Generalization in reinforcement learning by soft data augmentation. In2021 IEEE International Conference on Robotics and Automation (ICRA), pages 13611–13617. IEEE, 2021

work page 2021
[9]

Benchmarking Neural Network Robustness to Common Corruptions and Perturbations

Dan Hendrycks and Thomas Dietterich. Benchmarking neural network robustness to common corruptions and perturbations.arXiv preprint arXiv:1903.12261, 2019

work page internal anchor Pith review arXiv 1903
[10]

Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models

Wenxuan Huang, Bohan Jia, Zijie Zhai, Shaosheng Cao, Zheyu Ye, Fei Zhao, Zhe Xu, Yao Hu, and Shaohui Lin. Vision-r1: Incentivizing reasoning capability in multimodal large language models.arXiv preprint arXiv:2503.06749, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[11]

Tulu 3: Pushing Frontiers in Open Language Model Post-Training

Nathan Lambert, Jacob Morrison, Valentina Pyatkin, Shengyi Huang, Hamish Ivison, Faeze Brahman, Lester James V Miranda, Alisa Liu, Nouha Dziri, Shane Lyu, et al. T\" ulu 3: Pushing frontiers in open language model post-training.arXiv preprint arXiv:2411.15124, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[12]

Reinforcement learning with augmented data.Advances in neural information processing systems, 33:19884–19895, 2020

Misha Laskin, Kimin Lee, Adam Stooke, Lerrel Pinto, Pieter Abbeel, and Aravind Srinivas. Reinforcement learning with augmented data.Advances in neural information processing systems, 33:19884–19895, 2020

work page 2020
[13]

Self-Rewarding Vision-Language Model via Reasoning Decomposition

Zongxia Li, Wenhao Yu, Chengsong Huang, Rui Liu, Zhenwen Liang, Fuxiao Liu, Jingxi Che, Dian Yu, Jordan Boyd-Graber, Haitao Mi, et al. Self-rewarding vision-language model via reasoning decomposition.arXiv preprint arXiv:2508.19652, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[14]

Save the good prefix: Precise error penalization via process-supervised rl to enhance llm reasoning.arXiv preprint arXiv:2601.18984, 2026

Haolin Liu, Dian Yu, Sidi Lu, Yujun Zhou, Rui Liu, Zhenwen Liang, Haitao Mi, Chen-Yu Wei, and Dong Yu. Save the good prefix: Precise error penalization via process-supervised rl to enhance llm reasoning.arXiv preprint arXiv:2601.18984, 2026

work page arXiv 2026
[15]

Stable and efficient single-rollout rl for multimodal reasoning.arXiv preprint arXiv:2512.18215, 2025

Rui Liu, Dian Yu, Lei Ke, Haolin Liu, Yujun Zhou, Zhenwen Liang, Haitao Mi, Pratap Tokekar, and Dong Yu. Stable and efficient single-rollout rl for multimodal reasoning.arXiv preprint arXiv:2512.18215, 2025. 10

work page arXiv 2025
[16]

V ogue: Guiding exploration with visual uncertainty improves multimodal reasoning.arXiv preprint arXiv:2510.01444, 2025

Rui Liu, Dian Yu, Tong Zheng, Runpeng Dai, Zongxia Li, Wenhao Yu, Zhenwen Liang, Linfeng Song, Haitao Mi, Pratap Tokekar, et al. V ogue: Guiding exploration with visual uncertainty improves multimodal reasoning.arXiv preprint arXiv:2510.01444, 2025

work page arXiv 2025
[17]

Noisyrollout: Reinforcing visual reasoning with data aug- mentation.arXiv preprint arXiv:2504.13055, 2025

Xiangyan Liu, Jinjie Ni, Zijian Wu, Chao Du, Longxu Dou, Haonan Wang, Tianyu Pang, and Michael Qizhe Shieh. Noisyrollout: Reinforcing visual reasoning with data augmentation. arXiv preprint arXiv:2504.13055, 2025

work page arXiv 2025
[18]

MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts

Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts.arXiv preprint arXiv:2310.02255, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[19]

Reft: Reasoning with reinforced fine-tuning, 2024

Trung Quoc Luong, Xinbo Zhang, Zhanming Jie, Peng Sun, Xiaoran Jin, and Hang Li. Reft: Reasoning with reinforced fine-tuning.arXiv preprint arXiv:2401.08967, 2024

work page arXiv 2024
[20]

A comprehensive survey of data augmentation in visual reinforcement learning.International Journal of Computer Vision, 133(10):7368–7405, 2025

Guozheng Ma, Zhen Wang, Zhecheng Yuan, Xueqian Wang, Bo Yuan, and Dacheng Tao. A comprehensive survey of data augmentation in visual reinforcement learning.International Journal of Computer Vision, 133(10):7368–7405, 2025

work page 2025
[21]

Chartqa: A benchmark for question answering about charts with visual and logical reasoning, 2022

Ahmed Masry, Do Xuan Long, Jia Qing Tan, Shafiq Joty, and Enamul Hoque. Chartqa: A benchmark for question answering about charts with visual and logical reasoning.arXiv preprint arXiv:2203.10244, 2022

work page arXiv 2022
[22]

A survey of synthetic data augmentation methods in machine vision.Machine Intelligence Research, 21(5):831–869, 2024

Alhassan Mumuni, Fuseini Mumuni, and Nana Kobina Gerrar. A survey of synthetic data augmentation methods in machine vision.Machine Intelligence Research, 21(5):831–869, 2024

work page 2024
[23]

Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

work page 2022
[24]

arXiv preprint arXiv:2503.07536 , year =

Yingzhe Peng, Gongrui Zhang, Miaosen Zhang, Zhiyuan You, Jie Liu, Qipeng Zhu, Kai Yang, Xingzhong Xu, Xin Geng, and Xu Yang. Lmm-r1: Empowering 3b lmms with strong reasoning abilities through two-stage rule-based rl.arXiv preprint arXiv:2503.07536, 2025

work page arXiv 2025
[25]

We-math: Does your large multimodal model achieve human-like mathematical reasoning?arXiv preprint arXiv:2407.01284, 2024

Runqi Qiao, Qiuna Tan, Guanting Dong, Minhui Wu, Chong Sun, Xiaoshuai Song, Zhuoma GongQue, Shanglin Lei, Zhe Wei, Miaoxuan Zhang, et al. We-math: Does your large multimodal model achieve human-like mathematical reasoning?arXiv preprint arXiv:2407.01284, 2024

work page arXiv 2024
[26]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021

work page 2021
[27]

Auto- matic data augmentation for generalization in deep reinforcement learning.arXiv preprint arXiv:2006.12862, 2020

Roberta Raileanu, Max Goldstein, Denis Yarats, Ilya Kostrikov, and Rob Fergus. Auto- matic data augmentation for generalization in deep reinforcement learning.arXiv preprint arXiv:2006.12862, 2020

work page arXiv 2006
[28]

Visualizing and understanding contrastive learning.IEEE Transactions on Image Processing, 33:541–555, 2023

Fawaz Sammani, Boris Joukovsky, and Nikos Deligiannis. Visualizing and understanding contrastive learning.IEEE Transactions on Image Processing, 33:541–555, 2023

work page 2023
[29]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[30]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[31]

Visualpuzzles: Decoupling multimodal reasoning evaluation from domain knowledge.arXiv preprint arXiv:2504.10342, 2025

Yueqi Song, Tianyue Ou, Yibo Kong, Zecheng Li, Graham Neubig, and Xiang Yue. Visualpuz- zles: Decoupling multimodal reasoning evaluation from domain knowledge.arXiv preprint arXiv:2504.10342, 2025

work page arXiv 2025
[32]

Reason-rft: Reinforcement fine-tuning for visual reasoning of vision language models.arXiv preprint arXiv:2503.20752, 2025

Huajie Tan, Yuheng Ji, Xiaoshuai Hao, Minglan Lin, Pengwei Wang, Zhongyuan Wang, and Shanghang Zhang. Reason-rft: Reinforcement fine-tuning for visual reasoning.arXiv preprint arXiv:2503.20752, 2025. 11

work page arXiv 2025
[33]

Qwen2.5: A party of foundation models, September 2024

Qwen Team. Qwen2.5: A party of foundation models, September 2024. URL https:// qwenlm.github.io/blog/qwen2.5/

work page 2024
[34]

VL-Rethinker: Incentivizing Self-Reflection of Vision-Language Models with Reinforcement Learning

Haozhe Wang, Chao Qu, Zuming Huang, Wei Chu, Fangzhen Lin, and Wenhu Chen. Vl- rethinker: Incentivizing self-reflection of vision-language models with reinforcement learning. arXiv preprint arXiv:2504.08837, 2025

work page Pith review arXiv 2025
[35]

Perception-Aware Policy Optimization for Multimodal Reasoning

Zhenhailong Wang, Xuehang Guo, Sofia Stoica, Haiyang Xu, Hongru Wang, Hyeonjeong Ha, Xiusi Chen, Yangyi Chen, Ming Yan, Fei Huang, et al. Perception-aware policy optimization for multimodal reasoning.arXiv preprint arXiv:2507.06448, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[36]

Grok-1.5 Vision Preview, 2024

xAI. Grok-1.5 Vision Preview, 2024. URLhttps://x.ai/news/grok-1.5v

work page 2024
[37]

Logicvista: Multimodal llm logical reasoning benchmark in visual contexts.arXiv preprint arXiv:2407.04973, 2024

Yijia Xiao, Edward Sun, Tianyu Liu, and Wei Wang. Logicvista: Multimodal llm logical reasoning benchmark in visual contexts.arXiv preprint arXiv:2407.04973, 2024

work page arXiv 2024
[38]

R1-onevision: Advancing gen- eralized multimodal reasoning through cross-modal formal- ization.arXiv preprint arXiv:2503.10615, 2025

Yi Yang, Xiaoxuan He, Hongkun Pan, Xiyan Jiang, Yan Deng, Xingtao Yang, Haoyu Lu, Dacheng Yin, Fengyun Rao, Minfeng Zhu, et al. R1-onevision: Advancing generalized multimodal reasoning through cross-modal formalization.arXiv preprint arXiv:2503.10615, 2025

work page arXiv 2025
[39]

arXiv preprint arXiv:2505.16673 (2025)

Huanjin Yao, Qixiang Yin, Jingyi Zhang, Min Yang, Yibo Wang, Wenhao Wu, Fei Su, Li Shen, Minghui Qiu, Dacheng Tao, et al. R1-sharevl: Incentivizing reasoning capability of multimodal large language models via share-grpo.arXiv preprint arXiv:2505.16673, 2025

work page arXiv 2025
[40]

Image augmentation is all you need: Regu- larizing deep reinforcement learning from pixels

Denis Yarats, Ilya Kostrikov, and Rob Fergus. Image augmentation is all you need: Regu- larizing deep reinforcement learning from pixels. InInternational conference on learning representations, 2021

work page 2021
[41]

Parallel-r1: Towards parallel thinking via reinforcement learning

Tong Zheng, Hongming Zhang, Wenhao Yu, Xiaoyang Wang, Xinyu Yang, Runpeng Dai, Rui Liu, Huiwen Bao, Chengsong Huang, Heng Huang, et al. Parallel-r1: Towards parallel thinking via reinforcement learning.arXiv preprint arXiv:2509.07980, 2025

work page arXiv 2025
[42]

Easyr1: An efficient, scalable, multi-modality rl training framework, 2025

Yaowei Zheng, Junting Lu, Shenzhi Wang, Zhangchi Feng, Dongdong Kuang, and Yuwen Xiong. Easyr1: An efficient, scalable, multi-modality rl training framework, 2025. URL https://github.com/hiyouga/EasyR1

work page 2025
[43]

Shuffle-r1: Efficient rl framework for multimodal large language models via data-centric dynamic shuffle.arXiv preprint arXiv:2508.05612, 2025

Linghao Zhu, Yiran Guan, Dingkang Liang, Jianzhong Ju, Zhenbo Luo, Bin Qin, Jian Luan, Yuliang Liu, and Xiang Bai. Shuffle-r1: Efficient rl framework for multimodal large language models via data-centric dynamic shuffle.arXiv preprint arXiv:2508.05612, 2025. 12 A Appendix A.1 Implementation Details We train all models on the MMRL30k dataset [43], which co...

work page arXiv 2025