Recognition: no theorem link
Reinforcing Multimodal Reasoning Against Visual Degradation
Pith reviewed 2026-05-12 04:32 UTC · model grok-4.3
The pith
A new RL fine-tuning framework called ROMA makes multimodal reasoning robust to visual degradations like blur and compression while matching clean-image accuracy.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
ROMA modifies the optimization dynamics of RL fine-tuning for autoregressive MLLMs by using a dual-forward-pass strategy with teacher forcing to evaluate corrupted views against clean-image trajectories, applying a token-level surrogate KL penalty against the worst-case augmentation for distributional consistency, adding an auxiliary policy gradient loss anchored to clean-image advantages to preserve a reliable reward signal, and enforcing correctness-conditioned regularization that restricts invariance only to successful trajectories. This combination avoids reward poisoning from perceptual occlusions and prevents policy collapse under regularization, resulting in improved robustness on Qw
What carries the argument
The ROMA framework, which alters RL optimization through dual-forward-pass teacher forcing, token-level surrogate KL penalty, auxiliary policy gradient loss, and correctness-conditioned regularization to reinforce reasoning against degraded visual inputs.
Load-bearing premise
That the dual-forward-pass strategy with teacher forcing, token-level surrogate KL penalty, auxiliary policy gradient loss, and correctness-conditioned regularization together prevent reward poisoning and policy collapse in critic-free RL fine-tuning of autoregressive MLLMs without introducing new failure modes or reducing generalization.
What would settle it
A controlled experiment on Qwen3-VL or a similar model where applying ROMA during RL fine-tuning yields no gain in accuracy on corrupted test images from the seven benchmarks compared to standard GRPO, or where clean-image accuracy drops below the baseline.
Figures
read the original abstract
Reinforcement Learning has significantly advanced the reasoning capabilities of Multimodal Large Language Models (MLLMs), yet the resulting policies remain brittle against real-world visual degradations such as blur, compression artifacts, and low-resolution scans. Prior robustness techniques from vision and deep RL rely on static data augmentation or value-based regularization, neither of which transfers cleanly to critic-free RL fine-tuning of autoregressive MLLMs. Reinforcing reasoning against such corruptions is non-trivial: naively injecting degraded views during rollout induces reward poisoning, where perceptual occlusions trigger hallucinated trajectories and destabilize optimization. We propose ROMA, an RL fine-tuning framework that modifies the optimization dynamics to reinforce reasoning against visual degradation while preserving clean-input performance. A dual-forward-pass strategy uses teacher forcing to evaluate corrupted views against clean-image trajectories, avoiding new rollouts on degraded inputs. For distributional consistency, we apply a token-level surrogate KL penalty against the worst-case augmentation; to prevent policy collapse under regularization, an auxiliary policy gradient loss anchored to clean-image advantages preserves a reliable reward signal; and to avoid systematically incorrect invariance, correctness-conditioned regularization restricts enforcement to successful trajectories. On Qwen3-VL 4B/8B across seven multimodal reasoning benchmarks, our method improves robustness by +2.4% on seen and +2.3% on unseen corruptions over GRPO while matching clean accuracy.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces ROMA, a novel RL fine-tuning framework for MLLMs to improve robustness to visual degradations. It uses a dual-forward-pass with teacher forcing on clean trajectories for corrupted inputs, token-level surrogate KL penalty, auxiliary policy gradient loss anchored to clean advantages, and correctness-conditioned regularization. Results claim +2.4% improvement on seen corruptions and +2.3% on unseen over GRPO baseline on Qwen3-VL 4B/8B across seven benchmarks, while matching clean accuracy.
Significance. If the results are verified, this would represent a meaningful advance in making MLLM reasoning more reliable under real-world visual conditions without sacrificing performance on clean inputs. The approach avoids common pitfalls like reward poisoning in critic-free RL settings for autoregressive models, which is a practical contribution. The gains on both seen and unseen corruptions indicate potential for broad applicability.
major comments (2)
- The abstract reports specific percentage gains on named models and benchmarks, but provides no details on experimental protocols, statistical significance, hyperparameter choices, or ablation studies; the central claim therefore rests on unverified implementation specifics.
- The dual-forward-pass strategy uses teacher forcing to evaluate corrupted views against clean-image trajectories, avoiding new rollouts on degraded inputs. Consequently, policy gradient updates derive exclusively from clean-image trajectories. This creates a potential gap: the reported robustness gains on degraded test inputs may not reflect reinforced reasoning under visual degradation but rather distributional consistency or regularization effects on clean data. Additional analysis of reasoning chains on corrupted inputs is needed to substantiate the central claim.
minor comments (1)
- The term 'worst-case augmentation' in the surrogate KL penalty description could be clarified with a specific definition or reference to how it is computed.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback on our work. We address each major comment below with clarifications on the method and results, and we have revised the manuscript to incorporate additional details and analyses as appropriate.
read point-by-point responses
-
Referee: The abstract reports specific percentage gains on named models and benchmarks, but provides no details on experimental protocols, statistical significance, hyperparameter choices, or ablation studies; the central claim therefore rests on unverified implementation specifics.
Authors: We appreciate the referee noting the need for verifiability. The abstract serves as a concise summary of key outcomes, while the full manuscript details the experimental protocols in Section 4 (including the seven benchmarks, Qwen3-VL 4B/8B models, seen/unseen corruption types, and evaluation procedures), hyperparameter choices and training configurations in Section 4.2 and Appendix A, and ablation studies in Section 5.3. To directly address statistical significance, we have added multi-seed results with standard deviations and t-test p-values (p < 0.05) in the revised tables and text, confirming the reported gains over GRPO. revision: partial
-
Referee: The dual-forward-pass strategy uses teacher forcing to evaluate corrupted views against clean-image trajectories, avoiding new rollouts on degraded inputs. Consequently, policy gradient updates derive exclusively from clean-image trajectories. This creates a potential gap: the reported robustness gains on degraded test inputs may not reflect reinforced reasoning under visual degradation but rather distributional consistency or regularization effects on clean data. Additional analysis of reasoning chains on corrupted inputs is needed to substantiate the central claim.
Authors: We thank the referee for this precise analysis of the optimization dynamics. While policy gradients are computed from clean trajectories (to prevent reward poisoning from degraded rollouts), the dual-forward-pass explicitly evaluates corrupted inputs via teacher forcing against those trajectories. This enables the token-level surrogate KL penalty to enforce consistency under degradation and the correctness-conditioned regularization to apply only on successful trajectories, directly reinforcing robustness during training. The auxiliary loss preserves stable clean advantages without collapsing the policy. To substantiate the impact on reasoning under corruption, we have added a new analysis subsection (5.4) with qualitative examples of reasoning chains on corrupted inputs and quantitative metrics (e.g., step-wise correctness and coherence scores), showing ROMA maintains more reliable chains than GRPO on both seen and unseen degradations. revision: yes
Circularity Check
No circularity: empirical method with independent benchmark comparisons
full rationale
The paper proposes ROMA, an RL fine-tuning framework for MLLMs using a dual-forward-pass strategy with teacher forcing on clean trajectories, token-level surrogate KL penalty, auxiliary policy gradient loss, and correctness-conditioned regularization. No equations, derivations, or self-referential quantities appear in the abstract or described components. Results are presented as direct empirical comparisons (+2.4% / +2.3% robustness gains over GRPO baseline on seven benchmarks while matching clean accuracy), which are externally falsifiable and not reduced to fitted inputs or self-citations by construction. The derivation chain is absent; the work is self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[2]
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[3]
Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, et al. Training a helpful and harmless assistant with reinforcement learning from human feedback.arXiv preprint arXiv:2204.05862, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[4]
Are We on the Right Way for Evaluating Large Vision-Language Models?
Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Jiaqi Wang, Yu Qiao, Dahua Lin, and Feng Zhao. Are we on the right way for evaluating large vision-language models?, 2024. URLhttps://arxiv.org/abs/2403.20330
work page internal anchor Pith review arXiv 2024
-
[5]
Runpeng Dai, Linfeng Song, Haolin Liu, Zhenwen Liang, Dian Yu, Haitao Mi, Zhaopeng Tu, Rui Liu, Tong Zheng, Hongtu Zhu, and Dong Yu. Cde: Curiosity-driven exploration for efficient reinforcement learning in large language models, 2025. URL https://arxiv.org/ abs/2509.09675
-
[6]
Yihe Deng, Hritik Bansal, Fan Yin, Nanyun Peng, Wei Wang, and Kai-Wei Chang. Open- vlthinker: Complex vision-language reasoning via iterative sft-rl cycles, 2025. URL https: //arxiv.org/abs/2503.17352
-
[7]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[8]
Generalization in reinforcement learning by soft data augmentation
Nicklas Hansen and Xiaolong Wang. Generalization in reinforcement learning by soft data augmentation. In2021 IEEE International Conference on Robotics and Automation (ICRA), pages 13611–13617. IEEE, 2021
work page 2021
-
[9]
Benchmarking Neural Network Robustness to Common Corruptions and Perturbations
Dan Hendrycks and Thomas Dietterich. Benchmarking neural network robustness to common corruptions and perturbations.arXiv preprint arXiv:1903.12261, 2019
work page internal anchor Pith review arXiv 1903
-
[10]
Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models
Wenxuan Huang, Bohan Jia, Zijie Zhai, Shaosheng Cao, Zheyu Ye, Fei Zhao, Zhe Xu, Yao Hu, and Shaohui Lin. Vision-r1: Incentivizing reasoning capability in multimodal large language models.arXiv preprint arXiv:2503.06749, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[11]
Tulu 3: Pushing Frontiers in Open Language Model Post-Training
Nathan Lambert, Jacob Morrison, Valentina Pyatkin, Shengyi Huang, Hamish Ivison, Faeze Brahman, Lester James V Miranda, Alisa Liu, Nouha Dziri, Shane Lyu, et al. T\" ulu 3: Pushing frontiers in open language model post-training.arXiv preprint arXiv:2411.15124, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[12]
Misha Laskin, Kimin Lee, Adam Stooke, Lerrel Pinto, Pieter Abbeel, and Aravind Srinivas. Reinforcement learning with augmented data.Advances in neural information processing systems, 33:19884–19895, 2020
work page 2020
-
[13]
Self-Rewarding Vision-Language Model via Reasoning Decomposition
Zongxia Li, Wenhao Yu, Chengsong Huang, Rui Liu, Zhenwen Liang, Fuxiao Liu, Jingxi Che, Dian Yu, Jordan Boyd-Graber, Haitao Mi, et al. Self-rewarding vision-language model via reasoning decomposition.arXiv preprint arXiv:2508.19652, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[14]
Haolin Liu, Dian Yu, Sidi Lu, Yujun Zhou, Rui Liu, Zhenwen Liang, Haitao Mi, Chen-Yu Wei, and Dong Yu. Save the good prefix: Precise error penalization via process-supervised rl to enhance llm reasoning.arXiv preprint arXiv:2601.18984, 2026
-
[15]
Rui Liu, Dian Yu, Lei Ke, Haolin Liu, Yujun Zhou, Zhenwen Liang, Haitao Mi, Pratap Tokekar, and Dong Yu. Stable and efficient single-rollout rl for multimodal reasoning.arXiv preprint arXiv:2512.18215, 2025. 10
-
[16]
Rui Liu, Dian Yu, Tong Zheng, Runpeng Dai, Zongxia Li, Wenhao Yu, Zhenwen Liang, Linfeng Song, Haitao Mi, Pratap Tokekar, et al. V ogue: Guiding exploration with visual uncertainty improves multimodal reasoning.arXiv preprint arXiv:2510.01444, 2025
-
[17]
Xiangyan Liu, Jinjie Ni, Zijian Wu, Chao Du, Longxu Dou, Haonan Wang, Tianyu Pang, and Michael Qizhe Shieh. Noisyrollout: Reinforcing visual reasoning with data augmentation. arXiv preprint arXiv:2504.13055, 2025
-
[18]
MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts
Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts.arXiv preprint arXiv:2310.02255, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[19]
Reft: Reasoning with reinforced fine-tuning, 2024
Trung Quoc Luong, Xinbo Zhang, Zhanming Jie, Peng Sun, Xiaoran Jin, and Hang Li. Reft: Reasoning with reinforced fine-tuning.arXiv preprint arXiv:2401.08967, 2024
-
[20]
Guozheng Ma, Zhen Wang, Zhecheng Yuan, Xueqian Wang, Bo Yuan, and Dacheng Tao. A comprehensive survey of data augmentation in visual reinforcement learning.International Journal of Computer Vision, 133(10):7368–7405, 2025
work page 2025
-
[21]
Chartqa: A benchmark for question answering about charts with visual and logical reasoning, 2022
Ahmed Masry, Do Xuan Long, Jia Qing Tan, Shafiq Joty, and Enamul Hoque. Chartqa: A benchmark for question answering about charts with visual and logical reasoning.arXiv preprint arXiv:2203.10244, 2022
-
[22]
Alhassan Mumuni, Fuseini Mumuni, and Nana Kobina Gerrar. A survey of synthetic data augmentation methods in machine vision.Machine Intelligence Research, 21(5):831–869, 2024
work page 2024
-
[23]
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022
work page 2022
-
[24]
arXiv preprint arXiv:2503.07536 , year =
Yingzhe Peng, Gongrui Zhang, Miaosen Zhang, Zhiyuan You, Jie Liu, Qipeng Zhu, Kai Yang, Xingzhong Xu, Xin Geng, and Xu Yang. Lmm-r1: Empowering 3b lmms with strong reasoning abilities through two-stage rule-based rl.arXiv preprint arXiv:2503.07536, 2025
-
[25]
Runqi Qiao, Qiuna Tan, Guanting Dong, Minhui Wu, Chong Sun, Xiaoshuai Song, Zhuoma GongQue, Shanglin Lei, Zhe Wei, Miaoxuan Zhang, et al. We-math: Does your large multimodal model achieve human-like mathematical reasoning?arXiv preprint arXiv:2407.01284, 2024
-
[26]
Learning transferable visual models from natural language supervision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021
work page 2021
-
[27]
Roberta Raileanu, Max Goldstein, Denis Yarats, Ilya Kostrikov, and Rob Fergus. Auto- matic data augmentation for generalization in deep reinforcement learning.arXiv preprint arXiv:2006.12862, 2020
-
[28]
Fawaz Sammani, Boris Joukovsky, and Nikos Deligiannis. Visualizing and understanding contrastive learning.IEEE Transactions on Image Processing, 33:541–555, 2023
work page 2023
-
[29]
Proximal Policy Optimization Algorithms
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[30]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[31]
Yueqi Song, Tianyue Ou, Yibo Kong, Zecheng Li, Graham Neubig, and Xiang Yue. Visualpuz- zles: Decoupling multimodal reasoning evaluation from domain knowledge.arXiv preprint arXiv:2504.10342, 2025
-
[32]
Huajie Tan, Yuheng Ji, Xiaoshuai Hao, Minglan Lin, Pengwei Wang, Zhongyuan Wang, and Shanghang Zhang. Reason-rft: Reinforcement fine-tuning for visual reasoning.arXiv preprint arXiv:2503.20752, 2025. 11
-
[33]
Qwen2.5: A party of foundation models, September 2024
Qwen Team. Qwen2.5: A party of foundation models, September 2024. URL https:// qwenlm.github.io/blog/qwen2.5/
work page 2024
-
[34]
VL-Rethinker: Incentivizing Self-Reflection of Vision-Language Models with Reinforcement Learning
Haozhe Wang, Chao Qu, Zuming Huang, Wei Chu, Fangzhen Lin, and Wenhu Chen. Vl- rethinker: Incentivizing self-reflection of vision-language models with reinforcement learning. arXiv preprint arXiv:2504.08837, 2025
work page Pith review arXiv 2025
-
[35]
Perception-Aware Policy Optimization for Multimodal Reasoning
Zhenhailong Wang, Xuehang Guo, Sofia Stoica, Haiyang Xu, Hongru Wang, Hyeonjeong Ha, Xiusi Chen, Yangyi Chen, Ming Yan, Fei Huang, et al. Perception-aware policy optimization for multimodal reasoning.arXiv preprint arXiv:2507.06448, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[36]
xAI. Grok-1.5 Vision Preview, 2024. URLhttps://x.ai/news/grok-1.5v
work page 2024
-
[37]
Yijia Xiao, Edward Sun, Tianyu Liu, and Wei Wang. Logicvista: Multimodal llm logical reasoning benchmark in visual contexts.arXiv preprint arXiv:2407.04973, 2024
-
[38]
Yi Yang, Xiaoxuan He, Hongkun Pan, Xiyan Jiang, Yan Deng, Xingtao Yang, Haoyu Lu, Dacheng Yin, Fengyun Rao, Minfeng Zhu, et al. R1-onevision: Advancing generalized multimodal reasoning through cross-modal formalization.arXiv preprint arXiv:2503.10615, 2025
-
[39]
arXiv preprint arXiv:2505.16673 (2025)
Huanjin Yao, Qixiang Yin, Jingyi Zhang, Min Yang, Yibo Wang, Wenhao Wu, Fei Su, Li Shen, Minghui Qiu, Dacheng Tao, et al. R1-sharevl: Incentivizing reasoning capability of multimodal large language models via share-grpo.arXiv preprint arXiv:2505.16673, 2025
-
[40]
Image augmentation is all you need: Regu- larizing deep reinforcement learning from pixels
Denis Yarats, Ilya Kostrikov, and Rob Fergus. Image augmentation is all you need: Regu- larizing deep reinforcement learning from pixels. InInternational conference on learning representations, 2021
work page 2021
-
[41]
Parallel-r1: Towards parallel thinking via reinforcement learning
Tong Zheng, Hongming Zhang, Wenhao Yu, Xiaoyang Wang, Xinyu Yang, Runpeng Dai, Rui Liu, Huiwen Bao, Chengsong Huang, Heng Huang, et al. Parallel-r1: Towards parallel thinking via reinforcement learning.arXiv preprint arXiv:2509.07980, 2025
-
[42]
Easyr1: An efficient, scalable, multi-modality rl training framework, 2025
Yaowei Zheng, Junting Lu, Shenzhi Wang, Zhangchi Feng, Dongdong Kuang, and Yuwen Xiong. Easyr1: An efficient, scalable, multi-modality rl training framework, 2025. URL https://github.com/hiyouga/EasyR1
work page 2025
-
[43]
Linghao Zhu, Yiran Guan, Dingkang Liang, Jianzhong Ju, Zhenbo Luo, Bin Qin, Jian Luan, Yuliang Liu, and Xiang Bai. Shuffle-r1: Efficient rl framework for multimodal large language models via data-centric dynamic shuffle.arXiv preprint arXiv:2508.05612, 2025. 12 A Appendix A.1 Implementation Details We train all models on the MMRL30k dataset [43], which co...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.