pith. machine review for the scientific record. sign in

arxiv: 2604.03179 · v1 · submitted 2026-04-03 · 💻 cs.LG · cs.AI· cs.CV

Recognition: 2 theorem links

· Lean Theorem

Understanding the Role of Hallucination in Reinforcement Post-Training of Multimodal Reasoning Models

Authors on Pith no claims yet

Pith reviewed 2026-05-13 20:44 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CV
keywords hallucinationreinforcement learningmultimodal large language modelsvisual reasoningpost-trainingcorruptionsreasoning performance
0
0 comments X

The pith

Reinforcement learning post-training boosts multimodal reasoning even when models must rely on hallucination due to corrupted visual inputs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper develops the Hallucination-as-Cue Framework to examine how reinforcement learning post-training affects multimodal reasoning models when visual information is deliberately removed or altered. By applying these corruptions during training and evaluation, the authors show that performance gains occur even in settings where the model has no choice but to hallucinate answers. A sympathetic reader would care because this suggests that current RL methods may not be teaching true visual grounding as assumed, but instead leveraging the model's tendency to generate plausible responses. The work thus calls for rethinking how we train these models to better integrate actual multimodal data.

Core claim

The paper claims that RL post-training under purely hallucination-inductive settings can still significantly improve models' reasoning performance, and in some cases even outperform standard training. This is shown through experiments on multiple multimodal reasoning benchmarks using modality-specific corruptions that remove essential visual information, forcing hallucination-based reasoning. The findings indicate that hallucination plays a more significant role in RL-training dynamics than previously recognized.

What carries the argument

The Hallucination-as-Cue Framework, which uses hallucination-inductive, modality-specific corruptions to force and study hallucination-driven reasoning during RL post-training.

If this is right

  • RL can improve reasoning performance without access to complete visual information.
  • Hallucination contributes substantially to the effectiveness of RL post-training in MLLMs.
  • Existing multimodal reasoning datasets may contain properties that favor hallucination over grounded reasoning.
  • Training designs should consider modality-aware approaches to better leverage or mitigate hallucination effects.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar corruption techniques could be used to evaluate other training methods like supervised fine-tuning for their reliance on hallucination.
  • This implies that scaling RL post-training might amplify hallucination benefits, potentially requiring new safeguards.
  • Applications in real-world vision-language tasks may need to balance performance gains from hallucination with accuracy requirements.

Load-bearing premise

The modality-specific corruptions isolate hallucination-based reasoning without introducing unrelated artifacts that could drive the performance gains independently.

What would settle it

Observing no performance improvement or even degradation when RL is applied under these hallucination-inductive corruptions on a new benchmark would falsify the claim.

Figures

Figures reproduced from arXiv: 2604.03179 by Gengwei Zhang, Hossein Nourkhiz Mahjoub, Jie Peng, Kwonjoon Lee, Mufan Qiu, Tianlong Chen, Vaishnav Tadiparthi, Yanyong Zhang, Zhen Tan.

Figure 1
Figure 1. Figure 1: Case Study. An example illustrating different hallucination behaviors in multimodal reasoning models. The left side shows the model reasoning with normal visual inputs; in this case, the reinforcement-trained model (bottom-left) produces a noisier reasoning trajectory and ultimately yields an incorrect answer. In contrast, the right side demonstrates that when visual information is removed, the reinforceme… view at source ↗
Figure 2
Figure 2. Figure 2: Hallucination-as-Cue Framework. (a) Modality-Specific Corruptions: We define three types of data corruptions: Blank Image Replacement, Random Image Replacement, and Textual Information Removal. (b) Hallucination-Inductive Training: We apply these types of modality-specific corruptions to the training data to obtain three models. Since the input information is corrupted, the model learns to hallucinate the … view at source ↗
Figure 3
Figure 3. Figure 3: Accuracy of different training regimes on the normal [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Accuracy of different training regimes on corrupted training and test sets. [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
read the original abstract

The recent success of reinforcement learning (RL) in large reasoning models has inspired the growing adoption of RL for post-training Multimodal Large Language Models (MLLMs) to enhance their visual reasoning capabilities. Although many studies have reported improved performance, it remains unclear whether RL training truly enables models to learn from visual information. In this work, we propose the Hallucination-as-Cue Framework, an analytical framework designed to investigate the effects of RL-based post-training on multimodal reasoning models from the perspective of model hallucination. Specifically, we introduce hallucination-inductive, modality-specific corruptions that remove or replace essential information required to derive correct answers, thereby forcing the model to reason by hallucination. By applying these corruptions during both training and evaluation, our framework provides a unique perspective for diagnosing RL training dynamics and understanding the intrinsic properties of datasets. Through extensive experiments and analyses across multiple multimodal reasoning benchmarks, we reveal that the role of model hallucination for RL-training is more significant than previously recognized. For instance, we find that RL post-training under purely hallucination-inductive settings can still significantly improve models' reasoning performance, and in some cases even outperform standard training. These findings challenge prevailing assumptions about MLLM reasoning training and motivate the development of more modality-aware RL-based training designs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces the Hallucination-as-Cue Framework to investigate RL post-training of multimodal large language models (MLLMs). It applies modality-specific corruptions that remove or replace essential visual information during both training and evaluation, forcing reliance on hallucination. Experiments across multiple benchmarks claim that RL under these purely hallucination-inductive settings still yields significant reasoning improvements, sometimes outperforming standard training, implying hallucination plays a larger role in multimodal RL than previously recognized and motivating modality-aware training designs.

Significance. If the corruptions are shown to isolate hallucination without confounding artifacts and the performance gains are statistically robust, the work would be significant for challenging assumptions about visual grounding in MLLM RL post-training. The broad experimental scope across benchmarks provides a useful diagnostic lens for training dynamics and dataset properties. Credit is due for the reproducible experimental setup implied by the multi-benchmark evaluation.

major comments (3)
  1. [Abstract] Abstract: The central claim that RL post-training under purely hallucination-inductive settings can outperform standard training lacks any mention of statistical controls, baseline comparisons, or quantitative hallucination-rate measurements, which are required to attribute gains to the intended mechanism rather than reward-landscape changes.
  2. [§3] §3 (Hallucination-as-Cue Framework): The framework defines modality-specific corruptions to force hallucination-based reasoning, but provides no ablation or analysis demonstrating that these operations do not independently flatten the reward landscape or introduce spurious patterns that RL could exploit without true hallucination.
  3. [§4] §4 (Experiments): Performance tables or figures reporting improvements under corrupted settings contain no controls (e.g., hallucination-rate verification or corruption-type ablations) to rule out non-hallucination artifacts driving the observed gains, which is load-bearing for the claim that hallucination is the operative factor.
minor comments (2)
  1. [Abstract] Abstract: Specify the exact number and names of multimodal reasoning benchmarks used to allow readers to assess coverage.
  2. Ensure all figures clearly label the corruption types (remove vs. replace) and include error bars or significance markers for the reported improvements.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their detailed review and constructive comments. We address each major comment below, providing clarifications and indicating revisions where necessary to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim that RL post-training under purely hallucination-inductive settings can outperform standard training lacks any mention of statistical controls, baseline comparisons, or quantitative hallucination-rate measurements, which are required to attribute gains to the intended mechanism rather than reward-landscape changes.

    Authors: We agree that the abstract should better highlight the controls and measurements supporting our claims. The manuscript includes baseline comparisons to standard RL training and reports performance improvements across multiple benchmarks. We also measure hallucination rates by comparing responses on corrupted vs. original inputs. In the revised version, we will update the abstract to explicitly mention these statistical controls, baseline comparisons, and hallucination-rate verifications to better attribute the gains to the hallucination mechanism. revision: yes

  2. Referee: [§3] §3 (Hallucination-as-Cue Framework): The framework defines modality-specific corruptions to force hallucination-based reasoning, but provides no ablation or analysis demonstrating that these operations do not independently flatten the reward landscape or introduce spurious patterns that RL could exploit without true hallucination.

    Authors: The framework employs modality-specific corruptions such as object removal and attribute replacement to eliminate essential visual cues. To address potential confounding factors, we performed ablations across different corruption strategies and observed consistent performance trends, which suggest the improvements are not solely due to reward landscape flattening or spurious patterns. We will add a dedicated subsection in §3 discussing these ablations and their implications for ruling out non-hallucination artifacts. revision: partial

  3. Referee: [§4] §4 (Experiments): Performance tables or figures reporting improvements under corrupted settings contain no controls (e.g., hallucination-rate verification or corruption-type ablations) to rule out non-hallucination artifacts driving the observed gains, which is load-bearing for the claim that hallucination is the operative factor.

    Authors: In §4 and the appendix, we provide corruption-type ablations and verify increased hallucination rates through qualitative and quantitative analysis of model outputs on corrupted data. These controls are included to demonstrate that the gains persist across corruption types. We will revise the main text of §4 to more prominently feature these controls and include additional statistical significance tests in the tables. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical framework is self-contained

full rationale

The paper defines an empirical Hallucination-as-Cue Framework that applies modality-specific corruptions during RL post-training and evaluates resulting performance on external multimodal benchmarks. No equations, fitted parameters, or derivations are presented that reduce the central claims (e.g., RL gains under hallucination-inductive settings) to inputs by construction. No self-citations serve as load-bearing uniqueness theorems, no ansatzes are smuggled via prior work, and no known results are merely renamed. The derivation chain rests on independent experimental interventions and benchmark measurements rather than self-referential definitions or statistical forcing, satisfying the criteria for a non-circular analysis.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the domain assumption that the introduced corruptions force pure hallucination-based reasoning without confounding effects.

axioms (1)
  • domain assumption The hallucination-inductive corruptions remove or replace essential visual information required for correct answers.
    Invoked to justify that performance gains occur via hallucination rather than visual reasoning.
invented entities (1)
  • Hallucination-as-Cue Framework no independent evidence
    purpose: Analytical tool to diagnose RL training dynamics via forced hallucination
    Newly introduced construct whose validity depends on the corruption assumption.

pith-pipeline@v0.9.0 · 5570 in / 1133 out tokens · 42834 ms · 2026-05-13T20:44:55.917983+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

45 extracted references · 45 canonical work pages · 9 internal anchors

  1. [1]

    Mirage the illusion of visual understanding.arXiv preprint arXiv:2603.21687, 2026

    Mohammad Asadi, Jack W O’Sullivan, Fang Cao, Tahoura Nedaee, Kamyar Fardi, Fei-Fei Li, Ehsan Adeli, and Euan Ashley. Mirage the illusion of visual understanding.arXiv preprint arXiv:2603.21687, 2026. 3

  2. [2]

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 2, 4, 6

  3. [3]

    arXiv preprint arXiv:2504.11468 , year=

    Hardy Chen, Haoqin Tu, Fali Wang, Hui Liu, Xianfeng Tang, Xinya Du, Yuyin Zhou, and Cihang Xie. Sft or rl? an early investigation into training r1-like reasoning large vision-language models.arXiv preprint arXiv:2504.11468,

  4. [4]

    Advancing multimodal reasoning: From optimized cold start to staged reinforcement learning.arXiv preprint arXiv:2506.04207, 2025

    Shuang Chen, Yue Guo, Zhaochen Su, Yafu Li, Yulun Wu, Jiacheng Chen, Jiayu Chen, Weijie Wang, Xiaoye Qu, and Yu Cheng. Advancing multimodal reasoning: From opti- mized cold start to staged reinforcement learning.arXiv preprint arXiv:2506.04207, 2025. 5

  5. [5]

    Internvl: Scaling up vision foundation mod- els and aligning for generic visual-linguistic tasks

    Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation mod- els and aligning for generic visual-linguistic tasks. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 24185–24198, 2024. 2

  6. [6]

    Mirage: Assessing hal- lucination in multimodal reasoning chains of mllm.arXiv preprint arXiv:2505.24238, 2025

    Bowen Dong, Minheng Ni, Zitong Huang, Guanglei Yang, Wangmeng Zuo, and Lei Zhang. Mirage: Assessing hal- lucination in multimodal reasoning chains of mllm.arXiv preprint arXiv:2505.24238, 2025. 3

  7. [7]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025. 1, 2, 3

  8. [8]

    Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models

    Wenxuan Huang, Bohan Jia, Zijie Zhai, Shaosheng Cao, Zheyu Ye, Fei Zhao, Zhe Xu, Yao Hu, and Shaohui Lin. Vision-r1: Incentivizing reasoning capability in multimodal large language models.arXiv preprint arXiv:2503.06749,

  9. [9]

    OpenAI o1 System Card

    Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richard- son, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card.arXiv preprint arXiv:2412.16720, 2024. 1

  10. [10]

    Survey of hallucination in natural language generation.ACM computing surveys, 55(12):1–38, 2023

    Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Ye Jin Bang, Andrea Madotto, and Pascale Fung. Survey of hallucination in natural language generation.ACM computing surveys, 55(12):1–38, 2023. 2

  11. [11]

    Clevr: A diagnostic dataset for compositional language and elementary visual reasoning

    Justin Johnson, Bharath Hariharan, Laurens Van Der Maaten, Li Fei-Fei, C Lawrence Zitnick, and Ross Girshick. Clevr: A diagnostic dataset for compositional language and elementary visual reasoning. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 2901–2910, 2017. 3, 4, 7

  12. [12]

    Overcoming catastrophic forgetting in neu- ral networks.Proceedings of the national academy of sci- ences, 114(13):3521–3526, 2017

    James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska- Barwinska, et al. Overcoming catastrophic forgetting in neu- ral networks.Proceedings of the national academy of sci- ences, 114(13):3521–3526, 2017. 8

  13. [13]

    Med-r1: Re- inforcement learning for generalizable medical reasoning in 9 vision-language models.arXiv preprint arXiv:2503.13939,

    Yuxiang Lai, Jike Zhong, Ming Li, Shitian Zhao, Yuheng Li, Konstantinos Psounis, and Xiaofeng Yang. Med-r1: Re- inforcement learning for generalizable medical reasoning in 9 vision-language models.arXiv preprint arXiv:2503.13939,

  14. [14]

    Mmr1: Advancing the frontiers of multimodal reasoning, 2025

    Sicong Leng, Jing Wang, Jiaxi Li, Hao Zhang, Zhiqiang Hu, Boqiang Zhang, Hang Zhang, Yuming Jiang, Xin Li, Deli Zhao, et al. Mmr1: Advancing the frontiers of multimodal reasoning, 2025. 4, 7

  15. [15]

    Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation

    Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. InInterna- tional conference on machine learning, pages 12888–12900. PMLR, 2022. 2

  16. [16]

    Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

    Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InIn- ternational conference on machine learning, pages 19730– 19742. PMLR, 2023. 2

  17. [17]

    More thinking, less seeing? assessing amplified halluci- nation in multimodal reasoning models.arXiv preprint arXiv:2505.21523, 2025

    Chengzhi Liu, Zhongxing Xu, Qingyue Wei, Juncheng Wu, James Zou, Xin Eric Wang, Yuyin Zhou, and Sheng Liu. More thinking, less seeing? assessing amplified halluci- nation in multimodal reasoning models.arXiv preprint arXiv:2505.21523, 2025. 2, 3

  18. [18]

    Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023. 2

  19. [19]

    Noisyrollout: Reinforcing visual reasoning with data aug- mentation.arXiv preprint arXiv:2504.13055, 2025

    Xiangyan Liu, Jinjie Ni, Zijian Wu, Chao Du, Longxu Dou, Haonan Wang, Tianyu Pang, and Michael Qizhe Shieh. Noisyrollout: Reinforcing visual reasoning with data aug- mentation.arXiv preprint arXiv:2504.13055, 2025. 2, 3

  20. [20]

    Visual-RFT: Visual Reinforcement Fine-Tuning

    Ziyu Liu, Zeyi Sun, Yuhang Zang, Xiaoyi Dong, Yuhang Cao, Haodong Duan, Dahua Lin, and Jiaqi Wang. Visual- rft: Visual reinforcement fine-tuning.arXiv preprint arXiv:2503.01785, 2025. 2, 4

  21. [21]

    Inter-gps: Interpretable geometry problem solving with formal language and sym- bolic reasoning.arXiv preprint arXiv:2105.04165, 2021

    Pan Lu, Ran Gong, Shibiao Jiang, Liang Qiu, Siyuan Huang, Xiaodan Liang, and Song-Chun Zhu. Inter-gps: Interpretable geometry problem solving with formal language and sym- bolic reasoning.arXiv preprint arXiv:2105.04165, 2021. 4, 5, 6, 7

  22. [22]

    MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts

    Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathemat- ical reasoning of foundation models in visual contexts.arXiv preprint arXiv:2310.02255, 2023. 3, 4

  23. [23]

    Runqi Qiao, Qiuna Tan, Guanting Dong, MinhuiWu Min- huiWu, Chong Sun, Xiaoshuai Song, Jiapeng Wang, Zhuoma Gongque, Shanglin Lei, Yifan Zhang, et al. We-math: Does your large multimodal model achieve human-like mathemat- ical reasoning? InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages...

  24. [24]

    Proximal Policy Optimization Algo- rithms, 2017

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Rad- ford, and Oleg Klimov. Proximal Policy Optimization Algo- rithms, 2017. 3

  25. [25]

    VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model

    Haozhan Shen, Peng Liu, Jingcheng Li, Chunxin Fang, Yibo Ma, Jiajia Liao, Qiaoli Shen, Zilun Zhang, Kangjia Zhao, Qianqian Zhang, et al. Vlm-r1: A stable and generaliz- able r1-style large vision-language model.arXiv preprint arXiv:2504.07615, 2025. 2, 3

  26. [26]

    Rl’s razor: Why online reinforcement learning forgets less, 2025

    Idan Shenfeld, Jyothish Pari, and Pulkit Agrawal. Rl’s ra- zor: Why online reinforcement learning forgets less.arXiv preprint arXiv:2509.04259, 2025. 8

  27. [27]

    Towards vqa models that can read

    Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. Towards vqa models that can read. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8317–8326, 2019. 2

  28. [28]

    Llamav-o1: Rethinking step-by-step vi- sual reasoning in llms

    Omkar Thawakar, Dinura Dissanayake, Ketan Pravin More, Ritesh Thawkar, Ahmed Heakl, Noor Ahsan, Yuhao Li, Il- muz Zaman Mohammed Zumri, Jean Lahoud, Rao Muham- mad Anwer, et al. Llamav-o1: Rethinking step-by-step vi- sual reasoning in llms. InFindings of the Association for Computational Linguistics: ACL 2025, pages 24290–24315,

  29. [29]

    VL-Rethinker: Incentivizing Self-Reflection of Vision-Language Models with Reinforcement Learning

    Haozhe Wang, Chao Qu, Zuming Huang, Wei Chu, Fangzhen Lin, and Wenhu Chen. Vl-rethinker: Incentivizing self-reflection of vision-language models with reinforcement learning.arXiv preprint arXiv:2504.08837, 2025. 2

  30. [30]

    Think or not? selective reasoning via reinforcement learning for vision-language models.arXiv preprint arXiv:2505.16854, 2025

    Jiaqi Wang, Kevin Qinghong Lin, James Cheng, and Mike Zheng Shou. Think or not? selective reasoning via reinforcement learning for vision-language models.arXiv preprint arXiv:2505.16854, 2025. 3

  31. [31]

    Mea- suring multimodal mathematical reasoning with math-vision dataset.Advances in Neural Information Processing Sys- tems, 37:95095–95169, 2024

    Ke Wang, Junting Pan, Weikang Shi, Zimu Lu, Houxing Ren, Aojun Zhou, Mingjie Zhan, and Hongsheng Li. Mea- suring multimodal mathematical reasoning with math-vision dataset.Advances in Neural Information Processing Sys- tems, 37:95095–95169, 2024. 2, 4

  32. [32]

    Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

    Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024. 2

  33. [33]

    Sota with less: Mcts-guided sample selection for data-efficient visual reasoning self-improvement.arXiv preprint arXiv:2504.07934, 2025

    Xiyao Wang, Zhengyuan Yang, Chao Feng, Hongjin Lu, Lin- jie Li, Chung-Ching Lin, Kevin Lin, Furong Huang, and Li- juan Wang. Sota with less: Mcts-guided sample selection for data-efficient visual reasoning self-improvement.arXiv preprint arXiv:2504.07934, 2025. 2, 3

  34. [34]

    First sft, second rl, third upt: Continual improving multi-modal llm reasoning via unsuper- vised post-training

    Lai Wei, Yuting Li, Chen Wang, Yue Wang, Linghe Kong, Weiran Huang, and Lichao Sun. First sft, second rl, third upt: Continual improving multi-modal llm reasoning via unsuper- vised post-training. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. 4

  35. [35]

    Mitigating hallucinations in large vision-language models via entity-centric multimodal preference optimization.arXiv preprint arXiv:2506.04039, 2025

    Jiulong Wu, Zhengliang Shi, Shuaiqiang Wang, Jizhou Huang, Dawei Yin, Lingyong Yan, Min Cao, and Min Zhang. Mitigating hallucinations in large vision-language models via entity-centric multimodal preference optimization.arXiv preprint arXiv:2506.04039, 2025. 3

  36. [36]

    Visionthink: Smart and efficient vision lan- guage model via reinforcement learning.arXiv preprint arXiv:2507.13348, 2025

    Senqiao Yang, Junyi Li, Xin Lai, Bei Yu, Hengshuang Zhao, and Jiaya Jia. Visionthink: Smart and efficient vision lan- guage model via reinforcement learning.arXiv preprint arXiv:2507.13348, 2025. 2, 3

  37. [37]

    R1-onevision: Advancing gen- eralized multimodal reasoning through cross-modal formal- ization.arXiv preprint arXiv:2503.10615, 2025

    Yi Yang, Xiaoxuan He, Hongkun Pan, Xiyan Jiang, Yan Deng, Xingtao Yang, Haoyu Lu, Dacheng Yin, Fengyun Rao, Minfeng Zhu, et al. R1-onevision: Advancing gen- eralized multimodal reasoning through cross-modal formal- ization.arXiv preprint arXiv:2503.10615, 2025. 2, 3, 4 10

  38. [38]

    Machine mental imagery: Empower multimodal reasoning with latent visual tokens.arXiv preprint arXiv:2506.17218, 2025

    Zeyuan Yang, Xueyang Yu, Delin Chen, Maohao Shen, and Chuang Gan. Machine mental imagery: Empower multi- modal reasoning with latent visual tokens.arXiv preprint arXiv:2506.17218, 2025. 2

  39. [39]

    Mulberry: Empowering mllm with o1-like reasoning and reflection via collective monte 10 carlo tree search.arXiv preprint arXiv:2412.18319, 2024

    Huanjin Yao, Jiaxing Huang, Wenhao Wu, Jingyi Zhang, Yibo Wang, Shunyu Liu, Yingjie Wang, Yuxin Song, Haocheng Feng, Li Shen, et al. Mulberry: Empowering mllm with o1-like reasoning and reflection via collective monte carlo tree search.arXiv preprint arXiv:2412.18319, 2024. 2

  40. [40]

    Slca: Slow learner with classifier align- ment for continual learning on a pre-trained model

    Gengwei Zhang, Liyuan Wang, Guoliang Kang, Ling Chen, and Yunchao Wei. Slca: Slow learner with classifier align- ment for continual learning on a pre-trained model. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 19148–19158, 2023. 8

  41. [41]

    Mathverse: Does your multi-modal llm truly see the diagrams in visual math problems? In European Conference on Computer Vision, pages 169–186

    Renrui Zhang, Dongzhi Jiang, Yichi Zhang, Haokun Lin, Ziyu Guo, Pengshuo Qiu, Aojun Zhou, Pan Lu, Kai-Wei Chang, Yu Qiao, et al. Mathverse: Does your multi-modal llm truly see the diagrams in visual math problems? In European Conference on Computer Vision, pages 169–186. Springer, 2024. 2, 3, 4, 8

  42. [42]

    Improve vision language model chain-of- thought reasoning

    Ruohong Zhang, Bowen Zhang, Yanghao Li, Haotian Zhang, Zhiqing Sun, Zhe Gan, Yinfei Yang, Ruoming Pang, and Yiming Yang. Improve vision language model chain-of- thought reasoning. InProceedings of the 63rd Annual Meet- ing of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1631–1662, 2025. 2

  43. [43]

    Easyr1: An efficient, scalable, multi- modality rl training framework.https://github.com/ hiyouga/EasyR1, 2025

    Yaowei Zheng et al. Easyr1: An efficient, scalable, multi- modality rl training framework.https://github.com/ hiyouga/EasyR1, 2025. 4

  44. [44]

    MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

    Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mo- hamed Elhoseiny. Minigpt-4: Enhancing vision-language understanding with advanced large language models.arXiv preprint arXiv:2304.10592, 2023. 2

  45. [45]

    Look twice before you answer: Memory-space visual retracing for hallucination mitiga- tion in multimodal large language models.arXiv preprint arXiv:2410.03577, 2024

    Xin Zou, Yizhou Wang, Yibo Yan, Yuanhuiyi Lyu, Ken- ing Zheng, Sirui Huang, Junkai Chen, Peijie Jiang, Jia Liu, Chang Tang, et al. Look twice before you answer: Memory-space visual retracing for hallucination mitiga- tion in multimodal large language models.arXiv preprint arXiv:2410.03577, 2024. 3 11