pith. sign in

arxiv: 2510.21122 · v3 · submitted 2025-10-24 · 💻 cs.CV

NoisyGRPO: Incentivizing Multimodal CoT Reasoning via Noise Injection and Bayesian Estimation

Pith reviewed 2026-05-18 04:35 UTC · model grok-4.3

classification 💻 cs.CV
keywords NoisyGRPOmultimodal CoT reasoningnoise injectionBayesian advantage estimationreinforcement learningMLLMsgeneralizationhallucination reduction
0
0 comments X

The pith

Noise injection into visual inputs and Bayesian advantage estimation improve generalization in multimodal chain-of-thought reasoning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes NoisyGRPO to address generalization failures in reinforcement learning applied to chain-of-thought reasoning in multimodal large language models. It perturbs visual inputs with controllable Gaussian noise during training to expand the range of visual scenarios the model explores. It then treats advantage estimation as Bayesian inference, using the injected noise level as a prior and the observed trajectory reward as the likelihood, so the resulting posterior favors trajectories that remain effective without relying on the added noise. Experiments show this combination yields stronger results on CoT quality, general capability, and hallucination benchmarks, with the largest gains appearing in small-scale models such as Qwen2.5-VL 3B.

Core claim

NoisyGRPO improves RL training for MLLMs by (1) perturbing visual inputs with Gaussian noise to encourage exploration across wider visual scenarios and (2) formulating advantage estimation as Bayesian inference in which the injected noise level serves as prior and the observed trajectory reward as likelihood, fusing the two to produce a posterior estimate that guides the model toward visually grounded trajectories rather than those that succeed only under noise.

What carries the argument

Bayesian Advantage Estimation, which computes a posterior trajectory advantage by treating the Gaussian noise level as prior and the observed reward as likelihood to select robust, grounded reasoning paths.

Load-bearing premise

The injected Gaussian noise level can be used directly as a prior whose posterior advantage estimate reliably prefers visually grounded trajectories over those that succeed only under noise.

What would settle it

A controlled experiment showing that removing the Bayesian component or using a different noise prior yields no gain in out-of-distribution CoT generalization on standard benchmarks would falsify the claim that the noise-as-prior Bayesian step is what drives the improvement.

Figures

Figures reproduced from arXiv: 2510.21122 by Jiaxuan Sun, Longtian Qiu, Shan Ning, Xuming He.

Figure 1
Figure 1. Figure 1: Performance and training statistics of three RL methods. GRPO with noise injection refers to GRPO trained with noise-perturbed rollouts. The left plot shows evaluation performance on the MMStar benchmark over training iterations. The middle plot presents the standard deviation of rewards, reflecting the exploration degree of the policy. The right plot shows the accuracy reward, indicating how well the mode… view at source ↗
Figure 2
Figure 2. Figure 2: Overall pipeline of NoisyGRPO. For each image-question pair, we sample noise and inject it into the image. The policy model generates rollouts based on the perturbed inputs, and the reward function evaluates them. We then compute the posterior advantage by combining the noise-based prior with the reward-based observation. • Experimental results show that NoisyGRPO consistently outperforms GRPO across CoT q… view at source ↗
Figure 3
Figure 3. Figure 3: Performance over iteration and training statistics. We report the evaluation results of NoisyGRPO-3B on the MMStar benchmark to demonstrate the comprehensive capabilities of the MLLM. For both Importance Weight and Completion Length, the shaded regions represent the variance across samples. (a) (b) (c) (d) [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Performance of Ablation Study. We report the average performance on the MMStar benchmark across the following ablation settings: (a) Ablation of core design choices in NoisyGRPO; (b) Sensitivity to the hyperparameter α controlling observation confidence; (c) Sensitivity to the hy￾perparameter γ modulating prior variance adaptation; (d) Comparison of different noise distributions used in the noise-injected … view at source ↗
Figure 5
Figure 5. Figure 5: Correlation of correctness and noise level [PITH_FULL_IMAGE:figures/full_fig_p016_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Correlation of group normalized correctness and noise level. Alternatively, the posterior mean can be rewritten in the following equivalent form to highlight its interpolation behavior: µpost = y + σ 2 y σ 2 0 + σ 2 y (µ0 − y) (16) This form illustrates how the uncertainty and information from both the prior and the observation are combined with weighted contributions. The variances σ 2 0 and σ 2 y play a … view at source ↗
Figure 7
Figure 7. Figure 7: Histogram and Q-Q plot of residual of observation and prior. The Gaussian in red is the Gaussian distribution with identical mean and variance to the residual [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Histogram and Q-Q plot of residual of posterior and observation. The Gaussian in red is the Gaussian distribution with identical mean and variance to the residual. that the inherent differences in sample characteristics and difficulty levels cause the effect of noise injection to vary significantly across samples. Consequently, group normalization is essential to our approach for stabilizing training. B.2 … view at source ↗
Figure 9
Figure 9. Figure 9: Illustration of "Answer Correctness is a Partial Observation" To assess the distributional characteristics of this residual, we visualize it in [PITH_FULL_IMAGE:figures/full_fig_p018_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Illustration of Training rollouts and trajectory reward [PITH_FULL_IMAGE:figures/full_fig_p019_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Illustration of Training rollouts and trajectory reward. complex multi-step reasoning settings like Chain-of-Thought (CoT). To qualitatively illustrate this phenomenon, we provide a concrete example in [PITH_FULL_IMAGE:figures/full_fig_p019_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Comparison of rollouts generation between vanilla GRPO and NoisyGRPO [PITH_FULL_IMAGE:figures/full_fig_p020_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: CoT Inference comparison between GRPO and Noisy. Specifically, we examine the differences between two exploration strategies: one that solely relies on temperature sampling to generate multiple rollouts, and another that introduces noise injection to the inputs to diversify the rollouts. As illustrated in [PITH_FULL_IMAGE:figures/full_fig_p020_13.png] view at source ↗
read the original abstract

Reinforcement learning (RL) has shown promise in enhancing the general Chain-of-Thought (CoT) reasoning capabilities of multimodal large language models (MLLMs). However, when applied to improve general CoT reasoning, existing RL frameworks often struggle to generalize beyond the training distribution. To address this, we propose NoisyGRPO, a systematic multimodal RL framework that introduces controllable noise into visual inputs for enhanced exploration and explicitly models the advantage estimation process via a Bayesian framework. Specifically, NoisyGRPO improves RL training by: (1) Noise-Injected Exploration Policy: Perturbing visual inputs with Gaussian noise to encourage exploration across a wider range of visual scenarios; and (2) Bayesian Advantage Estimation: Formulating advantage estimation as a principled Bayesian inference problem, where the injected noise level serves as a prior and the observed trajectory reward as the likelihood. This Bayesian modeling fuses both sources of information to compute a robust posterior estimate of trajectory advantage, effectively guiding MLLMs to prefer visually grounded trajectories over noisy ones. Experiments on standard CoT quality, general capability, and hallucination benchmarks demonstrate that NoisyGRPO substantially improves generalization and robustness, especially in RL settings with small-scale MLLMs such as Qwen2.5-VL 3B. The project page is available at https://artanic30.github.io/project_pages/NoisyGRPO/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes NoisyGRPO, a multimodal RL framework for improving Chain-of-Thought reasoning in MLLMs. It adds controllable Gaussian noise to visual inputs to encourage broader exploration and formulates advantage estimation as Bayesian inference, using the injected noise level as prior and trajectory reward as likelihood to compute a posterior advantage that favors visually grounded trajectories. Experiments on CoT quality, general capability, and hallucination benchmarks report substantial gains in generalization and robustness, especially for small-scale models such as Qwen2.5-VL 3B.

Significance. If the Bayesian posterior reliably down-weights noise-dependent successes while preferring grounded trajectories, the method could supply a principled mechanism for improving generalization in RL for vision-language models, a known weakness of standard GRPO-style approaches. The focus on small-scale MLLMs and the reported benchmark gains suggest practical relevance for resource-limited settings, though the absence of mechanistic verification limits the assessed novelty.

major comments (2)
  1. [Bayesian Advantage Estimation] Bayesian Advantage Estimation section: the likelihood model p(reward | noise, trajectory) is never defined and no derivation is supplied showing that the posterior mean or mode systematically prefers visually grounded trajectories over those succeeding only under injected noise. Without this, the central claim that the Bayesian step 'effectively guiding MLLMs to prefer visually grounded trajectories over noisy ones' cannot be verified and the method may reduce to ordinary noise-augmented GRPO.
  2. [Experiments] Experiments section: no error bars, no ablation isolating the Bayesian update rule from the noise injection, and no quantitative comparison of posterior advantage versus raw reward are reported. This leaves the attribution of generalization improvements on CoT quality and hallucination benchmarks unsupported.
minor comments (2)
  1. [Abstract] The abstract states empirical improvements without any numerical values, baseline comparisons, or statistical significance; adding these would strengthen the summary.
  2. [Method] Notation for the posterior advantage estimate is introduced without an explicit equation; providing the update rule in closed form would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback on our manuscript. We address each of the major comments in detail below, providing clarifications and committing to revisions where appropriate to strengthen the presentation of our method and results.

read point-by-point responses
  1. Referee: [Bayesian Advantage Estimation] Bayesian Advantage Estimation section: the likelihood model p(reward | noise, trajectory) is never defined and no derivation is supplied showing that the posterior mean or mode systematically prefers visually grounded trajectories over those succeeding only under injected noise. Without this, the central claim that the Bayesian step 'effectively guiding MLLMs to prefer visually grounded trajectories over noisy ones' cannot be verified and the method may reduce to ordinary noise-augmented GRPO.

    Authors: We acknowledge that the original manuscript described the Bayesian advantage estimation at a conceptual level without providing an explicit mathematical definition of the likelihood p(reward | noise, trajectory) or a step-by-step derivation. This was an oversight in the presentation. In the revised manuscript, we will expand the Bayesian Advantage Estimation section to include the full specification of the likelihood model, which models the reward as decreasing with higher noise levels for non-grounded trajectories, and derive the posterior advantage as the mean of the posterior distribution. This derivation demonstrates that trajectories succeeding primarily due to high noise receive lower posterior advantage, thereby preferring visually grounded ones. We believe this addresses the concern and distinguishes the approach from standard noise-augmented GRPO. revision: yes

  2. Referee: [Experiments] Experiments section: no error bars, no ablation isolating the Bayesian update rule from the noise injection, and no quantitative comparison of posterior advantage versus raw reward are reported. This leaves the attribution of generalization improvements on CoT quality and hallucination benchmarks unsupported.

    Authors: We agree with the referee that the experimental section would benefit from additional rigor. In the revised version, we will include error bars computed over multiple random seeds for all main results. We will also add an ablation study that isolates the contribution of the Bayesian update by comparing full NoisyGRPO against a variant that uses only noise injection with standard GRPO advantage estimation. Furthermore, we will provide quantitative analysis, such as histograms or tables, comparing the posterior advantage values to raw rewards for selected trajectories to illustrate how the Bayesian step modulates the advantages. These changes will better support the attribution of the observed improvements. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper proposes a practical RL framework combining noise-injected visual inputs with a Bayesian reformulation of advantage estimation. The noise level is explicitly chosen by the experimenter and used as a prior, with trajectory reward as likelihood; the resulting posterior is presented as a modeling choice that fuses information rather than a first-principles derivation whose output is forced to equal its inputs by algebraic identity or statistical construction. No equations are shown that reduce the claimed posterior advantage to a monotonic function of the injected noise alone, nor is there a self-citation chain, ansatz smuggling, or renaming of a known result that bears the central generalization claim. Empirical gains are reported on external benchmarks, leaving the method self-contained against those benchmarks without circular reduction.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on two modeling choices whose justification is not supplied in the abstract: that Gaussian noise constitutes a suitable prior for visual robustness and that the reward likelihood can be treated as conditionally independent of the noise level. No free parameters are explicitly named, but the noise variance and any Bayesian hyperparameters are implicit.

free parameters (1)
  • noise variance
    Controllable Gaussian noise level injected into visual inputs; chosen by experimenter and used directly as prior.
axioms (1)
  • domain assumption Advantage estimation can be formulated as Bayesian inference with noise level as prior and trajectory reward as likelihood.
    Invoked in the Bayesian Advantage Estimation component; no derivation or validation of the likelihood model is given.

pith-pipeline@v0.9.0 · 5786 in / 1377 out tokens · 28703 ms · 2026-05-18T04:35:56.534820+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. WikiCLIP: An Efficient Contrastive Baseline for Open-domain Visual Entity Recognition

    cs.CV 2026-03 unverdicted novelty 7.0

    WikiCLIP delivers an efficient contrastive baseline for open-domain visual entity recognition that improves accuracy by 16% on OVEN unseen entities and runs nearly 100 times faster than leading generative models.

  2. PDCR: Perception-Decomposed Confidence Reward for Vision-Language Reasoning

    cs.CL 2026-05 unverdicted novelty 6.0

    PDCR improves vision-language reasoning by computing separate normalized confidence advantages for perception steps and reasoning steps after unsupervised decomposition.

Reference graph

Works this paper leans on

63 extracted references · 63 canonical work pages · cited by 2 Pith papers · 34 internal anchors

  1. [1]

    Gpt-4 technical report

    OpenAI Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, Red Avila, Igor Babuschkin, Suchir Balaji, and et al. Gpt-4 technical report. 2023

  2. [2]

    Flamingo: a visual language model for few-shot learning.Advances in Neural Information Processing Systems, 35:23716–23736, 2022

    Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning.Advances in Neural Information Processing Systems, 35:23716–23736, 2022

  3. [3]

    Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

    Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond.arXiv preprint arXiv:2308.12966, 2023

  4. [4]

    Qwen2.5-VL Technical Report

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-vl technical report.ar...

  5. [5]

    Are We on the Right Way for Evaluating Large Vision-Language Models?

    Lin Chen, Jinsong Li, Xiao wen Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Jiaqi Wang, Yu Qiao, Dahua Lin, and Feng Zhao. Are we on the right way for evaluating large vision-language models? ArXiv, abs/2403.20330, 2024

  6. [6]

    Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

    Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhangwei Gao, Erfei Cui, Jinguo Zhu, Shenglong Ye, Hao Tian, Zhaoyang Liu, Lixin Gu, Xuehui Wang, Qingyun Li, Yiming Ren, Zixuan Chen, Jiapeng Luo, Jiahao Wang, Tan Jiang, Bo Wang, Conghui He, Botian Shi, Xingcheng Zhang, Han Lv, Yi Wang, Wenqi Shao, Pei Chu, Zhongying Tu, Tong He, Zhiyong Wu, Hui Deng, Jiaye ...

  7. [7]

    Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning

    DeepSeek-AI. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. 2025

  8. [8]

    OpenVLThinker: Complex Vision-Language Reasoning via Iterative SFT-RL Cycles

    Yihe Deng, Hritik Bansal, Fan Yin, Nanyun Peng, Wei Wang, and Kai-Wei Chang. Openvlthinker: An early exploration to complex vision-language reasoning via iterative self-improvement.ArXiv, abs/2503.17352, 2025

  9. [9]

    Promptdet: Towards open-vocabulary detection using uncurated images

    Chengjian Feng, Yujie Zhong, Zequn Jie, Xiangxiang Chu, Haibing Ren, Xiaolin Wei, Weidi Xie, and Lin Ma. Promptdet: Towards open-vocabulary detection using uncurated images. InEuropean Conference on Computer Vision, 2022

  10. [10]

    Sphinx-x: Scaling data and parameters for a family of multi-modal large language models.arXiv preprint arXiv:2402.05935, 2024

    Peng Gao, Renrui Zhang, Chris Liu, Longtian Qiu, Siyuan Huang, Weifeng Lin, Shitian Zhao, Shijie Geng, Ziyi Lin, Peng Jin, Kaipeng Zhang, Wenqi Shao, Chao Xu, Conghui He, Junjun He, Hao Shao, Pan Lu, Hongsheng Li, and Yu Qiao. Sphinx-x: Scaling data and parameters for a family of multi-modal large language models.ArXiv, abs/2402.05935, 2024

  11. [11]

    SEED-X: Multimodal Models with Unified Multi-granularity Comprehension and Generation

    Yuying Ge, Sijie Zhao, Jinguo Zhu, Yixiao Ge, Kun Yi, Lin Song, Chen Li, Xiaohan Ding, and Ying Shan. Seed-x: Multimodal models with unified multi-granularity comprehension and generation.ArXiv, abs/2404.14396, 2024

  12. [12]

    Calip: Zero-shot enhancement of clip with parameter-free attention.arXiv preprint arXiv:2209.14169, 2022

    Ziyu Guo, Renrui Zhang, Longtian Qiu, Xianzheng Ma, Xupeng Miao, Xuming He, and Bin Cui. Calip: Zero-shot enhancement of clip with parameter-free attention.arXiv preprint arXiv:2209.14169, 2022

  13. [13]

    Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models

    Wenxuan Huang, Bohan Jia, Zijie Zhai, Shaoshen Cao, Zheyu Ye, Fei Zhao, Zhe Xu, Yao Hu, and Shaohui Lin. Vision-r1: Incentivizing reasoning capability in multimodal large language models.ArXiv, abs/2503.06749, 2025

  14. [14]

    GPT-4o System Card

    OpenAI Aaron Hurst, Adam Lerer, Adam P. Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Os- trow, Akila Welihinda, Alan Hayes, Alec Radford, and et al. Gpt-4o system card.ArXiv, abs/2410.21276, 2024. 11

  15. [15]

    Generalization in reinforcement learning with selective noise injection and information bottleneck

    Maximilian Igl, Kamil Ciosek, Yingzhen Li, Sebastian Tschiatschek, Cheng Zhang, Sam Devlin, and Katja Hofmann. Generalization in reinforcement learning with selective noise injection and information bottleneck. InNeural Information Processing Systems, 2019

  16. [16]

    Mme-cot: Benchmarking chain-of-thought in large multimodal models for reasoning quality, robustness, and efficiency.arXiv preprint arXiv:2502.09621, 2025

    Dongzhi Jiang, Renrui Zhang, Ziyu Guo, Yanwei Li, Yu Qi, Xinyan Chen, Liuhui Wang, Jianhan Jin, Claire Guo, Shen Yan, Bo Zhang, Chaoyou Fu, Peng Gao, and Hongsheng Li. Mme-cot: Benchmarking chain-of-thought in large multimodal models for reasoning quality, robustness, and efficiency.ArXiv, abs/2502.09621, 2025

  17. [17]

    A Diagram Is Worth A Dozen Images

    Aniruddha Kembhavi, Michael Salvato, Eric Kolve, Minjoon Seo, Hannaneh Hajishirzi, and Ali Farhadi. A diagram is worth a dozen images.ArXiv, abs/1603.07396, 2016

  18. [18]

    LLaVA-OneVision: Easy Visual Task Transfer

    Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Yanwei Li, Ziwei Liu, and Chunyuan Li. Llava-onevision: Easy visual task transfer.ArXiv, abs/2408.03326, 2024

  19. [19]

    SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension

    Bohao Li, Rui Wang, Guangzhi Wang, Yuying Ge, Yixiao Ge, and Ying Shan. Seed-bench: Benchmarking multimodal llms with generative comprehension.ArXiv, abs/2307.16125, 2023

  20. [20]

    Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

    Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InInternational conference on machine learning, pages 19730–19742. PMLR, 2023

  21. [21]

    Uni-moe: Scaling unified multimodal llms with mixture of experts.IEEE Transactions on Pattern Analysis and Machine Intelligence, 47:3424–3439, 2024

    Yunxin Li, Shenyuan Jiang, Baotian Hu, Longyue Wang, Wanqi Zhong, Wenhan Luo, Lin Ma, and Min Zhang. Uni-moe: Scaling unified multimodal llms with mixture of experts.IEEE Transactions on Pattern Analysis and Machine Intelligence, 47:3424–3439, 2024

  22. [22]

    SPHINX: The Joint Mixing of Weights, Tasks, and Visual Embeddings for Multi-modal Large Language Models

    Ziyi Lin, Chris Liu, Renrui Zhang, Peng Gao, Longtian Qiu, Han Xiao, Han Qiu, Chen Lin, Wenqi Shao, Keqin Chen, Jiaming Han, Siyuan Huang, Yichi Zhang, Xuming He, Hongsheng Li, and Yu Jiao Qiao. Sphinx: The joint mixing of weights, tasks, and visual embeddings for multi-modal large language models. ArXiv, abs/2311.07575, 2023

  23. [23]

    Improved Baselines with Visual Instruction Tuning

    Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning.ArXiv, abs/2310.03744, 2023

  24. [24]

    Llava-next: Improved reasoning, ocr, and world knowledge, January 2024

    Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llava-next: Improved reasoning, ocr, and world knowledge, January 2024

  25. [25]

    Visual Instruction Tuning

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.ArXiv, abs/2304.08485, 2023

  26. [26]

    MMBench: Is Your Multi-modal Model an All-around Player?

    Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player?arXiv preprint arXiv:2307.06281, 2023

  27. [27]

    MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts

    Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chun yue Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating math reasoning in visual contexts with gpt-4v, bard, and other large multimodal models.ArXiv, abs/2310.02255, 2023

  28. [28]

    Learn to explain: Multimodal reasoning via thought chains for science question answering

    Pan Lu, Swaroop Mishra, Tony Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering. InThe 36th Conference on Neural Information Processing Systems (NeurIPS), 2022

  29. [29]

    2025.doi:10.48550/arXiv.2411.07975

    Yiyang Ma, Xingchao Liu, Xiaokang Chen, Wen Liu, Chengyue Wu, Zhiyu Wu, Zizheng Pan, Zhenda Xie, Haowei Zhang, Xingkai Yu, Liang Zhao, Yisong Wang, Jiaying Liu, and Chong Ruan. Janusflow: Harmonizing autoregression and rectified flow for unified multimodal understanding and generation.ArXiv, abs/2411.07975, 2024

  30. [30]

    Mm-eureka: Exploring the frontiers of multimodal reasoning with rule-based reinforcement learning

    Fanqing Meng, Lingxiao Du, Zongkai Liu, Zhixiang Zhou, Quanfeng Lu, Daocheng Fu, Tiancheng Han, Botian Shi, Wenhai Wang, Junjun He, Kaipeng Zhang, Ping Luo, Yu Qiao, Qiaosheng Zhang, and Wenqi Shao. Mm-eureka: Exploring the frontiers of multimodal reasoning with rule-based reinforcement learning. 2025

  31. [31]

    Sha Ning, Longtian Qiu, Yongfei Liu, and Xuming He. Hoiclip: Efficient knowledge transfer for hoi detection with vision-language models.2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 23507–23517, 2023

  32. [32]

    Vision - openai api.https://platform.openai.com/docs/guides/vision, 2023

    OpenAI. Vision - openai api.https://platform.openai.com/docs/guides/vision, 2023

  33. [33]

    Openai gpt-4o system card, 2024

    OpenAI. Openai gpt-4o system card, 2024. System Card for OpenAI GPT-4o. 12

  34. [34]

    Openai o1 system card, 2024

    OpenAI. Openai o1 system card, 2024. System Card for OpenAI o1

  35. [35]

    DINOv2: Learning Robust Visual Features without Supervision

    Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193, 2023

  36. [36]

    Clip-dpo: Vision-language models as a source of preference for fixing hallucinations in lvlms.ArXiv, abs/2408.10433, 2024

    Yassine Ouali, Adrian Bulat, Brais Martínez, and Georgios Tzimiropoulos. Clip-dpo: Vision-language models as a source of preference for fixing hallucinations in lvlms.ArXiv, abs/2408.10433, 2024

  37. [37]

    Training language models to follow instructions with human feedback.Advances in Neural Information Processing Systems, 35:27730–27744, 2022

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in Neural Information Processing Systems, 35:27730–27744, 2022

  38. [38]

    LMM-R1: Empowering 3B LMMs with Strong Reasoning Abilities Through Two-Stage Rule-Based RL

    Yi Peng, Gongrui Zhang, Miaosen Zhang, Zhiyuan You, Jie Liu, Qipeng Zhu, Kai Yang, Xingzhong Xu, Xin Geng, and Xu Yang. Lmm-r1: Empowering 3b lmms with strong reasoning abilities through two-stage rule-based rl.ArXiv, abs/2503.07536, 2025

  39. [39]

    CoRR , volume =

    Renjie Pi, Tianyang Han, Wei Xiong, Jipeng Zhang, Runtao Liu, Rui Pan, and Tong Zhang. Strengthening multimodal large language model with bootstrapped preference optimization.ArXiv, abs/2403.08730, 2024

  40. [40]

    Mining fine-grained image-text alignment for zero-shot captioning via text-only training.ArXiv, abs/2401.02347, 2024

    Longtian Qiu, Shan Ning, and Xuming He. Mining fine-grained image-text alignment for zero-shot captioning via text-only training.ArXiv, abs/2401.02347, 2024

  41. [41]

    Direct Preference Optimization: Your Language Model is Secretly a Reward Model

    Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model.ArXiv, abs/2305.18290, 2023

  42. [42]

    Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

    Machel Reid, Nikolay Savinov, Denis Teplyashin, Dmitry Lepikhin, Timothy P. Lillicrap, Jean-Baptiste Alayrac, Radu Soricut, Angeliki Lazaridou, Orhan Firat, Julian Schrittwieser, Ioannis Antonoglou, Rohan Anil, and et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. ArXiv, abs/2403.05530, 2024

  43. [43]

    Sentence-bert: Sentence embeddings using siamese bert-networks

    Nils Reimers and Iryna Gurevych. Sentence-bert: Sentence embeddings using siamese bert-networks. In Conference on Empirical Methods in Natural Language Processing, 2019

  44. [45]

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Jun-Mei Song, Mingchuan Zhang, Y . K. Li, Yu Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. ArXiv, abs/2402.03300, 2024

  45. [46]

    VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model

    Haozhan Shen, Peng Liu, Jingcheng Li, Chunxin Fang, Yibo Ma, Jiajia Liao, Qiaoli Shen, Zilun Zhang, Kangjia Zhao, Qianqian Zhang, Ruochen Xu, and Tiancheng Zhao. Vlm-r1: A stable and generalizable r1-style large vision-language model.arXiv preprint arXiv:2504.07615, 2025

  46. [47]

    Math-llava: Bootstrapping mathematical reasoning for multimodal large language models

    Wenhao Shi, Zhiqiang Hu, Yi Bin, Junhua Liu, Yang Yang, See-Kiong Ng, Li Bing, and Roy Ka wei Lee. Math-llava: Bootstrapping mathematical reasoning for multimodal large language models. InConference on Empirical Methods in Natural Language Processing, 2024

  47. [48]

    Chameleon: Mixed-Modal Early-Fusion Foundation Models

    Chameleon Team. Chameleon: Mixed-modal early-fusion foundation models.ArXiv, abs/2405.09818, 2024

  48. [49]

    Kimi k1.5: Scaling Reinforcement Learning with LLMs

    Kimi Team, Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, and et al. Kimi k1.5: Scaling reinforcement learning with llms.ArXiv, abs/2501.12599, 2025

  49. [50]

    AMBER: An LLM-free Multi-dimensional Benchmark for MLLMs Hallucination Evaluation

    Junyang Wang, Yuhang Wang, Guohai Xu, Jing Zhang, Yukai Gu, Haitao Jia, Haiyang Xu, Ming Yan, Ji Zhang, and Jitao Sang. An llm-free multi-dimensional benchmark for mllms hallucination evaluation. ArXiv, abs/2311.07397, 2023

  50. [51]

    Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

    Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Yang Fan, Kai Dang, Mengfei Du, Xuancheng Ren, Rui Men, Dayiheng Liu, Chang Zhou, Jingren Zhou, and Junyang Lin. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024

  51. [52]

    CogVLM: Visual Expert for Pretrained Language Models

    Weihan Wang, Qingsong Lv, Wenmeng Yu, Wenyi Hong, Ji Qi, Yan Wang, Junhui Ji, Zhuoyi Yang, Lei Zhao, Xixuan Song, Jiazheng Xu, Bin Xu, Juanzi Li, Yuxiao Dong, Ming Ding, and Jie Tang. Cogvlm: Visual expert for pretrained language models.ArXiv, abs/2311.03079, 2023. 13

  52. [53]

    InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Model

    Xiao wen Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Bin Wang, Linke Ouyang, Xilin Wei, Songyang Zhang, Haodong Duan, Maosong Cao, Wenwei Zhang, Yining Li, Hang Yan, Yang Gao, Xinyue Zhang, Wei Li, Jingwen Li, Kai Chen, Conghui He, Xingcheng Zhang, Yu Qiao, Dahua Lin, and Jiaqi Wang. Internlm-xcomposer2: Mastering free-form text-image composition and compre...

  53. [54]

    Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation

    Chengyue Wu, Xiaokang Chen, Zhiyu Wu, Yiyang Ma, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, Chong Ruan, and Ping Luo. Janus: Decoupling visual encoding for unified multimodal understanding and generation.ArXiv, abs/2410.13848, 2024

  54. [55]

    DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding

    Zhiyu Wu, Xiaokang Chen, Zizheng Pan, Xingchao Liu, Wen Liu, Damai Dai, Huazuo Gao, Yiyang Ma, Chengyue Wu, Bing-Li Wang, Zhenda Xie, Yu Wu, Kai Hu, Jiawei Wang, Yaofeng Sun, Yukun Li, Yishi Piao, Kang Guan, Aixin Liu, Xin Xie, Yu mei You, Kaihong Dong, Xingkai Yu, Haowei Zhang, Liang Zhao, Yisong Wang, and Chong Ruan. Deepseek-vl2: Mixture-of-experts vis...

  55. [56]

    Show-o: One Single Transformer to Unify Multimodal Understanding and Generation

    Jinheng Xie, Weijia Mao, Zechen Bai, David Junhao Zhang, Weihao Wang, Kevin Qinghong Lin, Yuchao Gu, Zhijie Chen, Zhenheng Yang, and Mike Zheng Shou. Show-o: One single transformer to unify multimodal understanding and generation.ArXiv, abs/2408.12528, 2024

  56. [57]

    MiniCPM-V: A GPT-4V Level MLLM on Your Phone

    Yuan Yao, Tianyu Yu, Ao Zhang, Chongyi Wang, Junbo Cui, Hongji Zhu, Tianchi Cai, Haoyu Li, Weilin Zhao, Zhihui He, Qi-An Chen, Huarong Zhou, Zhensheng Zou, Haoye Zhang, Shengding Hu, Zhi Zheng, Jie Zhou, Jie Cai, Xu Han, Guoyang Zeng, Dahai Li, Zhiyuan Liu, and Maosong Sun. Minicpm-v: A gpt-4v level mllm on your phone.ArXiv, abs/2408.01800, 2024

  57. [58]

    DAPO: An Open-Source LLM Reinforcement Learning System at Scale

    Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Tiantian Fan, Gaohong Liu, Lingjun Liu, Xin Liu, Haibin Lin, Zhiqi Lin, Bole Ma, Guangming Sheng, Yuxuan Tong, Chi Zhang, Mofan Zhang, Wang Zhang, Hang Zhu, Jinhua Zhu, Jiaze Chen, Jiangjie Chen, Chengyi Wang, Honglin Yu, Weinan Dai, Yuxuan Song, Xiang Wei, Haodong Zhou, Jingjing Liu, ...

  58. [59]

    Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, Cong Wei, Botao Yu, Ruibin Yuan, Renliang Sun, Ming Yin, Boyuan Zheng, Zhenzhu Yang, Yibo Liu, Wenhao Huang, Huan Sun, Yu Su, and Wenhu Chen. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for exp...

  59. [60]

    Mavis: Mathematical visual in- struction tuning

    Renrui Zhang, Xinyu Wei, Dongzhi Jiang, Yichi Zhang, Ziyu Guo, Chengzhuo Tong, Jiaming Liu, Aojun Zhou, Bin Wei, Shanghang Zhang, Peng Gao, and Hongsheng Li. Mavis: Mathematical visual instruction tuning.ArXiv, abs/2407.08739, 2024

  60. [62]

    Mm-rlhf: The next step forward in multimodal llm alignment,

    Yi-Fan Zhang, Tao Yu, Haochen Tian, Chaoyou Fu, Peiyan Li, Jianshu Zeng, Wulin Xie, Yang Shi, Huanyu Zhang, Junkang Wu, Xue Wang, Yibo Hu, Bin Wen, Fan Yang, Zhang Zhang, Tingting Gao, Di Zhang, Liang Wang, Rong Jin, and Tien-Ping Tan. Mm-rlhf: The next step forward in multimodal llm alignment. ArXiv, abs/2502.10391, 2025

  61. [63]

    MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans?

    Yi-Fan Zhang, Huanyu Zhang, Haochen Tian, Chaoyou Fu, Shuangqing Zhang, Jun Wu, Feng Li, Kun Wang, Qingsong Wen, Zhang Zhang, Liang Wang, Rong Jin, and Tien-Ping Tan. Mme-realworld: Could your multimodal llm challenge high-resolution real-world scenarios that are difficult for humans?ArXiv, abs/2408.13257, 2024

  62. [64]

    Conditional prompt learning for vision-language models

    Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. Conditional prompt learning for vision-language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16816–16825, 2022

  63. [65]

    Answer Correctness is a Partial Observation

    Zhuofan Zong, Bingqi Ma, Dazhong Shen, Guanglu Song, Hao Shao, Dongzhi Jiang, Hongsheng Li, and Yu Liu. Mova: Adapting mixture of vision experts to multimodal context.ArXiv, abs/2404.13046, 2024. 14 A Preliminary This section formalizes the reinforcement learning (RL) framework for post-training optimization of multimodal large language models (MLLMs). We...