NoisyGRPO: Incentivizing Multimodal CoT Reasoning via Noise Injection and Bayesian Estimation
Pith reviewed 2026-05-18 04:35 UTC · model grok-4.3
The pith
Noise injection into visual inputs and Bayesian advantage estimation improve generalization in multimodal chain-of-thought reasoning.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
NoisyGRPO improves RL training for MLLMs by (1) perturbing visual inputs with Gaussian noise to encourage exploration across wider visual scenarios and (2) formulating advantage estimation as Bayesian inference in which the injected noise level serves as prior and the observed trajectory reward as likelihood, fusing the two to produce a posterior estimate that guides the model toward visually grounded trajectories rather than those that succeed only under noise.
What carries the argument
Bayesian Advantage Estimation, which computes a posterior trajectory advantage by treating the Gaussian noise level as prior and the observed reward as likelihood to select robust, grounded reasoning paths.
Load-bearing premise
The injected Gaussian noise level can be used directly as a prior whose posterior advantage estimate reliably prefers visually grounded trajectories over those that succeed only under noise.
What would settle it
A controlled experiment showing that removing the Bayesian component or using a different noise prior yields no gain in out-of-distribution CoT generalization on standard benchmarks would falsify the claim that the noise-as-prior Bayesian step is what drives the improvement.
Figures
read the original abstract
Reinforcement learning (RL) has shown promise in enhancing the general Chain-of-Thought (CoT) reasoning capabilities of multimodal large language models (MLLMs). However, when applied to improve general CoT reasoning, existing RL frameworks often struggle to generalize beyond the training distribution. To address this, we propose NoisyGRPO, a systematic multimodal RL framework that introduces controllable noise into visual inputs for enhanced exploration and explicitly models the advantage estimation process via a Bayesian framework. Specifically, NoisyGRPO improves RL training by: (1) Noise-Injected Exploration Policy: Perturbing visual inputs with Gaussian noise to encourage exploration across a wider range of visual scenarios; and (2) Bayesian Advantage Estimation: Formulating advantage estimation as a principled Bayesian inference problem, where the injected noise level serves as a prior and the observed trajectory reward as the likelihood. This Bayesian modeling fuses both sources of information to compute a robust posterior estimate of trajectory advantage, effectively guiding MLLMs to prefer visually grounded trajectories over noisy ones. Experiments on standard CoT quality, general capability, and hallucination benchmarks demonstrate that NoisyGRPO substantially improves generalization and robustness, especially in RL settings with small-scale MLLMs such as Qwen2.5-VL 3B. The project page is available at https://artanic30.github.io/project_pages/NoisyGRPO/.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes NoisyGRPO, a multimodal RL framework for improving Chain-of-Thought reasoning in MLLMs. It adds controllable Gaussian noise to visual inputs to encourage broader exploration and formulates advantage estimation as Bayesian inference, using the injected noise level as prior and trajectory reward as likelihood to compute a posterior advantage that favors visually grounded trajectories. Experiments on CoT quality, general capability, and hallucination benchmarks report substantial gains in generalization and robustness, especially for small-scale models such as Qwen2.5-VL 3B.
Significance. If the Bayesian posterior reliably down-weights noise-dependent successes while preferring grounded trajectories, the method could supply a principled mechanism for improving generalization in RL for vision-language models, a known weakness of standard GRPO-style approaches. The focus on small-scale MLLMs and the reported benchmark gains suggest practical relevance for resource-limited settings, though the absence of mechanistic verification limits the assessed novelty.
major comments (2)
- [Bayesian Advantage Estimation] Bayesian Advantage Estimation section: the likelihood model p(reward | noise, trajectory) is never defined and no derivation is supplied showing that the posterior mean or mode systematically prefers visually grounded trajectories over those succeeding only under injected noise. Without this, the central claim that the Bayesian step 'effectively guiding MLLMs to prefer visually grounded trajectories over noisy ones' cannot be verified and the method may reduce to ordinary noise-augmented GRPO.
- [Experiments] Experiments section: no error bars, no ablation isolating the Bayesian update rule from the noise injection, and no quantitative comparison of posterior advantage versus raw reward are reported. This leaves the attribution of generalization improvements on CoT quality and hallucination benchmarks unsupported.
minor comments (2)
- [Abstract] The abstract states empirical improvements without any numerical values, baseline comparisons, or statistical significance; adding these would strengthen the summary.
- [Method] Notation for the posterior advantage estimate is introduced without an explicit equation; providing the update rule in closed form would improve clarity.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback on our manuscript. We address each of the major comments in detail below, providing clarifications and committing to revisions where appropriate to strengthen the presentation of our method and results.
read point-by-point responses
-
Referee: [Bayesian Advantage Estimation] Bayesian Advantage Estimation section: the likelihood model p(reward | noise, trajectory) is never defined and no derivation is supplied showing that the posterior mean or mode systematically prefers visually grounded trajectories over those succeeding only under injected noise. Without this, the central claim that the Bayesian step 'effectively guiding MLLMs to prefer visually grounded trajectories over noisy ones' cannot be verified and the method may reduce to ordinary noise-augmented GRPO.
Authors: We acknowledge that the original manuscript described the Bayesian advantage estimation at a conceptual level without providing an explicit mathematical definition of the likelihood p(reward | noise, trajectory) or a step-by-step derivation. This was an oversight in the presentation. In the revised manuscript, we will expand the Bayesian Advantage Estimation section to include the full specification of the likelihood model, which models the reward as decreasing with higher noise levels for non-grounded trajectories, and derive the posterior advantage as the mean of the posterior distribution. This derivation demonstrates that trajectories succeeding primarily due to high noise receive lower posterior advantage, thereby preferring visually grounded ones. We believe this addresses the concern and distinguishes the approach from standard noise-augmented GRPO. revision: yes
-
Referee: [Experiments] Experiments section: no error bars, no ablation isolating the Bayesian update rule from the noise injection, and no quantitative comparison of posterior advantage versus raw reward are reported. This leaves the attribution of generalization improvements on CoT quality and hallucination benchmarks unsupported.
Authors: We agree with the referee that the experimental section would benefit from additional rigor. In the revised version, we will include error bars computed over multiple random seeds for all main results. We will also add an ablation study that isolates the contribution of the Bayesian update by comparing full NoisyGRPO against a variant that uses only noise injection with standard GRPO advantage estimation. Furthermore, we will provide quantitative analysis, such as histograms or tables, comparing the posterior advantage values to raw rewards for selected trajectories to illustrate how the Bayesian step modulates the advantages. These changes will better support the attribution of the observed improvements. revision: yes
Circularity Check
No significant circularity detected
full rationale
The paper proposes a practical RL framework combining noise-injected visual inputs with a Bayesian reformulation of advantage estimation. The noise level is explicitly chosen by the experimenter and used as a prior, with trajectory reward as likelihood; the resulting posterior is presented as a modeling choice that fuses information rather than a first-principles derivation whose output is forced to equal its inputs by algebraic identity or statistical construction. No equations are shown that reduce the claimed posterior advantage to a monotonic function of the injected noise alone, nor is there a self-citation chain, ansatz smuggling, or renaming of a known result that bears the central generalization claim. Empirical gains are reported on external benchmarks, leaving the method self-contained against those benchmarks without circular reduction.
Axiom & Free-Parameter Ledger
free parameters (1)
- noise variance
axioms (1)
- domain assumption Advantage estimation can be formulated as Bayesian inference with noise level as prior and trajectory reward as likelihood.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Formulating advantage estimation as a principled Bayesian inference problem, where the injected noise level serves as a prior and the observed trajectory reward as the likelihood... posterior mean ˆr_i = ˜r_i + σ²_s / (σ² + σ²_s) (μ_i − ˜r_i)
-
IndisputableMonolith/Foundation/AlphaCoordinateFixation.leanJ_uniquely_calibrated_via_higher_derivative unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Noise-Injected Exploration Policy: Perturbing visual inputs with Gaussian noise... Bayesian Advantage Estimation
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 2 Pith papers
-
WikiCLIP: An Efficient Contrastive Baseline for Open-domain Visual Entity Recognition
WikiCLIP delivers an efficient contrastive baseline for open-domain visual entity recognition that improves accuracy by 16% on OVEN unseen entities and runs nearly 100 times faster than leading generative models.
-
PDCR: Perception-Decomposed Confidence Reward for Vision-Language Reasoning
PDCR improves vision-language reasoning by computing separate normalized confidence advantages for perception steps and reasoning steps after unsupervised decomposition.
Reference graph
Works this paper leans on
-
[1]
OpenAI Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, Red Avila, Igor Babuschkin, Suchir Balaji, and et al. Gpt-4 technical report. 2023
work page 2023
-
[2]
Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning.Advances in Neural Information Processing Systems, 35:23716–23736, 2022
work page 2022
-
[3]
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond.arXiv preprint arXiv:2308.12966, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[4]
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-vl technical report.ar...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[5]
Are We on the Right Way for Evaluating Large Vision-Language Models?
Lin Chen, Jinsong Li, Xiao wen Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Jiaqi Wang, Yu Qiao, Dahua Lin, and Feng Zhao. Are we on the right way for evaluating large vision-language models? ArXiv, abs/2403.20330, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[6]
Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhangwei Gao, Erfei Cui, Jinguo Zhu, Shenglong Ye, Hao Tian, Zhaoyang Liu, Lixin Gu, Xuehui Wang, Qingyun Li, Yiming Ren, Zixuan Chen, Jiapeng Luo, Jiahao Wang, Tan Jiang, Bo Wang, Conghui He, Botian Shi, Xingcheng Zhang, Han Lv, Yi Wang, Wenqi Shao, Pei Chu, Zhongying Tu, Tong He, Zhiyong Wu, Hui Deng, Jiaye ...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[7]
Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning
DeepSeek-AI. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. 2025
work page 2025
-
[8]
OpenVLThinker: Complex Vision-Language Reasoning via Iterative SFT-RL Cycles
Yihe Deng, Hritik Bansal, Fan Yin, Nanyun Peng, Wei Wang, and Kai-Wei Chang. Openvlthinker: An early exploration to complex vision-language reasoning via iterative self-improvement.ArXiv, abs/2503.17352, 2025
work page internal anchor Pith review arXiv 2025
-
[9]
Promptdet: Towards open-vocabulary detection using uncurated images
Chengjian Feng, Yujie Zhong, Zequn Jie, Xiangxiang Chu, Haibing Ren, Xiaolin Wei, Weidi Xie, and Lin Ma. Promptdet: Towards open-vocabulary detection using uncurated images. InEuropean Conference on Computer Vision, 2022
work page 2022
-
[10]
Peng Gao, Renrui Zhang, Chris Liu, Longtian Qiu, Siyuan Huang, Weifeng Lin, Shitian Zhao, Shijie Geng, Ziyi Lin, Peng Jin, Kaipeng Zhang, Wenqi Shao, Chao Xu, Conghui He, Junjun He, Hao Shao, Pan Lu, Hongsheng Li, and Yu Qiao. Sphinx-x: Scaling data and parameters for a family of multi-modal large language models.ArXiv, abs/2402.05935, 2024
-
[11]
SEED-X: Multimodal Models with Unified Multi-granularity Comprehension and Generation
Yuying Ge, Sijie Zhao, Jinguo Zhu, Yixiao Ge, Kun Yi, Lin Song, Chen Li, Xiaohan Ding, and Ying Shan. Seed-x: Multimodal models with unified multi-granularity comprehension and generation.ArXiv, abs/2404.14396, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[12]
Ziyu Guo, Renrui Zhang, Longtian Qiu, Xianzheng Ma, Xupeng Miao, Xuming He, and Bin Cui. Calip: Zero-shot enhancement of clip with parameter-free attention.arXiv preprint arXiv:2209.14169, 2022
-
[13]
Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models
Wenxuan Huang, Bohan Jia, Zijie Zhai, Shaoshen Cao, Zheyu Ye, Fei Zhao, Zhe Xu, Yao Hu, and Shaohui Lin. Vision-r1: Incentivizing reasoning capability in multimodal large language models.ArXiv, abs/2503.06749, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[14]
OpenAI Aaron Hurst, Adam Lerer, Adam P. Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Os- trow, Akila Welihinda, Alan Hayes, Alec Radford, and et al. Gpt-4o system card.ArXiv, abs/2410.21276, 2024. 11
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[15]
Generalization in reinforcement learning with selective noise injection and information bottleneck
Maximilian Igl, Kamil Ciosek, Yingzhen Li, Sebastian Tschiatschek, Cheng Zhang, Sam Devlin, and Katja Hofmann. Generalization in reinforcement learning with selective noise injection and information bottleneck. InNeural Information Processing Systems, 2019
work page 2019
-
[16]
Dongzhi Jiang, Renrui Zhang, Ziyu Guo, Yanwei Li, Yu Qi, Xinyan Chen, Liuhui Wang, Jianhan Jin, Claire Guo, Shen Yan, Bo Zhang, Chaoyou Fu, Peng Gao, and Hongsheng Li. Mme-cot: Benchmarking chain-of-thought in large multimodal models for reasoning quality, robustness, and efficiency.ArXiv, abs/2502.09621, 2025
-
[17]
A Diagram Is Worth A Dozen Images
Aniruddha Kembhavi, Michael Salvato, Eric Kolve, Minjoon Seo, Hannaneh Hajishirzi, and Ali Farhadi. A diagram is worth a dozen images.ArXiv, abs/1603.07396, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[18]
LLaVA-OneVision: Easy Visual Task Transfer
Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Yanwei Li, Ziwei Liu, and Chunyuan Li. Llava-onevision: Easy visual task transfer.ArXiv, abs/2408.03326, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[19]
SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension
Bohao Li, Rui Wang, Guangzhi Wang, Yuying Ge, Yixiao Ge, and Ying Shan. Seed-bench: Benchmarking multimodal llms with generative comprehension.ArXiv, abs/2307.16125, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[20]
Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InInternational conference on machine learning, pages 19730–19742. PMLR, 2023
work page 2023
-
[21]
Yunxin Li, Shenyuan Jiang, Baotian Hu, Longyue Wang, Wanqi Zhong, Wenhan Luo, Lin Ma, and Min Zhang. Uni-moe: Scaling unified multimodal llms with mixture of experts.IEEE Transactions on Pattern Analysis and Machine Intelligence, 47:3424–3439, 2024
work page 2024
-
[22]
Ziyi Lin, Chris Liu, Renrui Zhang, Peng Gao, Longtian Qiu, Han Xiao, Han Qiu, Chen Lin, Wenqi Shao, Keqin Chen, Jiaming Han, Siyuan Huang, Yichi Zhang, Xuming He, Hongsheng Li, and Yu Jiao Qiao. Sphinx: The joint mixing of weights, tasks, and visual embeddings for multi-modal large language models. ArXiv, abs/2311.07575, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[23]
Improved Baselines with Visual Instruction Tuning
Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning.ArXiv, abs/2310.03744, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[24]
Llava-next: Improved reasoning, ocr, and world knowledge, January 2024
Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llava-next: Improved reasoning, ocr, and world knowledge, January 2024
work page 2024
-
[25]
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.ArXiv, abs/2304.08485, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[26]
MMBench: Is Your Multi-modal Model an All-around Player?
Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player?arXiv preprint arXiv:2307.06281, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[27]
MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts
Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chun yue Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating math reasoning in visual contexts with gpt-4v, bard, and other large multimodal models.ArXiv, abs/2310.02255, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[28]
Learn to explain: Multimodal reasoning via thought chains for science question answering
Pan Lu, Swaroop Mishra, Tony Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering. InThe 36th Conference on Neural Information Processing Systems (NeurIPS), 2022
work page 2022
-
[29]
2025.doi:10.48550/arXiv.2411.07975
Yiyang Ma, Xingchao Liu, Xiaokang Chen, Wen Liu, Chengyue Wu, Zhiyu Wu, Zizheng Pan, Zhenda Xie, Haowei Zhang, Xingkai Yu, Liang Zhao, Yisong Wang, Jiaying Liu, and Chong Ruan. Janusflow: Harmonizing autoregression and rectified flow for unified multimodal understanding and generation.ArXiv, abs/2411.07975, 2024
-
[30]
Mm-eureka: Exploring the frontiers of multimodal reasoning with rule-based reinforcement learning
Fanqing Meng, Lingxiao Du, Zongkai Liu, Zhixiang Zhou, Quanfeng Lu, Daocheng Fu, Tiancheng Han, Botian Shi, Wenhai Wang, Junjun He, Kaipeng Zhang, Ping Luo, Yu Qiao, Qiaosheng Zhang, and Wenqi Shao. Mm-eureka: Exploring the frontiers of multimodal reasoning with rule-based reinforcement learning. 2025
work page 2025
-
[31]
Sha Ning, Longtian Qiu, Yongfei Liu, and Xuming He. Hoiclip: Efficient knowledge transfer for hoi detection with vision-language models.2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 23507–23517, 2023
work page 2023
-
[32]
Vision - openai api.https://platform.openai.com/docs/guides/vision, 2023
OpenAI. Vision - openai api.https://platform.openai.com/docs/guides/vision, 2023
work page 2023
-
[33]
Openai gpt-4o system card, 2024
OpenAI. Openai gpt-4o system card, 2024. System Card for OpenAI GPT-4o. 12
work page 2024
-
[34]
OpenAI. Openai o1 system card, 2024. System Card for OpenAI o1
work page 2024
-
[35]
DINOv2: Learning Robust Visual Features without Supervision
Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[36]
Yassine Ouali, Adrian Bulat, Brais Martínez, and Georgios Tzimiropoulos. Clip-dpo: Vision-language models as a source of preference for fixing hallucinations in lvlms.ArXiv, abs/2408.10433, 2024
-
[37]
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in Neural Information Processing Systems, 35:27730–27744, 2022
work page 2022
-
[38]
LMM-R1: Empowering 3B LMMs with Strong Reasoning Abilities Through Two-Stage Rule-Based RL
Yi Peng, Gongrui Zhang, Miaosen Zhang, Zhiyuan You, Jie Liu, Qipeng Zhu, Kai Yang, Xingzhong Xu, Xin Geng, and Xu Yang. Lmm-r1: Empowering 3b lmms with strong reasoning abilities through two-stage rule-based rl.ArXiv, abs/2503.07536, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[39]
Renjie Pi, Tianyang Han, Wei Xiong, Jipeng Zhang, Runtao Liu, Rui Pan, and Tong Zhang. Strengthening multimodal large language model with bootstrapped preference optimization.ArXiv, abs/2403.08730, 2024
-
[40]
Longtian Qiu, Shan Ning, and Xuming He. Mining fine-grained image-text alignment for zero-shot captioning via text-only training.ArXiv, abs/2401.02347, 2024
-
[41]
Direct Preference Optimization: Your Language Model is Secretly a Reward Model
Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model.ArXiv, abs/2305.18290, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[42]
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context
Machel Reid, Nikolay Savinov, Denis Teplyashin, Dmitry Lepikhin, Timothy P. Lillicrap, Jean-Baptiste Alayrac, Radu Soricut, Angeliki Lazaridou, Orhan Firat, Julian Schrittwieser, Ioannis Antonoglou, Rohan Anil, and et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. ArXiv, abs/2403.05530, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[43]
Sentence-bert: Sentence embeddings using siamese bert-networks
Nils Reimers and Iryna Gurevych. Sentence-bert: Sentence embeddings using siamese bert-networks. In Conference on Empirical Methods in Natural Language Processing, 2019
work page 2019
-
[45]
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Jun-Mei Song, Mingchuan Zhang, Y . K. Li, Yu Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. ArXiv, abs/2402.03300, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[46]
VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model
Haozhan Shen, Peng Liu, Jingcheng Li, Chunxin Fang, Yibo Ma, Jiajia Liao, Qiaoli Shen, Zilun Zhang, Kangjia Zhao, Qianqian Zhang, Ruochen Xu, and Tiancheng Zhao. Vlm-r1: A stable and generalizable r1-style large vision-language model.arXiv preprint arXiv:2504.07615, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[47]
Math-llava: Bootstrapping mathematical reasoning for multimodal large language models
Wenhao Shi, Zhiqiang Hu, Yi Bin, Junhua Liu, Yang Yang, See-Kiong Ng, Li Bing, and Roy Ka wei Lee. Math-llava: Bootstrapping mathematical reasoning for multimodal large language models. InConference on Empirical Methods in Natural Language Processing, 2024
work page 2024
-
[48]
Chameleon: Mixed-Modal Early-Fusion Foundation Models
Chameleon Team. Chameleon: Mixed-modal early-fusion foundation models.ArXiv, abs/2405.09818, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[49]
Kimi k1.5: Scaling Reinforcement Learning with LLMs
Kimi Team, Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, and et al. Kimi k1.5: Scaling reinforcement learning with llms.ArXiv, abs/2501.12599, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[50]
AMBER: An LLM-free Multi-dimensional Benchmark for MLLMs Hallucination Evaluation
Junyang Wang, Yuhang Wang, Guohai Xu, Jing Zhang, Yukai Gu, Haitao Jia, Haiyang Xu, Ming Yan, Ji Zhang, and Jitao Sang. An llm-free multi-dimensional benchmark for mllms hallucination evaluation. ArXiv, abs/2311.07397, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[51]
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Yang Fan, Kai Dang, Mengfei Du, Xuancheng Ren, Rui Men, Dayiheng Liu, Chang Zhou, Jingren Zhou, and Junyang Lin. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[52]
CogVLM: Visual Expert for Pretrained Language Models
Weihan Wang, Qingsong Lv, Wenmeng Yu, Wenyi Hong, Ji Qi, Yan Wang, Junhui Ji, Zhuoyi Yang, Lei Zhao, Xixuan Song, Jiazheng Xu, Bin Xu, Juanzi Li, Yuxiao Dong, Ming Ding, and Jie Tang. Cogvlm: Visual expert for pretrained language models.ArXiv, abs/2311.03079, 2023. 13
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[53]
Xiao wen Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Bin Wang, Linke Ouyang, Xilin Wei, Songyang Zhang, Haodong Duan, Maosong Cao, Wenwei Zhang, Yining Li, Hang Yan, Yang Gao, Xinyue Zhang, Wei Li, Jingwen Li, Kai Chen, Conghui He, Xingcheng Zhang, Yu Qiao, Dahua Lin, and Jiaqi Wang. Internlm-xcomposer2: Mastering free-form text-image composition and compre...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[54]
Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation
Chengyue Wu, Xiaokang Chen, Zhiyu Wu, Yiyang Ma, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, Chong Ruan, and Ping Luo. Janus: Decoupling visual encoding for unified multimodal understanding and generation.ArXiv, abs/2410.13848, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[55]
DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding
Zhiyu Wu, Xiaokang Chen, Zizheng Pan, Xingchao Liu, Wen Liu, Damai Dai, Huazuo Gao, Yiyang Ma, Chengyue Wu, Bing-Li Wang, Zhenda Xie, Yu Wu, Kai Hu, Jiawei Wang, Yaofeng Sun, Yukun Li, Yishi Piao, Kang Guan, Aixin Liu, Xin Xie, Yu mei You, Kaihong Dong, Xingkai Yu, Haowei Zhang, Liang Zhao, Yisong Wang, and Chong Ruan. Deepseek-vl2: Mixture-of-experts vis...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[56]
Show-o: One Single Transformer to Unify Multimodal Understanding and Generation
Jinheng Xie, Weijia Mao, Zechen Bai, David Junhao Zhang, Weihao Wang, Kevin Qinghong Lin, Yuchao Gu, Zhijie Chen, Zhenheng Yang, and Mike Zheng Shou. Show-o: One single transformer to unify multimodal understanding and generation.ArXiv, abs/2408.12528, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[57]
MiniCPM-V: A GPT-4V Level MLLM on Your Phone
Yuan Yao, Tianyu Yu, Ao Zhang, Chongyi Wang, Junbo Cui, Hongji Zhu, Tianchi Cai, Haoyu Li, Weilin Zhao, Zhihui He, Qi-An Chen, Huarong Zhou, Zhensheng Zou, Haoye Zhang, Shengding Hu, Zhi Zheng, Jie Zhou, Jie Cai, Xu Han, Guoyang Zeng, Dahai Li, Zhiyuan Liu, and Maosong Sun. Minicpm-v: A gpt-4v level mllm on your phone.ArXiv, abs/2408.01800, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[58]
DAPO: An Open-Source LLM Reinforcement Learning System at Scale
Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Tiantian Fan, Gaohong Liu, Lingjun Liu, Xin Liu, Haibin Lin, Zhiqi Lin, Bole Ma, Guangming Sheng, Yuxuan Tong, Chi Zhang, Mofan Zhang, Wang Zhang, Hang Zhu, Jinhua Zhu, Jiaze Chen, Jiangjie Chen, Chengyi Wang, Honglin Yu, Weinan Dai, Yuxuan Song, Xiang Wei, Haodong Zhou, Jingjing Liu, ...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[59]
Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, Cong Wei, Botao Yu, Ruibin Yuan, Renliang Sun, Ming Yin, Boyuan Zheng, Zhenzhu Yang, Yibo Liu, Wenhao Huang, Huan Sun, Yu Su, and Wenhu Chen. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for exp...
work page 2024
-
[60]
Mavis: Mathematical visual in- struction tuning
Renrui Zhang, Xinyu Wei, Dongzhi Jiang, Yichi Zhang, Ziyu Guo, Chengzhuo Tong, Jiaming Liu, Aojun Zhou, Bin Wei, Shanghang Zhang, Peng Gao, and Hongsheng Li. Mavis: Mathematical visual instruction tuning.ArXiv, abs/2407.08739, 2024
-
[62]
Mm-rlhf: The next step forward in multimodal llm alignment,
Yi-Fan Zhang, Tao Yu, Haochen Tian, Chaoyou Fu, Peiyan Li, Jianshu Zeng, Wulin Xie, Yang Shi, Huanyu Zhang, Junkang Wu, Xue Wang, Yibo Hu, Bin Wen, Fan Yang, Zhang Zhang, Tingting Gao, Di Zhang, Liang Wang, Rong Jin, and Tien-Ping Tan. Mm-rlhf: The next step forward in multimodal llm alignment. ArXiv, abs/2502.10391, 2025
-
[63]
Yi-Fan Zhang, Huanyu Zhang, Haochen Tian, Chaoyou Fu, Shuangqing Zhang, Jun Wu, Feng Li, Kun Wang, Qingsong Wen, Zhang Zhang, Liang Wang, Rong Jin, and Tien-Ping Tan. Mme-realworld: Could your multimodal llm challenge high-resolution real-world scenarios that are difficult for humans?ArXiv, abs/2408.13257, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[64]
Conditional prompt learning for vision-language models
Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. Conditional prompt learning for vision-language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16816–16825, 2022
work page 2022
-
[65]
Answer Correctness is a Partial Observation
Zhuofan Zong, Bingqi Ma, Dazhong Shen, Guanglu Song, Hao Shao, Dongzhi Jiang, Hongsheng Li, and Yu Liu. Mova: Adapting mixture of vision experts to multimodal context.ArXiv, abs/2404.13046, 2024. 14 A Preliminary This section formalizes the reinforcement learning (RL) framework for post-training optimization of multimodal large language models (MLLMs). We...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.