pith. machine review for the scientific record. sign in

arxiv: 2509.24251 · v2 · submitted 2025-09-29 · 💻 cs.CV · cs.CL

Recognition: 2 theorem links

· Lean Theorem

Latent Visual Reasoning

Authors on Pith no claims yet

Pith reviewed 2026-05-15 18:38 UTC · model grok-4.3

classification 💻 cs.CV cs.CL
keywords Latent Visual ReasoningMultimodal Large Language ModelsVisual Question AnsweringAutoregressive ReasoningVisual EmbeddingsChain of Thought
0
0 comments X

The pith

Multimodal models can perform reasoning steps by autoregressively generating latent visual states that reconstruct key image tokens.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Latent Visual Reasoning as a way to move chain-of-thought beyond language alone. A visual encoder turns images into tokens that share space with the language model; the model is then trained to output latent states whose goal is to rebuild the specific visual tokens needed for the answer. These latent steps are interleaved with ordinary text generation during inference. The result is reported as stronger performance on tasks that demand fine visual detail, such as reaching 71.67 percent on MMVP where the baseline sits at 66.67 percent. Reinforcement learning with an adapted GRPO objective is used to keep the visual and textual parts balanced.

Core claim

The central claim is that training the language model to generate latent states in the joint visual embedding space, with the explicit objective of reconstructing critical visual tokens from the encoder, constitutes autoregressive visual reasoning that improves perception-intensive visual question answering when interleaved with standard text output.

What carries the argument

Autoregressive generation of latent visual states whose training target is reconstruction of selected visual tokens from the shared embedding space.

If this is right

  • Interleaving visual latent steps with text output produces measurable gains on fine-grained visual question answering.
  • Reinforcement learning via adapted GRPO can be used to balance the frequency of latent visual generation against text generation.
  • The approach reduces dependence on external visual-editing tools by keeping reasoning inside the model's own embedding space.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The method might extend naturally to video by treating successive frames as sequences of visual tokens to reconstruct.
  • If the latent states truly carry visual reasoning, similar reconstruction objectives could be added to pure language models for tasks that benefit from imagined visual simulation.
  • A practical test would measure whether the same gains appear when the visual tokens are supplied through text captions instead of direct reconstruction.

Load-bearing premise

That forcing the model to reconstruct chosen visual tokens through latent states produces actual reasoning steps that outperform language-only chain-of-thought or external editing tools.

What would settle it

A controlled experiment showing that removing the visual-token reconstruction loss while keeping the same model size, data, and interleaving schedule yields no gain on MMVP or similar perception benchmarks.

read the original abstract

Multimodal Large Language Models (MLLMs) have achieved notable gains in various tasks by incorporating Chain-of-Thought (CoT) reasoning in language spaces. Recent work extends this direction by leveraging external tools for visual editing, thereby enhancing the visual signal along the reasoning trajectories. Nevertheless, these approaches remain fundamentally constrained: reasoning is still confined to the language space, with visual information treated as static preconditions. We introduce Latent Visual Reasoning (LVR), a new paradigm that enables autoregressive reasoning directly in the visual embedding space. A visual encoder first projects images into visual tokens within a joint semantic space shared with the language model. The language model is then trained to generate latent states that reconstruct key visual tokens critical for answering the query, constituting the process of latent visual reasoning. By interleaving LVR with standard text generation, our model achieves substantial gains on perception-intensive visual question answering tasks. In addition, we adapt the GRPO algorithm to conduct reinforcement learning on latent reasoning, further balancing LVR and textual generation. We show that LVR substantially improves fine-grained visual understanding and perception, achieving 71.67% on MMVP compared to 66.67% with Qwen2.5-VL. Code base and model weights will be released later.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes Latent Visual Reasoning (LVR), a paradigm for autoregressive reasoning in the visual embedding space of MLLMs. A visual encoder maps images to tokens in a joint space; the LM is trained to generate latent states that reconstruct query-critical visual tokens. These latents are interleaved with text generation, and the GRPO RL algorithm is adapted to balance LVR with textual output. The work reports gains on perception-intensive VQA, including 71.67% on MMVP versus 66.67% for Qwen2.5-VL.

Significance. If the reported gains can be isolated to the latent reconstruction mechanism rather than the RL stage or capacity increases, the approach could meaningfully extend reasoning beyond language-space CoT or external visual tools by operating directly in shared visual embeddings. The GRPO adaptation for latent trajectories is a concrete technical contribution worth exploring further.

major comments (2)
  1. [Abstract] Abstract: the single reported benchmark improvement (MMVP 71.67% vs 66.67%) provides no baseline details, ablations, or controls to separate the contribution of the latent reconstruction objective from the GRPO training stage, additional capacity, or data differences. This is load-bearing for the central claim that LVR constitutes a new reasoning paradigm.
  2. [Abstract] Abstract: the mechanism description does not specify how latent states are produced step-by-step as an autoregressive trajectory independent of the final answer, nor whether reconstruction is enforced only during training or also at inference. Without these details the process risks being equivalent to a conditional visual predictor rather than genuine multi-step visual reasoning.
minor comments (1)
  1. [Abstract] Abstract: the statement that code and model weights will be released later lacks a timeline or repository link.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive comments. We address each major point below, providing clarifications from the full manuscript and committing to revisions that strengthen the presentation of LVR as a distinct reasoning paradigm.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the single reported benchmark improvement (MMVP 71.67% vs 66.67%) provides no baseline details, ablations, or controls to separate the contribution of the latent reconstruction objective from the GRPO training stage, additional capacity, or data differences. This is load-bearing for the central claim that LVR constitutes a new reasoning paradigm.

    Authors: We agree that the abstract alone does not isolate contributions. The full manuscript includes direct comparisons to the Qwen2.5-VL base model, an ablation removing the latent reconstruction loss while retaining GRPO, and controls for data and capacity. These show that the largest gains on MMVP and similar perception benchmarks arise specifically from the latent reconstruction objective. We will revise the abstract to reference these controls and point readers to the relevant experimental sections. revision: yes

  2. Referee: [Abstract] Abstract: the mechanism description does not specify how latent states are produced step-by-step as an autoregressive trajectory independent of the final answer, nor whether reconstruction is enforced only during training or also at inference. Without these details the process risks being equivalent to a conditional visual predictor rather than genuine multi-step visual reasoning.

    Authors: The manuscript (Section 3) specifies that the LM autoregressively generates a variable-length sequence of latent visual states conditioned on prior latents and text tokens, before emitting the final textual answer. Each latent is trained to reconstruct query-critical visual tokens via an auxiliary loss applied only during training; at inference the reconstruction loss is disabled and the model generates the latent trajectory freely as part of its reasoning process. This interleaving produces multi-step visual reasoning trajectories distinct from a single conditional prediction. We will add a concise description of this training-versus-inference distinction to the abstract. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical paradigm introduction without derivations

full rationale

The paper introduces LVR as a training paradigm in which the LM generates latent states to reconstruct selected visual tokens in a shared embedding space, then interleaves this with text generation and applies GRPO RL. No equations, derivations, or first-principles results are presented. Claims rest entirely on empirical benchmark gains (e.g., MMVP improvement over Qwen2.5-VL). No self-definitional loops, fitted inputs renamed as predictions, load-bearing self-citations, uniqueness theorems, or smuggled ansatzes appear. The method is self-contained as a proposed architecture evaluated on external tasks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The approach rests on standard multimodal encoder assumptions and introduces latent visual states as a new construct without external falsifiable evidence supplied in the abstract.

axioms (1)
  • domain assumption A visual encoder projects images into visual tokens within a joint semantic space shared with the language model.
    This is taken as given from existing MLLM architectures.
invented entities (1)
  • latent states no independent evidence
    purpose: Intermediate representations generated autoregressively to reconstruct key visual tokens during reasoning.
    New construct introduced by the paper to enable visual-space reasoning.

pith-pipeline@v0.9.0 · 5544 in / 1283 out tokens · 37226 ms · 2026-05-15T18:38:00.990162+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 22 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. ATLAS: Agentic or Latent Visual Reasoning? One Word is Enough for Both

    cs.CV 2026-05 unverdicted novelty 7.0

    ATLAS uses a single functional token to unify agentic and latent visual reasoning without image generation or external execution.

  2. UniVLR: Unifying Text and Vision in Visual Latent Reasoning for Multimodal LLMs

    cs.CV 2026-05 unverdicted novelty 7.0

    UniVLR unifies textual and visual reasoning in multimodal LLMs by compressing reasoning traces and auxiliary images into visual latent tokens for direct inference without interleaved text CoT.

  3. Hybrid Latent Reasoning with Decoupled Policy Optimization

    cs.CV 2026-04 unverdicted novelty 7.0

    HyLaR with DePO enables effective RL in hybrid discrete-continuous spaces for multimodal models, outperforming prior MLLMs on perception and understanding benchmarks.

  4. V-Reflection: Transforming MLLMs from Passive Observers to Active Interrogators

    cs.CV 2026-03 unverdicted novelty 7.0

    V-Reflection introduces a think-then-look mechanism where MLLM latent states actively interrogate visual features via two-stage distillation from a box-guided teacher to a dynamic autoregressive student, narrowing the...

  5. Self-Consistent Latent Reasoning: Long Latent Sequence Reasoning for Vision-Language Model

    cs.CV 2026-05 unverdicted novelty 6.0

    SCOLAR fixes information gain collapse in latent visual reasoning by generating independent auxiliary visual tokens via a detransformer, extending acceptable CoT length over 30x and delivering +14.12% gains on reasoni...

  6. Self-Consistent Latent Reasoning: Long Latent Sequence Reasoning for Vision-Language Model

    cs.CV 2026-05 unverdicted novelty 6.0

    SCOLAR addresses information gain collapse in latent visual reasoning by generating independent auxiliary visual tokens from LLM hidden states, extending acceptable CoT length over 30x and achieving +14.12% gains on b...

  7. CoWorld-VLA: Thinking in a Multi-Expert World Model for Autonomous Driving

    cs.CV 2026-05 unverdicted novelty 6.0

    CoWorld-VLA extracts semantic, geometric, dynamic, and trajectory expert tokens from multi-source supervision and feeds them into a diffusion-based hierarchical planner, achieving competitive collision avoidance and t...

  8. CoWorld-VLA: Thinking in a Multi-Expert World Model for Autonomous Driving

    cs.CV 2026-05 unverdicted novelty 6.0

    CoWorld-VLA encodes world information into four expert tokens that condition a diffusion-based planner, yielding competitive collision avoidance and trajectory accuracy on the NAVSIM benchmark.

  9. Retrieve, Integrate, and Synthesize: Spatial-Semantic Grounded Latent Visual Reasoning

    cs.CL 2026-05 unverdicted novelty 6.0

    RIS improves MLLM latent visual reasoning by retrieving spatial-semantic evidence, integrating it via attention bottlenecks, and synthesizing it with language transition tokens, yielding gains on V*, HRBench, MMVP, an...

  10. 4DThinker: Thinking with 4D Imagery for Dynamic Spatial Understanding

    cs.CV 2026-05 unverdicted novelty 6.0

    4DThinker enables VLMs to perform dynamic spatial reasoning by internally simulating 4D imagery in latent space, outperforming prior text-based and modular approaches.

  11. HypEHR: Hyperbolic Modeling of Electronic Health Records for Efficient Question Answering

    cs.AI 2026-04 unverdicted novelty 6.0

    HypEHR is a hyperbolic embedding model for EHR data that uses Lorentzian geometry and hierarchy-aware pretraining to answer clinical questions nearly as well as large language models but with much smaller size.

  12. Visual Enhanced Depth Scaling for Multimodal Latent Reasoning

    cs.CV 2026-04 unverdicted novelty 6.0

    Visual replay module and adaptive depth scaling improve multimodal latent reasoning, reaching SOTA benchmarks with faster inference than explicit chain-of-thought methods.

  13. GRASP: Grounded CoT Reasoning with Dual-Stage Optimization for Multimodal Sarcasm Target Identification

    cs.CL 2026-04 unverdicted novelty 6.0

    GRASP improves multimodal sarcasm target identification by anchoring visual regions in grounded chain-of-thought reasoning and using dual-stage optimization on a new balanced dataset.

  14. Q-Zoom: Query-Aware Adaptive Perception for Efficient Multimodal Large Language Models

    cs.CV 2026-04 unverdicted novelty 6.0

    Q-Zoom achieves up to 4.39x inference speedup in high-resolution MLLM scenarios via query-aware gating and region localization, matching or exceeding baseline accuracy on document and high-res benchmarks.

  15. Walk the Talk: Bridging the Reasoning-Action Gap for Thinking with Images via Multimodal Agentic Policy Optimization

    cs.CV 2026-04 unverdicted novelty 6.0

    MAPO improves multimodal chain-of-thought reasoning by requiring explicit textual descriptions of visual tool results and using a novel advantage estimator that combines semantic alignment with task rewards.

  16. Fill the GAP: A Granular Alignment Paradigm for Visual Reasoning in Multimodal Large Language Models

    cs.CV 2026-05 unverdicted novelty 5.0

    GAP aligns visual latent reasoning in MLLMs at feature, context, and capacity levels, yielding the best aggregate perception and reasoning scores on Qwen2.5-VL 7B among supervised variants while providing task-relevan...

  17. Test-time Scaling over Perception: Resolving the Grounding Paradox in Thinking with Images

    cs.CV 2026-04 unverdicted novelty 5.0

    TTSP resolves the Grounding Paradox by treating perception as a scalable test-time process that generates, filters, and iteratively refines multiple visual exploration traces, outperforming baselines on high-resolutio...

  18. Visual Enhanced Depth Scaling for Multimodal Latent Reasoning

    cs.CV 2026-04 unverdicted novelty 5.0

    A visual replay module combined with adaptive depth scaling improves multimodal latent reasoning, delivering state-of-the-art benchmark results and faster inference than explicit chain-of-thought methods.

  19. Visual Enhanced Depth Scaling for Multimodal Latent Reasoning

    cs.CV 2026-04 unverdicted novelty 5.0

    Visual replay and depth scaling in latent reasoning produce state-of-the-art multimodal results with faster inference than explicit CoT.

  20. MedLVR: Latent Visual Reasoning for Reliable Medical Visual Question Answering

    cs.CV 2026-04 unverdicted novelty 5.0

    MedLVR interleaves latent visual reasoning segments in autoregressive decoding and uses two-stage training to raise average medical VQA accuracy from 48.3% to 53.4% over a Qwen2.5-VL-7B backbone on OmniMedVQA and five...

  21. Act Wisely: Cultivating Meta-Cognitive Tool Use in Agentic Multimodal Models

    cs.CV 2026-04 unverdicted novelty 5.0

    HDPO reframes tool efficiency as a conditional objective within accurate trajectories, enabling Metis to reduce tool invocations by orders of magnitude while raising reasoning accuracy.

  22. Decompose, Look, and Reason: Reinforced Latent Reasoning for VLMs

    cs.CL 2026-04 unverdicted novelty 5.0

    DLR is a new reinforced latent reasoning method for VLMs that decomposes queries, uses continuous visual latents, and outperforms text-only and multimodal CoT baselines on vision-centric benchmarks with better interpr...

Reference graph

Works this paper leans on

29 extracted references · 29 canonical work pages · cited by 18 Pith papers · 12 internal anchors

  1. [1]

    Qwen2.5-VL Technical Report

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-vl technical report.ar...

  2. [2]

    Diagnosing and mitigating modality interference in multimodal large language models.arXiv preprint arXiv:2505.19616,

    Rui Cai, Bangzheng Li, Xiaofei Wen, Muhao Chen, and Zhe Zhao. Diagnosing and mitigating modality interference in multimodal large language models.arXiv preprint arXiv:2505.19616,

  3. [3]

    Sft or rl? an early investigation into training r1-like reasoning large vision-language models

    Hardy Chen, Haoqin Tu, Fali Wang, Hui Liu, Xianfeng Tang, Xinya Du, Yuyin Zhou, and Cihang Xie. Sft or rl? an early investigation into training r1-like reasoning large vision-language models. arXiv preprint arXiv:2504.11468, 2025a. 10 Preprint. Shuang Chen, Yue Guo, Zhaochen Su, Yafu Li, Yulun Wu, Jiacheng Chen, Jiayu Chen, Weijie Wang, Xiaoye Qu, and Yu ...

  4. [4]

    Compressed chain of thought: Efficient reasoning through dense representations.arXiv preprint arXiv:2412.13171,

    Jeffrey Cheng and Benjamin Van Durme. Compressed chain of thought: Efficient reasoning through dense representations.arXiv preprint arXiv:2412.13171,

  5. [5]

    v1: Learning to Point Visual Tokens for Multimodal Grounded Reasoning

    Jiwan Chung, Junhyeok Kim, Siyeol Kim, Jaeyoung Lee, Min Soo Kim, and Youngjae Yu. Don’t look only once: Towards multimodal interactive reasoning with selective visual revisitation.arXiv preprint arXiv:2505.18842,

  6. [6]

    Emerging Properties in Unified Multimodal Pretraining

    Ailin Deng, Tri Cao, Zhirui Chen, and Bryan Hooi. Words or vision: Do vision-language models have blind faith in text?, 2025a. Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Wei- hao Yu, Xiaonan Nie, Ziang Song, Guang Shi, and Haoqi Fan. Emerging properties in unified multimodal pretraining.arXiv preprint arXiv:2505.1468...

  7. [7]

    Refocus: Visual editing as a chain of thought for structured image understanding.arXiv preprint arXiv:2501.05452,

    Xingyu Fu, Minqian Liu, Zhengyuan Yang, John Corring, Yijuan Lu, Jianwei Yang, Dan Roth, Dinei Florencio, and Cha Zhang. Refocus: Visual editing as a chain of thought for structured image understanding.arXiv preprint arXiv:2501.05452,

  8. [8]

    WebWatcher: Breaking New Frontier of Vision-Language Deep Research Agent

    Xinyu Geng, Peng Xia, Zhen Zhang, Xinyu Wang, Qiuchen Wang, Ruixue Ding, Chenxi Wang, Jialong Wu, Yida Zhao, Kuan Li, et al. Webwatcher: Breaking new frontiers of vision-language deep research agent.arXiv preprint arXiv:2508.05748,

  9. [9]

    Training Large Language Models to Reason in a Continuous Latent Space

    Shibo Hao, Sainbayar Sukhbaatar, DiJia Su, Xian Li, Zhiting Hu, Jason Weston, and Yuandong Tian. Training large language models to reason in a continuous latent space.arXiv preprint arXiv:2412.06769,

  10. [10]

    Apo: Enhancing reasoning ability of mllms via asymmetric policy optimization.arXiv preprint arXiv:2506.21655,

    Minjie Hong, Zirun Guo, Yan Xia, Zehan Wang, Ziang Zhang, Tao Jin, and Zhou Zhao. Apo: Enhancing reasoning ability of mllms via asymmetric policy optimization.arXiv preprint arXiv:2506.21655,

  11. [11]

    Visual sketchpad: Sketching as a visual chain of thought for multimodal language models.arXiv preprint arXiv:2406.09403,

    Yushi Hu, Weijia Shi, Xingyu Fu, Dan Roth, Mari Ostendorf, Luke Zettlemoyer, Noah A Smith, and Ranjay Krishna. Visual sketchpad: Sketching as a visual chain of thought for multimodal language models.arXiv preprint arXiv:2406.09403,

  12. [12]

    Vision-r1: Incentivizing reasoning capability in multimodal large language models, 2025a

    Wenxuan Huang, Bohan Jia, Zijie Zhai, Shaosheng Cao, Zheyu Ye, Fei Zhao, Zhe Xu, Yao Hu, and Shaohui Lin. Vision-r1: Incentivizing reasoning capability in multimodal large language models, 2025a. Zeyi Huang, Yuyang Ji, Anirudh Sundara Rajan, Zefan Cai, Wen Xiao, Haohan Wang, Junjie Hu, and Yong Jae Lee. Visualtoolagent (vista): A reinforcement learning fr...

  13. [13]

    LLaVA-OneVision: Easy Visual Task Transfer

    Bo Li, Peiyuan Zhang, Kaichen Zhang, Fanyi Pu, Xinrun Du, Yuhao Dong, Haotian Liu, Yuanhan Zhang, Ge Zhang, Chunyuan Li, and Ziwei Liu. Lmms-eval: Accelerating the development of large multimoal models, 2024a. URLhttps://github.com/EvolvingLMMs-Lab/ lmms-eval. Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, ...

  14. [14]

    Yiqing Liang, Jielin Qiu, Wenhao Ding, Zuxin Liu, James Tompkin, Mengdi Xu, Mengzhou Xia, Zhengzhong Tu, Laixi Shi, and Jiacheng Zhu

    URL https://arxiv.org/abs/2501.07542. Yiqing Liang, Jielin Qiu, Wenhao Ding, Zuxin Liu, James Tompkin, Mengdi Xu, Mengzhou Xia, Zhengzhong Tu, Laixi Shi, and Jiacheng Zhu. Modomodo: Multi-domain data mixtures for multimodal llm reinforcement learning.arXiv preprint arXiv:2505.24871,

  15. [15]

    Visual Ab- stract Thinking Empowers Multimodal Reasoning, May 2025a

    Dairu Liu, Ziyue Wang, Minyuan Ruan, Fuwen Luo, Chi Chen, Peng Li, and Yang Liu. Visual Ab- stract Thinking Empowers Multimodal Reasoning, May 2025a. URLhttp://arxiv.org/ abs/2505.20164. Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. InNeurIPS,

  16. [16]

    Infi-mmr: Curriculum-based unlocking multimodal reasoning via phased reinforcement learning in multimodal small language models.arXiv preprint arXiv:2505.23091, 2025b

    Zeyu Liu, Yuhang Liu, Guanghao Zhu, Congkai Xie, Zhen Li, Jianbo Yuan, Xinyao Wang, Qing Li, Shing-Chi Cheung, Shengyu Zhang, et al. Infi-mmr: Curriculum-based unlocking multimodal reasoning via phased reinforcement learning in multimodal small language models.arXiv preprint arXiv:2505.23091, 2025b. Ziyu Liu, Zeyi Sun, Yuhang Zang, Xiaoyi Dong, Yuhang Cao...

  17. [17]

    Point-rft: Improving multimodal reasoning with visually grounded reinforcement finetuning.arXiv preprint arXiv:2505.19702,

    Minheng Ni, Zhengyuan Yang, Linjie Li, Chung-Ching Lin, Kevin Lin, Wangmeng Zuo, and Li- juan Wang. Point-rft: Improving multimodal reasoning with visually grounded reinforcement finetuning.arXiv preprint arXiv:2505.19702,

  18. [18]

    Lmm-r1: Empowering 3b lmms with strong reasoning abilities through two-stage rule-based rl.arXiv preprint arXiv:2503.07536, 2025

    Yingzhe Peng, Gongrui Zhang, Miaosen Zhang, Zhiyuan You, Jie Liu, Qipeng Zhu, Kai Yang, Xingzhong Xu, Xin Geng, and Xu Yang. Lmm-r1: Empowering 3b lmms with strong reasoning abilities through two-stage rule-based rl.arXiv preprint arXiv:2503.07536,

  19. [19]

    Visual cot: Unleashing chain-of-thought reasoning in multi-modal language models, 2024a

    Hao Shao, Shengju Qian, Han Xiao, Guanglu Song, Zhuofan Zong, Letian Wang, Yu Liu, and Hong- sheng Li. Visual cot: Unleashing chain-of-thought reasoning in multi-modal language models, 2024a. Hao Shao, Shengju Qian, Han Xiao, Guanglu Song, Zhuofan Zong, Letian Wang, Yu Liu, and Hong- sheng Li. Visual cot: Unleashing chain-of-thought reasoning in multi-mod...

  20. [20]

    Reason-rft: Reinforcement fine-tuning for visual reasoning.arXiv preprint arXiv:2503.20752, 2025

    Huajie Tan, Yuheng Ji, Xiaoshuai Hao, Minglan Lin, Pengwei Wang, Zhongyuan Wang, and Shanghang Zhang. Reason-rft: Reinforcement fine-tuning for visual reasoning.arXiv preprint arXiv:2503.20752,

  21. [21]

    Chameleon: Mixed-Modal Early-Fusion Foundation Models

    Chameleon Team. Chameleon: Mixed-modal early-fusion foundation models.arXiv preprint arXiv:2405.09818,

  22. [22]

    Chameleon: Mixed-Modal Early-Fusion Foundation Models

    doi: 10.48550/arXiv.2405.09818. URLhttps://github.com/ facebookresearch/chameleon. Shengbang Tong, Zhuang Liu, Yuexiang Zhai, Yi Ma, Yann LeCun, and Saining Xie. Eyes wide shut? exploring the visual shortcomings of multimodal llms,

  23. [23]

    VL-Rethinker: Incentivizing Self-Reflection of Vision-Language Models with Reinforcement Learning

    Haozhe Wang, Chao Qu, Zuming Huang, Wei Chu, Fangzhen Lin, and Wenhu Chen. Vl- rethinker: Incentivizing self-reflection of vision-language models with reinforcement learning. arXiv preprint arXiv:2504.08837, 2025a. Jiacong Wang, Zijian Kang, Haochen Wang, Haiyong Jiang, Jiawen Li, Bohong Wu, Ya Wang, Jiao Ran, Xiao Liang, Chao Feng, et al. Vgr: Visual gro...

  24. [24]

    Show-o2: Improved Native Unified Multimodal Models

    Jinheng Xie, Zhenheng Yang, and Mike Zheng Shou. Show-o2: Improved native unified multimodal models.arXiv preprint arXiv:2506.15564,

  25. [25]

    Llava-cot: Let vision language models reason step-by-step.arXiv preprint arXiv:2411.10440,

    Guowei Xu, Peng Jin, Ziang Wu, Hao Li, Yibing Song, Lichao Sun, and Li Yuan. Llava-cot: Let vision language models reason step-by-step.arXiv preprint arXiv:2411.10440,

  26. [26]

    Yi Yang, Xiaoxuan He, Hongkun Pan, Xiyan Jiang, Yan Deng, Xingtao Yang, Haoyu Lu, Dacheng Yin, Fengyun Rao, Minfeng Zhu, et al

    URLhttps://arxiv.org/abs/ 2505.11409. Yi Yang, Xiaoxuan He, Hongkun Pan, Xiyan Jiang, Yan Deng, Xingtao Yang, Haoyu Lu, Dacheng Yin, Fengyun Rao, Minfeng Zhu, et al. R1-onevision: Advancing generalized multimodal rea- soning through cross-modal formalization.arXiv preprint arXiv:2503.10615, 2025a. Zeyuan Yang, Xueyang Yu, Delin Chen, Maohao Shen, and Chua...

  27. [27]

    Cmmcot: Enhancing complex multi-image comprehension via multi-modal chain-of-thought and memory augmentation.arXiv preprint arXiv:2503.05255, 2025a

    Guanghao Zhang, Tao Zhong, Yan Xia, Zhelun Yu, Haoyuan Li, Wanggui He, Fangxun Shu, Mushui Liu, Dong She, Yi Wang, et al. Cmmcot: Enhancing complex multi-image comprehension via multi-modal chain-of-thought and memory augmentation.arXiv preprint arXiv:2503.05255, 2025a. Jingyi Zhang, Jiaxing Huang, Huanjin Yao, Shunyu Liu, Xikun Zhang, Shijian Lu, and Dac...

  28. [28]

    Multimodal Chain-of-Thought Reasoning in Language Models

    Zhuosheng Zhang, Aston Zhang, Mu Li, Hai Zhao, George Karypis, and Alex Smola. Multimodal chain-of-thought reasoning in language models.arXiv preprint arXiv:2302.00923,

  29. [29]

    DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning

    Ziwei Zheng, Michael Yang, Jack Hong, Chenxiao Zhao, Guohai Xu, Le Yang, Chao Shen, and Xing Yu. Deepeyes: Incentivizing” thinking with images” via reinforcement learning.arXiv preprint arXiv:2505.14362,