hub Canonical reference

Multimodal Chain-of-Thought Reasoning: A Comprehensive Survey

Yaoting Wang, Shengqiong Wu, Yuecheng Zhang, Shuicheng Yan, Ziwei Liu, Jiebo Luo · 2025 · cs.CV · arXiv 2503.12605

Canonical reference. 100% of citing Pith papers cite this work as background.

30 Pith papers citing it

Background 100% of classified citations

open full Pith review browse 30 citing papers arXiv PDF

abstract

By extending the advantage of chain-of-thought (CoT) reasoning in human-like step-by-step processes to multimodal contexts, multimodal CoT (MCoT) reasoning has recently garnered significant research attention, especially in the integration with multimodal large language models (MLLMs). Existing MCoT studies design various methodologies and innovative reasoning paradigms to address the unique challenges of image, video, speech, audio, 3D, and structured data across different modalities, achieving extensive success in applications such as robotics, healthcare, autonomous driving, and multimodal generation. However, MCoT still presents distinct challenges and opportunities that require further focus to ensure consistent thriving in this field, where, unfortunately, an up-to-date review of this domain is lacking. To bridge this gap, we present the first systematic survey of MCoT reasoning, elucidating the relevant foundational concepts and definitions. We offer a comprehensive taxonomy and an in-depth analysis of current methodologies from diverse perspectives across various application scenarios. Furthermore, we provide insights into existing challenges and future research directions, aiming to foster innovation toward multimodal AGI.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 12

citation-polarity summary

background 12

representative citing papers

LatentOmni: Rethinking Omni-Modal Understanding via Unified Audio-Visual Latent Reasoning

cs.CL · 2026-05-21 · unverdicted · novelty 7.0

LatentOmni proposes a latent-space cross-modal reasoning framework that uses feature-level supervision and Omni-Sync Position Embedding to align and synchronize audio-visual latents, supported by a new 35K interleaved reasoning dataset and showing gains over text CoT baselines.

On the Cost and Benefit of Chain of Thought: A Learning-Theoretic Perspective

cs.LG · 2026-05-20 · unverdicted · novelty 7.0

Chain of Thought risk decomposes into oracle-trajectory benefit and trajectory-mismatch cost, with stability determining bounded, linear, or exponential error growth.

ProjLens: Unveiling the Role of Projectors in Multimodal Model Safety

cs.CR · 2026-04-21 · unverdicted · novelty 7.0

ProjLens shows that backdoor parameters in MLLMs are encoded in low-rank subspaces of the projector and that embeddings shift toward the target direction with magnitude linear in input norm, activating only on poisoned samples.

SCP: Spatial Causal Prediction in Video

cs.CV · 2026-03-04 · unverdicted · novelty 7.0

SCP defines a new benchmark task for predicting spatial causal outcomes beyond direct observation and shows that 23 leading models lag far behind humans on it.

Audio Flamingo 3: Advancing Audio Intelligence with Fully Open Large Audio Language Models

cs.SD · 2025-07-10 · unverdicted · novelty 7.0

Audio Flamingo 3 introduces an open large audio-language model achieving new state-of-the-art results on over 20 audio understanding and reasoning benchmarks using a unified encoder and curriculum training on open data.

DynFrame: Adaptive Reasoning-Driven Multimodal Framework with Dynamic Frame Augmentation for Complex Video Understanding

cs.CV · 2026-05-26 · unverdicted · novelty 6.0

DynFrame introduces tokenized learnable span-density retrieval and Segment-Decoupled GRPO in video MLLMs, achieving competitive or SOTA results on six benchmarks with 4B and 8B models.

Retrieve, Integrate, and Synthesize: Spatial-Semantic Grounded Latent Visual Reasoning

cs.CL · 2026-05-08 · unverdicted · novelty 6.0

RIS improves MLLM latent visual reasoning by retrieving spatial-semantic evidence, integrating it via attention bottlenecks, and synthesizing it with language transition tokens, yielding gains on V*, HRBench, MMVP, and BLINK benchmarks.

Visual Latents Know More Than They Say: Unsilencing Latent Reasoning in MLLMs

cs.LG · 2026-05-04 · unverdicted · novelty 6.0

Visual latents in MLLMs are systematically silenced by autoregressive training but can be unsilenced at inference via query-guided contrastive alignment followed by a confidence-progression reward.

See Further, Think Deeper: Advancing VLM's Reasoning Ability with Low-level Visual Cues and Reflection

cs.CV · 2026-04-27 · unverdicted · novelty 6.0

ForeSight lets VLMs use low-level visual cues and mask-based visual feedback within an RL loop to reason more accurately, with the 7B model beating same-scale peers and some closed-source SOTA on a new benchmark.

HypEHR: Hyperbolic Modeling of Electronic Health Records for Efficient Question Answering

cs.AI · 2026-04-22 · unverdicted · novelty 6.0

HypEHR is a hyperbolic embedding model for EHR data that uses Lorentzian geometry and hierarchy-aware pretraining to answer clinical questions nearly as well as large language models but with much smaller size.

OMIBench: Benchmarking Olympiad-Level Multi-Image Reasoning in Large Vision-Language Model

cs.CV · 2026-04-22 · unverdicted · novelty 6.0

OMIBench benchmark reveals that current LVLMs achieve at most 50% on Olympiad problems requiring reasoning across multiple images.

V-tableR1: Process-Supervised Multimodal Table Reasoning with Critic-Guided Policy Optimization

cs.AI · 2026-04-22 · unverdicted · novelty 6.0

V-tableR1 uses a critic VLM for dense step-level feedback and a new PGPO algorithm to shift multimodal table reasoning from pattern matching to verifiable logical steps, achieving SOTA accuracy with a 4B open-source model.

Reasoning Structure Matters for Safety Alignment of Reasoning Models

cs.AI · 2026-04-21 · unverdicted · novelty 6.0

Changing the internal reasoning structure of large reasoning models through simple supervised fine-tuning on 1K examples produces strong safety alignment that generalizes across tasks and languages.

Reasoning Dynamics and the Limits of Monitoring Modality Reliance in Vision-Language Models

cs.CL · 2026-04-16 · unverdicted · novelty 6.0

VLMs show answer inertia in CoT reasoning and remain influenced by misleading textual cues even with sufficient visual evidence, making CoT an incomplete window into modality reliance.

CFMS: A Coarse-to-Fine Multimodal Synthesis Framework for Enhanced Tabular Reasoning

cs.AI · 2026-04-13 · unverdicted · novelty 6.0

CFMS is a coarse-to-fine framework that uses MLLMs to create a multi-perspective knowledge tuple as a reasoning map for symbolic table operations, yielding competitive accuracy on WikiTQ and TabFact.

From Perception to Planning: Evolving Ego-Centric Task-Oriented Spatiotemporal Reasoning via Curriculum Learning

cs.AI · 2026-04-12 · unverdicted · novelty 6.0

EgoTSR applies a three-stage curriculum on a 46-million-sample dataset to build egocentric spatiotemporal reasoning, reaching 92.4% accuracy on long-horizon tasks and reducing chronological biases.

Thinking Diffusion: Penalize and Guide Visual-Grounded Reasoning in Diffusion Multimodal Language Models

cs.AI · 2026-04-07 · unverdicted · novelty 6.0

Position and step penalty plus visual reasoning guidance fix premature answering and weak visual grounding in diffusion MLLMs, delivering up to 7.5% accuracy gains and over 3x speedup.

C2F-Thinker: Coarse-to-Fine Reasoning with Hint-Guided Reinforcement Learning for Multimodal Sentiment Analysis

cs.CL · 2026-03-10 · unverdicted · novelty 6.0

C2F-Thinker combines structured coarse-to-fine chain-of-thought reasoning with hint-guided GRPO reinforcement learning to achieve competitive fine-grained sentiment regression and superior cross-domain generalization in multimodal analysis.

Mitigating Visual Context Degradation in Large Multimodal Models: A Training-Free Decoupled Agentic Framework

cs.CV · 2025-09-27 · unverdicted · novelty 6.0

DRP decouples reasoning from perception in LMMs by using an LLM reasoner to query an LMM observer for visual details as needed, reducing visual grounding loss.

Thinking Economically: A Hierarchical Framework for Adaptive-Complexity Reasoning in LLMs

cs.CL · 2026-05-31 · unverdicted · novelty 5.0

HAB applies coarse-to-fine budgeting to LLM reasoning, predicting per-problem depth and learning intra-step token budgets via PPL comparisons and adaptive Pareto optimization, yielding higher accuracy and lower token use than standard CoT on GSM8K and MATH500.

VisionPulse: Dynamic Visual Sparsity for Efficient Multimodal Reasoning

cs.CV · 2026-05-29 · unverdicted · novelty 5.0

VisionPulse is a step-wise visual token pruning method for LMMs that retains 5% of tokens per step, shortens reasoning traces by 11.2%, and maintains accuracy.

From Data to Insights: Exploring Program-of-Thoughts Prompting for Chart Summarization

cs.CL · 2026-05-25 · unverdicted · novelty 5.0

The work introduces a chart-to-dictionary auxiliary task paired with Program-of-Thoughts prompting to enable zero-shot chart summarization that matches existing methods on semantic and factual metrics.

SafeAlign-VLA: A Negative-Enhanced Safe Alignment Framework for Risk-Aware Autonomous Driving

cs.RO · 2026-05-19 · unverdicted · novelty 5.0

SafeAlign-VLA uses counterfactual safety pairing and anchor-based group relative policy optimization to incorporate negative data for safer VLA-based autonomous driving.

From Where Things Are to What They Are For: Benchmarking Spatial-Functional Intelligence in Multimodal LLMs

cs.CV · 2026-05-04 · unverdicted · novelty 5.0

SFI-Bench shows current multimodal LLMs struggle to integrate spatial memory with functional reasoning and external knowledge in video tasks.

citing papers explorer

Showing 1 of 1 citing paper after filters.

ProjLens: Unveiling the Role of Projectors in Multimodal Model Safety cs.CR · 2026-04-21 · unverdicted · none · ref 183 · internal anchor
ProjLens shows that backdoor parameters in MLLMs are encoded in low-rank subspaces of the projector and that embeddings shift toward the target direction with magnitude linear in input norm, activating only on poisoned samples.

Multimodal Chain-of-Thought Reasoning: A Comprehensive Survey

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer