arxiv: 2503.12605 · v2 · submitted 2025-03-16 · 💻 cs.CV

Recognition: no theorem link

Multimodal Chain-of-Thought Reasoning: A Comprehensive Survey

Yaoting Wang , Shengqiong Wu , Yuecheng Zhang , Shuicheng Yan , Ziwei Liu , Jiebo Luo , Hao Fei

Authors on Pith no claims yet

Pith reviewed 2026-05-15 17:14 UTC · model grok-4.3

classification 💻 cs.CV

keywords multimodal chain-of-thoughtMCoT reasoningmultimodal large language modelstaxonomysurveyreasoning paradigmsmultimodal applicationschallenges

0 comments

The pith

Multimodal chain-of-thought reasoning receives its first systematic survey and taxonomy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper compiles scattered research on multimodal chain-of-thought reasoning, where models perform step-by-step logic across text, images, video, audio, 3D, and structured data. It defines core terms, builds a taxonomy that groups methods by modality handling and task type, and reviews how these approaches perform in robotics, healthcare, autonomous driving, and generation tasks. The survey also identifies open challenges and future directions. A reader cares because the structure turns isolated papers into a shared reference that can guide consistent progress toward multimodal AI systems.

Core claim

By extending chain-of-thought reasoning to multimodal contexts, MCoT has produced methods that integrate image, video, speech, audio, 3D, and structured data with large language models and deliver results in real-world applications. This work supplies the first systematic survey that clarifies foundational concepts and definitions, presents a comprehensive taxonomy of methodologies viewed from multiple perspectives, analyzes them across application scenarios, and supplies targeted insights on remaining challenges and research paths aimed at multimodal AGI.

What carries the argument

A comprehensive taxonomy that organizes MCoT methodologies according to reasoning paradigms, modality combinations, and application scenarios.

If this is right

Researchers gain a shared reference for comparing MCoT techniques across different modalities and tasks.
Identified challenges can focus development on consistent performance in noisy real-world settings such as autonomous driving.
Future work can follow the outlined directions to integrate MCoT more effectively with multimodal large language models.
Applications in healthcare and robotics can adopt standardized reasoning steps that build on the surveyed successes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The taxonomy could become the basis for new cross-modal benchmarks that measure step-by-step reasoning quality.
Linking specific taxonomy branches to model architectures might reveal which designs best support reliable multimodal inference.
The survey's challenges section may prompt hybrid approaches that combine MCoT with external tools or memory mechanisms.

Load-bearing premise

The body of published MCoT work is sufficiently complete and mature to support a stable taxonomy without major omissions or soon-to-be-invalidated categories.

What would settle it

Discovery of several high-impact MCoT papers or methods published before this survey that fall outside the proposed taxonomy categories or were not included in the analysis.

read the original abstract

By extending the advantage of chain-of-thought (CoT) reasoning in human-like step-by-step processes to multimodal contexts, multimodal CoT (MCoT) reasoning has recently garnered significant research attention, especially in the integration with multimodal large language models (MLLMs). Existing MCoT studies design various methodologies and innovative reasoning paradigms to address the unique challenges of image, video, speech, audio, 3D, and structured data across different modalities, achieving extensive success in applications such as robotics, healthcare, autonomous driving, and multimodal generation. However, MCoT still presents distinct challenges and opportunities that require further focus to ensure consistent thriving in this field, where, unfortunately, an up-to-date review of this domain is lacking. To bridge this gap, we present the first systematic survey of MCoT reasoning, elucidating the relevant foundational concepts and definitions. We offer a comprehensive taxonomy and an in-depth analysis of current methodologies from diverse perspectives across various application scenarios. Furthermore, we provide insights into existing challenges and future research directions, aiming to foster innovation toward multimodal AGI.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript presents the first systematic survey of multimodal chain-of-thought (MCoT) reasoning. It elucidates foundational concepts and definitions, provides a comprehensive taxonomy of methodologies from diverse perspectives across application scenarios involving various modalities (image, video, speech, etc.), analyzes current approaches in MLLMs, and discusses challenges and future directions toward multimodal AGI.

Significance. Should the literature coverage prove representative and the taxonomy stable, this survey would serve as a key reference point for organizing the growing body of work on MCoT reasoning, facilitating cross-pollination of ideas across modalities and applications such as robotics and autonomous driving.

major comments (2)

[Abstract and §1] Abstract and §1: The central claim of presenting the 'first systematic survey' with a 'comprehensive taxonomy' is load-bearing on the selection process, yet the manuscript provides no explicit literature search protocol (keywords, databases, date cutoffs, or inclusion/exclusion criteria). This omission prevents verification that the collected works form a representative sample, directly undermining the stability of the taxonomy in a fast-moving field.
[Taxonomy section] Taxonomy section: The taxonomy is presented as comprehensive across modalities (image, video, speech, audio, 3D, structured data), but without a documented derivation process or explicit mapping of how edge cases (e.g., hybrid modalities or recent arXiv-only works) were handled, it risks being incomplete or unstable shortly after publication.

minor comments (2)

[Abstract] The abstract would benefit from stating the approximate number of papers surveyed and the time period covered to give readers an immediate sense of scope.
[Analysis section] Consider adding a summary table in the analysis section listing key methodologies by modality with representative citations to improve readability and quick reference.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful and constructive comments, which help strengthen the transparency and rigor of our survey. We address each major point below and will incorporate revisions to document our methodology more explicitly.

read point-by-point responses

Referee: [Abstract and §1] Abstract and §1: The central claim of presenting the 'first systematic survey' with a 'comprehensive taxonomy' is load-bearing on the selection process, yet the manuscript provides no explicit literature search protocol (keywords, databases, date cutoffs, or inclusion/exclusion criteria). This omission prevents verification that the collected works form a representative sample, directly undermining the stability of the taxonomy in a fast-moving field.

Authors: We agree that an explicit literature search protocol is necessary to substantiate the claim of a systematic survey and to allow verification of coverage in this rapidly evolving area. The original manuscript did not include a dedicated description of the search strategy. In the revised version, we will add a new subsection (likely in Section 1) that details the databases consulted (arXiv, Google Scholar, ACL Anthology, and major conference proceedings), search keywords (including 'multimodal chain-of-thought', 'MCoT', 'multimodal CoT reasoning', and modality-specific variants), date cutoff (literature up to February 2025), and inclusion/exclusion criteria (prioritizing works with novel reasoning paradigms while excluding purely application-focused papers without methodological contribution). This addition will directly support the representativeness of the collected works and the stability of the taxonomy. revision: yes
Referee: [Taxonomy section] Taxonomy section: The taxonomy is presented as comprehensive across modalities (image, video, speech, audio, 3D, structured data), but without a documented derivation process or explicit mapping of how edge cases (e.g., hybrid modalities or recent arXiv-only works) were handled, it risks being incomplete or unstable shortly after publication.

Authors: We acknowledge the value of documenting the taxonomy derivation process. The taxonomy was constructed by iteratively grouping methodologies according to core dimensions: reasoning structure (e.g., step-wise vs. tree-based), modality fusion mechanisms, and application domains, informed by a broad review of the literature. To address the concern, the revised manuscript will expand the taxonomy section with an explicit paragraph describing this construction process, including criteria for classifying hybrid-modality works (assigning them to the dominant modality with cross-references) and the inclusion of recent arXiv preprints that met our novelty threshold. This will provide a clear rationale and mapping for edge cases, improving long-term stability. revision: yes

Circularity Check

0 steps flagged

No circularity: survey taxonomy compiled from external literature

full rationale

This is a literature survey paper with no mathematical derivations, equations, fitted parameters, or predictive claims that could reduce to self-defined inputs. The central contribution is a taxonomy and analysis drawn from cited external MCoT works; no step in the provided text defines a concept in terms of itself or renames a fitted result as a prediction. Self-citations, if present, are not load-bearing for the taxonomy construction, which rests on independent prior publications rather than a closed loop. The derivation chain is therefore self-contained through compilation of outside sources.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The survey introduces no new free parameters, axioms, or invented entities; it reviews concepts already present in the multimodal AI literature.

pith-pipeline@v0.9.0 · 5504 in / 1068 out tokens · 34971 ms · 2026-05-15T17:14:02.519364+00:00 · methodology

discussion (0)

Forward citations

Cited by 19 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

ProjLens: Unveiling the Role of Projectors in Multimodal Model Safety
cs.CR 2026-04 unverdicted novelty 7.0

ProjLens shows that backdoor parameters in MLLMs are encoded in low-rank subspaces of the projector and that embeddings shift toward the target direction with magnitude linear in input norm, activating only on poisone...
SCP: Spatial Causal Prediction in Video
cs.CV 2026-03 unverdicted novelty 7.0

SCP defines a new benchmark task for predicting spatial causal outcomes beyond direct observation and shows that 23 leading models lag far behind humans on it.
Audio Flamingo 3: Advancing Audio Intelligence with Fully Open Large Audio Language Models
cs.SD 2025-07 unverdicted novelty 7.0

Audio Flamingo 3 introduces an open large audio-language model achieving new state-of-the-art results on over 20 audio understanding and reasoning benchmarks using a unified encoder and curriculum training on open data.
Retrieve, Integrate, and Synthesize: Spatial-Semantic Grounded Latent Visual Reasoning
cs.CL 2026-05 unverdicted novelty 6.0

RIS improves MLLM latent visual reasoning by retrieving spatial-semantic evidence, integrating it via attention bottlenecks, and synthesizing it with language transition tokens, yielding gains on V*, HRBench, MMVP, an...
Visual Latents Know More Than They Say: Unsilencing Latent Reasoning in MLLMs
cs.LG 2026-05 unverdicted novelty 6.0

Visual latents in MLLMs are systematically silenced by autoregressive training but can be unsilenced at inference via query-guided contrastive alignment followed by a confidence-progression reward.
See Further, Think Deeper: Advancing VLM's Reasoning Ability with Low-level Visual Cues and Reflection
cs.CV 2026-04 unverdicted novelty 6.0

ForeSight lets VLMs use low-level visual cues and mask-based visual feedback within an RL loop to reason more accurately, with the 7B model beating same-scale peers and some closed-source SOTA on a new benchmark.
Beyond Chain-of-Thought: Rewrite as a Universal Interface for Generative Multimodal Embeddings
cs.CV 2026-04 unverdicted novelty 6.0

Rewrite-driven generation with alignment and RL produces shorter, more effective generative multimodal embeddings than CoT methods on retrieval benchmarks.
HypEHR: Hyperbolic Modeling of Electronic Health Records for Efficient Question Answering
cs.AI 2026-04 unverdicted novelty 6.0

HypEHR is a hyperbolic embedding model for EHR data that uses Lorentzian geometry and hierarchy-aware pretraining to answer clinical questions nearly as well as large language models but with much smaller size.
OMIBench: Benchmarking Olympiad-Level Multi-Image Reasoning in Large Vision-Language Model
cs.CV 2026-04 unverdicted novelty 6.0

OMIBench benchmark reveals that current LVLMs achieve at most 50% on Olympiad problems requiring reasoning across multiple images.
V-tableR1: Process-Supervised Multimodal Table Reasoning with Critic-Guided Policy Optimization
cs.AI 2026-04 unverdicted novelty 6.0

V-tableR1 uses a critic VLM for dense step-level feedback and a new PGPO algorithm to shift multimodal table reasoning from pattern matching to verifiable logical steps, achieving SOTA accuracy with a 4B open-source model.
Reasoning Structure Matters for Safety Alignment of Reasoning Models
cs.AI 2026-04 unverdicted novelty 6.0

Changing the internal reasoning structure of large reasoning models through simple supervised fine-tuning on 1K examples produces strong safety alignment that generalizes across tasks and languages.
Reasoning Dynamics and the Limits of Monitoring Modality Reliance in Vision-Language Models
cs.CL 2026-04 unverdicted novelty 6.0

VLMs show answer inertia in CoT reasoning and remain influenced by misleading textual cues even with sufficient visual evidence, making CoT an incomplete window into modality reliance.
CFMS: A Coarse-to-Fine Multimodal Synthesis Framework for Enhanced Tabular Reasoning
cs.AI 2026-04 unverdicted novelty 6.0

CFMS is a coarse-to-fine framework that uses MLLMs to create a multi-perspective knowledge tuple as a reasoning map for symbolic table operations, yielding competitive accuracy on WikiTQ and TabFact.
From Perception to Planning: Evolving Ego-Centric Task-Oriented Spatiotemporal Reasoning via Curriculum Learning
cs.AI 2026-04 unverdicted novelty 6.0

EgoTSR applies a three-stage curriculum on a 46-million-sample dataset to build egocentric spatiotemporal reasoning, reaching 92.4% accuracy on long-horizon tasks and reducing chronological biases.
Thinking Diffusion: Penalize and Guide Visual-Grounded Reasoning in Diffusion Multimodal Language Models
cs.AI 2026-04 unverdicted novelty 6.0

Position and step penalty plus visual reasoning guidance fix premature answering and weak visual grounding in diffusion MLLMs, delivering up to 7.5% accuracy gains and over 3x speedup.
C2F-Thinker: Coarse-to-Fine Reasoning with Hint-Guided Reinforcement Learning for Multimodal Sentiment Analysis
cs.CL 2026-03 unverdicted novelty 6.0

C2F-Thinker combines structured coarse-to-fine chain-of-thought reasoning with hint-guided GRPO reinforcement learning to achieve competitive fine-grained sentiment regression and superior cross-domain generalization ...
From Where Things Are to What They Are For: Benchmarking Spatial-Functional Intelligence in Multimodal LLMs
cs.CV 2026-05 unverdicted novelty 5.0

SFI-Bench shows current multimodal LLMs struggle to integrate spatial memory with functional reasoning and external knowledge in video tasks.
Towards Robust Endogenous Reasoning: Unifying Drift Adaptation in Non-Stationary Tuning
cs.LG 2026-04 unverdicted novelty 5.0

CPO++ adapts reinforcement fine-tuning of MLLMs to endogenous multi-modal concept drift through counterfactual reasoning and preference optimization, yielding better coherence and cross-domain robustness in safety-cri...
Collaborative Multi-Agent Scripts Generation for Enhancing Imperfect-Information Reasoning in Murder Mystery Games
cs.AI 2026-04 unverdicted novelty 5.0

A multi-agent system creates role-specific murder mystery scripts and applies chain-of-thought fine-tuning plus GRPO reinforcement learning to improve VLMs' multi-hop reasoning under uncertainty and deception.

Reference graph

Works this paper leans on

275 extracted references · 275 canonical work pages · cited by 19 Pith papers · 42 internal anchors

[1]

Qwen2 Technical Report

An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, Guanting Dong, Haoran Wei, Huan Lin, Jialong Tang, Jialin Wang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Ma, Jianxin Yang, Jin Xu, Jingren Zhou, Jinze Bai, Jinzheng He, Junyang Lin, Kai Dang, Keming Lu, Keqin Chen, Kexin Yang, Mei...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[2]

ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools

Aohan Zeng, Bin Xu, Bowen Wang, Chenhui Zhang, Da Yin, Diego Rojas, Guanyu Feng, Hanlin Zhao, Hanyu Lai, Hao Yu, Hongning Wang, Jiadai Sun, Jiajie Zhang, Jiale Cheng, Jiayi Gui, Jie Tang, Jing Zhang, Juanzi Li, Lei Zhao, Lindong Wu, Lucen Zhong, Mingdao Liu, Minlie Huang, Peng Zhang, Qinkai Zheng, Rui Lu, Shuaiqi Duan, Shudan Zhang, Shulin Cao, Shuxun Yan...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[4]

Yi: Open Foundation Models by 01.AI

Alex Young, Bei Chen, Chao Li, Chengen Huang, Ge Zhang, Guanwei Zhang, Heng Li, Jiangcheng Zhu, Jianqun Chen, Jing Chang, Kaidong Yu, Peng Liu, Qiang Liu, Shawn Yue, Senbin Yang, Shiming Yang, Tao Yu, Wen Xie, Wenhao Huang, Xiaohui Hu, Xi- aoyi Ren, Xinyao Niu, Pengcheng Nie, Yuchi Xu, Yudong Liu, Yue Wang, Yuxuan Cai, Zhenyu Gu, Zhiyuan Liu, and Zonghong...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[5]

PaLM 2 Technical Report

Rohan Anil, Andrew M. Dai, Orhan Firat, Melvin Johnson, Dmitry Lepikhin, Alexandre Pas- sos, Siamak Shakeri, Emanuel Taropa, Paige Bailey, Zhifeng Chen, Eric Chu, Jonathan H. Clark, Laurent El Shafey, Yanping Huang, Kathy Meier-Hellstern, Gaurav Mishra, Erica Moreira, Mark Omernick, Kevin Robinson, Sebastian Ruder, Yi Tay, Kefan Xiao, Yuanzhong Xu, Yujing...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[6]

Marah I Abdin, Sam Ade Jacobs, Ammar Ahmad Awan, Jyoti Aneja, Ahmed Awadallah, Hany Awadalla, Nguyen Bach, Amit Bahree, Arash Bakhtiari, Harkirat S. Behl, Alon Ben- haim, Misha Bilenko, Johan Bjorck, S ´ebastien Bubeck, Martin Cai, Caio C ´esar Teodoro Mendes, Weizhu Chen, Vishrav Chaudhary, Parul Chopra, Allie Del Giorno, Gustavo de Rosa, Matthew Dixon, ...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[7]

Gemini: A Family of Highly Capable Multimodal Models

Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Sori- cut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[8]

Llava-uhd: An LMM perceiving any aspect ratio and high- resolution images

Zonghao Guo, Ruyi Xu, Yuan Yao, Junbo Cui, Zanlin Ni, Chunjiang Ge, Tat-Seng Chua, Zhiyuan Liu, and Gao Huang. Llava-uhd: An LMM perceiving any aspect ratio and high- resolution images. In ECCV, pages 390–406, 2024

work page 2024
[10]

Llava-med: Training a large language-and- vision assistant for biomedicine in one day

Chunyuan Li, Cliff Wong, Sheng Zhang, Naoto Usuyama, Haotian Liu, Jianwei Yang, Tristan Naumann, Hoifung Poon, and Jianfeng Gao. Llava-med: Training a large language-and- vision assistant for biomedicine in one day. In NeurIPS, 2023

work page 2023
[11]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Yang Fan, Kai Dang, Mengfei Du, Xuancheng Ren, Rui Men, Dayiheng Liu, Chang Zhou, Jingren Zhou, and Junyang Lin. Qwen2-vl: Enhancing vision- language model’s perception of the world at any resolution.CoRR, abs/2409.12191, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[12]

Monkey: Image resolution and text label are important things for large multi-modal models

Zhang Li, Biao Yang, Qiang Liu, Zhiyin Ma, Shuo Zhang, Jingxu Yang, Yabo Sun, Yuliang Liu, and Xiang Bai. Monkey: Image resolution and text label are important things for large multi-modal models. In CVPR, pages 26753–26763, 2024

work page 2024
[13]

Visual instruction tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. Advances in neural information processing systems, 36:34892–34916, 2023

work page 2023
[14]

NExT-GPT: Any-to-any multimodal llm

Shengqiong Wu, Hao Fei, Leigang Qu, Wei Ji, and Tat-Seng Chua. NExT-GPT: Any-to-any multimodal llm. In International Conference on Machine Learning , pages 53366–53397, 2024

work page 2024
[15]

Vitron: A unified pixel-level vision LLM for understanding, generating, segmenting, editing

Hao Fei, Shengqiong Wu, Hanwang Zhang, Tat-Seng Chua, and Shuicheng Yan. Vitron: A unified pixel-level vision LLM for understanding, generating, segmenting, editing. In Advances in neural information processing systems, 2024

work page 2024
[16]

Qwen2-Audio Technical Report

Yunfei Chu, Jin Xu, Qian Yang, Haojie Wei, Xipin Wei, Zhifang Guo, Yichong Leng, Yuan- jun Lv, Jinzheng He, Junyang Lin, et al. Qwen2-audio technical report. arXiv preprint arXiv:2407.10759, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[17]

Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models

Muhammad Maaz, Hanoona Rasheed, Salman Khan, and Fahad Shahbaz Khan. Video- chatgpt: Towards detailed video understanding via large vision and language models. arXiv preprint arXiv:2306.05424, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[18]

Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding

Hang Zhang, Xin Li, and Lidong Bing. Video-llama: An instruction-tuned audio-visual language model for video understanding. arXiv preprint arXiv:2306.02858, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[19]

Chain-of-thought prompting elicits reasoning in large language models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837, 2022

work page 2022
[20]

Automatic chain of thought prompting in large language models

Zhuosheng Zhang, Aston Zhang, Mu Li, and Alex Smola. Automatic chain of thought prompting in large language models. arXiv preprint arXiv:2210.03493, 2022

work page arXiv 2022
[21]

Chain-of-thought reasoning without prompting

Xuezhi Wang and Denny Zhou. Chain-of-thought reasoning without prompting. In NeurIPS, 2024

work page 2024
[22]

Self-refine: Iterative refinement with self-feedback

Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegr- effe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, Shashank Gupta, Bod- hisattwa Prasad Majumder, Katherine Hermann, Sean Welleck, Amir Yazdanbakhsh, and Pe- ter Clark. Self-refine: Iterative refinement with self-feedback. In NeurIPS, 2023

work page 2023
[23]

Graph of thoughts: Solving elaborate problems with large language models

Maciej Besta, Nils Blach, Ales Kubicek, Robert Gerstenberger, Michal Podstawski, Lukas Gianinazzi, Joanna Gajda, Tomasz Lehmann, Hubert Niewiadomski, Piotr Nyczyk, et al. Graph of thoughts: Solving elaborate problems with large language models. In AAAI, pages 17682–17690, 2024

work page 2024
[24]

Tree of thoughts: Deliberate problem solving with large language models

Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models. In NeurIPS, 2023

work page 2023
[28]

M 3cot: A novel benchmark for multi-domain multi-step multi-modal chain-of-thought

Qiguang Chen, Libo Qin, Jin Zhang, Zhi Chen, Xiao Xu, and Wanxiang Che. M 3cot: A novel benchmark for multi-domain multi-step multi-modal chain-of-thought. In ACL, pages 8199–8221. Association for Computational Linguistics, 2024

work page 2024
[29]

Multimodal Chain-of-Thought Reasoning in Language Models

Zhuosheng Zhang, Aston Zhang, Mu Li, Hai Zhao, George Karypis, and Alex Smola. Mul- timodal chain-of-thought reasoning in language models. arXiv preprint arXiv:2302.00923, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[30]

Yiqing Liang, Jielin Qiu, Wenhao Ding, Zuxin Liu, James Tompkin, Mengdi Xu, Mengzhou Xia, Zhengzhong Tu, Laixi Shi, and Jiacheng Zhu

Chengzu Li, Wenshan Wu, Huanyu Zhang, Yan Xia, Shaoguang Mao, Li Dong, Ivan Vuli ´c, and Furu Wei. Imagine while reasoning in space: Multimodal visualization-of-thought.arXiv preprint arXiv:2501.07542, 2025

work page arXiv 2025
[31]

Video-of-thought: Step-by-step video reasoning from perception to cognition

Hao Fei, Shengqiong Wu, Wei Ji, Hanwang Zhang, Meishan Zhang, Mong-Li Lee, and Wynne Hsu. Video-of-thought: Step-by-step video reasoning from perception to cognition. In Forty-first International Conference on Machine Learning, 2024

work page 2024
[32]

Avqa-cot: When cot meets question answering in audio-visual scenarios

Guangyao Li, Henghui Du, and Di Hu. Avqa-cot: When cot meets question answering in audio-visual scenarios. In CVPR Workshops, 2024

work page 2024
[33]

Cot3dref: Chain-of-thoughts data-efficient 3d visual grounding

Eslam Abdelrahman, Mohamed Ayman, Mahmoud Ahmed, Habib Slim, and Mohamed El- hoseiny. Cot3dref: Chain-of-thoughts data-efficient 3d visual grounding. arXiv preprint arXiv:2310.06214, 2023

work page arXiv 2023
[34]

Can we generate images with cot? let’s verify and reinforce image gener- ation step by step

Ziyu Guo, Renrui Zhang, Chengzhuo Tong, Zhizheng Zhao, Peng Gao, Hongsheng Li, and Pheng-Ann Heng. Can we generate images with cot? let’s verify and reinforce image gener- ation step by step. arXiv preprint arXiv:2501.13926, 2025. 28

work page arXiv 2025
[35]

DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models

Xiaoyu Tian, Junru Gu, Bailin Li, Yicheng Liu, Chenxu Hu, Yang Wang, Kun Zhan, Peng Jia, Xianpeng Lang, and Hang Zhao. Drivevlm: The convergence of autonomous driving and large vision-language models. CoRR, abs/2402.12289, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[36]

Dilu: A knowledge-driven approach to autonomous driving with large language models

Licheng Wen, Daocheng Fu, Xin Li, Xinyu Cai, Tao Ma, Pinlong Cai, Min Dou, Botian Shi, Liang He, and Yu Qiao. Dilu: A knowledge-driven approach to autonomous driving with large language models. In ICLR, 2024

work page 2024
[37]

´Alvarez

Shihao Wang, Zhiding Yu, Xiaohui Jiang, Shiyi Lan, Min Shi, Nadine Chang, Jan Kautz, Ying Li, and Jos ´e M. ´Alvarez. Omnidrive: A holistic llm-agent framework for autonomous driving with 3d perception, reasoning and planning. CoRR, abs/2405.01533, 2024

work page arXiv 2024
[38]

Is a 3d-tokenized LLM the key to reliable autonomous driving? CoRR, abs/2405.18361, 2024

Yifan Bai, Dongming Wu, Yingfei Liu, Fan Jia, Weixin Mao, Ziheng Zhang, Yucheng Zhao, Jianbing Shen, Xing Wei, Tiancai Wang, and Xiangyu Zhang. Is a 3d-tokenized LLM the key to reliable autonomous driving? CoRR, abs/2405.18361, 2024

work page arXiv 2024
[39]

Embodiedgpt: Vision-language pre-training via embodied chain of thought

Yao Mu, Qinglong Zhang, Mengkang Hu, Wenhai Wang, Mingyu Ding, Jun Jin, Bin Wang, Jifeng Dai, Yu Qiao, and Ping Luo. Embodiedgpt: Vision-language pre-training via embodied chain of thought. Advances in Neural Information Processing Systems , 36:25081–25094, 2023

work page 2023
[40]

Embodied AI with large language models: A survey and new HRI framework

Ming-Yi Lin, Ou-Wen Lee, and Chih-Ying Lu. Embodied AI with large language models: A survey and new HRI framework. In ICARM, pages 978–983, 2024

work page 2024
[41]

Robotic Control via Embodied Chain-of-Thought Reasoning

Michał Zawalski, William Chen, Karl Pertsch, Oier Mees, Chelsea Finn, and Sergey Levine. Robotic control via embodied chain-of-thought reasoning. arXiv preprint arXiv:2407.08693, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[42]

Progprompt: Generating situated robot task plans using large language models

Ishika Singh, Valts Blukis, Arsalan Mousavian, Ankit Goyal, Danfei Xu, Jonathan Tremblay, Dieter Fox, Jesse Thomason, and Animesh Garg. Progprompt: Generating situated robot task plans using large language models. In ICRA, pages 11523–11530, 2023

work page 2023
[43]

Robot learning in the era of foundation models: A survey

Xuan Xiao, Jiahang Liu, Zhipeng Wang, Yanmin Zhou, Yong Qi, Qian Cheng, Bin He, and Shuo Jiang. Robot learning in the era of foundation models: A survey. CoRR, abs/2311.14379, 2023

work page arXiv 2023
[44]

Reid, and Niko S¨underhauf

Krishan Rana, Jesse Haviland, Sourav Garg, Jad Abou-Chakra, Ian D. Reid, and Niko S¨underhauf. Sayplan: Grounding large language models using 3d scene graphs for scalable robot task planning. In CoRL, pages 23–72, 2023

work page 2023
[45]

Tran, Radu Soricut, Anikait Singh, Jaspiar Singh, Pierre Sermanet, Pannag R

Brianna Zitkovich, Tianhe Yu, Sichun Xu, Peng Xu, Ted Xiao, Fei Xia, Jialin Wu, Paul Wohlhart, Stefan Welker, Ayzaan Wahid, Quan Vuong, Vincent Vanhoucke, Huong T. Tran, Radu Soricut, Anikait Singh, Jaspiar Singh, Pierre Sermanet, Pannag R. Sanketi, Gre- cia Salazar, Michael S. Ryoo, Krista Reymann, Kanishka Rao, Karl Pertsch, Igor Mor- datch, Henryk Mich...

work page 2023
[46]

Foundation model for advancing healthcare: Challenges, opportunities, and future directions

Yuting He, Fuxiang Huang, Xinrui Jiang, Yuxiang Nie, Minghao Wang, Jiguang Wang, and Hao Chen. Foundation model for advancing healthcare: Challenges, opportunities, and future directions. CoRR, abs/2404.03264, 2024

work page arXiv 2024
[47]

Healai: A healthcare LLM for effective medical documentation

Sagar Goyal, Eti Rastogi, Sree Prasanna Rajagopal, Dong Yuan, Fen Zhao, Jai Chintagunta, Gautam Naik, and Jeff Ward. Healai: A healthcare LLM for effective medical documentation. In WSDM, pages 1167–1168, 2024

work page 2024
[48]

Healthcare copilot: Eliciting the power of general llms for medical consultation

Zhiyao Ren, Yibing Zhan, Baosheng Yu, Liang Ding, and Dacheng Tao. Healthcare copilot: Eliciting the power of general llms for medical consultation. CoRR, abs/2402.13408, 2024. 29

work page arXiv 2024
[49]

Better to ask in english: Cross-lingual evaluation of large language models for healthcare queries

Yiqiao Jin, Mohit Chandra, Gaurav Verma, Yibo Hu, Munmun De Choudhury, and Srijan Ku- mar. Better to ask in english: Cross-lingual evaluation of large language models for healthcare queries. In ACM WWW, pages 2627–2638, 2024

work page 2024
[50]

Ziyu Wang, Hao Li, Di Huang, and Amir M. Rahmani. Healthq: Unveiling questioning capabilities of LLM chains in healthcare conversations. CoRR, abs/2409.19487, 2024

work page arXiv 2024
[51]

See, think, confirm: Interactive prompting between vision and language models for knowledge-based visual reasoning

Zhenfang Chen, Qinhong Zhou, Yikang Shen, Yining Hong, Hao Zhang, and Chuang Gan. See, think, confirm: Interactive prompting between vision and language models for knowledge-based visual reasoning. arXiv preprint arXiv:2301.05226, 2023

work page arXiv 2023
[52]

Soft-prompting with graph-of-thought for multi-modal representation learning

Juncheng Yang, Zuchao Li, Shuai Xie, Wei Yu, Shijun Li, and Bo Du. Soft-prompting with graph-of-thought for multi-modal representation learning. arXiv preprint arXiv:2404.04538, 2024

work page arXiv 2024
[53]

Promptcot: Align prompt distribution via adapted chain- of-thought

Junyi Yao, Yijiang Liu, Zhen Dong, Mingfei Guo, Helan Hu, Kurt Keutzer, Li Du, Daquan Zhou, and Shanghang Zhang. Promptcot: Align prompt distribution via adapted chain- of-thought. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7027–7037, 2024

work page 2024
[54]

Visual chain of thought: bridging logical gaps with multimodal infillings

Daniel Rose, Vaishnavi Himakunthala, Andy Ouyang, Ryan He, Alex Mei, Yujie Lu, Michael Saxon, Chinmay Sonar, Diba Mirza, and William Yang Wang. Visual chain of thought: bridging logical gaps with multimodal infillings. arXiv preprint arXiv:2305.02317, 2023

work page arXiv 2023
[55]

T-sciq: Teaching multimodal chain-of-thought reasoning via large language model signals for science question answering

Lei Wang, Yi Hu, Jiabang He, Xing Xu, Ning Liu, Hui Liu, and Heng Tao Shen. T-sciq: Teaching multimodal chain-of-thought reasoning via large language model signals for science question answering. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 19162–19170, 2024

work page 2024
[56]

Boosting the power of small multimodal reasoning models to match larger models with self-consistency training

Cheng Tan, Jingxuan Wei, Zhangyang Gao, Linzhuang Sun, Siyuan Li, Ruifeng Guo, Bihui Yu, and Stan Z Li. Boosting the power of small multimodal reasoning models to match larger models with self-consistency training. In European Conference on Computer Vision, pages 305–322. Springer, 2024

work page 2024
[57]

Thinking like an expert: Multimodal hypergraph-of-thought (hot) reasoning to boost foundation modals

Fanglong Yao, Changyuan Tian, Jintao Liu, Zequn Zhang, Qing Liu, Li Jin, Shuchao Li, Xiaoyu Li, and Xian Sun. Thinking like an expert: Multimodal hypergraph-of-thought (hot) reasoning to boost foundation modals. arXiv preprint arXiv:2308.06207, 2023

work page arXiv 2023
[58]

Cotdet: Affordance knowledge prompting for task driven object detection

Jiajin Tang, Ge Zheng, Jingyi Yu, and Sibei Yang. Cotdet: Affordance knowledge prompting for task driven object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3068–3078, 2023

work page 2023
[59]

Ddcot: Duty-distinct chain-of-thought prompting for multimodal reasoning in language models

Ge Zheng, Bin Yang, Jiajin Tang, Hong-Yu Zhou, and Sibei Yang. Ddcot: Duty-distinct chain-of-thought prompting for multimodal reasoning in language models. Advances in Neu- ral Information Processing Systems, 36:5168–5191, 2023

work page 2023
[60]

Cpseg: Finer-grained image semantic segmentation via chain-of-thought language prompting

Lei Li. Cpseg: Finer-grained image semantic segmentation via chain-of-thought language prompting. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 513–522, 2024

work page 2024
[61]

Gen2sim: Scaling up robot learning in simulation with generative models

Pushkal Katara, Zhou Xian, and Katerina Fragkiadaki. Gen2sim: Scaling up robot learning in simulation with generative models. In 2024 IEEE International Conference on Robotics and Automation (ICRA), pages 6672–6679. IEEE, 2024

work page 2024
[62]

Chain of images for intuitively reasoning

Fanxu Meng, Haotong Yang, Yiding Wang, and Muhan Zhang. Chain of images for intuitively reasoning. arXiv preprint arXiv:2311.09241, 2023

work page arXiv 2023
[63]

Mc-cot: A modular collaborative cot framework for zero-shot medical-vqa with LLM and MLLM integration

Lai Wei, Wenkai Wang, Xiaoyu Shen, Yu Xie, Zhihao Fan, Xiaojin Zhang, Zhongyu Wei, and Wei Chen. Mc-cot: A modular collaborative cot framework for zero-shot medical-vqa with LLM and MLLM integration. CoRR, abs/2410.04521, 2024

work page arXiv 2024
[64]

Compositional chain-of- thought prompting for large multimodal models

Chancharik Mitra, Brandon Huang, Trevor Darrell, and Roei Herzig. Compositional chain-of- thought prompting for large multimodal models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14420–14431, 2024. 30

work page 2024
[65]

Let’s think outside the box: Exploring leap-of-thought in large language models with creative humor generation

Shanshan Zhong, Zhongzhan Huang, Shanghua Gao, Wushao Wen, Liang Lin, Marinka Zit- nik, and Pan Zhou. Let’s think outside the box: Exploring leap-of-thought in large language models with creative humor generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13246–13257, 2024

work page 2024
[66]

Multi-modal latent space learning for chain-of-thought reasoning in language models

Liqi He, Zuchao Li, Xiantao Cai, and Ping Wang. Multi-modal latent space learning for chain-of-thought reasoning in language models. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 18180–18187, 2024

work page 2024
[67]

Cocot: Contrastive chain-of-thought prompting for large multimodal models with mul- tiple image inputs

Daoan Zhang, Junming Yang, Hanjia Lyu, Zijian Jin, Yuan Yao, Mingkai Chen, and Jiebo Luo. Cocot: Contrastive chain-of-thought prompting for large multimodal models with mul- tiple image inputs. arXiv preprint arXiv:2401.02582, 2024

work page arXiv 2024
[68]

Kam-cot: Knowledge augmented multimodal chain-of-thoughts reasoning

Debjyoti Mondal, Suraj Modi, Subhadarshi Panda, Rituraj Singh, and Godawari Sudhakar Rao. Kam-cot: Knowledge augmented multimodal chain-of-thoughts reasoning. In Proceed- ings of the AAAI Conference on Artificial Intelligence, pages 18798–18806, 2024

work page 2024
[69]

Pkrd-cot: A unified chain-of-thought prompting for multi-modal large language models in autonomous driving

Xuewen Luo, Fan Ding, Yinsheng Song, Xiaofeng Zhang, and Junnyong Loo. Pkrd-cot: A unified chain-of-thought prompting for multi-modal large language models in autonomous driving. arXiv preprint arXiv:2412.02025, 2024

work page arXiv 2024
[70]

Chain-of-spot: Interactive reasoning improves large vision-language models

Zuyan Liu, Yuhao Dong, Yongming Rao, Jie Zhou, and Jiwen Lu. Chain-of-spot: Interactive reasoning improves large vision-language models. arXiv preprint arXiv:2403.12966, 2024

work page arXiv 2024
[71]

Chain-of-action: Faithful and multi- modal question answering through large language models

Zhenyu Pan, Haozheng Luo, Manling Li, and Han Liu. Chain-of-action: Faithful and multi- modal question answering through large language models. arXiv preprint arXiv:2403.17359, 2024

work page arXiv 2024
[72]

Dettoolchain: A new prompting paradigm to unleash detection ability of mllm

Yixuan Wu, Yizhou Wang, Shixiang Tang, Wenhao Wu, Tong He, Wanli Ouyang, Philip Torr, and Jian Wu. Dettoolchain: A new prompting paradigm to unleash detection ability of mllm. In European Conference on Computer Vision, pages 164–182. Springer, 2024

work page 2024
[73]

A picture is worth a graph: A blueprint debate paradigm for multimodal reasoning

Changmeng Zheng, Dayong Liang, Wengyu Zhang, Xiao-Yong Wei, Tat-Seng Chua, and Qing Li. A picture is worth a graph: A blueprint debate paradigm for multimodal reasoning. In Proceedings of the 32nd ACM International Conference on Multimedia , pages 419–428, 2024

work page 2024
[74]

Textcot: Zoom in for enhanced multimodal text-rich image understanding

Bozhi Luan, Hao Feng, Hong Chen, Yonghui Wang, Wengang Zhou, and Houqiang Li. Textcot: Zoom in for enhanced multimodal text-rich image understanding. arXiv preprint arXiv:2404.09797, 2024

work page arXiv 2024
[75]

Ragar, your falsehood radar: Rag-augmented reasoning for political fact-checking using multimodal large language models

M Abdul Khaliq, P Chang, M Ma, Bernhard Pflugfelder, and F Mileti´c. Ragar, your falsehood radar: Rag-augmented reasoning for political fact-checking using multimodal large language models. arXiv preprint arXiv:2404.12065, 2024

work page arXiv 2024
[76]

Cantor: Inspiring multi- modal chain-of-thought of mllm

Timin Gao, Peixian Chen, Mengdan Zhang, Chaoyou Fu, Yunhang Shen, Yan Zhang, Shengchuan Zhang, Xiawu Zheng, Xing Sun, Liujuan Cao, et al. Cantor: Inspiring multi- modal chain-of-thought of mllm. In Proceedings of the 32nd ACM International Conference on Multimedia, pages 9096–9105, 2024

work page 2024
[77]

Visual sketchpad: Sketching as a visual chain of thought for multimodal language models.arXiv preprint arXiv:2406.09403,

Yushi Hu, Weijia Shi, Xingyu Fu, Dan Roth, Mari Ostendorf, Luke Zettlemoyer, Noah A Smith, and Ranjay Krishna. Visual sketchpad: Sketching as a visual chain of thought for multimodal language models. arXiv preprint arXiv:2406.09403, 2024

work page arXiv 2024
[78]

Image- of-thought prompting for visual reasoning refinement in multimodal large language models

Qiji Zhou, Ruochen Zhou, Zike Hu, Panzhong Lu, Siyang Gao, and Yue Zhang. Image- of-thought prompting for visual reasoning refinement in multimodal large language models. arXiv preprint arXiv:2405.13872, 2024

work page arXiv 2024
[79]

Ps-cot-adapter: adapting plan-and-solve chain-of-thought for scienceqa.Science China Information Sciences, 68(1):119101, 2025

Qun Li, Haixin Sun, Fu Xiao, Yiming Wang, Xinping Gao, and Bir Bhanu. Ps-cot-adapter: adapting plan-and-solve chain-of-thought for scienceqa.Science China Information Sciences, 68(1):119101, 2025. 31

work page 2025
[80]

Dolphins: Mul- timodal language model for driving

Yingzi Ma, Yulong Cao, Jiachen Sun, Marco Pavone, and Chaowei Xiao. Dolphins: Mul- timodal language model for driving. In European Conference on Computer Vision , pages 403–420. Springer, 2024

work page 2024
[81]

Enhancing large vision language models with self-training on image comprehension

Yihe Deng, Pan Lu, Fan Yin, Ziniu Hu, Sheng Shen, Quanquan Gu, James Zou, Kai-Wei Chang, and Wei Wang. Enhancing large vision language models with self-training on image comprehension. arXiv preprint arXiv:2405.19716, 2024

work page arXiv 2024
[82]

Enhancing semantics in multimodal chain of thought via soft negative sampling

Guangmin Zheng, Jin Wang, Xiaobing Zhou, and Xuejie Zhang. Enhancing semantics in multimodal chain of thought via soft negative sampling. arXiv preprint arXiv:2405.09848, 2024

work page arXiv 2024
[83]

Chain-of-exemplar: enhancing distractor generation for multimodal educational question generation

Haohao Luo, Yang Deng, Ying Shen, See-Kiong Ng, and Tat-Seng Chua. Chain-of-exemplar: enhancing distractor generation for multimodal educational question generation. In ACL, 2024

work page 2024
[84]

Dcot: Dual chain-of-thought prompting for large multimodal models

Zixi Jia, Jiqiang Liu, Hexiao Li, Qinghua Liu, and Hongbin Gao. Dcot: Dual chain-of-thought prompting for large multimodal models. In The 16th Asian Conference on Machine Learning (Conference Track), 2024

work page 2024
[85]

Layoutllm-t2i: Eliciting layout guidance from llm for text-to-image generation

Leigang Qu, Shengqiong Wu, Hao Fei, Liqiang Nie, and Tat-Seng Chua. Layoutllm-t2i: Eliciting layout guidance from llm for text-to-image generation. In Proceedings of the 31st ACM International Conference on Multimedia, pages 643–654, 2023

work page 2023

Showing first 80 references.