Recognition: no theorem link
Multimodal Chain-of-Thought Reasoning: A Comprehensive Survey
Pith reviewed 2026-05-15 17:14 UTC · model grok-4.3
The pith
Multimodal chain-of-thought reasoning receives its first systematic survey and taxonomy.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By extending chain-of-thought reasoning to multimodal contexts, MCoT has produced methods that integrate image, video, speech, audio, 3D, and structured data with large language models and deliver results in real-world applications. This work supplies the first systematic survey that clarifies foundational concepts and definitions, presents a comprehensive taxonomy of methodologies viewed from multiple perspectives, analyzes them across application scenarios, and supplies targeted insights on remaining challenges and research paths aimed at multimodal AGI.
What carries the argument
A comprehensive taxonomy that organizes MCoT methodologies according to reasoning paradigms, modality combinations, and application scenarios.
If this is right
- Researchers gain a shared reference for comparing MCoT techniques across different modalities and tasks.
- Identified challenges can focus development on consistent performance in noisy real-world settings such as autonomous driving.
- Future work can follow the outlined directions to integrate MCoT more effectively with multimodal large language models.
- Applications in healthcare and robotics can adopt standardized reasoning steps that build on the surveyed successes.
Where Pith is reading between the lines
- The taxonomy could become the basis for new cross-modal benchmarks that measure step-by-step reasoning quality.
- Linking specific taxonomy branches to model architectures might reveal which designs best support reliable multimodal inference.
- The survey's challenges section may prompt hybrid approaches that combine MCoT with external tools or memory mechanisms.
Load-bearing premise
The body of published MCoT work is sufficiently complete and mature to support a stable taxonomy without major omissions or soon-to-be-invalidated categories.
What would settle it
Discovery of several high-impact MCoT papers or methods published before this survey that fall outside the proposed taxonomy categories or were not included in the analysis.
read the original abstract
By extending the advantage of chain-of-thought (CoT) reasoning in human-like step-by-step processes to multimodal contexts, multimodal CoT (MCoT) reasoning has recently garnered significant research attention, especially in the integration with multimodal large language models (MLLMs). Existing MCoT studies design various methodologies and innovative reasoning paradigms to address the unique challenges of image, video, speech, audio, 3D, and structured data across different modalities, achieving extensive success in applications such as robotics, healthcare, autonomous driving, and multimodal generation. However, MCoT still presents distinct challenges and opportunities that require further focus to ensure consistent thriving in this field, where, unfortunately, an up-to-date review of this domain is lacking. To bridge this gap, we present the first systematic survey of MCoT reasoning, elucidating the relevant foundational concepts and definitions. We offer a comprehensive taxonomy and an in-depth analysis of current methodologies from diverse perspectives across various application scenarios. Furthermore, we provide insights into existing challenges and future research directions, aiming to foster innovation toward multimodal AGI.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents the first systematic survey of multimodal chain-of-thought (MCoT) reasoning. It elucidates foundational concepts and definitions, provides a comprehensive taxonomy of methodologies from diverse perspectives across application scenarios involving various modalities (image, video, speech, etc.), analyzes current approaches in MLLMs, and discusses challenges and future directions toward multimodal AGI.
Significance. Should the literature coverage prove representative and the taxonomy stable, this survey would serve as a key reference point for organizing the growing body of work on MCoT reasoning, facilitating cross-pollination of ideas across modalities and applications such as robotics and autonomous driving.
major comments (2)
- [Abstract and §1] Abstract and §1: The central claim of presenting the 'first systematic survey' with a 'comprehensive taxonomy' is load-bearing on the selection process, yet the manuscript provides no explicit literature search protocol (keywords, databases, date cutoffs, or inclusion/exclusion criteria). This omission prevents verification that the collected works form a representative sample, directly undermining the stability of the taxonomy in a fast-moving field.
- [Taxonomy section] Taxonomy section: The taxonomy is presented as comprehensive across modalities (image, video, speech, audio, 3D, structured data), but without a documented derivation process or explicit mapping of how edge cases (e.g., hybrid modalities or recent arXiv-only works) were handled, it risks being incomplete or unstable shortly after publication.
minor comments (2)
- [Abstract] The abstract would benefit from stating the approximate number of papers surveyed and the time period covered to give readers an immediate sense of scope.
- [Analysis section] Consider adding a summary table in the analysis section listing key methodologies by modality with representative citations to improve readability and quick reference.
Simulated Author's Rebuttal
We thank the referee for their thoughtful and constructive comments, which help strengthen the transparency and rigor of our survey. We address each major point below and will incorporate revisions to document our methodology more explicitly.
read point-by-point responses
-
Referee: [Abstract and §1] Abstract and §1: The central claim of presenting the 'first systematic survey' with a 'comprehensive taxonomy' is load-bearing on the selection process, yet the manuscript provides no explicit literature search protocol (keywords, databases, date cutoffs, or inclusion/exclusion criteria). This omission prevents verification that the collected works form a representative sample, directly undermining the stability of the taxonomy in a fast-moving field.
Authors: We agree that an explicit literature search protocol is necessary to substantiate the claim of a systematic survey and to allow verification of coverage in this rapidly evolving area. The original manuscript did not include a dedicated description of the search strategy. In the revised version, we will add a new subsection (likely in Section 1) that details the databases consulted (arXiv, Google Scholar, ACL Anthology, and major conference proceedings), search keywords (including 'multimodal chain-of-thought', 'MCoT', 'multimodal CoT reasoning', and modality-specific variants), date cutoff (literature up to February 2025), and inclusion/exclusion criteria (prioritizing works with novel reasoning paradigms while excluding purely application-focused papers without methodological contribution). This addition will directly support the representativeness of the collected works and the stability of the taxonomy. revision: yes
-
Referee: [Taxonomy section] Taxonomy section: The taxonomy is presented as comprehensive across modalities (image, video, speech, audio, 3D, structured data), but without a documented derivation process or explicit mapping of how edge cases (e.g., hybrid modalities or recent arXiv-only works) were handled, it risks being incomplete or unstable shortly after publication.
Authors: We acknowledge the value of documenting the taxonomy derivation process. The taxonomy was constructed by iteratively grouping methodologies according to core dimensions: reasoning structure (e.g., step-wise vs. tree-based), modality fusion mechanisms, and application domains, informed by a broad review of the literature. To address the concern, the revised manuscript will expand the taxonomy section with an explicit paragraph describing this construction process, including criteria for classifying hybrid-modality works (assigning them to the dominant modality with cross-references) and the inclusion of recent arXiv preprints that met our novelty threshold. This will provide a clear rationale and mapping for edge cases, improving long-term stability. revision: yes
Circularity Check
No circularity: survey taxonomy compiled from external literature
full rationale
This is a literature survey paper with no mathematical derivations, equations, fitted parameters, or predictive claims that could reduce to self-defined inputs. The central contribution is a taxonomy and analysis drawn from cited external MCoT works; no step in the provided text defines a concept in terms of itself or renames a fitted result as a prediction. Self-citations, if present, are not load-bearing for the taxonomy construction, which rests on independent prior publications rather than a closed loop. The derivation chain is therefore self-contained through compilation of outside sources.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 19 Pith papers
-
ProjLens: Unveiling the Role of Projectors in Multimodal Model Safety
ProjLens shows that backdoor parameters in MLLMs are encoded in low-rank subspaces of the projector and that embeddings shift toward the target direction with magnitude linear in input norm, activating only on poisone...
-
SCP: Spatial Causal Prediction in Video
SCP defines a new benchmark task for predicting spatial causal outcomes beyond direct observation and shows that 23 leading models lag far behind humans on it.
-
Audio Flamingo 3: Advancing Audio Intelligence with Fully Open Large Audio Language Models
Audio Flamingo 3 introduces an open large audio-language model achieving new state-of-the-art results on over 20 audio understanding and reasoning benchmarks using a unified encoder and curriculum training on open data.
-
Retrieve, Integrate, and Synthesize: Spatial-Semantic Grounded Latent Visual Reasoning
RIS improves MLLM latent visual reasoning by retrieving spatial-semantic evidence, integrating it via attention bottlenecks, and synthesizing it with language transition tokens, yielding gains on V*, HRBench, MMVP, an...
-
Visual Latents Know More Than They Say: Unsilencing Latent Reasoning in MLLMs
Visual latents in MLLMs are systematically silenced by autoregressive training but can be unsilenced at inference via query-guided contrastive alignment followed by a confidence-progression reward.
-
See Further, Think Deeper: Advancing VLM's Reasoning Ability with Low-level Visual Cues and Reflection
ForeSight lets VLMs use low-level visual cues and mask-based visual feedback within an RL loop to reason more accurately, with the 7B model beating same-scale peers and some closed-source SOTA on a new benchmark.
-
Beyond Chain-of-Thought: Rewrite as a Universal Interface for Generative Multimodal Embeddings
Rewrite-driven generation with alignment and RL produces shorter, more effective generative multimodal embeddings than CoT methods on retrieval benchmarks.
-
HypEHR: Hyperbolic Modeling of Electronic Health Records for Efficient Question Answering
HypEHR is a hyperbolic embedding model for EHR data that uses Lorentzian geometry and hierarchy-aware pretraining to answer clinical questions nearly as well as large language models but with much smaller size.
-
OMIBench: Benchmarking Olympiad-Level Multi-Image Reasoning in Large Vision-Language Model
OMIBench benchmark reveals that current LVLMs achieve at most 50% on Olympiad problems requiring reasoning across multiple images.
-
V-tableR1: Process-Supervised Multimodal Table Reasoning with Critic-Guided Policy Optimization
V-tableR1 uses a critic VLM for dense step-level feedback and a new PGPO algorithm to shift multimodal table reasoning from pattern matching to verifiable logical steps, achieving SOTA accuracy with a 4B open-source model.
-
Reasoning Structure Matters for Safety Alignment of Reasoning Models
Changing the internal reasoning structure of large reasoning models through simple supervised fine-tuning on 1K examples produces strong safety alignment that generalizes across tasks and languages.
-
Reasoning Dynamics and the Limits of Monitoring Modality Reliance in Vision-Language Models
VLMs show answer inertia in CoT reasoning and remain influenced by misleading textual cues even with sufficient visual evidence, making CoT an incomplete window into modality reliance.
-
CFMS: A Coarse-to-Fine Multimodal Synthesis Framework for Enhanced Tabular Reasoning
CFMS is a coarse-to-fine framework that uses MLLMs to create a multi-perspective knowledge tuple as a reasoning map for symbolic table operations, yielding competitive accuracy on WikiTQ and TabFact.
-
From Perception to Planning: Evolving Ego-Centric Task-Oriented Spatiotemporal Reasoning via Curriculum Learning
EgoTSR applies a three-stage curriculum on a 46-million-sample dataset to build egocentric spatiotemporal reasoning, reaching 92.4% accuracy on long-horizon tasks and reducing chronological biases.
-
Thinking Diffusion: Penalize and Guide Visual-Grounded Reasoning in Diffusion Multimodal Language Models
Position and step penalty plus visual reasoning guidance fix premature answering and weak visual grounding in diffusion MLLMs, delivering up to 7.5% accuracy gains and over 3x speedup.
-
C2F-Thinker: Coarse-to-Fine Reasoning with Hint-Guided Reinforcement Learning for Multimodal Sentiment Analysis
C2F-Thinker combines structured coarse-to-fine chain-of-thought reasoning with hint-guided GRPO reinforcement learning to achieve competitive fine-grained sentiment regression and superior cross-domain generalization ...
-
From Where Things Are to What They Are For: Benchmarking Spatial-Functional Intelligence in Multimodal LLMs
SFI-Bench shows current multimodal LLMs struggle to integrate spatial memory with functional reasoning and external knowledge in video tasks.
-
Towards Robust Endogenous Reasoning: Unifying Drift Adaptation in Non-Stationary Tuning
CPO++ adapts reinforcement fine-tuning of MLLMs to endogenous multi-modal concept drift through counterfactual reasoning and preference optimization, yielding better coherence and cross-domain robustness in safety-cri...
-
Collaborative Multi-Agent Scripts Generation for Enhancing Imperfect-Information Reasoning in Murder Mystery Games
A multi-agent system creates role-specific murder mystery scripts and applies chain-of-thought fine-tuning plus GRPO reinforcement learning to improve VLMs' multi-hop reasoning under uncertainty and deception.
Reference graph
Works this paper leans on
-
[1]
An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, Guanting Dong, Haoran Wei, Huan Lin, Jialong Tang, Jialin Wang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Ma, Jianxin Yang, Jin Xu, Jingren Zhou, Jinze Bai, Jinzheng He, Junyang Lin, Kai Dang, Keming Lu, Keqin Chen, Kexin Yang, Mei...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[2]
ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools
Aohan Zeng, Bin Xu, Bowen Wang, Chenhui Zhang, Da Yin, Diego Rojas, Guanyu Feng, Hanlin Zhao, Hanyu Lai, Hao Yu, Hongning Wang, Jiadai Sun, Jiajie Zhang, Jiale Cheng, Jiayi Gui, Jie Tang, Jing Zhang, Juanzi Li, Lei Zhao, Lindong Wu, Lucen Zhong, Mingdao Liu, Minlie Huang, Peng Zhang, Qinkai Zheng, Rui Lu, Shuaiqi Duan, Shudan Zhang, Shulin Cao, Shuxun Yan...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[4]
Yi: Open Foundation Models by 01.AI
Alex Young, Bei Chen, Chao Li, Chengen Huang, Ge Zhang, Guanwei Zhang, Heng Li, Jiangcheng Zhu, Jianqun Chen, Jing Chang, Kaidong Yu, Peng Liu, Qiang Liu, Shawn Yue, Senbin Yang, Shiming Yang, Tao Yu, Wen Xie, Wenhao Huang, Xiaohui Hu, Xi- aoyi Ren, Xinyao Niu, Pengcheng Nie, Yuchi Xu, Yudong Liu, Yue Wang, Yuxuan Cai, Zhenyu Gu, Zhiyuan Liu, and Zonghong...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[5]
Rohan Anil, Andrew M. Dai, Orhan Firat, Melvin Johnson, Dmitry Lepikhin, Alexandre Pas- sos, Siamak Shakeri, Emanuel Taropa, Paige Bailey, Zhifeng Chen, Eric Chu, Jonathan H. Clark, Laurent El Shafey, Yanping Huang, Kathy Meier-Hellstern, Gaurav Mishra, Erica Moreira, Mark Omernick, Kevin Robinson, Sebastian Ruder, Yi Tay, Kefan Xiao, Yuanzhong Xu, Yujing...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[6]
Marah I Abdin, Sam Ade Jacobs, Ammar Ahmad Awan, Jyoti Aneja, Ahmed Awadallah, Hany Awadalla, Nguyen Bach, Amit Bahree, Arash Bakhtiari, Harkirat S. Behl, Alon Ben- haim, Misha Bilenko, Johan Bjorck, S ´ebastien Bubeck, Martin Cai, Caio C ´esar Teodoro Mendes, Weizhu Chen, Vishrav Chaudhary, Parul Chopra, Allie Del Giorno, Gustavo de Rosa, Matthew Dixon, ...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[7]
Gemini: A Family of Highly Capable Multimodal Models
Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Sori- cut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[8]
Llava-uhd: An LMM perceiving any aspect ratio and high- resolution images
Zonghao Guo, Ruyi Xu, Yuan Yao, Junbo Cui, Zanlin Ni, Chunjiang Ge, Tat-Seng Chua, Zhiyuan Liu, and Gao Huang. Llava-uhd: An LMM perceiving any aspect ratio and high- resolution images. In ECCV, pages 390–406, 2024
work page 2024
-
[10]
Llava-med: Training a large language-and- vision assistant for biomedicine in one day
Chunyuan Li, Cliff Wong, Sheng Zhang, Naoto Usuyama, Haotian Liu, Jianwei Yang, Tristan Naumann, Hoifung Poon, and Jianfeng Gao. Llava-med: Training a large language-and- vision assistant for biomedicine in one day. In NeurIPS, 2023
work page 2023
-
[11]
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Yang Fan, Kai Dang, Mengfei Du, Xuancheng Ren, Rui Men, Dayiheng Liu, Chang Zhou, Jingren Zhou, and Junyang Lin. Qwen2-vl: Enhancing vision- language model’s perception of the world at any resolution.CoRR, abs/2409.12191, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[12]
Monkey: Image resolution and text label are important things for large multi-modal models
Zhang Li, Biao Yang, Qiang Liu, Zhiyin Ma, Shuo Zhang, Jingxu Yang, Yabo Sun, Yuliang Liu, and Xiang Bai. Monkey: Image resolution and text label are important things for large multi-modal models. In CVPR, pages 26753–26763, 2024
work page 2024
-
[13]
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. Advances in neural information processing systems, 36:34892–34916, 2023
work page 2023
-
[14]
NExT-GPT: Any-to-any multimodal llm
Shengqiong Wu, Hao Fei, Leigang Qu, Wei Ji, and Tat-Seng Chua. NExT-GPT: Any-to-any multimodal llm. In International Conference on Machine Learning , pages 53366–53397, 2024
work page 2024
-
[15]
Vitron: A unified pixel-level vision LLM for understanding, generating, segmenting, editing
Hao Fei, Shengqiong Wu, Hanwang Zhang, Tat-Seng Chua, and Shuicheng Yan. Vitron: A unified pixel-level vision LLM for understanding, generating, segmenting, editing. In Advances in neural information processing systems, 2024
work page 2024
-
[16]
Yunfei Chu, Jin Xu, Qian Yang, Haojie Wei, Xipin Wei, Zhifang Guo, Yichong Leng, Yuan- jun Lv, Jinzheng He, Junyang Lin, et al. Qwen2-audio technical report. arXiv preprint arXiv:2407.10759, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[17]
Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models
Muhammad Maaz, Hanoona Rasheed, Salman Khan, and Fahad Shahbaz Khan. Video- chatgpt: Towards detailed video understanding via large vision and language models. arXiv preprint arXiv:2306.05424, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[18]
Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding
Hang Zhang, Xin Li, and Lidong Bing. Video-llama: An instruction-tuned audio-visual language model for video understanding. arXiv preprint arXiv:2306.02858, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[19]
Chain-of-thought prompting elicits reasoning in large language models
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837, 2022
work page 2022
-
[20]
Automatic chain of thought prompting in large language models
Zhuosheng Zhang, Aston Zhang, Mu Li, and Alex Smola. Automatic chain of thought prompting in large language models. arXiv preprint arXiv:2210.03493, 2022
-
[21]
Chain-of-thought reasoning without prompting
Xuezhi Wang and Denny Zhou. Chain-of-thought reasoning without prompting. In NeurIPS, 2024
work page 2024
-
[22]
Self-refine: Iterative refinement with self-feedback
Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegr- effe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, Shashank Gupta, Bod- hisattwa Prasad Majumder, Katherine Hermann, Sean Welleck, Amir Yazdanbakhsh, and Pe- ter Clark. Self-refine: Iterative refinement with self-feedback. In NeurIPS, 2023
work page 2023
-
[23]
Graph of thoughts: Solving elaborate problems with large language models
Maciej Besta, Nils Blach, Ales Kubicek, Robert Gerstenberger, Michal Podstawski, Lukas Gianinazzi, Joanna Gajda, Tomasz Lehmann, Hubert Niewiadomski, Piotr Nyczyk, et al. Graph of thoughts: Solving elaborate problems with large language models. In AAAI, pages 17682–17690, 2024
work page 2024
-
[24]
Tree of thoughts: Deliberate problem solving with large language models
Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models. In NeurIPS, 2023
work page 2023
-
[28]
M 3cot: A novel benchmark for multi-domain multi-step multi-modal chain-of-thought
Qiguang Chen, Libo Qin, Jin Zhang, Zhi Chen, Xiao Xu, and Wanxiang Che. M 3cot: A novel benchmark for multi-domain multi-step multi-modal chain-of-thought. In ACL, pages 8199–8221. Association for Computational Linguistics, 2024
work page 2024
-
[29]
Multimodal Chain-of-Thought Reasoning in Language Models
Zhuosheng Zhang, Aston Zhang, Mu Li, Hai Zhao, George Karypis, and Alex Smola. Mul- timodal chain-of-thought reasoning in language models. arXiv preprint arXiv:2302.00923, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[30]
Chengzu Li, Wenshan Wu, Huanyu Zhang, Yan Xia, Shaoguang Mao, Li Dong, Ivan Vuli ´c, and Furu Wei. Imagine while reasoning in space: Multimodal visualization-of-thought.arXiv preprint arXiv:2501.07542, 2025
-
[31]
Video-of-thought: Step-by-step video reasoning from perception to cognition
Hao Fei, Shengqiong Wu, Wei Ji, Hanwang Zhang, Meishan Zhang, Mong-Li Lee, and Wynne Hsu. Video-of-thought: Step-by-step video reasoning from perception to cognition. In Forty-first International Conference on Machine Learning, 2024
work page 2024
-
[32]
Avqa-cot: When cot meets question answering in audio-visual scenarios
Guangyao Li, Henghui Du, and Di Hu. Avqa-cot: When cot meets question answering in audio-visual scenarios. In CVPR Workshops, 2024
work page 2024
-
[33]
Cot3dref: Chain-of-thoughts data-efficient 3d visual grounding
Eslam Abdelrahman, Mohamed Ayman, Mahmoud Ahmed, Habib Slim, and Mohamed El- hoseiny. Cot3dref: Chain-of-thoughts data-efficient 3d visual grounding. arXiv preprint arXiv:2310.06214, 2023
-
[34]
Can we generate images with cot? let’s verify and reinforce image gener- ation step by step
Ziyu Guo, Renrui Zhang, Chengzhuo Tong, Zhizheng Zhao, Peng Gao, Hongsheng Li, and Pheng-Ann Heng. Can we generate images with cot? let’s verify and reinforce image gener- ation step by step. arXiv preprint arXiv:2501.13926, 2025. 28
-
[35]
DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models
Xiaoyu Tian, Junru Gu, Bailin Li, Yicheng Liu, Chenxu Hu, Yang Wang, Kun Zhan, Peng Jia, Xianpeng Lang, and Hang Zhao. Drivevlm: The convergence of autonomous driving and large vision-language models. CoRR, abs/2402.12289, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[36]
Dilu: A knowledge-driven approach to autonomous driving with large language models
Licheng Wen, Daocheng Fu, Xin Li, Xinyu Cai, Tao Ma, Pinlong Cai, Min Dou, Botian Shi, Liang He, and Yu Qiao. Dilu: A knowledge-driven approach to autonomous driving with large language models. In ICLR, 2024
work page 2024
- [37]
-
[38]
Is a 3d-tokenized LLM the key to reliable autonomous driving? CoRR, abs/2405.18361, 2024
Yifan Bai, Dongming Wu, Yingfei Liu, Fan Jia, Weixin Mao, Ziheng Zhang, Yucheng Zhao, Jianbing Shen, Xing Wei, Tiancai Wang, and Xiangyu Zhang. Is a 3d-tokenized LLM the key to reliable autonomous driving? CoRR, abs/2405.18361, 2024
-
[39]
Embodiedgpt: Vision-language pre-training via embodied chain of thought
Yao Mu, Qinglong Zhang, Mengkang Hu, Wenhai Wang, Mingyu Ding, Jun Jin, Bin Wang, Jifeng Dai, Yu Qiao, and Ping Luo. Embodiedgpt: Vision-language pre-training via embodied chain of thought. Advances in Neural Information Processing Systems , 36:25081–25094, 2023
work page 2023
-
[40]
Embodied AI with large language models: A survey and new HRI framework
Ming-Yi Lin, Ou-Wen Lee, and Chih-Ying Lu. Embodied AI with large language models: A survey and new HRI framework. In ICARM, pages 978–983, 2024
work page 2024
-
[41]
Robotic Control via Embodied Chain-of-Thought Reasoning
Michał Zawalski, William Chen, Karl Pertsch, Oier Mees, Chelsea Finn, and Sergey Levine. Robotic control via embodied chain-of-thought reasoning. arXiv preprint arXiv:2407.08693, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[42]
Progprompt: Generating situated robot task plans using large language models
Ishika Singh, Valts Blukis, Arsalan Mousavian, Ankit Goyal, Danfei Xu, Jonathan Tremblay, Dieter Fox, Jesse Thomason, and Animesh Garg. Progprompt: Generating situated robot task plans using large language models. In ICRA, pages 11523–11530, 2023
work page 2023
-
[43]
Robot learning in the era of foundation models: A survey
Xuan Xiao, Jiahang Liu, Zhipeng Wang, Yanmin Zhou, Yong Qi, Qian Cheng, Bin He, and Shuo Jiang. Robot learning in the era of foundation models: A survey. CoRR, abs/2311.14379, 2023
-
[44]
Krishan Rana, Jesse Haviland, Sourav Garg, Jad Abou-Chakra, Ian D. Reid, and Niko S¨underhauf. Sayplan: Grounding large language models using 3d scene graphs for scalable robot task planning. In CoRL, pages 23–72, 2023
work page 2023
-
[45]
Tran, Radu Soricut, Anikait Singh, Jaspiar Singh, Pierre Sermanet, Pannag R
Brianna Zitkovich, Tianhe Yu, Sichun Xu, Peng Xu, Ted Xiao, Fei Xia, Jialin Wu, Paul Wohlhart, Stefan Welker, Ayzaan Wahid, Quan Vuong, Vincent Vanhoucke, Huong T. Tran, Radu Soricut, Anikait Singh, Jaspiar Singh, Pierre Sermanet, Pannag R. Sanketi, Gre- cia Salazar, Michael S. Ryoo, Krista Reymann, Kanishka Rao, Karl Pertsch, Igor Mor- datch, Henryk Mich...
work page 2023
-
[46]
Foundation model for advancing healthcare: Challenges, opportunities, and future directions
Yuting He, Fuxiang Huang, Xinrui Jiang, Yuxiang Nie, Minghao Wang, Jiguang Wang, and Hao Chen. Foundation model for advancing healthcare: Challenges, opportunities, and future directions. CoRR, abs/2404.03264, 2024
-
[47]
Healai: A healthcare LLM for effective medical documentation
Sagar Goyal, Eti Rastogi, Sree Prasanna Rajagopal, Dong Yuan, Fen Zhao, Jai Chintagunta, Gautam Naik, and Jeff Ward. Healai: A healthcare LLM for effective medical documentation. In WSDM, pages 1167–1168, 2024
work page 2024
-
[48]
Healthcare copilot: Eliciting the power of general llms for medical consultation
Zhiyao Ren, Yibing Zhan, Baosheng Yu, Liang Ding, and Dacheng Tao. Healthcare copilot: Eliciting the power of general llms for medical consultation. CoRR, abs/2402.13408, 2024. 29
-
[49]
Better to ask in english: Cross-lingual evaluation of large language models for healthcare queries
Yiqiao Jin, Mohit Chandra, Gaurav Verma, Yibo Hu, Munmun De Choudhury, and Srijan Ku- mar. Better to ask in english: Cross-lingual evaluation of large language models for healthcare queries. In ACM WWW, pages 2627–2638, 2024
work page 2024
- [50]
-
[51]
Zhenfang Chen, Qinhong Zhou, Yikang Shen, Yining Hong, Hao Zhang, and Chuang Gan. See, think, confirm: Interactive prompting between vision and language models for knowledge-based visual reasoning. arXiv preprint arXiv:2301.05226, 2023
-
[52]
Soft-prompting with graph-of-thought for multi-modal representation learning
Juncheng Yang, Zuchao Li, Shuai Xie, Wei Yu, Shijun Li, and Bo Du. Soft-prompting with graph-of-thought for multi-modal representation learning. arXiv preprint arXiv:2404.04538, 2024
-
[53]
Promptcot: Align prompt distribution via adapted chain- of-thought
Junyi Yao, Yijiang Liu, Zhen Dong, Mingfei Guo, Helan Hu, Kurt Keutzer, Li Du, Daquan Zhou, and Shanghang Zhang. Promptcot: Align prompt distribution via adapted chain- of-thought. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7027–7037, 2024
work page 2024
-
[54]
Visual chain of thought: bridging logical gaps with multimodal infillings
Daniel Rose, Vaishnavi Himakunthala, Andy Ouyang, Ryan He, Alex Mei, Yujie Lu, Michael Saxon, Chinmay Sonar, Diba Mirza, and William Yang Wang. Visual chain of thought: bridging logical gaps with multimodal infillings. arXiv preprint arXiv:2305.02317, 2023
-
[55]
Lei Wang, Yi Hu, Jiabang He, Xing Xu, Ning Liu, Hui Liu, and Heng Tao Shen. T-sciq: Teaching multimodal chain-of-thought reasoning via large language model signals for science question answering. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 19162–19170, 2024
work page 2024
-
[56]
Cheng Tan, Jingxuan Wei, Zhangyang Gao, Linzhuang Sun, Siyuan Li, Ruifeng Guo, Bihui Yu, and Stan Z Li. Boosting the power of small multimodal reasoning models to match larger models with self-consistency training. In European Conference on Computer Vision, pages 305–322. Springer, 2024
work page 2024
-
[57]
Thinking like an expert: Multimodal hypergraph-of-thought (hot) reasoning to boost foundation modals
Fanglong Yao, Changyuan Tian, Jintao Liu, Zequn Zhang, Qing Liu, Li Jin, Shuchao Li, Xiaoyu Li, and Xian Sun. Thinking like an expert: Multimodal hypergraph-of-thought (hot) reasoning to boost foundation modals. arXiv preprint arXiv:2308.06207, 2023
-
[58]
Cotdet: Affordance knowledge prompting for task driven object detection
Jiajin Tang, Ge Zheng, Jingyi Yu, and Sibei Yang. Cotdet: Affordance knowledge prompting for task driven object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3068–3078, 2023
work page 2023
-
[59]
Ddcot: Duty-distinct chain-of-thought prompting for multimodal reasoning in language models
Ge Zheng, Bin Yang, Jiajin Tang, Hong-Yu Zhou, and Sibei Yang. Ddcot: Duty-distinct chain-of-thought prompting for multimodal reasoning in language models. Advances in Neu- ral Information Processing Systems, 36:5168–5191, 2023
work page 2023
-
[60]
Cpseg: Finer-grained image semantic segmentation via chain-of-thought language prompting
Lei Li. Cpseg: Finer-grained image semantic segmentation via chain-of-thought language prompting. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 513–522, 2024
work page 2024
-
[61]
Gen2sim: Scaling up robot learning in simulation with generative models
Pushkal Katara, Zhou Xian, and Katerina Fragkiadaki. Gen2sim: Scaling up robot learning in simulation with generative models. In 2024 IEEE International Conference on Robotics and Automation (ICRA), pages 6672–6679. IEEE, 2024
work page 2024
-
[62]
Chain of images for intuitively reasoning
Fanxu Meng, Haotong Yang, Yiding Wang, and Muhan Zhang. Chain of images for intuitively reasoning. arXiv preprint arXiv:2311.09241, 2023
-
[63]
Lai Wei, Wenkai Wang, Xiaoyu Shen, Yu Xie, Zhihao Fan, Xiaojin Zhang, Zhongyu Wei, and Wei Chen. Mc-cot: A modular collaborative cot framework for zero-shot medical-vqa with LLM and MLLM integration. CoRR, abs/2410.04521, 2024
-
[64]
Compositional chain-of- thought prompting for large multimodal models
Chancharik Mitra, Brandon Huang, Trevor Darrell, and Roei Herzig. Compositional chain-of- thought prompting for large multimodal models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14420–14431, 2024. 30
work page 2024
-
[65]
Shanshan Zhong, Zhongzhan Huang, Shanghua Gao, Wushao Wen, Liang Lin, Marinka Zit- nik, and Pan Zhou. Let’s think outside the box: Exploring leap-of-thought in large language models with creative humor generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13246–13257, 2024
work page 2024
-
[66]
Multi-modal latent space learning for chain-of-thought reasoning in language models
Liqi He, Zuchao Li, Xiantao Cai, and Ping Wang. Multi-modal latent space learning for chain-of-thought reasoning in language models. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 18180–18187, 2024
work page 2024
-
[67]
Daoan Zhang, Junming Yang, Hanjia Lyu, Zijian Jin, Yuan Yao, Mingkai Chen, and Jiebo Luo. Cocot: Contrastive chain-of-thought prompting for large multimodal models with mul- tiple image inputs. arXiv preprint arXiv:2401.02582, 2024
-
[68]
Kam-cot: Knowledge augmented multimodal chain-of-thoughts reasoning
Debjyoti Mondal, Suraj Modi, Subhadarshi Panda, Rituraj Singh, and Godawari Sudhakar Rao. Kam-cot: Knowledge augmented multimodal chain-of-thoughts reasoning. In Proceed- ings of the AAAI Conference on Artificial Intelligence, pages 18798–18806, 2024
work page 2024
-
[69]
Xuewen Luo, Fan Ding, Yinsheng Song, Xiaofeng Zhang, and Junnyong Loo. Pkrd-cot: A unified chain-of-thought prompting for multi-modal large language models in autonomous driving. arXiv preprint arXiv:2412.02025, 2024
-
[70]
Chain-of-spot: Interactive reasoning improves large vision-language models
Zuyan Liu, Yuhao Dong, Yongming Rao, Jie Zhou, and Jiwen Lu. Chain-of-spot: Interactive reasoning improves large vision-language models. arXiv preprint arXiv:2403.12966, 2024
-
[71]
Chain-of-action: Faithful and multi- modal question answering through large language models
Zhenyu Pan, Haozheng Luo, Manling Li, and Han Liu. Chain-of-action: Faithful and multi- modal question answering through large language models. arXiv preprint arXiv:2403.17359, 2024
-
[72]
Dettoolchain: A new prompting paradigm to unleash detection ability of mllm
Yixuan Wu, Yizhou Wang, Shixiang Tang, Wenhao Wu, Tong He, Wanli Ouyang, Philip Torr, and Jian Wu. Dettoolchain: A new prompting paradigm to unleash detection ability of mllm. In European Conference on Computer Vision, pages 164–182. Springer, 2024
work page 2024
-
[73]
A picture is worth a graph: A blueprint debate paradigm for multimodal reasoning
Changmeng Zheng, Dayong Liang, Wengyu Zhang, Xiao-Yong Wei, Tat-Seng Chua, and Qing Li. A picture is worth a graph: A blueprint debate paradigm for multimodal reasoning. In Proceedings of the 32nd ACM International Conference on Multimedia , pages 419–428, 2024
work page 2024
-
[74]
Textcot: Zoom in for enhanced multimodal text-rich image understanding
Bozhi Luan, Hao Feng, Hong Chen, Yonghui Wang, Wengang Zhou, and Houqiang Li. Textcot: Zoom in for enhanced multimodal text-rich image understanding. arXiv preprint arXiv:2404.09797, 2024
-
[75]
M Abdul Khaliq, P Chang, M Ma, Bernhard Pflugfelder, and F Mileti´c. Ragar, your falsehood radar: Rag-augmented reasoning for political fact-checking using multimodal large language models. arXiv preprint arXiv:2404.12065, 2024
-
[76]
Cantor: Inspiring multi- modal chain-of-thought of mllm
Timin Gao, Peixian Chen, Mengdan Zhang, Chaoyou Fu, Yunhang Shen, Yan Zhang, Shengchuan Zhang, Xiawu Zheng, Xing Sun, Liujuan Cao, et al. Cantor: Inspiring multi- modal chain-of-thought of mllm. In Proceedings of the 32nd ACM International Conference on Multimedia, pages 9096–9105, 2024
work page 2024
-
[77]
Yushi Hu, Weijia Shi, Xingyu Fu, Dan Roth, Mari Ostendorf, Luke Zettlemoyer, Noah A Smith, and Ranjay Krishna. Visual sketchpad: Sketching as a visual chain of thought for multimodal language models. arXiv preprint arXiv:2406.09403, 2024
-
[78]
Image- of-thought prompting for visual reasoning refinement in multimodal large language models
Qiji Zhou, Ruochen Zhou, Zike Hu, Panzhong Lu, Siyang Gao, and Yue Zhang. Image- of-thought prompting for visual reasoning refinement in multimodal large language models. arXiv preprint arXiv:2405.13872, 2024
-
[79]
Qun Li, Haixin Sun, Fu Xiao, Yiming Wang, Xinping Gao, and Bir Bhanu. Ps-cot-adapter: adapting plan-and-solve chain-of-thought for scienceqa.Science China Information Sciences, 68(1):119101, 2025. 31
work page 2025
-
[80]
Dolphins: Mul- timodal language model for driving
Yingzi Ma, Yulong Cao, Jiachen Sun, Marco Pavone, and Chaowei Xiao. Dolphins: Mul- timodal language model for driving. In European Conference on Computer Vision , pages 403–420. Springer, 2024
work page 2024
-
[81]
Enhancing large vision language models with self-training on image comprehension
Yihe Deng, Pan Lu, Fan Yin, Ziniu Hu, Sheng Shen, Quanquan Gu, James Zou, Kai-Wei Chang, and Wei Wang. Enhancing large vision language models with self-training on image comprehension. arXiv preprint arXiv:2405.19716, 2024
-
[82]
Enhancing semantics in multimodal chain of thought via soft negative sampling
Guangmin Zheng, Jin Wang, Xiaobing Zhou, and Xuejie Zhang. Enhancing semantics in multimodal chain of thought via soft negative sampling. arXiv preprint arXiv:2405.09848, 2024
-
[83]
Chain-of-exemplar: enhancing distractor generation for multimodal educational question generation
Haohao Luo, Yang Deng, Ying Shen, See-Kiong Ng, and Tat-Seng Chua. Chain-of-exemplar: enhancing distractor generation for multimodal educational question generation. In ACL, 2024
work page 2024
-
[84]
Dcot: Dual chain-of-thought prompting for large multimodal models
Zixi Jia, Jiqiang Liu, Hexiao Li, Qinghua Liu, and Hongbin Gao. Dcot: Dual chain-of-thought prompting for large multimodal models. In The 16th Asian Conference on Machine Learning (Conference Track), 2024
work page 2024
-
[85]
Layoutllm-t2i: Eliciting layout guidance from llm for text-to-image generation
Leigang Qu, Shengqiong Wu, Hao Fei, Liqiang Nie, and Tat-Seng Chua. Layoutllm-t2i: Eliciting layout guidance from llm for text-to-image generation. In Proceedings of the 31st ACM International Conference on Multimedia, pages 643–654, 2023
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.