arxiv: 2512.14044 · v3 · submitted 2025-12-16 · 💻 cs.CV · cs.AI

Recognition: no theorem link

OmniDrive-R1: Reinforcement-driven Interleaved Multi-modal Chain-of-Thought for Trustworthy Vision-Language Autonomous Driving

Zhenguo Zhang , Haohan Zheng , Yishen Wang , Le Xu , Tianchen Deng , Xuefeng Chen , Qu Chen , Bo Zhang

show 1 more author

Wuxiong Huang

Authors on Pith no claims yet

Pith reviewed 2026-05-16 22:29 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords autonomous drivingvision-language modelschain-of-thought reasoningreinforcement learningvisual groundingmulti-modal CoTtrustworthy AI

0 comments

The pith

Reinforcement learning lets vision-language models interleave visual attention with reasoning steps to reduce hallucinations in autonomous driving.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to address object hallucination in vision-language models for autonomous driving, which arises because standard chain-of-thought reasoning stays purely textual and ungrounded in images. It proposes OmniDrive-R1 as an end-to-end framework that uses an interleaved multi-modal chain-of-thought process driven by reinforcement learning, allowing the model to direct its own visual focus during reasoning steps. The Clip-GRPO algorithm supplies a process-based reward that enforces consistency between visual regions and text without requiring dense localization labels or external tools. A sympathetic reader would care because this removes two major barriers—decoupled perception-reasoning stages and expensive annotations—while delivering large measured gains on a driving benchmark.

Core claim

OmniDrive-R1 introduces an end-to-end VLM that unifies perception and reasoning through an interleaved multi-modal chain-of-thought mechanism. Its reinforcement-driven visual grounding lets the model autonomously zoom in on critical image regions. This is realized via a pure two-stage RL pipeline and the Clip-GRPO algorithm, which supplies an annotation-free, process-based grounding reward that enforces real-time cross-modal consistency between visual focus and textual reasoning.

What carries the argument

The interleaved multi-modal chain-of-thought (iMCoT) mechanism combined with the Clip-GRPO algorithm, which supplies a process-based grounding reward that ties visual attention directly to reasoning steps without dense labels.

If this is right

Perception and reasoning stages can be jointly optimized in a single end-to-end training loop.
Visual grounding becomes possible without collecting dense localization annotations for every training example.
Real-time cross-modal consistency can be enforced during inference without calling external tools.
Reasoning quality and final answer accuracy both rise substantially on driving-specific benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same reinforcement grounding approach could be tested on other safety-critical domains such as medical image interpretation where hallucinations are costly.
If the attention maps remain stable under distribution shift, the method may support deployment in vehicles where scenes change rapidly.
Process-based rewards of this form might replace outcome-only supervision in broader multimodal reasoning tasks beyond driving.

Load-bearing premise

The measured improvements stem from the new interleaved grounding reward rather than from unstated differences in training data volume, compute budget, or baseline implementation details.

What would settle it

Run the trained model on a fresh driving dataset containing scenes with known hallucinated objects and check whether the model's generated attention regions during reasoning actually cover the objects referenced in its final answer.

Figures

Figures reproduced from arXiv: 2512.14044 by Bo Zhang, Haohan Zheng, Le Xu, Qu Chen, Tianchen Deng, Wuxiong Huang, Xuefeng Chen, Yishen Wang, Zhenguo Zhang.

**Figure 1.** Figure 1: An Illustration of OmniDrive-R1’s Interleaved Multi-modal Chain-of-Thought Reasoning Example. The model initiates a multi-step thought process (Round 1) by actively invoking the Zoom-in Tool to ground its reasoning on a critical region (the traffic signal). This mechanism dynamically acquires fine-grained visual evidence (Round 2), which is directly used to refine the thought and arrive at a confident, vis… view at source ↗

**Figure 2.** Figure 2: The Overall iMCOT Reasoning Framework of OmniDrive-R1. The model operates in an iterative loop: starting from the Original Image (I0) and a Question, the VLM generates a textual thought. It then autonomously decides whether to invoke the Zoom-in tool to actively zoom into a crucial visual region, dynamically acquiring new, fine-grained visual evidence (Cropped Image 1 (I1)) based on its native grounding ca… view at source ↗

**Figure 3.** Figure 3: The Two-stage Reinforcement Learning Pipeline for OmniDrive-R1. The training process effectively decouples tool learning from task optimization. Stage 1 (Tool Learning, Left) utilizes the novel Clip-GRPO algorithm on Ddetail to enforce robust grounding: the Process Reward (ROI Grounding Reward), which is annotation-free, uses CLIP’s cross-modal consistency to ensure the localized region is semantically rel… view at source ↗

**Figure 4.** Figure 4: Automated Pipeline for Generating RL-Verifiable Data (Ddrive rl). To enhance reward verification accuracy and scalability for RL training, open-ended scene Q&A Dscene from Ddrive is converted into structured, easily verifiable formats (Multiple-choice or True/False). The process leverages an advanced MLLM (Qwen2.5VL-72B) for Diversity Sampling, followed by a Rule-based Scoring system (assessing format and … view at source ↗

read the original abstract

The deployment of Vision-Language Models (VLMs) in safety-critical domains like autonomous driving (AD) is critically hindered by reliability failures, most notably object hallucination. This failure stems from their reliance on ungrounded, text-based Chain-of-Thought (CoT) reasoning. While existing multi-modal CoT approaches attempt mitigation, they suffer from two fundamental flaws: (1) decoupled perception and reasoning stages that prevent end-to-end joint optimization, and (2) reliance on expensive, dense localization labels. Thus we introduce OmniDrive-R1, an end-to-end VLM framework designed for autonomous driving, which unifies perception and reasoning through an interleaved Multi-modal Chain-of-Thought (iMCoT) mechanism. Our core innovation is an Reinforcement-driven visual grounding capability, enabling the model to autonomously direct its attention and "zoom in" on critical regions for fine-grained analysis. This capability is enabled by our pure two-stage reinforcement learning training pipeline and Clip-GRPO algorithm. Crucially, Clip-GRPO introduces an annotation-free, process-based grounding reward. This reward not only eliminates the need for dense labels but also circumvents the instability of external tool calls by enforcing real-time cross-modal consistency between the visual focus and the textual reasoning. Extensive experiments on DriveLMM-o1 demonstrate our model's significant improvements. Compared to the baseline Qwen2.5VL-7B, OmniDrive-R1 improves the overall reasoning score from 51.77% to 80.35%, and the final answer accuracy from 37.81% to 73.62%.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Big gains on the DriveLMM-o1 benchmark from RL-driven interleaved CoT and a label-free grounding reward, but the baseline comparison leaves open whether extra training explains the delta.

read the letter

OmniDrive-R1 trains a VLM for autonomous driving with an interleaved multi-modal chain-of-thought that lets the model zoom in on relevant image regions while reasoning, plus a Clip-GRPO reward that enforces visual-textual consistency without dense labels. The headline numbers are the lift from 51.77% to 80.35% overall reasoning score and 37.81% to 73.62% final-answer accuracy against Qwen2.5VL-7B on DriveLMM-o1. The practical upside is clear: the two-stage RL pipeline removes the need for expensive localization annotations that most prior multi-modal CoT work requires. That is a real engineering win for driving datasets. The method also tries to keep grounding inside the forward pass instead of calling external tools, which should reduce latency. The soft spot is the missing controls on the baseline. The abstract does not state whether the original Qwen2.5VL-7B received the same volume of RL steps, data, or optimizer schedule. If only the new model went through the full pipeline, the reported improvement could come from extra compute rather than the specific iMCoT-plus-Clip-GRPO design. The internal consistency reward also carries some circularity risk on this benchmark, so we need to see whether the gains survive on held-out driving data or different VLMs. This paper is aimed at groups already running RL on vision-language models for safety tasks. Anyone who needs a concrete recipe for annotation-free grounding will get value from the method section. I would send it for peer review. The core idea is straightforward, the gains are large, and the label-free angle is worth checking in detail even if the current evidence is not yet decisive.

Referee Report

3 major / 1 minor

Summary. The manuscript introduces OmniDrive-R1, an end-to-end vision-language model for autonomous driving that unifies perception and reasoning via an interleaved Multi-modal Chain-of-Thought (iMCoT) mechanism. Its central innovation is a two-stage reinforcement learning pipeline employing the Clip-GRPO algorithm, which supplies an annotation-free, process-based grounding reward enforcing real-time cross-modal consistency to reduce object hallucination. On the DriveLMM-o1 benchmark the paper reports that OmniDrive-R1 raises the overall reasoning score from 51.77% to 80.35% and final-answer accuracy from 37.81% to 73.62% relative to the Qwen2.5VL-7B baseline.

Significance. If the performance deltas can be isolated to the iMCoT and Clip-GRPO components, the work would constitute a meaningful step toward trustworthy multi-modal reasoning in safety-critical settings. The annotation-free grounding reward and end-to-end optimization address two documented shortcomings of prior multi-modal CoT methods; successful validation would therefore be of clear interest to the autonomous-driving and reliable-VLM communities.

major comments (3)

[Abstract] Abstract: The reported gains (reasoning score 51.77%→80.35%, final-answer accuracy 37.81%→73.62%) are presented without any statement that the Qwen2.5VL-7B baseline received identical data volume, number of RL steps, optimizer schedule, or total compute as OmniDrive-R1. This information is load-bearing for the claim that improvements arise from the Clip-GRPO grounding reward rather than unstated training differences.
[Abstract] Abstract and Experiments: The grounding reward is defined internally via cross-modal consistency enforced during training. No external benchmark or ablation is described that would demonstrate the accuracy gains are independent of this internal formulation, leaving open the possibility of circularity between the reward and the reported trustworthiness metric.
[Experiments] Experiments: No statistical significance tests, run-to-run variance, or ablation isolating the interleaved CoT stage from the two-stage RL pipeline are reported. These controls are required to substantiate that the observed deltas are attributable to the proposed mechanism.

minor comments (1)

[Methods] The acronym Clip-GRPO is introduced without expansion or explicit relation to standard GRPO/PPO variants; a brief derivation or pseudocode in the methods section would improve clarity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their insightful comments, which have helped us improve the clarity and rigor of our manuscript. We address each major comment below.

read point-by-point responses

Referee: [Abstract] Abstract: The reported gains (reasoning score 51.77%→80.35%, final-answer accuracy 37.81%→73.62%) are presented without any statement that the Qwen2.5VL-7B baseline received identical data volume, number of RL steps, optimizer schedule, or total compute as OmniDrive-R1. This information is load-bearing for the claim that improvements arise from the Clip-GRPO grounding reward rather than unstated training differences.

Authors: We agree that specifying the training configuration for the baseline is essential to isolate the effect of our proposed components. In the revised manuscript, we have added explicit statements in both the abstract and the experiments section clarifying that the Qwen2.5VL-7B baseline was trained with the same data volume, number of RL steps, and optimizer schedule. The total compute was matched as closely as possible given hardware constraints, and we note that the primary difference lies in the application of the Clip-GRPO algorithm versus standard RL. This supports our claim that the gains stem from the annotation-free grounding reward. revision: yes
Referee: [Abstract] Abstract and Experiments: The grounding reward is defined internally via cross-modal consistency enforced during training. No external benchmark or ablation is described that would demonstrate the accuracy gains are independent of this internal formulation, leaving open the possibility of circularity between the reward and the reported trustworthiness metric.

Authors: While the grounding reward is computed internally based on cross-modal consistency, the evaluation metrics on DriveLMM-o1 are derived from an independent benchmark with fixed ground-truth answers and reasoning annotations that were not used in reward computation. To further address concerns of circularity, we have included an additional ablation in the revised experiments section that removes the grounding reward while keeping the iMCoT structure, showing that performance drops significantly, thus demonstrating the reward's contribution beyond the internal definition. We also emphasize that the trustworthiness metrics include multi-hop reasoning accuracy not directly optimized by the consistency reward. revision: partial
Referee: [Experiments] Experiments: No statistical significance tests, run-to-run variance, or ablation isolating the interleaved CoT stage from the two-stage RL pipeline are reported. These controls are required to substantiate that the observed deltas are attributable to the proposed mechanism.

Authors: We acknowledge the importance of these statistical controls and ablations. In the revised version, we report results averaged over three independent runs with different random seeds, include standard deviation, and perform paired t-tests to establish statistical significance of the improvements. Additionally, we have added a new ablation table that isolates the contribution of the interleaved CoT mechanism from the full two-stage RL pipeline with Clip-GRPO, confirming that both elements are necessary for the reported gains. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper presents an empirical training procedure (two-stage RL with Clip-GRPO) whose grounding reward is explicitly defined as enforcing cross-modal consistency; the reported gains are measured on the external DriveLMM-o1 benchmark via reasoning score and final-answer accuracy. No derivation chain reduces any claimed result to its inputs by construction, no self-citation is used as a load-bearing uniqueness theorem, and no fitted parameter is relabeled as an independent prediction. The method is therefore self-contained as a standard RL pipeline evaluated on held-out data.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Framework rests on standard VLM and RL assumptions plus newly introduced training components; no external formal verification or independent benchmarks are referenced in the abstract.

axioms (1)

domain assumption End-to-end joint optimization of perception and reasoning is feasible via RL in VLMs
Invoked to justify the two-stage reinforcement learning pipeline

invented entities (1)

Clip-GRPO algorithm no independent evidence
purpose: Annotation-free process-based grounding reward enforcing cross-modal consistency
Newly introduced component whose stability is asserted but not externally validated in abstract

pith-pipeline@v0.9.0 · 5620 in / 1238 out tokens · 40581 ms · 2026-05-16T22:29:53.449480+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

65 extracted references · 65 canonical work pages · 9 internal anchors

[1]

Flamingo: a visual language model for few-shot learning.Advances in neural information processing systems, 35:23716–23736,

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning.Advances in neural information processing systems, 35:23716–23736,

work page
[2]

Mitigating object hallucinations in large vision-language models with assembly of global and local attention

Wenbin An, Feng Tian, Sicong Leng, Jiahao Nie, Haonan Lin, QianYing Wang, Ping Chen, Xiaoqin Zhang, and Shijian Lu. Mitigating object hallucinations in large vision-language models with assembly of global and local attention. InPro- ceedings of the Computer Vision and Pattern Recognition Conference, pages 29915–29926, 2025. 1

work page 2025
[3]

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 2, 6, 7

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

Spatialbot: Pre- cise spatial understanding with vision language models

Wenxiao Cai, Iaroslav Ponomarenko, Jianhao Yuan, Xiaoqi Li, Wankou Yang, Hao Dong, and Bo Zhao. Spatialbot: Pre- cise spatial understanding with vision language models. In 2025 IEEE International Conference on Robotics and Au- tomation (ICRA), pages 9490–9498. IEEE, 2025. 7, 8

work page 2025
[5]

Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhang- wei Gao, Erfei Cui, Jinguo Zhu, Shenglong Ye, Hao Tian, Zhaoyang Liu, et al. Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling.arXiv preprint arXiv:2412.05271, 2024. 7

work page internal anchor Pith review Pith/arXiv arXiv 2024
[6]

Spa- tialrgpt: Grounded spatial reasoning in vision-language mod- els.Advances in Neural Information Processing Systems, 37: 135062–135093, 2024

An-Chieh Cheng, Hongxu Yin, Yang Fu, Qiushan Guo, Rui- han Yang, Jan Kautz, Xiaolong Wang, and Sifei Liu. Spa- tialrgpt: Grounded spatial reasoning in vision-language mod- els.Advances in Neural Information Processing Systems, 37: 135062–135093, 2024. 8

work page 2024
[7]

Visual thoughts: A unified perspective of understanding multimodal chain-of-thought.arXiv preprint arXiv:2505.15510, 2025

Zihui Cheng, Qiguang Chen, Xiao Xu, Jiaqi Wang, Weiyun Wang, Hao Fei, Yidong Wang, Alex Jinpeng Wang, Zhi Chen, Wanxiang Che, et al. Visual thoughts: A unified perspective of understanding multimodal chain-of-thought.arXiv preprint arXiv:2505.15510, 2025. 1

work page arXiv 2025
[8]

Talk2bev: Language-enhanced bird’s-eye view maps for autonomous driving

Tushar Choudhary, Vikrant Dewangan, Shivam Chandhok, Shubham Priyadarshan, Anushka Jain, Arun K Singh, Sid- dharth Srivastava, Krishna Murthy Jatavallabhula, and K Mad- hava Krishna. Talk2bev: Language-enhanced bird’s-eye view maps for autonomous driving. In2024 IEEE Inter- national Conference on Robotics and Automation (ICRA), pages 16345–16352. IEEE, 2024. 3

work page 2024
[9]

Retrieval-based inter- leaved visual chain-of-thought in real-world driving scenarios

Charles Corbi`ere, Simon Roburin, Syrielle Montariol, An- toine Bosselut, and Alexandre Alahi. Retrieval-based inter- leaved visual chain-of-thought in real-world driving scenarios. arXiv preprint arXiv:2501.04671, 2025. 3

work page arXiv 2025
[10]

Virgo: A preliminary exploration on reproducing o1-like mllm.arXiv preprint arXiv:2501.01904,

Yifan Du, Zikang Liu, Yifan Li, Wayne Xin Zhao, Yuqi Huo, Bingning Wang, Weipeng Chen, Zheng Liu, Zhongyuan Wang, and Ji-Rong Wen. Virgo: A preliminary exploration on reproducing o1-like mllm.arXiv preprint arXiv:2501.01904,

work page arXiv
[11]

Drive like a human: Rethinking autonomous driving with large language models

Daocheng Fu, Xin Li, Licheng Wen, Min Dou, Pinlong Cai, Botian Shi, and Yu Qiao. Drive like a human: Rethinking autonomous driving with large language models. In2024 IEEE/CVF Winter Conference on Applications of Computer Vision Workshops (WACVW), pages 910–919. IEEE, 2024. 3

work page 2024
[12]

Surds: Benchmarking spatial understand- ing and reasoning in driving scenarios with vision language models.arXiv preprint arXiv:2411.13112, 2024

Xianda Guo, Ruijun Zhang, Yiqun Duan, Yuhang He, Dujun Nie, Wenke Huang, Chenming Zhang, Shuai Liu, Hao Zhao, and Long Chen. Surds: Benchmarking spatial understand- ing and reasoning in driving scenarios with vision language models.arXiv preprint arXiv:2411.13112, 2024. 6, 7, 1, 2

work page arXiv 2024
[13]

Cogagent: A visual language model for gui agents

Wenyi Hong, Weihan Wang, Qingsong Lv, Jiazheng Xu, Wen- meng Yu, Junhui Ji, Yan Wang, Zihan Wang, Yuxiao Dong, Ming Ding, et al. Cogagent: A visual language model for gui agents. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, pages 14281–14290,

work page
[14]

Visual sketchpad: Sketching as a visual chain of thought for mul- timodal language models.Advances in Neural Information Processing Systems, 37:139348–139379, 2024

Yushi Hu, Weijia Shi, Xingyu Fu, Dan Roth, Mari Ostendorf, Luke Zettlemoyer, Noah A Smith, and Ranjay Krishna. Visual sketchpad: Sketching as a visual chain of thought for mul- timodal language models.Advances in Neural Information Processing Systems, 37:139348–139379, 2024. 1

work page 2024
[15]

Drivelmm- o1: A step-by-step reasoning dataset and large multimodal model for driving scenario understanding.arXiv preprint arXiv:2503.10621, 2025

Ayesha Ishaq, Jean Lahoud, Ketan More, Omkar Thawakar, Ritesh Thawkar, Dinura Dissanayake, Noor Ahsan, Yuhao Li, Fahad Shahbaz Khan, Hisham Cholakkal, et al. Drivelmm- o1: A step-by-step reasoning dataset and large multimodal model for driving scenario understanding.arXiv preprint arXiv:2503.10621, 2025. 1, 2, 6, 7

work page arXiv 2025
[16]

Gpt-4o: The cutting-edge advance- ment in multimodal llm

R Islam and OM Moushi. Gpt-4o: The cutting-edge advance- ment in multimodal llm. techrxiv, 2024. 2, 7

work page 2024
[17]

Vlm-r 3: Region recognition, reasoning, and refinement for enhanced multimodal chain-of-thought.arXiv preprint arXiv:2505.16192, 2025

Chaoya Jiang, Yongrui Heng, Wei Ye, Han Yang, Haiyang Xu, Ming Yan, Ji Zhang, Fei Huang, and Shikun Zhang. Vlm-r 3: Region recognition, reasoning, and refinement for enhanced multimodal chain-of-thought.arXiv preprint arXiv:2505.16192, 2025. 1

work page arXiv 2025
[18]

Imagine while Reasoning in Space: Multimodal Visualization-of-Thought

Chengzu Li, Wenshan Wu, Huanyu Zhang, Yan Xia, Shaoguang Mao, Li Dong, Ivan Vuli´c, and Furu Wei. Imag- ine while reasoning in space: Multimodal visualization-of- thought.arXiv preprint arXiv:2501.07542, 2025. 1, 3

work page internal anchor Pith review arXiv 2025
[19]

Let’s verify step by step

Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Ed- wards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. In The Twelfth International Conference on Learning Represen- tations, 2023. 3

work page 2023
[20]

Ovis: Structural embed- ding alignment for multimodal large language model.arXiv preprint arXiv:2405.20797, 2024

Shiyin Lu, Yang Li, Qing-Guo Chen, Zhao Xu, Weihua Luo, Kaifu Zhang, and Han-Jia Ye. Ovis: Structural embed- ding alignment for multimodal large language model.arXiv preprint arXiv:2405.20797, 2024. 7

work page arXiv 2024
[21]

Multi-task collaborative network for joint referring expression comprehension and segmentation

Gen Luo, Yiyi Zhou, Xiaoshuai Sun, Liujuan Cao, Chenglin Wu, Cheng Deng, and Rongrong Ji. Multi-task collaborative network for joint referring expression comprehension and segmentation. InProceedings of the IEEE/CVF Conference on computer vision and pattern recognition, pages 10034– 10043, 2020. 3

work page 2020
[22]

Compositional chain-of-thought prompting for large multimodal models

Chancharik Mitra, Brandon Huang, Trevor Darrell, and Roei Herzig. Compositional chain-of-thought prompting for large multimodal models. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 14420–14431, 2024. 3

work page 2024
[23]

Reason2drive: Towards 9 interpretable and chain-based reasoning for autonomous driv- ing

Ming Nie, Renyuan Peng, Chunwei Wang, Xinyue Cai, Jian- hua Han, Hang Xu, and Li Zhang. Reason2drive: Towards 9 interpretable and chain-based reasoning for autonomous driv- ing. InEuropean Conference on Computer Vision, pages 292–308. Springer, 2024. 3

work page 2024
[24]

Pfau, J., Merrill, W., and Bowman, S

Xuefei Ning, Zinan Lin, Zixuan Zhou, Zifu Wang, Huazhong Yang, and Yu Wang. Skeleton-of-thought: Prompting llms for efficient parallel generation.arXiv preprint arXiv:2307.15337, 2023. 3

work page arXiv 2023
[25]

Kosmos-2: Grounding Multimodal Large Language Models to the World

Zhiliang Peng, Wenhui Wang, Li Dong, Yaru Hao, Shaohan Huang, Shuming Ma, and Furu Wei. Kosmos-2: Ground- ing multimodal large language models to the world.arXiv preprint arXiv:2306.14824, 2023. 3

work page internal anchor Pith review Pith/arXiv arXiv 2023
[26]

Mutual reasoning makes smaller llms stronger problem-solvers.arXiv preprint arXiv:2408.06195,

Zhenting Qi, Mingyuan Ma, Jiahang Xu, Li Lyna Zhang, Fan Yang, and Mao Yang. Mutual reasoning makes smaller llms stronger problem-solvers.arXiv preprint arXiv:2408.06195,

work page arXiv
[27]

Agentthink: A unified framework for tool-augmented chain-of-thought reasoning in vision- language models for autonomous driving.arXiv preprint arXiv:2505.15298, 2025

Kangan Qian, Sicong Jiang, Yang Zhong, Ziang Luo, Zilin Huang, Tianze Zhu, Kun Jiang, Mengmeng Yang, Zheng Fu, Jinyu Miao, et al. Agentthink: A unified framework for tool-augmented chain-of-thought reasoning in vision- language models for autonomous driving.arXiv preprint arXiv:2505.15298, 2025. 1, 2, 3, 4, 6, 7

work page arXiv 2025
[28]

Nuscenes-qa: A multi-modal visual question answering benchmark for autonomous driving scenario

Tianwen Qian, Jingjing Chen, Linhai Zhuo, Yang Jiao, and Yu-Gang Jiang. Nuscenes-qa: A multi-modal visual question answering benchmark for autonomous driving scenario. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 4542–4550, 2024. 3

work page 2024
[29]

Learning transferable visual models from natural language supervi- sion

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 2

work page 2021
[30]

A re- duction of imitation learning and structured prediction to no-regret online learning

St´ephane Ross, Geoffrey Gordon, and Drew Bagnell. A re- duction of imitation learning and structured prediction to no-regret online learning. InProceedings of the fourteenth in- ternational conference on artificial intelligence and statistics, pages 627–635. JMLR Workshop and Conference Proceed- ings, 2011. 1

work page 2011
[31]

Visual cot: Advancing multi-modal language models with a comprehen- sive dataset and benchmark for chain-of-thought reasoning

Hao Shao, Shengju Qian, Han Xiao, Guanglu Song, Zhuofan Zong, Letian Wang, Yu Liu, and Hongsheng Li. Visual cot: Advancing multi-modal language models with a comprehen- sive dataset and benchmark for chain-of-thought reasoning. Advances in Neural Information Processing Systems, 37:8612– 8642, 2024. 1

work page 2024
[32]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathe- matical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024. 2, 5

work page internal anchor Pith review Pith/arXiv arXiv 2024
[33]

VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model

Haozhan Shen, Peng Liu, Jingcheng Li, Chunxin Fang, Yibo Ma, Jiajia Liao, Qiaoli Shen, Zilun Zhang, Kangjia Zhao, Qianqian Zhang, et al. Vlm-r1: A stable and generaliz- able r1-style large vision-language model.arXiv preprint arXiv:2504.07615, 2025. 3

work page internal anchor Pith review Pith/arXiv arXiv 2025
[34]

Driving with graph visual question answering

C Sima, K Renz, K Chitta, L Chen, H Zhang, C Xie, P Luo, A Geiger, and H Drivelm Li. Driving with graph visual question answering. arxiv 2023.arXiv preprint arXiv:2312.14150,

work page arXiv 2023
[35]

Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Ku- mar. Scaling llm test-time compute optimally can be more effective than scaling model parameters.arXiv preprint arXiv:2408.03314, 2024. 3

work page internal anchor Pith review Pith/arXiv arXiv 2024
[36]

Visual agents as fast and slow thinkers.arXiv preprint arXiv:2408.08862, 2024

Guangyan Sun, Mingyu Jin, Zhenting Wang, Cheng-Long Wang, Siqi Ma, Qifan Wang, Tong Geng, Ying Nian Wu, Yongfeng Zhang, and Dongfang Liu. Visual agents as fast and slow thinkers.arXiv preprint arXiv:2408.08862, 2024. 1

work page arXiv 2024
[37]

Mm- verify: Enhancing multimodal reasoning with chain-of- thought verification.arXiv preprint arXiv:2502.13383, 2025

Linzhuang Sun, Hao Liang, Jingxuan Wei, Bihui Yu, Tian- peng Li, Fan Yang, Zenan Zhou, and Wentao Zhang. Mm- verify: Enhancing multimodal reasoning with chain-of- thought verification.arXiv preprint arXiv:2502.13383, 2025. 3

work page arXiv 2025
[38]

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, et al. Gemini 1.5: Unlocking mul- timodal understanding across millions of tokens of context. arXiv preprint arXiv:2403.05530, 2024. 8

work page internal anchor Pith review Pith/arXiv arXiv 2024
[39]

Llamav-o1: Rethinking step-by-step visual reasoning in llms, 2025

Omkar Thawakar, Dinura Dissanayake, Ketan More, Ritesh Thawkar, Ahmed Heakl, Noor Ahsan, Yuhao Li, Mohammed Zumri, Jean Lahoud, Rao Muhammad Anwer, Hisham Cholakkal, Ivan Laptev, Mubarak Shah, Fahad Shahbaz Khan, and Salman Khan. Llamav-o1: Rethinking step-by-step visual reasoning in llms, 2025. 3

work page 2025
[40]

Llamav-o1: Rethinking step-by-step visual reasoning in llms

Omkar Thawakar, Dinura Dissanayake, Ketan Pravin More, Ritesh Thawkar, Ahmed Heakl, Noor Ahsan, Yuhao Li, Ilmuz Zaman Mohammed Zumri, Jean Lahoud, Rao Muhammad Anwer, et al. Llamav-o1: Rethinking step-by-step visual reasoning in llms. InFindings of the Association for Compu- tational Linguistics: ACL 2025, pages 24290–24315, 2025. 7

work page 2025
[41]

Om- nidrive: A holistic vision-language dataset for autonomous driving with counterfactual reasoning

Shihao Wang, Zhiding Yu, Xiaohui Jiang, Shiyi Lan, Min Shi, Nadine Chang, Jan Kautz, Ying Li, and Jose M Alvarez. Om- nidrive: A holistic vision-language dataset for autonomous driving with counterfactual reasoning. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 22442–22452, 2025. 1

work page 2025
[42]

Self-Consistency Improves Chain of Thought Reasoning in Language Models

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in lan- guage models.arXiv preprint arXiv:2203.11171, 2022. 3

work page internal anchor Pith review Pith/arXiv arXiv 2022
[43]

Chain-of- thought prompting elicits reasoning in large language models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of- thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824– 24837, 2022. 1, 3

work page 2022
[44]

V?: Guided visual search as a core mechanism in multimodal llms

Penghao Wu and Saining Xie. V?: Guided visual search as a core mechanism in multimodal llms. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13084–13094, 2024. 3

work page 2024
[45]

Llava-cot: Let vision language models reason step-by-step

Guowei Xu, Peng Jin, Ziang Wu, Hao Li, Yibing Song, Lichao Sun, and Li Yuan. Llava-cot: Let vision language models reason step-by-step. InProceedings of the IEEE/CVF Inter- national Conference on Computer Vision, pages 2087–2098,

work page 2087
[46]

Llava-cot: Let vision language models reason step-by-step, 2025

Guowei Xu, Peng Jin, Ziang Wu, Hao Li, Yibing Song, Lichao Sun, and Li Yuan. Llava-cot: Let vision language models reason step-by-step, 2025. 3 10

work page 2025
[47]

Drivegpt4: Interpretable end-to-end autonomous driving via large language model.IEEE Robotics and Automation Letters,

Zhenhua Xu, Yujia Zhang, Enze Xie, Zhen Zhao, Yong Guo, Kwan-Yee K Wong, Zhenguo Li, and Hengshuang Zhao. Drivegpt4: Interpretable end-to-end autonomous driving via large language model.IEEE Robotics and Automation Letters,

work page
[48]

Mulberry: Empowering mllm with o1- like reasoning and reflection via collective monte carlo tree search.arXiv preprint arXiv:2412.18319, 2024

Huanjin Yao, Jiaxing Huang, Wenhao Wu, Jingyi Zhang, Yibo Wang, Shunyu Liu, Yingjie Wang, Yuxin Song, Haocheng Feng, Li Shen, et al. Mulberry: Empowering mllm with o1- like reasoning and reflection via collective monte carlo tree search.arXiv preprint arXiv:2412.18319, 2024. 7

work page arXiv 2024
[49]

What makes good examples for visual in-context learning?Advances in Neural Information Processing Systems, 36:17773–17794,

Yuanhan Zhang, Kaiyang Zhou, and Ziwei Liu. What makes good examples for visual in-context learning?Advances in Neural Information Processing Systems, 36:17773–17794,

work page
[50]

Prompt highlighter: Interactive control for multi- modal llms

Yuechen Zhang, Shengju Qian, Bohao Peng, Shu Liu, and Jiaya Jia. Prompt highlighter: Interactive control for multi- modal llms. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13215– 13224, 2024. 3

work page 2024
[51]

Ddcot: Duty-distinct chain-of-thought prompting for multimodal reasoning in language models.Advances in Neu- ral Information Processing Systems, 36:5168–5191, 2023

Ge Zheng, Bin Yang, Jiajin Tang, Hong-Yu Zhou, and Sibei Yang. Ddcot: Duty-distinct chain-of-thought prompting for multimodal reasoning in language models.Advances in Neu- ral Information Processing Systems, 36:5168–5191, 2023. 3

work page 2023
[52]

Modality bias in lvlms: Analyzing and mitigating object hallucination via attention lens.arXiv preprint arXiv:2508.02419, 2025

Haohan Zheng and Zhenguo Zhang. Modality bias in lvlms: Analyzing and mitigating object hallucination via attention lens.arXiv preprint arXiv:2508.02419, 2025. 1

work page arXiv 2025
[53]

Seqtr: A simple yet universal network for visual grounding

Chaoyang Zhu, Yiyi Zhou, Yunhang Shen, Gen Luo, Xingjia Pan, Mingbao Lin, Chao Chen, Liujuan Cao, Xiaoshuai Sun, and Rongrong Ji. Seqtr: A simple yet universal network for visual grounding. InEuropean Conference on Computer Vision, pages 598–615. Springer, 2022. 3 11 OmniDrive-R1: Reinforcement-driven Interleaved Multi-modal Chain-of-Thought for Trustwort...

work page 2022
[54]

• 9-10:All steps correctly match or closely reflect the reference

Faithfulness-Step (1-10):Measures how well the model’s reasoning steps align with the ground truth. • 9-10:All steps correctly match or closely reflect the reference. •7-8:Most steps align, with minor deviations. • 5-6:Some steps align, but several are incorrect or missing. •3-4:Few steps align; most are inaccurate or missing. •1-2:Majority of steps are incorrect

work page
[55]

•9-10:Captures almost all critical information

Informativeness-Step (1-10):Measures completeness of reasoning. •9-10:Captures almost all critical information. •7-8:Covers most key points, with minor omissions. •5-6:Missing significant details. •3-4:Only partial reasoning present. •1-2:Poor extraction of relevant reasoning

work page
[56]

•9-10:Correctly identifies and prioritizes key dangers

Risk Assessment Accuracy (1-10):Evaluates if the model correctly prioritizes high-risk objects or scenar- ios. •9-10:Correctly identifies and prioritizes key dangers. •7-8:Mostly accurate, with minor misprioritizations. •5-6:Some important risks are overlooked. •3-4:Significant misjudgments in risk prioritization. •1-2:Misidentifies key risks or misses th...

work page
[57]

• 9-10:Fully compliant with legal and safe driving prac- tices

Traffic Rule Adherence (1-10):Evaluates whether the response follows traffic laws and driving best practices. • 9-10:Fully compliant with legal and safe driving prac- tices. •7-8:Minor deviations, but mostly correct. • 5-6:Some inaccuracies in legal/safe driving recom- mendations. •3-4:Several rule violations or unsafe suggestions. •1-2:Promotes highly un...

work page
[58]

• 9-10:Clearly understands all relevant objects and their relationships

Scene Awareness & Object Understanding (1-10): Measures how well the response interprets objects, their positions, and actions. • 9-10:Clearly understands all relevant objects and their relationships. •7-8:Minor misinterpretations but mostly correct. •5-6:Some key objects misunderstood or ignored. •3-4:Many errors in object recognition and reasoning. •1-2...

work page
[59]

•9-10:No redundancy, very concise

Repetition-Token (1-10):Identifies unnecessary repeti- tion in reasoning. •9-10:No redundancy, very concise. •7-8:Minor repetition but still clear. •5-6:Noticeable redundancy. •3-4:Frequent repetition that disrupts reasoning. •1-2:Excessive redundancy, making reasoning unclear

work page
[60]

•9-10:No hallucinations, all reasoning is grounded

Hallucination (1-10):Detects irrelevant or invented rea- soning steps not aligned with ground truth. •9-10:No hallucinations, all reasoning is grounded. •7-8:One or two minor hallucinations. •5-6:Some fabricated details. •3-4:Frequent hallucinations. •1-2:Majority of reasoning is hallucinated

work page
[61]

•9-10:Nearly complete semantic coverage

Semantic Coverage-Step (1-10):Checks if the response fully covers the critical reasoning elements. •9-10:Nearly complete semantic coverage. •7-8:Good coverage, some minor omissions. •5-6:Partial coverage with key gaps. •3-4:Major gaps in reasoning. •1-2:Very poor semantic coverage

work page
[62]

•9-10:Displays strong commonsense understanding

Commonsense Reasoning (1-10):Assesses the use of intuitive driving logic in reasoning. •9-10:Displays strong commonsense understanding. •7-8:Mostly correct, with minor gaps. •5-6:Some commonsense errors. •3-4:Frequent commonsense mistakes. •1-2:Lacks basic driving commonsense

work page
[63]

•9-10:No critical steps missing

Missing Step (1-10):Evaluates if any necessary reason- ing steps are missing. •9-10:No critical steps missing. •7-8:Minor missing steps, but answer is mostly intact. •5-6:Some important steps missing. 1 •3-4:Many critical reasoning gaps. •1-2:Response is highly incomplete

work page
[64]

• 9-10:Highly specific and directly relevant to the driv- ing scenario

Relevance (1-10):Measures how well the response is specific to the given scenario and ground truth. • 9-10:Highly specific and directly relevant to the driv- ing scenario. Captures critical elements precisely, with no unnecessary generalization. • 7-8:Mostly relevant, but some minor parts may be overly generic or slightly off-focus. • 5-6:Somewhat relevan...

work page
[65]

• 9-10:No significant details are missing; response is comprehensive and complete

Missing Details (1-10):Evaluates the extent to which critical information is missing from the response, impact- ing the reasoning quality. • 9-10:No significant details are missing; response is comprehensive and complete. • 7-8:Covers most important details, with minor omis- sions that do not severely impact reasoning. • 5-6:Some essential details are mis...

work page