pith. machine review for the scientific record. sign in

arxiv: 2512.14044 · v3 · submitted 2025-12-16 · 💻 cs.CV · cs.AI

Recognition: no theorem link

OmniDrive-R1: Reinforcement-driven Interleaved Multi-modal Chain-of-Thought for Trustworthy Vision-Language Autonomous Driving

Authors on Pith no claims yet

Pith reviewed 2026-05-16 22:29 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords autonomous drivingvision-language modelschain-of-thought reasoningreinforcement learningvisual groundingmulti-modal CoTtrustworthy AI
0
0 comments X

The pith

Reinforcement learning lets vision-language models interleave visual attention with reasoning steps to reduce hallucinations in autonomous driving.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to address object hallucination in vision-language models for autonomous driving, which arises because standard chain-of-thought reasoning stays purely textual and ungrounded in images. It proposes OmniDrive-R1 as an end-to-end framework that uses an interleaved multi-modal chain-of-thought process driven by reinforcement learning, allowing the model to direct its own visual focus during reasoning steps. The Clip-GRPO algorithm supplies a process-based reward that enforces consistency between visual regions and text without requiring dense localization labels or external tools. A sympathetic reader would care because this removes two major barriers—decoupled perception-reasoning stages and expensive annotations—while delivering large measured gains on a driving benchmark.

Core claim

OmniDrive-R1 introduces an end-to-end VLM that unifies perception and reasoning through an interleaved multi-modal chain-of-thought mechanism. Its reinforcement-driven visual grounding lets the model autonomously zoom in on critical image regions. This is realized via a pure two-stage RL pipeline and the Clip-GRPO algorithm, which supplies an annotation-free, process-based grounding reward that enforces real-time cross-modal consistency between visual focus and textual reasoning.

What carries the argument

The interleaved multi-modal chain-of-thought (iMCoT) mechanism combined with the Clip-GRPO algorithm, which supplies a process-based grounding reward that ties visual attention directly to reasoning steps without dense labels.

If this is right

  • Perception and reasoning stages can be jointly optimized in a single end-to-end training loop.
  • Visual grounding becomes possible without collecting dense localization annotations for every training example.
  • Real-time cross-modal consistency can be enforced during inference without calling external tools.
  • Reasoning quality and final answer accuracy both rise substantially on driving-specific benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same reinforcement grounding approach could be tested on other safety-critical domains such as medical image interpretation where hallucinations are costly.
  • If the attention maps remain stable under distribution shift, the method may support deployment in vehicles where scenes change rapidly.
  • Process-based rewards of this form might replace outcome-only supervision in broader multimodal reasoning tasks beyond driving.

Load-bearing premise

The measured improvements stem from the new interleaved grounding reward rather than from unstated differences in training data volume, compute budget, or baseline implementation details.

What would settle it

Run the trained model on a fresh driving dataset containing scenes with known hallucinated objects and check whether the model's generated attention regions during reasoning actually cover the objects referenced in its final answer.

Figures

Figures reproduced from arXiv: 2512.14044 by Bo Zhang, Haohan Zheng, Le Xu, Qu Chen, Tianchen Deng, Wuxiong Huang, Xuefeng Chen, Yishen Wang, Zhenguo Zhang.

Figure 1
Figure 1. Figure 1: An Illustration of OmniDrive-R1’s Interleaved Multi-modal Chain-of-Thought Reasoning Example. The model initiates a multi-step thought process (Round 1) by actively invoking the Zoom-in Tool to ground its reasoning on a critical region (the traffic signal). This mechanism dynamically acquires fine-grained visual evidence (Round 2), which is directly used to refine the thought and arrive at a confident, vis… view at source ↗
Figure 2
Figure 2. Figure 2: The Overall iMCOT Reasoning Framework of OmniDrive-R1. The model operates in an iterative loop: starting from the Original Image (I0) and a Question, the VLM generates a textual thought. It then autonomously decides whether to invoke the Zoom-in tool to actively zoom into a crucial visual region, dynamically acquiring new, fine-grained visual evidence (Cropped Image 1 (I1)) based on its native grounding ca… view at source ↗
Figure 3
Figure 3. Figure 3: The Two-stage Reinforcement Learning Pipeline for OmniDrive-R1. The training process effectively decouples tool learning from task optimization. Stage 1 (Tool Learning, Left) utilizes the novel Clip-GRPO algorithm on Ddetail to enforce robust grounding: the Process Reward (ROI Grounding Reward), which is annotation-free, uses CLIP’s cross-modal consistency to ensure the localized region is semantically rel… view at source ↗
Figure 4
Figure 4. Figure 4: Automated Pipeline for Generating RL-Verifiable Data (Ddrive rl). To enhance reward verification accuracy and scalability for RL training, open-ended scene Q&A Dscene from Ddrive is converted into structured, easily verifiable formats (Multiple-choice or True/False). The process leverages an advanced MLLM (Qwen2.5VL-72B) for Diversity Sampling, followed by a Rule-based Scoring system (assessing format and … view at source ↗
read the original abstract

The deployment of Vision-Language Models (VLMs) in safety-critical domains like autonomous driving (AD) is critically hindered by reliability failures, most notably object hallucination. This failure stems from their reliance on ungrounded, text-based Chain-of-Thought (CoT) reasoning. While existing multi-modal CoT approaches attempt mitigation, they suffer from two fundamental flaws: (1) decoupled perception and reasoning stages that prevent end-to-end joint optimization, and (2) reliance on expensive, dense localization labels. Thus we introduce OmniDrive-R1, an end-to-end VLM framework designed for autonomous driving, which unifies perception and reasoning through an interleaved Multi-modal Chain-of-Thought (iMCoT) mechanism. Our core innovation is an Reinforcement-driven visual grounding capability, enabling the model to autonomously direct its attention and "zoom in" on critical regions for fine-grained analysis. This capability is enabled by our pure two-stage reinforcement learning training pipeline and Clip-GRPO algorithm. Crucially, Clip-GRPO introduces an annotation-free, process-based grounding reward. This reward not only eliminates the need for dense labels but also circumvents the instability of external tool calls by enforcing real-time cross-modal consistency between the visual focus and the textual reasoning. Extensive experiments on DriveLMM-o1 demonstrate our model's significant improvements. Compared to the baseline Qwen2.5VL-7B, OmniDrive-R1 improves the overall reasoning score from 51.77% to 80.35%, and the final answer accuracy from 37.81% to 73.62%.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The manuscript introduces OmniDrive-R1, an end-to-end vision-language model for autonomous driving that unifies perception and reasoning via an interleaved Multi-modal Chain-of-Thought (iMCoT) mechanism. Its central innovation is a two-stage reinforcement learning pipeline employing the Clip-GRPO algorithm, which supplies an annotation-free, process-based grounding reward enforcing real-time cross-modal consistency to reduce object hallucination. On the DriveLMM-o1 benchmark the paper reports that OmniDrive-R1 raises the overall reasoning score from 51.77% to 80.35% and final-answer accuracy from 37.81% to 73.62% relative to the Qwen2.5VL-7B baseline.

Significance. If the performance deltas can be isolated to the iMCoT and Clip-GRPO components, the work would constitute a meaningful step toward trustworthy multi-modal reasoning in safety-critical settings. The annotation-free grounding reward and end-to-end optimization address two documented shortcomings of prior multi-modal CoT methods; successful validation would therefore be of clear interest to the autonomous-driving and reliable-VLM communities.

major comments (3)
  1. [Abstract] Abstract: The reported gains (reasoning score 51.77%→80.35%, final-answer accuracy 37.81%→73.62%) are presented without any statement that the Qwen2.5VL-7B baseline received identical data volume, number of RL steps, optimizer schedule, or total compute as OmniDrive-R1. This information is load-bearing for the claim that improvements arise from the Clip-GRPO grounding reward rather than unstated training differences.
  2. [Abstract] Abstract and Experiments: The grounding reward is defined internally via cross-modal consistency enforced during training. No external benchmark or ablation is described that would demonstrate the accuracy gains are independent of this internal formulation, leaving open the possibility of circularity between the reward and the reported trustworthiness metric.
  3. [Experiments] Experiments: No statistical significance tests, run-to-run variance, or ablation isolating the interleaved CoT stage from the two-stage RL pipeline are reported. These controls are required to substantiate that the observed deltas are attributable to the proposed mechanism.
minor comments (1)
  1. [Methods] The acronym Clip-GRPO is introduced without expansion or explicit relation to standard GRPO/PPO variants; a brief derivation or pseudocode in the methods section would improve clarity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their insightful comments, which have helped us improve the clarity and rigor of our manuscript. We address each major comment below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The reported gains (reasoning score 51.77%→80.35%, final-answer accuracy 37.81%→73.62%) are presented without any statement that the Qwen2.5VL-7B baseline received identical data volume, number of RL steps, optimizer schedule, or total compute as OmniDrive-R1. This information is load-bearing for the claim that improvements arise from the Clip-GRPO grounding reward rather than unstated training differences.

    Authors: We agree that specifying the training configuration for the baseline is essential to isolate the effect of our proposed components. In the revised manuscript, we have added explicit statements in both the abstract and the experiments section clarifying that the Qwen2.5VL-7B baseline was trained with the same data volume, number of RL steps, and optimizer schedule. The total compute was matched as closely as possible given hardware constraints, and we note that the primary difference lies in the application of the Clip-GRPO algorithm versus standard RL. This supports our claim that the gains stem from the annotation-free grounding reward. revision: yes

  2. Referee: [Abstract] Abstract and Experiments: The grounding reward is defined internally via cross-modal consistency enforced during training. No external benchmark or ablation is described that would demonstrate the accuracy gains are independent of this internal formulation, leaving open the possibility of circularity between the reward and the reported trustworthiness metric.

    Authors: While the grounding reward is computed internally based on cross-modal consistency, the evaluation metrics on DriveLMM-o1 are derived from an independent benchmark with fixed ground-truth answers and reasoning annotations that were not used in reward computation. To further address concerns of circularity, we have included an additional ablation in the revised experiments section that removes the grounding reward while keeping the iMCoT structure, showing that performance drops significantly, thus demonstrating the reward's contribution beyond the internal definition. We also emphasize that the trustworthiness metrics include multi-hop reasoning accuracy not directly optimized by the consistency reward. revision: partial

  3. Referee: [Experiments] Experiments: No statistical significance tests, run-to-run variance, or ablation isolating the interleaved CoT stage from the two-stage RL pipeline are reported. These controls are required to substantiate that the observed deltas are attributable to the proposed mechanism.

    Authors: We acknowledge the importance of these statistical controls and ablations. In the revised version, we report results averaged over three independent runs with different random seeds, include standard deviation, and perform paired t-tests to establish statistical significance of the improvements. Additionally, we have added a new ablation table that isolates the contribution of the interleaved CoT mechanism from the full two-stage RL pipeline with Clip-GRPO, confirming that both elements are necessary for the reported gains. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper presents an empirical training procedure (two-stage RL with Clip-GRPO) whose grounding reward is explicitly defined as enforcing cross-modal consistency; the reported gains are measured on the external DriveLMM-o1 benchmark via reasoning score and final-answer accuracy. No derivation chain reduces any claimed result to its inputs by construction, no self-citation is used as a load-bearing uniqueness theorem, and no fitted parameter is relabeled as an independent prediction. The method is therefore self-contained as a standard RL pipeline evaluated on held-out data.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Framework rests on standard VLM and RL assumptions plus newly introduced training components; no external formal verification or independent benchmarks are referenced in the abstract.

axioms (1)
  • domain assumption End-to-end joint optimization of perception and reasoning is feasible via RL in VLMs
    Invoked to justify the two-stage reinforcement learning pipeline
invented entities (1)
  • Clip-GRPO algorithm no independent evidence
    purpose: Annotation-free process-based grounding reward enforcing cross-modal consistency
    Newly introduced component whose stability is asserted but not externally validated in abstract

pith-pipeline@v0.9.0 · 5620 in / 1238 out tokens · 40581 ms · 2026-05-16T22:29:53.449480+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

65 extracted references · 65 canonical work pages · 9 internal anchors

  1. [1]

    Flamingo: a visual language model for few-shot learning.Advances in neural information processing systems, 35:23716–23736,

    Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning.Advances in neural information processing systems, 35:23716–23736,

  2. [2]

    Mitigating object hallucinations in large vision-language models with assembly of global and local attention

    Wenbin An, Feng Tian, Sicong Leng, Jiahao Nie, Haonan Lin, QianYing Wang, Ping Chen, Xiaoqin Zhang, and Shijian Lu. Mitigating object hallucinations in large vision-language models with assembly of global and local attention. InPro- ceedings of the Computer Vision and Pattern Recognition Conference, pages 29915–29926, 2025. 1

  3. [3]

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 2, 6, 7

  4. [4]

    Spatialbot: Pre- cise spatial understanding with vision language models

    Wenxiao Cai, Iaroslav Ponomarenko, Jianhao Yuan, Xiaoqi Li, Wankou Yang, Hao Dong, and Bo Zhao. Spatialbot: Pre- cise spatial understanding with vision language models. In 2025 IEEE International Conference on Robotics and Au- tomation (ICRA), pages 9490–9498. IEEE, 2025. 7, 8

  5. [5]

    Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

    Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhang- wei Gao, Erfei Cui, Jinguo Zhu, Shenglong Ye, Hao Tian, Zhaoyang Liu, et al. Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling.arXiv preprint arXiv:2412.05271, 2024. 7

  6. [6]

    Spa- tialrgpt: Grounded spatial reasoning in vision-language mod- els.Advances in Neural Information Processing Systems, 37: 135062–135093, 2024

    An-Chieh Cheng, Hongxu Yin, Yang Fu, Qiushan Guo, Rui- han Yang, Jan Kautz, Xiaolong Wang, and Sifei Liu. Spa- tialrgpt: Grounded spatial reasoning in vision-language mod- els.Advances in Neural Information Processing Systems, 37: 135062–135093, 2024. 8

  7. [7]

    Visual thoughts: A unified perspective of understanding multimodal chain-of-thought.arXiv preprint arXiv:2505.15510, 2025

    Zihui Cheng, Qiguang Chen, Xiao Xu, Jiaqi Wang, Weiyun Wang, Hao Fei, Yidong Wang, Alex Jinpeng Wang, Zhi Chen, Wanxiang Che, et al. Visual thoughts: A unified perspective of understanding multimodal chain-of-thought.arXiv preprint arXiv:2505.15510, 2025. 1

  8. [8]

    Talk2bev: Language-enhanced bird’s-eye view maps for autonomous driving

    Tushar Choudhary, Vikrant Dewangan, Shivam Chandhok, Shubham Priyadarshan, Anushka Jain, Arun K Singh, Sid- dharth Srivastava, Krishna Murthy Jatavallabhula, and K Mad- hava Krishna. Talk2bev: Language-enhanced bird’s-eye view maps for autonomous driving. In2024 IEEE Inter- national Conference on Robotics and Automation (ICRA), pages 16345–16352. IEEE, 2024. 3

  9. [9]

    Retrieval-based inter- leaved visual chain-of-thought in real-world driving scenarios

    Charles Corbi`ere, Simon Roburin, Syrielle Montariol, An- toine Bosselut, and Alexandre Alahi. Retrieval-based inter- leaved visual chain-of-thought in real-world driving scenarios. arXiv preprint arXiv:2501.04671, 2025. 3

  10. [10]

    Virgo: A preliminary exploration on reproducing o1-like mllm.arXiv preprint arXiv:2501.01904,

    Yifan Du, Zikang Liu, Yifan Li, Wayne Xin Zhao, Yuqi Huo, Bingning Wang, Weipeng Chen, Zheng Liu, Zhongyuan Wang, and Ji-Rong Wen. Virgo: A preliminary exploration on reproducing o1-like mllm.arXiv preprint arXiv:2501.01904,

  11. [11]

    Drive like a human: Rethinking autonomous driving with large language models

    Daocheng Fu, Xin Li, Licheng Wen, Min Dou, Pinlong Cai, Botian Shi, and Yu Qiao. Drive like a human: Rethinking autonomous driving with large language models. In2024 IEEE/CVF Winter Conference on Applications of Computer Vision Workshops (WACVW), pages 910–919. IEEE, 2024. 3

  12. [12]

    Surds: Benchmarking spatial understand- ing and reasoning in driving scenarios with vision language models.arXiv preprint arXiv:2411.13112, 2024

    Xianda Guo, Ruijun Zhang, Yiqun Duan, Yuhang He, Dujun Nie, Wenke Huang, Chenming Zhang, Shuai Liu, Hao Zhao, and Long Chen. Surds: Benchmarking spatial understand- ing and reasoning in driving scenarios with vision language models.arXiv preprint arXiv:2411.13112, 2024. 6, 7, 1, 2

  13. [13]

    Cogagent: A visual language model for gui agents

    Wenyi Hong, Weihan Wang, Qingsong Lv, Jiazheng Xu, Wen- meng Yu, Junhui Ji, Yan Wang, Zihan Wang, Yuxiao Dong, Ming Ding, et al. Cogagent: A visual language model for gui agents. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, pages 14281–14290,

  14. [14]

    Visual sketchpad: Sketching as a visual chain of thought for mul- timodal language models.Advances in Neural Information Processing Systems, 37:139348–139379, 2024

    Yushi Hu, Weijia Shi, Xingyu Fu, Dan Roth, Mari Ostendorf, Luke Zettlemoyer, Noah A Smith, and Ranjay Krishna. Visual sketchpad: Sketching as a visual chain of thought for mul- timodal language models.Advances in Neural Information Processing Systems, 37:139348–139379, 2024. 1

  15. [15]

    Drivelmm- o1: A step-by-step reasoning dataset and large multimodal model for driving scenario understanding.arXiv preprint arXiv:2503.10621, 2025

    Ayesha Ishaq, Jean Lahoud, Ketan More, Omkar Thawakar, Ritesh Thawkar, Dinura Dissanayake, Noor Ahsan, Yuhao Li, Fahad Shahbaz Khan, Hisham Cholakkal, et al. Drivelmm- o1: A step-by-step reasoning dataset and large multimodal model for driving scenario understanding.arXiv preprint arXiv:2503.10621, 2025. 1, 2, 6, 7

  16. [16]

    Gpt-4o: The cutting-edge advance- ment in multimodal llm

    R Islam and OM Moushi. Gpt-4o: The cutting-edge advance- ment in multimodal llm. techrxiv, 2024. 2, 7

  17. [17]

    Vlm-r 3: Region recognition, reasoning, and refinement for enhanced multimodal chain-of-thought.arXiv preprint arXiv:2505.16192, 2025

    Chaoya Jiang, Yongrui Heng, Wei Ye, Han Yang, Haiyang Xu, Ming Yan, Ji Zhang, Fei Huang, and Shikun Zhang. Vlm-r 3: Region recognition, reasoning, and refinement for enhanced multimodal chain-of-thought.arXiv preprint arXiv:2505.16192, 2025. 1

  18. [18]

    Imagine while Reasoning in Space: Multimodal Visualization-of-Thought

    Chengzu Li, Wenshan Wu, Huanyu Zhang, Yan Xia, Shaoguang Mao, Li Dong, Ivan Vuli´c, and Furu Wei. Imag- ine while reasoning in space: Multimodal visualization-of- thought.arXiv preprint arXiv:2501.07542, 2025. 1, 3

  19. [19]

    Let’s verify step by step

    Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Ed- wards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. In The Twelfth International Conference on Learning Represen- tations, 2023. 3

  20. [20]

    Ovis: Structural embed- ding alignment for multimodal large language model.arXiv preprint arXiv:2405.20797, 2024

    Shiyin Lu, Yang Li, Qing-Guo Chen, Zhao Xu, Weihua Luo, Kaifu Zhang, and Han-Jia Ye. Ovis: Structural embed- ding alignment for multimodal large language model.arXiv preprint arXiv:2405.20797, 2024. 7

  21. [21]

    Multi-task collaborative network for joint referring expression comprehension and segmentation

    Gen Luo, Yiyi Zhou, Xiaoshuai Sun, Liujuan Cao, Chenglin Wu, Cheng Deng, and Rongrong Ji. Multi-task collaborative network for joint referring expression comprehension and segmentation. InProceedings of the IEEE/CVF Conference on computer vision and pattern recognition, pages 10034– 10043, 2020. 3

  22. [22]

    Compositional chain-of-thought prompting for large multimodal models

    Chancharik Mitra, Brandon Huang, Trevor Darrell, and Roei Herzig. Compositional chain-of-thought prompting for large multimodal models. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 14420–14431, 2024. 3

  23. [23]

    Reason2drive: Towards 9 interpretable and chain-based reasoning for autonomous driv- ing

    Ming Nie, Renyuan Peng, Chunwei Wang, Xinyue Cai, Jian- hua Han, Hang Xu, and Li Zhang. Reason2drive: Towards 9 interpretable and chain-based reasoning for autonomous driv- ing. InEuropean Conference on Computer Vision, pages 292–308. Springer, 2024. 3

  24. [24]

    Pfau, J., Merrill, W., and Bowman, S

    Xuefei Ning, Zinan Lin, Zixuan Zhou, Zifu Wang, Huazhong Yang, and Yu Wang. Skeleton-of-thought: Prompting llms for efficient parallel generation.arXiv preprint arXiv:2307.15337, 2023. 3

  25. [25]

    Kosmos-2: Grounding Multimodal Large Language Models to the World

    Zhiliang Peng, Wenhui Wang, Li Dong, Yaru Hao, Shaohan Huang, Shuming Ma, and Furu Wei. Kosmos-2: Ground- ing multimodal large language models to the world.arXiv preprint arXiv:2306.14824, 2023. 3

  26. [26]

    Mutual reasoning makes smaller llms stronger problem-solvers.arXiv preprint arXiv:2408.06195,

    Zhenting Qi, Mingyuan Ma, Jiahang Xu, Li Lyna Zhang, Fan Yang, and Mao Yang. Mutual reasoning makes smaller llms stronger problem-solvers.arXiv preprint arXiv:2408.06195,

  27. [27]

    Agentthink: A unified framework for tool-augmented chain-of-thought reasoning in vision- language models for autonomous driving.arXiv preprint arXiv:2505.15298, 2025

    Kangan Qian, Sicong Jiang, Yang Zhong, Ziang Luo, Zilin Huang, Tianze Zhu, Kun Jiang, Mengmeng Yang, Zheng Fu, Jinyu Miao, et al. Agentthink: A unified framework for tool-augmented chain-of-thought reasoning in vision- language models for autonomous driving.arXiv preprint arXiv:2505.15298, 2025. 1, 2, 3, 4, 6, 7

  28. [28]

    Nuscenes-qa: A multi-modal visual question answering benchmark for autonomous driving scenario

    Tianwen Qian, Jingjing Chen, Linhai Zhuo, Yang Jiao, and Yu-Gang Jiang. Nuscenes-qa: A multi-modal visual question answering benchmark for autonomous driving scenario. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 4542–4550, 2024. 3

  29. [29]

    Learning transferable visual models from natural language supervi- sion

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 2

  30. [30]

    A re- duction of imitation learning and structured prediction to no-regret online learning

    St´ephane Ross, Geoffrey Gordon, and Drew Bagnell. A re- duction of imitation learning and structured prediction to no-regret online learning. InProceedings of the fourteenth in- ternational conference on artificial intelligence and statistics, pages 627–635. JMLR Workshop and Conference Proceed- ings, 2011. 1

  31. [31]

    Visual cot: Advancing multi-modal language models with a comprehen- sive dataset and benchmark for chain-of-thought reasoning

    Hao Shao, Shengju Qian, Han Xiao, Guanglu Song, Zhuofan Zong, Letian Wang, Yu Liu, and Hongsheng Li. Visual cot: Advancing multi-modal language models with a comprehen- sive dataset and benchmark for chain-of-thought reasoning. Advances in Neural Information Processing Systems, 37:8612– 8642, 2024. 1

  32. [32]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathe- matical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024. 2, 5

  33. [33]

    VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model

    Haozhan Shen, Peng Liu, Jingcheng Li, Chunxin Fang, Yibo Ma, Jiajia Liao, Qiaoli Shen, Zilun Zhang, Kangjia Zhao, Qianqian Zhang, et al. Vlm-r1: A stable and generaliz- able r1-style large vision-language model.arXiv preprint arXiv:2504.07615, 2025. 3

  34. [34]

    Driving with graph visual question answering

    C Sima, K Renz, K Chitta, L Chen, H Zhang, C Xie, P Luo, A Geiger, and H Drivelm Li. Driving with graph visual question answering. arxiv 2023.arXiv preprint arXiv:2312.14150,

  35. [35]

    Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

    Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Ku- mar. Scaling llm test-time compute optimally can be more effective than scaling model parameters.arXiv preprint arXiv:2408.03314, 2024. 3

  36. [36]

    Visual agents as fast and slow thinkers.arXiv preprint arXiv:2408.08862, 2024

    Guangyan Sun, Mingyu Jin, Zhenting Wang, Cheng-Long Wang, Siqi Ma, Qifan Wang, Tong Geng, Ying Nian Wu, Yongfeng Zhang, and Dongfang Liu. Visual agents as fast and slow thinkers.arXiv preprint arXiv:2408.08862, 2024. 1

  37. [37]

    Mm- verify: Enhancing multimodal reasoning with chain-of- thought verification.arXiv preprint arXiv:2502.13383, 2025

    Linzhuang Sun, Hao Liang, Jingxuan Wei, Bihui Yu, Tian- peng Li, Fan Yang, Zenan Zhou, and Wentao Zhang. Mm- verify: Enhancing multimodal reasoning with chain-of- thought verification.arXiv preprint arXiv:2502.13383, 2025. 3

  38. [38]

    Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

    Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, et al. Gemini 1.5: Unlocking mul- timodal understanding across millions of tokens of context. arXiv preprint arXiv:2403.05530, 2024. 8

  39. [39]

    Llamav-o1: Rethinking step-by-step visual reasoning in llms, 2025

    Omkar Thawakar, Dinura Dissanayake, Ketan More, Ritesh Thawkar, Ahmed Heakl, Noor Ahsan, Yuhao Li, Mohammed Zumri, Jean Lahoud, Rao Muhammad Anwer, Hisham Cholakkal, Ivan Laptev, Mubarak Shah, Fahad Shahbaz Khan, and Salman Khan. Llamav-o1: Rethinking step-by-step visual reasoning in llms, 2025. 3

  40. [40]

    Llamav-o1: Rethinking step-by-step visual reasoning in llms

    Omkar Thawakar, Dinura Dissanayake, Ketan Pravin More, Ritesh Thawkar, Ahmed Heakl, Noor Ahsan, Yuhao Li, Ilmuz Zaman Mohammed Zumri, Jean Lahoud, Rao Muhammad Anwer, et al. Llamav-o1: Rethinking step-by-step visual reasoning in llms. InFindings of the Association for Compu- tational Linguistics: ACL 2025, pages 24290–24315, 2025. 7

  41. [41]

    Om- nidrive: A holistic vision-language dataset for autonomous driving with counterfactual reasoning

    Shihao Wang, Zhiding Yu, Xiaohui Jiang, Shiyi Lan, Min Shi, Nadine Chang, Jan Kautz, Ying Li, and Jose M Alvarez. Om- nidrive: A holistic vision-language dataset for autonomous driving with counterfactual reasoning. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 22442–22452, 2025. 1

  42. [42]

    Self-Consistency Improves Chain of Thought Reasoning in Language Models

    Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in lan- guage models.arXiv preprint arXiv:2203.11171, 2022. 3

  43. [43]

    Chain-of- thought prompting elicits reasoning in large language models

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of- thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824– 24837, 2022. 1, 3

  44. [44]

    V?: Guided visual search as a core mechanism in multimodal llms

    Penghao Wu and Saining Xie. V?: Guided visual search as a core mechanism in multimodal llms. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13084–13094, 2024. 3

  45. [45]

    Llava-cot: Let vision language models reason step-by-step

    Guowei Xu, Peng Jin, Ziang Wu, Hao Li, Yibing Song, Lichao Sun, and Li Yuan. Llava-cot: Let vision language models reason step-by-step. InProceedings of the IEEE/CVF Inter- national Conference on Computer Vision, pages 2087–2098,

  46. [46]

    Llava-cot: Let vision language models reason step-by-step, 2025

    Guowei Xu, Peng Jin, Ziang Wu, Hao Li, Yibing Song, Lichao Sun, and Li Yuan. Llava-cot: Let vision language models reason step-by-step, 2025. 3 10

  47. [47]

    Drivegpt4: Interpretable end-to-end autonomous driving via large language model.IEEE Robotics and Automation Letters,

    Zhenhua Xu, Yujia Zhang, Enze Xie, Zhen Zhao, Yong Guo, Kwan-Yee K Wong, Zhenguo Li, and Hengshuang Zhao. Drivegpt4: Interpretable end-to-end autonomous driving via large language model.IEEE Robotics and Automation Letters,

  48. [48]

    Mulberry: Empowering mllm with o1- like reasoning and reflection via collective monte carlo tree search.arXiv preprint arXiv:2412.18319, 2024

    Huanjin Yao, Jiaxing Huang, Wenhao Wu, Jingyi Zhang, Yibo Wang, Shunyu Liu, Yingjie Wang, Yuxin Song, Haocheng Feng, Li Shen, et al. Mulberry: Empowering mllm with o1- like reasoning and reflection via collective monte carlo tree search.arXiv preprint arXiv:2412.18319, 2024. 7

  49. [49]

    What makes good examples for visual in-context learning?Advances in Neural Information Processing Systems, 36:17773–17794,

    Yuanhan Zhang, Kaiyang Zhou, and Ziwei Liu. What makes good examples for visual in-context learning?Advances in Neural Information Processing Systems, 36:17773–17794,

  50. [50]

    Prompt highlighter: Interactive control for multi- modal llms

    Yuechen Zhang, Shengju Qian, Bohao Peng, Shu Liu, and Jiaya Jia. Prompt highlighter: Interactive control for multi- modal llms. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13215– 13224, 2024. 3

  51. [51]

    Ddcot: Duty-distinct chain-of-thought prompting for multimodal reasoning in language models.Advances in Neu- ral Information Processing Systems, 36:5168–5191, 2023

    Ge Zheng, Bin Yang, Jiajin Tang, Hong-Yu Zhou, and Sibei Yang. Ddcot: Duty-distinct chain-of-thought prompting for multimodal reasoning in language models.Advances in Neu- ral Information Processing Systems, 36:5168–5191, 2023. 3

  52. [52]

    Modality bias in lvlms: Analyzing and mitigating object hallucination via attention lens.arXiv preprint arXiv:2508.02419, 2025

    Haohan Zheng and Zhenguo Zhang. Modality bias in lvlms: Analyzing and mitigating object hallucination via attention lens.arXiv preprint arXiv:2508.02419, 2025. 1

  53. [53]

    Seqtr: A simple yet universal network for visual grounding

    Chaoyang Zhu, Yiyi Zhou, Yunhang Shen, Gen Luo, Xingjia Pan, Mingbao Lin, Chao Chen, Liujuan Cao, Xiaoshuai Sun, and Rongrong Ji. Seqtr: A simple yet universal network for visual grounding. InEuropean Conference on Computer Vision, pages 598–615. Springer, 2022. 3 11 OmniDrive-R1: Reinforcement-driven Interleaved Multi-modal Chain-of-Thought for Trustwort...

  54. [54]

    • 9-10:All steps correctly match or closely reflect the reference

    Faithfulness-Step (1-10):Measures how well the model’s reasoning steps align with the ground truth. • 9-10:All steps correctly match or closely reflect the reference. •7-8:Most steps align, with minor deviations. • 5-6:Some steps align, but several are incorrect or missing. •3-4:Few steps align; most are inaccurate or missing. •1-2:Majority of steps are incorrect

  55. [55]

    •9-10:Captures almost all critical information

    Informativeness-Step (1-10):Measures completeness of reasoning. •9-10:Captures almost all critical information. •7-8:Covers most key points, with minor omissions. •5-6:Missing significant details. •3-4:Only partial reasoning present. •1-2:Poor extraction of relevant reasoning

  56. [56]

    •9-10:Correctly identifies and prioritizes key dangers

    Risk Assessment Accuracy (1-10):Evaluates if the model correctly prioritizes high-risk objects or scenar- ios. •9-10:Correctly identifies and prioritizes key dangers. •7-8:Mostly accurate, with minor misprioritizations. •5-6:Some important risks are overlooked. •3-4:Significant misjudgments in risk prioritization. •1-2:Misidentifies key risks or misses th...

  57. [57]

    • 9-10:Fully compliant with legal and safe driving prac- tices

    Traffic Rule Adherence (1-10):Evaluates whether the response follows traffic laws and driving best practices. • 9-10:Fully compliant with legal and safe driving prac- tices. •7-8:Minor deviations, but mostly correct. • 5-6:Some inaccuracies in legal/safe driving recom- mendations. •3-4:Several rule violations or unsafe suggestions. •1-2:Promotes highly un...

  58. [58]

    • 9-10:Clearly understands all relevant objects and their relationships

    Scene Awareness & Object Understanding (1-10): Measures how well the response interprets objects, their positions, and actions. • 9-10:Clearly understands all relevant objects and their relationships. •7-8:Minor misinterpretations but mostly correct. •5-6:Some key objects misunderstood or ignored. •3-4:Many errors in object recognition and reasoning. •1-2...

  59. [59]

    •9-10:No redundancy, very concise

    Repetition-Token (1-10):Identifies unnecessary repeti- tion in reasoning. •9-10:No redundancy, very concise. •7-8:Minor repetition but still clear. •5-6:Noticeable redundancy. •3-4:Frequent repetition that disrupts reasoning. •1-2:Excessive redundancy, making reasoning unclear

  60. [60]

    •9-10:No hallucinations, all reasoning is grounded

    Hallucination (1-10):Detects irrelevant or invented rea- soning steps not aligned with ground truth. •9-10:No hallucinations, all reasoning is grounded. •7-8:One or two minor hallucinations. •5-6:Some fabricated details. •3-4:Frequent hallucinations. •1-2:Majority of reasoning is hallucinated

  61. [61]

    •9-10:Nearly complete semantic coverage

    Semantic Coverage-Step (1-10):Checks if the response fully covers the critical reasoning elements. •9-10:Nearly complete semantic coverage. •7-8:Good coverage, some minor omissions. •5-6:Partial coverage with key gaps. •3-4:Major gaps in reasoning. •1-2:Very poor semantic coverage

  62. [62]

    •9-10:Displays strong commonsense understanding

    Commonsense Reasoning (1-10):Assesses the use of intuitive driving logic in reasoning. •9-10:Displays strong commonsense understanding. •7-8:Mostly correct, with minor gaps. •5-6:Some commonsense errors. •3-4:Frequent commonsense mistakes. •1-2:Lacks basic driving commonsense

  63. [63]

    •9-10:No critical steps missing

    Missing Step (1-10):Evaluates if any necessary reason- ing steps are missing. •9-10:No critical steps missing. •7-8:Minor missing steps, but answer is mostly intact. •5-6:Some important steps missing. 1 •3-4:Many critical reasoning gaps. •1-2:Response is highly incomplete

  64. [64]

    • 9-10:Highly specific and directly relevant to the driv- ing scenario

    Relevance (1-10):Measures how well the response is specific to the given scenario and ground truth. • 9-10:Highly specific and directly relevant to the driv- ing scenario. Captures critical elements precisely, with no unnecessary generalization. • 7-8:Mostly relevant, but some minor parts may be overly generic or slightly off-focus. • 5-6:Somewhat relevan...

  65. [65]

    • 9-10:No significant details are missing; response is comprehensive and complete

    Missing Details (1-10):Evaluates the extent to which critical information is missing from the response, impact- ing the reasoning quality. • 9-10:No significant details are missing; response is comprehensive and complete. • 7-8:Covers most important details, with minor omis- sions that do not severely impact reasoning. • 5-6:Some essential details are mis...