Recognition: no theorem link
OmniDrive-R1: Reinforcement-driven Interleaved Multi-modal Chain-of-Thought for Trustworthy Vision-Language Autonomous Driving
Pith reviewed 2026-05-16 22:29 UTC · model grok-4.3
The pith
Reinforcement learning lets vision-language models interleave visual attention with reasoning steps to reduce hallucinations in autonomous driving.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
OmniDrive-R1 introduces an end-to-end VLM that unifies perception and reasoning through an interleaved multi-modal chain-of-thought mechanism. Its reinforcement-driven visual grounding lets the model autonomously zoom in on critical image regions. This is realized via a pure two-stage RL pipeline and the Clip-GRPO algorithm, which supplies an annotation-free, process-based grounding reward that enforces real-time cross-modal consistency between visual focus and textual reasoning.
What carries the argument
The interleaved multi-modal chain-of-thought (iMCoT) mechanism combined with the Clip-GRPO algorithm, which supplies a process-based grounding reward that ties visual attention directly to reasoning steps without dense labels.
If this is right
- Perception and reasoning stages can be jointly optimized in a single end-to-end training loop.
- Visual grounding becomes possible without collecting dense localization annotations for every training example.
- Real-time cross-modal consistency can be enforced during inference without calling external tools.
- Reasoning quality and final answer accuracy both rise substantially on driving-specific benchmarks.
Where Pith is reading between the lines
- The same reinforcement grounding approach could be tested on other safety-critical domains such as medical image interpretation where hallucinations are costly.
- If the attention maps remain stable under distribution shift, the method may support deployment in vehicles where scenes change rapidly.
- Process-based rewards of this form might replace outcome-only supervision in broader multimodal reasoning tasks beyond driving.
Load-bearing premise
The measured improvements stem from the new interleaved grounding reward rather than from unstated differences in training data volume, compute budget, or baseline implementation details.
What would settle it
Run the trained model on a fresh driving dataset containing scenes with known hallucinated objects and check whether the model's generated attention regions during reasoning actually cover the objects referenced in its final answer.
Figures
read the original abstract
The deployment of Vision-Language Models (VLMs) in safety-critical domains like autonomous driving (AD) is critically hindered by reliability failures, most notably object hallucination. This failure stems from their reliance on ungrounded, text-based Chain-of-Thought (CoT) reasoning. While existing multi-modal CoT approaches attempt mitigation, they suffer from two fundamental flaws: (1) decoupled perception and reasoning stages that prevent end-to-end joint optimization, and (2) reliance on expensive, dense localization labels. Thus we introduce OmniDrive-R1, an end-to-end VLM framework designed for autonomous driving, which unifies perception and reasoning through an interleaved Multi-modal Chain-of-Thought (iMCoT) mechanism. Our core innovation is an Reinforcement-driven visual grounding capability, enabling the model to autonomously direct its attention and "zoom in" on critical regions for fine-grained analysis. This capability is enabled by our pure two-stage reinforcement learning training pipeline and Clip-GRPO algorithm. Crucially, Clip-GRPO introduces an annotation-free, process-based grounding reward. This reward not only eliminates the need for dense labels but also circumvents the instability of external tool calls by enforcing real-time cross-modal consistency between the visual focus and the textual reasoning. Extensive experiments on DriveLMM-o1 demonstrate our model's significant improvements. Compared to the baseline Qwen2.5VL-7B, OmniDrive-R1 improves the overall reasoning score from 51.77% to 80.35%, and the final answer accuracy from 37.81% to 73.62%.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces OmniDrive-R1, an end-to-end vision-language model for autonomous driving that unifies perception and reasoning via an interleaved Multi-modal Chain-of-Thought (iMCoT) mechanism. Its central innovation is a two-stage reinforcement learning pipeline employing the Clip-GRPO algorithm, which supplies an annotation-free, process-based grounding reward enforcing real-time cross-modal consistency to reduce object hallucination. On the DriveLMM-o1 benchmark the paper reports that OmniDrive-R1 raises the overall reasoning score from 51.77% to 80.35% and final-answer accuracy from 37.81% to 73.62% relative to the Qwen2.5VL-7B baseline.
Significance. If the performance deltas can be isolated to the iMCoT and Clip-GRPO components, the work would constitute a meaningful step toward trustworthy multi-modal reasoning in safety-critical settings. The annotation-free grounding reward and end-to-end optimization address two documented shortcomings of prior multi-modal CoT methods; successful validation would therefore be of clear interest to the autonomous-driving and reliable-VLM communities.
major comments (3)
- [Abstract] Abstract: The reported gains (reasoning score 51.77%→80.35%, final-answer accuracy 37.81%→73.62%) are presented without any statement that the Qwen2.5VL-7B baseline received identical data volume, number of RL steps, optimizer schedule, or total compute as OmniDrive-R1. This information is load-bearing for the claim that improvements arise from the Clip-GRPO grounding reward rather than unstated training differences.
- [Abstract] Abstract and Experiments: The grounding reward is defined internally via cross-modal consistency enforced during training. No external benchmark or ablation is described that would demonstrate the accuracy gains are independent of this internal formulation, leaving open the possibility of circularity between the reward and the reported trustworthiness metric.
- [Experiments] Experiments: No statistical significance tests, run-to-run variance, or ablation isolating the interleaved CoT stage from the two-stage RL pipeline are reported. These controls are required to substantiate that the observed deltas are attributable to the proposed mechanism.
minor comments (1)
- [Methods] The acronym Clip-GRPO is introduced without expansion or explicit relation to standard GRPO/PPO variants; a brief derivation or pseudocode in the methods section would improve clarity.
Simulated Author's Rebuttal
We thank the referee for their insightful comments, which have helped us improve the clarity and rigor of our manuscript. We address each major comment below.
read point-by-point responses
-
Referee: [Abstract] Abstract: The reported gains (reasoning score 51.77%→80.35%, final-answer accuracy 37.81%→73.62%) are presented without any statement that the Qwen2.5VL-7B baseline received identical data volume, number of RL steps, optimizer schedule, or total compute as OmniDrive-R1. This information is load-bearing for the claim that improvements arise from the Clip-GRPO grounding reward rather than unstated training differences.
Authors: We agree that specifying the training configuration for the baseline is essential to isolate the effect of our proposed components. In the revised manuscript, we have added explicit statements in both the abstract and the experiments section clarifying that the Qwen2.5VL-7B baseline was trained with the same data volume, number of RL steps, and optimizer schedule. The total compute was matched as closely as possible given hardware constraints, and we note that the primary difference lies in the application of the Clip-GRPO algorithm versus standard RL. This supports our claim that the gains stem from the annotation-free grounding reward. revision: yes
-
Referee: [Abstract] Abstract and Experiments: The grounding reward is defined internally via cross-modal consistency enforced during training. No external benchmark or ablation is described that would demonstrate the accuracy gains are independent of this internal formulation, leaving open the possibility of circularity between the reward and the reported trustworthiness metric.
Authors: While the grounding reward is computed internally based on cross-modal consistency, the evaluation metrics on DriveLMM-o1 are derived from an independent benchmark with fixed ground-truth answers and reasoning annotations that were not used in reward computation. To further address concerns of circularity, we have included an additional ablation in the revised experiments section that removes the grounding reward while keeping the iMCoT structure, showing that performance drops significantly, thus demonstrating the reward's contribution beyond the internal definition. We also emphasize that the trustworthiness metrics include multi-hop reasoning accuracy not directly optimized by the consistency reward. revision: partial
-
Referee: [Experiments] Experiments: No statistical significance tests, run-to-run variance, or ablation isolating the interleaved CoT stage from the two-stage RL pipeline are reported. These controls are required to substantiate that the observed deltas are attributable to the proposed mechanism.
Authors: We acknowledge the importance of these statistical controls and ablations. In the revised version, we report results averaged over three independent runs with different random seeds, include standard deviation, and perform paired t-tests to establish statistical significance of the improvements. Additionally, we have added a new ablation table that isolates the contribution of the interleaved CoT mechanism from the full two-stage RL pipeline with Clip-GRPO, confirming that both elements are necessary for the reported gains. revision: yes
Circularity Check
No significant circularity detected
full rationale
The paper presents an empirical training procedure (two-stage RL with Clip-GRPO) whose grounding reward is explicitly defined as enforcing cross-modal consistency; the reported gains are measured on the external DriveLMM-o1 benchmark via reasoning score and final-answer accuracy. No derivation chain reduces any claimed result to its inputs by construction, no self-citation is used as a load-bearing uniqueness theorem, and no fitted parameter is relabeled as an independent prediction. The method is therefore self-contained as a standard RL pipeline evaluated on held-out data.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption End-to-end joint optimization of perception and reasoning is feasible via RL in VLMs
invented entities (1)
-
Clip-GRPO algorithm
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning.Advances in neural information processing systems, 35:23716–23736,
-
[2]
Wenbin An, Feng Tian, Sicong Leng, Jiahao Nie, Haonan Lin, QianYing Wang, Ping Chen, Xiaoqin Zhang, and Shijian Lu. Mitigating object hallucinations in large vision-language models with assembly of global and local attention. InPro- ceedings of the Computer Vision and Pattern Recognition Conference, pages 29915–29926, 2025. 1
work page 2025
-
[3]
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 2, 6, 7
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[4]
Spatialbot: Pre- cise spatial understanding with vision language models
Wenxiao Cai, Iaroslav Ponomarenko, Jianhao Yuan, Xiaoqi Li, Wankou Yang, Hao Dong, and Bo Zhao. Spatialbot: Pre- cise spatial understanding with vision language models. In 2025 IEEE International Conference on Robotics and Au- tomation (ICRA), pages 9490–9498. IEEE, 2025. 7, 8
work page 2025
-
[5]
Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhang- wei Gao, Erfei Cui, Jinguo Zhu, Shenglong Ye, Hao Tian, Zhaoyang Liu, et al. Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling.arXiv preprint arXiv:2412.05271, 2024. 7
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[6]
An-Chieh Cheng, Hongxu Yin, Yang Fu, Qiushan Guo, Rui- han Yang, Jan Kautz, Xiaolong Wang, and Sifei Liu. Spa- tialrgpt: Grounded spatial reasoning in vision-language mod- els.Advances in Neural Information Processing Systems, 37: 135062–135093, 2024. 8
work page 2024
-
[7]
Zihui Cheng, Qiguang Chen, Xiao Xu, Jiaqi Wang, Weiyun Wang, Hao Fei, Yidong Wang, Alex Jinpeng Wang, Zhi Chen, Wanxiang Che, et al. Visual thoughts: A unified perspective of understanding multimodal chain-of-thought.arXiv preprint arXiv:2505.15510, 2025. 1
-
[8]
Talk2bev: Language-enhanced bird’s-eye view maps for autonomous driving
Tushar Choudhary, Vikrant Dewangan, Shivam Chandhok, Shubham Priyadarshan, Anushka Jain, Arun K Singh, Sid- dharth Srivastava, Krishna Murthy Jatavallabhula, and K Mad- hava Krishna. Talk2bev: Language-enhanced bird’s-eye view maps for autonomous driving. In2024 IEEE Inter- national Conference on Robotics and Automation (ICRA), pages 16345–16352. IEEE, 2024. 3
work page 2024
-
[9]
Retrieval-based inter- leaved visual chain-of-thought in real-world driving scenarios
Charles Corbi`ere, Simon Roburin, Syrielle Montariol, An- toine Bosselut, and Alexandre Alahi. Retrieval-based inter- leaved visual chain-of-thought in real-world driving scenarios. arXiv preprint arXiv:2501.04671, 2025. 3
-
[10]
Virgo: A preliminary exploration on reproducing o1-like mllm.arXiv preprint arXiv:2501.01904,
Yifan Du, Zikang Liu, Yifan Li, Wayne Xin Zhao, Yuqi Huo, Bingning Wang, Weipeng Chen, Zheng Liu, Zhongyuan Wang, and Ji-Rong Wen. Virgo: A preliminary exploration on reproducing o1-like mllm.arXiv preprint arXiv:2501.01904,
-
[11]
Drive like a human: Rethinking autonomous driving with large language models
Daocheng Fu, Xin Li, Licheng Wen, Min Dou, Pinlong Cai, Botian Shi, and Yu Qiao. Drive like a human: Rethinking autonomous driving with large language models. In2024 IEEE/CVF Winter Conference on Applications of Computer Vision Workshops (WACVW), pages 910–919. IEEE, 2024. 3
work page 2024
-
[12]
Xianda Guo, Ruijun Zhang, Yiqun Duan, Yuhang He, Dujun Nie, Wenke Huang, Chenming Zhang, Shuai Liu, Hao Zhao, and Long Chen. Surds: Benchmarking spatial understand- ing and reasoning in driving scenarios with vision language models.arXiv preprint arXiv:2411.13112, 2024. 6, 7, 1, 2
-
[13]
Cogagent: A visual language model for gui agents
Wenyi Hong, Weihan Wang, Qingsong Lv, Jiazheng Xu, Wen- meng Yu, Junhui Ji, Yan Wang, Zihan Wang, Yuxiao Dong, Ming Ding, et al. Cogagent: A visual language model for gui agents. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, pages 14281–14290,
-
[14]
Yushi Hu, Weijia Shi, Xingyu Fu, Dan Roth, Mari Ostendorf, Luke Zettlemoyer, Noah A Smith, and Ranjay Krishna. Visual sketchpad: Sketching as a visual chain of thought for mul- timodal language models.Advances in Neural Information Processing Systems, 37:139348–139379, 2024. 1
work page 2024
-
[15]
Ayesha Ishaq, Jean Lahoud, Ketan More, Omkar Thawakar, Ritesh Thawkar, Dinura Dissanayake, Noor Ahsan, Yuhao Li, Fahad Shahbaz Khan, Hisham Cholakkal, et al. Drivelmm- o1: A step-by-step reasoning dataset and large multimodal model for driving scenario understanding.arXiv preprint arXiv:2503.10621, 2025. 1, 2, 6, 7
-
[16]
Gpt-4o: The cutting-edge advance- ment in multimodal llm
R Islam and OM Moushi. Gpt-4o: The cutting-edge advance- ment in multimodal llm. techrxiv, 2024. 2, 7
work page 2024
-
[17]
Chaoya Jiang, Yongrui Heng, Wei Ye, Han Yang, Haiyang Xu, Ming Yan, Ji Zhang, Fei Huang, and Shikun Zhang. Vlm-r 3: Region recognition, reasoning, and refinement for enhanced multimodal chain-of-thought.arXiv preprint arXiv:2505.16192, 2025. 1
-
[18]
Imagine while Reasoning in Space: Multimodal Visualization-of-Thought
Chengzu Li, Wenshan Wu, Huanyu Zhang, Yan Xia, Shaoguang Mao, Li Dong, Ivan Vuli´c, and Furu Wei. Imag- ine while reasoning in space: Multimodal visualization-of- thought.arXiv preprint arXiv:2501.07542, 2025. 1, 3
work page internal anchor Pith review arXiv 2025
-
[19]
Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Ed- wards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. In The Twelfth International Conference on Learning Represen- tations, 2023. 3
work page 2023
-
[20]
Shiyin Lu, Yang Li, Qing-Guo Chen, Zhao Xu, Weihua Luo, Kaifu Zhang, and Han-Jia Ye. Ovis: Structural embed- ding alignment for multimodal large language model.arXiv preprint arXiv:2405.20797, 2024. 7
-
[21]
Multi-task collaborative network for joint referring expression comprehension and segmentation
Gen Luo, Yiyi Zhou, Xiaoshuai Sun, Liujuan Cao, Chenglin Wu, Cheng Deng, and Rongrong Ji. Multi-task collaborative network for joint referring expression comprehension and segmentation. InProceedings of the IEEE/CVF Conference on computer vision and pattern recognition, pages 10034– 10043, 2020. 3
work page 2020
-
[22]
Compositional chain-of-thought prompting for large multimodal models
Chancharik Mitra, Brandon Huang, Trevor Darrell, and Roei Herzig. Compositional chain-of-thought prompting for large multimodal models. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 14420–14431, 2024. 3
work page 2024
-
[23]
Reason2drive: Towards 9 interpretable and chain-based reasoning for autonomous driv- ing
Ming Nie, Renyuan Peng, Chunwei Wang, Xinyue Cai, Jian- hua Han, Hang Xu, and Li Zhang. Reason2drive: Towards 9 interpretable and chain-based reasoning for autonomous driv- ing. InEuropean Conference on Computer Vision, pages 292–308. Springer, 2024. 3
work page 2024
-
[24]
Pfau, J., Merrill, W., and Bowman, S
Xuefei Ning, Zinan Lin, Zixuan Zhou, Zifu Wang, Huazhong Yang, and Yu Wang. Skeleton-of-thought: Prompting llms for efficient parallel generation.arXiv preprint arXiv:2307.15337, 2023. 3
-
[25]
Kosmos-2: Grounding Multimodal Large Language Models to the World
Zhiliang Peng, Wenhui Wang, Li Dong, Yaru Hao, Shaohan Huang, Shuming Ma, and Furu Wei. Kosmos-2: Ground- ing multimodal large language models to the world.arXiv preprint arXiv:2306.14824, 2023. 3
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[26]
Mutual reasoning makes smaller llms stronger problem-solvers.arXiv preprint arXiv:2408.06195,
Zhenting Qi, Mingyuan Ma, Jiahang Xu, Li Lyna Zhang, Fan Yang, and Mao Yang. Mutual reasoning makes smaller llms stronger problem-solvers.arXiv preprint arXiv:2408.06195,
-
[27]
Kangan Qian, Sicong Jiang, Yang Zhong, Ziang Luo, Zilin Huang, Tianze Zhu, Kun Jiang, Mengmeng Yang, Zheng Fu, Jinyu Miao, et al. Agentthink: A unified framework for tool-augmented chain-of-thought reasoning in vision- language models for autonomous driving.arXiv preprint arXiv:2505.15298, 2025. 1, 2, 3, 4, 6, 7
-
[28]
Nuscenes-qa: A multi-modal visual question answering benchmark for autonomous driving scenario
Tianwen Qian, Jingjing Chen, Linhai Zhuo, Yang Jiao, and Yu-Gang Jiang. Nuscenes-qa: A multi-modal visual question answering benchmark for autonomous driving scenario. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 4542–4550, 2024. 3
work page 2024
-
[29]
Learning transferable visual models from natural language supervi- sion
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 2
work page 2021
-
[30]
A re- duction of imitation learning and structured prediction to no-regret online learning
St´ephane Ross, Geoffrey Gordon, and Drew Bagnell. A re- duction of imitation learning and structured prediction to no-regret online learning. InProceedings of the fourteenth in- ternational conference on artificial intelligence and statistics, pages 627–635. JMLR Workshop and Conference Proceed- ings, 2011. 1
work page 2011
-
[31]
Hao Shao, Shengju Qian, Han Xiao, Guanglu Song, Zhuofan Zong, Letian Wang, Yu Liu, and Hongsheng Li. Visual cot: Advancing multi-modal language models with a comprehen- sive dataset and benchmark for chain-of-thought reasoning. Advances in Neural Information Processing Systems, 37:8612– 8642, 2024. 1
work page 2024
-
[32]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathe- matical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024. 2, 5
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[33]
VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model
Haozhan Shen, Peng Liu, Jingcheng Li, Chunxin Fang, Yibo Ma, Jiajia Liao, Qiaoli Shen, Zilun Zhang, Kangjia Zhao, Qianqian Zhang, et al. Vlm-r1: A stable and generaliz- able r1-style large vision-language model.arXiv preprint arXiv:2504.07615, 2025. 3
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[34]
Driving with graph visual question answering
C Sima, K Renz, K Chitta, L Chen, H Zhang, C Xie, P Luo, A Geiger, and H Drivelm Li. Driving with graph visual question answering. arxiv 2023.arXiv preprint arXiv:2312.14150,
-
[35]
Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters
Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Ku- mar. Scaling llm test-time compute optimally can be more effective than scaling model parameters.arXiv preprint arXiv:2408.03314, 2024. 3
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[36]
Visual agents as fast and slow thinkers.arXiv preprint arXiv:2408.08862, 2024
Guangyan Sun, Mingyu Jin, Zhenting Wang, Cheng-Long Wang, Siqi Ma, Qifan Wang, Tong Geng, Ying Nian Wu, Yongfeng Zhang, and Dongfang Liu. Visual agents as fast and slow thinkers.arXiv preprint arXiv:2408.08862, 2024. 1
-
[37]
Linzhuang Sun, Hao Liang, Jingxuan Wei, Bihui Yu, Tian- peng Li, Fan Yang, Zenan Zhou, and Wentao Zhang. Mm- verify: Enhancing multimodal reasoning with chain-of- thought verification.arXiv preprint arXiv:2502.13383, 2025. 3
-
[38]
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context
Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, et al. Gemini 1.5: Unlocking mul- timodal understanding across millions of tokens of context. arXiv preprint arXiv:2403.05530, 2024. 8
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[39]
Llamav-o1: Rethinking step-by-step visual reasoning in llms, 2025
Omkar Thawakar, Dinura Dissanayake, Ketan More, Ritesh Thawkar, Ahmed Heakl, Noor Ahsan, Yuhao Li, Mohammed Zumri, Jean Lahoud, Rao Muhammad Anwer, Hisham Cholakkal, Ivan Laptev, Mubarak Shah, Fahad Shahbaz Khan, and Salman Khan. Llamav-o1: Rethinking step-by-step visual reasoning in llms, 2025. 3
work page 2025
-
[40]
Llamav-o1: Rethinking step-by-step visual reasoning in llms
Omkar Thawakar, Dinura Dissanayake, Ketan Pravin More, Ritesh Thawkar, Ahmed Heakl, Noor Ahsan, Yuhao Li, Ilmuz Zaman Mohammed Zumri, Jean Lahoud, Rao Muhammad Anwer, et al. Llamav-o1: Rethinking step-by-step visual reasoning in llms. InFindings of the Association for Compu- tational Linguistics: ACL 2025, pages 24290–24315, 2025. 7
work page 2025
-
[41]
Om- nidrive: A holistic vision-language dataset for autonomous driving with counterfactual reasoning
Shihao Wang, Zhiding Yu, Xiaohui Jiang, Shiyi Lan, Min Shi, Nadine Chang, Jan Kautz, Ying Li, and Jose M Alvarez. Om- nidrive: A holistic vision-language dataset for autonomous driving with counterfactual reasoning. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 22442–22452, 2025. 1
work page 2025
-
[42]
Self-Consistency Improves Chain of Thought Reasoning in Language Models
Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in lan- guage models.arXiv preprint arXiv:2203.11171, 2022. 3
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[43]
Chain-of- thought prompting elicits reasoning in large language models
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of- thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824– 24837, 2022. 1, 3
work page 2022
-
[44]
V?: Guided visual search as a core mechanism in multimodal llms
Penghao Wu and Saining Xie. V?: Guided visual search as a core mechanism in multimodal llms. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13084–13094, 2024. 3
work page 2024
-
[45]
Llava-cot: Let vision language models reason step-by-step
Guowei Xu, Peng Jin, Ziang Wu, Hao Li, Yibing Song, Lichao Sun, and Li Yuan. Llava-cot: Let vision language models reason step-by-step. InProceedings of the IEEE/CVF Inter- national Conference on Computer Vision, pages 2087–2098,
work page 2087
-
[46]
Llava-cot: Let vision language models reason step-by-step, 2025
Guowei Xu, Peng Jin, Ziang Wu, Hao Li, Yibing Song, Lichao Sun, and Li Yuan. Llava-cot: Let vision language models reason step-by-step, 2025. 3 10
work page 2025
-
[47]
Zhenhua Xu, Yujia Zhang, Enze Xie, Zhen Zhao, Yong Guo, Kwan-Yee K Wong, Zhenguo Li, and Hengshuang Zhao. Drivegpt4: Interpretable end-to-end autonomous driving via large language model.IEEE Robotics and Automation Letters,
-
[48]
Huanjin Yao, Jiaxing Huang, Wenhao Wu, Jingyi Zhang, Yibo Wang, Shunyu Liu, Yingjie Wang, Yuxin Song, Haocheng Feng, Li Shen, et al. Mulberry: Empowering mllm with o1- like reasoning and reflection via collective monte carlo tree search.arXiv preprint arXiv:2412.18319, 2024. 7
-
[49]
Yuanhan Zhang, Kaiyang Zhou, and Ziwei Liu. What makes good examples for visual in-context learning?Advances in Neural Information Processing Systems, 36:17773–17794,
-
[50]
Prompt highlighter: Interactive control for multi- modal llms
Yuechen Zhang, Shengju Qian, Bohao Peng, Shu Liu, and Jiaya Jia. Prompt highlighter: Interactive control for multi- modal llms. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13215– 13224, 2024. 3
work page 2024
-
[51]
Ge Zheng, Bin Yang, Jiajin Tang, Hong-Yu Zhou, and Sibei Yang. Ddcot: Duty-distinct chain-of-thought prompting for multimodal reasoning in language models.Advances in Neu- ral Information Processing Systems, 36:5168–5191, 2023. 3
work page 2023
-
[52]
Haohan Zheng and Zhenguo Zhang. Modality bias in lvlms: Analyzing and mitigating object hallucination via attention lens.arXiv preprint arXiv:2508.02419, 2025. 1
-
[53]
Seqtr: A simple yet universal network for visual grounding
Chaoyang Zhu, Yiyi Zhou, Yunhang Shen, Gen Luo, Xingjia Pan, Mingbao Lin, Chao Chen, Liujuan Cao, Xiaoshuai Sun, and Rongrong Ji. Seqtr: A simple yet universal network for visual grounding. InEuropean Conference on Computer Vision, pages 598–615. Springer, 2022. 3 11 OmniDrive-R1: Reinforcement-driven Interleaved Multi-modal Chain-of-Thought for Trustwort...
work page 2022
-
[54]
• 9-10:All steps correctly match or closely reflect the reference
Faithfulness-Step (1-10):Measures how well the model’s reasoning steps align with the ground truth. • 9-10:All steps correctly match or closely reflect the reference. •7-8:Most steps align, with minor deviations. • 5-6:Some steps align, but several are incorrect or missing. •3-4:Few steps align; most are inaccurate or missing. •1-2:Majority of steps are incorrect
-
[55]
•9-10:Captures almost all critical information
Informativeness-Step (1-10):Measures completeness of reasoning. •9-10:Captures almost all critical information. •7-8:Covers most key points, with minor omissions. •5-6:Missing significant details. •3-4:Only partial reasoning present. •1-2:Poor extraction of relevant reasoning
-
[56]
•9-10:Correctly identifies and prioritizes key dangers
Risk Assessment Accuracy (1-10):Evaluates if the model correctly prioritizes high-risk objects or scenar- ios. •9-10:Correctly identifies and prioritizes key dangers. •7-8:Mostly accurate, with minor misprioritizations. •5-6:Some important risks are overlooked. •3-4:Significant misjudgments in risk prioritization. •1-2:Misidentifies key risks or misses th...
-
[57]
• 9-10:Fully compliant with legal and safe driving prac- tices
Traffic Rule Adherence (1-10):Evaluates whether the response follows traffic laws and driving best practices. • 9-10:Fully compliant with legal and safe driving prac- tices. •7-8:Minor deviations, but mostly correct. • 5-6:Some inaccuracies in legal/safe driving recom- mendations. •3-4:Several rule violations or unsafe suggestions. •1-2:Promotes highly un...
-
[58]
• 9-10:Clearly understands all relevant objects and their relationships
Scene Awareness & Object Understanding (1-10): Measures how well the response interprets objects, their positions, and actions. • 9-10:Clearly understands all relevant objects and their relationships. •7-8:Minor misinterpretations but mostly correct. •5-6:Some key objects misunderstood or ignored. •3-4:Many errors in object recognition and reasoning. •1-2...
-
[59]
•9-10:No redundancy, very concise
Repetition-Token (1-10):Identifies unnecessary repeti- tion in reasoning. •9-10:No redundancy, very concise. •7-8:Minor repetition but still clear. •5-6:Noticeable redundancy. •3-4:Frequent repetition that disrupts reasoning. •1-2:Excessive redundancy, making reasoning unclear
-
[60]
•9-10:No hallucinations, all reasoning is grounded
Hallucination (1-10):Detects irrelevant or invented rea- soning steps not aligned with ground truth. •9-10:No hallucinations, all reasoning is grounded. •7-8:One or two minor hallucinations. •5-6:Some fabricated details. •3-4:Frequent hallucinations. •1-2:Majority of reasoning is hallucinated
-
[61]
•9-10:Nearly complete semantic coverage
Semantic Coverage-Step (1-10):Checks if the response fully covers the critical reasoning elements. •9-10:Nearly complete semantic coverage. •7-8:Good coverage, some minor omissions. •5-6:Partial coverage with key gaps. •3-4:Major gaps in reasoning. •1-2:Very poor semantic coverage
-
[62]
•9-10:Displays strong commonsense understanding
Commonsense Reasoning (1-10):Assesses the use of intuitive driving logic in reasoning. •9-10:Displays strong commonsense understanding. •7-8:Mostly correct, with minor gaps. •5-6:Some commonsense errors. •3-4:Frequent commonsense mistakes. •1-2:Lacks basic driving commonsense
-
[63]
•9-10:No critical steps missing
Missing Step (1-10):Evaluates if any necessary reason- ing steps are missing. •9-10:No critical steps missing. •7-8:Minor missing steps, but answer is mostly intact. •5-6:Some important steps missing. 1 •3-4:Many critical reasoning gaps. •1-2:Response is highly incomplete
-
[64]
• 9-10:Highly specific and directly relevant to the driv- ing scenario
Relevance (1-10):Measures how well the response is specific to the given scenario and ground truth. • 9-10:Highly specific and directly relevant to the driv- ing scenario. Captures critical elements precisely, with no unnecessary generalization. • 7-8:Mostly relevant, but some minor parts may be overly generic or slightly off-focus. • 5-6:Somewhat relevan...
-
[65]
• 9-10:No significant details are missing; response is comprehensive and complete
Missing Details (1-10):Evaluates the extent to which critical information is missing from the response, impact- ing the reasoning quality. • 9-10:No significant details are missing; response is comprehensive and complete. • 7-8:Covers most important details, with minor omis- sions that do not severely impact reasoning. • 5-6:Some essential details are mis...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.