VLADriveBench: Evaluating CoT-Action Relationship in VLA for Autonomous Driving

Burhan Yaman; Danhua Guo; Fei Wu; Thach Nguyen; Tom Lampo

arxiv: 2606.12706 · v1 · pith:QI4SZV2Onew · submitted 2026-06-10 · 💻 cs.CV

VLADriveBench: Evaluating CoT-Action Relationship in VLA for Autonomous Driving

Thach Nguyen , Danhua Guo , Tom Lampo , Fei Wu , Burhan Yaman This is my paper

Pith reviewed 2026-06-27 09:34 UTC · model grok-4.3

classification 💻 cs.CV

keywords VLADriveBenchchain-of-thoughtvision-language-actionautonomous drivingcausal evaluationobservational metricsCoT intervention

0 comments

The pith

VLADriveBench shows observational alignment and causal influence can diverge sharply in VLA driving models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents VLADriveBench to test whether chain-of-thought reasoning generated by vision-language-action models actually shapes their driving trajectories or is merely correlated with them. Current evaluations only measure final trajectory quality and ignore whether the reasoning mentions relevant elements, avoids contradictions, or influences the chosen action. The new framework adds an intervention step that alters the CoT and measures resulting changes in actions, revealing cases where strong observational scores coincide with reasoning that has no causal role and other cases where lower scores coincide with reasoning that does control behavior.

Core claim

VLADriveBench pairs four observational metrics (mentioning, hallucination, contradiction, action alignment) with a CoT intervention protocol. When run on three models, the two views conflict: one model achieves the highest observational alignment yet its CoT proves epiphenomenal, while another scores lower observationally yet its CoT exerts strong causal control, with visual salience modulating how much the reasoning affects the final action.

What carries the argument

The CoT intervention protocol, which modifies generated reasoning and records consequent shifts in driving actions, used together with observational checks for relevance and consistency.

If this is right

Trajectory-quality benchmarks alone can certify models whose reasoning does not affect their outputs.
Visual salience acts as a gate on whether generated reasoning influences actions.
Complementary causal checks are required to determine whether a model's explanations are functionally operative.
Architecture and training differences can produce opposite relationships between observational scores and causal impact.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Safety arguments for VLA deployment would need to verify causal CoT influence rather than rely on alignment metrics.
Training objectives that strengthen the link between salient visual features and reasoning steps could increase causal consistency.
The same intervention approach could be applied to other embodied VLA tasks to test whether reasoning is epiphenomenal.

Load-bearing premise

Modifying the chain-of-thought text isolates its causal effect on actions without being altered by differences in model architecture, training data, or implementation details.

What would settle it

A controlled test in which editing the CoT of the high-observational model produces no measurable change in its driving trajectories, or editing the CoT of the lower-scoring model produces large trajectory changes only when visual salience is high.

read the original abstract

Vision-language-action (VLA) models generate chain-of-thought (CoT) reasoning alongside driving trajectories, but existing benchmarks evaluate only trajectory quality and do not assess whether the CoT is relevant, consistent, or causally connected to the driving action. We introduce VLADriveBench, a framework that combines observational metrics (mentioning, hallucination, contradiction, action alignment) with a CoT intervention protocol to provide complementary views of the CoT-action relationship. Applying VLADriveBench to three models across two architectures, we find that the two analyses can diverge sharply: ORION scores highest on observational alignment yet its CoT is epiphenomenal, while Alpamayo v1.5 scores lower yet its CoT is strongly causal, with visual salience gating the extent of CoT influence.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

VLADriveBench adds an interventional check on top of observational metrics for CoT in driving VLAs, but the protocol's validity is not established enough to support the divergence claims.

read the letter

The main takeaway is that this paper introduces VLADriveBench to evaluate whether chain-of-thought reasoning in vision-language-action models for autonomous driving actually influences the output actions or is just decorative. They apply it to three models and report that observational alignment and causal intervention results can point in opposite directions, with ORION looking strong on the first but weak on the second.

What is new is the explicit pairing of standard observational metrics (mentioning, hallucination, contradiction, alignment) with an intervention protocol meant to test causality. That dual view is a direct response to a practical problem in safety-critical systems, where you cannot rely on surface-level consistency alone.

The paper does a clear job framing why trajectory quality alone is insufficient and showing concrete model differences that illustrate the gap. The visual salience point is a useful secondary observation.

The soft spot is the intervention protocol itself. No details appear on how CoT tokens are edited or masked, what control conditions are used, or how architecture and training differences are ruled out as confounds. Without that validation, the reported divergence between ORION and Alpamayo could stem from implementation artifacts rather than true causal effects. The abstract supplies no statistical tests or robustness checks either.

This is for researchers working on VLA models in autonomous driving who need better ways to audit reasoning. A reader focused on benchmarks for multimodal agents would find the framing useful even if the execution is preliminary.

It deserves peer review because the underlying question matters for deployment and the basic framework is a reasonable step forward, though the protocol section will need substantial expansion and evidence before the claims can be trusted.

Referee Report

2 major / 1 minor

Summary. The paper introduces VLADriveBench, a benchmark combining observational metrics (mentioning, hallucination, contradiction, action alignment) with a CoT intervention protocol to evaluate whether chain-of-thought reasoning in vision-language-action (VLA) models is relevant and causally connected to driving actions. Applied to three models across two architectures, it reports that observational and causal analyses diverge sharply, with ORION scoring highest on observational alignment but having epiphenomenal CoT, while Alpamayo v1.5 shows lower observational scores but strongly causal CoT, modulated by visual salience.

Significance. If the intervention protocol can be shown to isolate causal effects without confounds, the result would usefully demonstrate that observational metrics alone are insufficient for assessing CoT utility in VLA driving models and could inform safer model design. The work addresses a clear gap in existing trajectory-only benchmarks.

major comments (2)

[CoT intervention protocol] The CoT intervention protocol (methods section): the headline divergence result between observational and causal analyses for ORION vs. Alpamayo v1.5 requires that the protocol isolates the causal contribution of CoT without confounding from architecture, training data, or generation differences. No validation experiments, control conditions, token-editing mechanics, or architecture-specific controls are described, so alternative explanations for the reported divergence cannot be ruled out.
[Abstract and results] Abstract and results: metric definitions, statistical tests, and implementation details for the observational metrics and intervention outcomes are absent, preventing assessment of whether the data support the specific claims about epiphenomenal vs. causal CoT.

minor comments (1)

[Abstract] The abstract would benefit from a brief sentence on the number of scenarios or trajectories used in the benchmark evaluation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful review and constructive comments on our paper. We address each major comment below and indicate the revisions we plan to make.

read point-by-point responses

Referee: [CoT intervention protocol] The CoT intervention protocol (methods section): the headline divergence result between observational and causal analyses for ORION vs. Alpamayo v1.5 requires that the protocol isolates the causal contribution of CoT without confounding from architecture, training data, or generation differences. No validation experiments, control conditions, token-editing mechanics, or architecture-specific controls are described, so alternative explanations for the reported divergence cannot be ruled out.

Authors: We acknowledge the referee's concern regarding the need for validation of the CoT intervention protocol to ensure it isolates causal effects. The manuscript describes the protocol but lacks explicit validation experiments and detailed controls. In the revised version, we will expand the methods section to include descriptions of token-editing mechanics, control conditions, and architecture-specific considerations. We will also add a discussion of potential confounds from architecture and training data differences. This addresses the point by providing more transparency, though full empirical validation may require additional experiments beyond the current scope. revision: partial
Referee: [Abstract and results] Abstract and results: metric definitions, statistical tests, and implementation details for the observational metrics and intervention outcomes are absent, preventing assessment of whether the data support the specific claims about epiphenomenal vs. causal CoT.

Authors: We agree that the absence of detailed metric definitions and implementation details in the abstract and results sections hinders evaluation. The revised manuscript will incorporate precise definitions for the observational metrics (mentioning, hallucination, contradiction, action alignment), specify the statistical tests used, and provide implementation details for the intervention outcomes. These changes will be made in the methods and results sections, with a brief mention in the abstract. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical benchmark without derivation chain

full rationale

The paper introduces VLADriveBench as an empirical evaluation framework combining observational metrics and a CoT intervention protocol. No equations, fitted parameters, predictions, or first-principles derivations appear in the provided abstract or description. The central claims rest on applying the benchmark to existing models rather than any self-referential construction or reduction of results to inputs by definition. This matches the default expectation for non-derivational empirical work.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no free parameters, axioms, or invented entities can be identified.

pith-pipeline@v0.9.1-grok · 5677 in / 1174 out tokens · 26803 ms · 2026-06-27T09:34:57.091657+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

30 extracted references · 10 linked inside Pith

[1]

Orion model weights and code.https://github.com/xiaomi-mlab/Orion, 2025

2025
[2]

Alpamayo-r1-10b model weights.https://huggingface.co/nvidia/Alpamayo-R1-10B, 2026

2026
[3]

Alpamayo-1.5-10b model weights.https://huggingface.co/nvidia/Alpamayo-1.5-10B, 2026

2026
[4]

Claude opus 4.6 system card.https://www.anthropic.com/claude-opus-4-6-system-card, 2026

Anthropic. Claude opus 4.6 system card.https://www.anthropic.com/claude-opus-4-6-system-card, 2026

2026
[5]

Introducing claude opus 4.7.https://www.anthropic.com/news/claude-opus-4-7, 2026

Anthropic. Introducing claude opus 4.7.https://www.anthropic.com/news/claude-opus-4-7, 2026

2026
[6]

Carla: An open urban driving simulator

Alexey Dosovitskiy, German Ros, Felipe Codevilla, Antonio Lopez, and Vladlen Koltun. Carla: An open urban driving simulator. InConference on robot learning, pages 1–16. PMLR, 2017

2017
[7]

Orion: A holistic end-to-end autonomous driving framework by vision-language instructed action generation

Haoyu Fu, Diankun Zhang, Zongchuang Zhao, Jianfeng Cui, Dingkang Liang, Chong Zhang, Dingyuan Zhang, Hongwei Xie, Bing Wang, and Xiang Bai. Orion: A holistic end-to-end autonomous driving framework by vision-language instructed action generation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 24823–24834, 2025. 8

2025
[8]

Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025

Google DeepMind. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025

Pith/arXiv arXiv 2025
[9]

Distilling multi-modal large language models for autonomous driving

Deepti Hegde, Rajeev Yasarla, Hong Cai, Shizhong Han, Apratim Bhattacharyya, Shweta Mahajan, Litian Liu, Risheek Garrepalli, Vishal M Patel, and Fatih Porikli. Distilling multi-modal large language models for autonomous driving. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 27575–27585, 2025

2025
[10]

Planning-oriented autonomous driving

Yihan Hu, Jiazhi Yang, Li Chen, Keyu Li, Chonghao Sima, Xizhou Zhu, Siqi Chai, Senyao Du, Tianwei Lin, Wenhai Wang, Lewei Lu, Xiaosong Jia, Qiang Liu, Jifeng Dai, Yu Qiao, and Hongyang Li. Planning-oriented autonomous driving. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023

2023
[11]

Emma: End-to-end multimodal model for autonomous driving.arXiv preprint arXiv:2410.23262, 2024

Jyh-Jing Hwang, Runsheng Xu, Hubert Lin, Wei-Chih Hung, Jingwei Ji, Kristy Choi, Di Huang, Tong He, Paul Covington, Benjamin Sapp, et al. Emma: End-to-end multimodal model for autonomous driving.arXiv preprint arXiv:2410.23262, 2024

Pith/arXiv arXiv 2024
[12]

Alon Jacovi and Yoav Goldberg. Towards faithfully interpretable NLP systems: How should we define and evaluate faithfulness? InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4198–4205, 2020

2020
[13]

Bench2drive: Towards multi-ability benchmarking of closed-loop end-to-end autonomous driving.Advances in Neural Information Processing Systems, 37:819–844, 2024

Xiaosong Jia, Zhenjie Yang, Qifeng Li, Zhiyuan Zhang, and Junchi Yan. Bench2drive: Towards multi-ability benchmarking of closed-loop end-to-end autonomous driving.Advances in Neural Information Processing Systems, 37:819–844, 2024

2024
[14]

Vad: Vectorized scene representation for efficient autonomous driving.ICCV, 2023

Bo Jiang, Shaoyu Chen, Qing Xu, Bencheng Liao, Jiajie Chen, Helong Zhou, Qian Zhang, Wenyu Liu, Chang Huang, and Xinggang Wang. Vad: Vectorized scene representation for efficient autonomous driving.ICCV, 2023

2023
[15]

Senna: Bridging large vision-language models and end-to-end autonomous driving.arXiv preprint arXiv:2410.22313, 2024

Bo Jiang, Shaoyu Chen, Bencheng Liao, Xingyu Zhang, Wei Yin, Qian Zhang, Chang Huang, Wenyu Liu, and Xinggang Wang. Senna: Bridging large vision-language models and end-to-end autonomous driving.arXiv preprint arXiv:2410.22313, 2024

Pith/arXiv arXiv 2024
[16]

Alphadrive: Unleashing the power of vlms in autonomous driving via reinforcement learning and reasoning.arXiv preprint arXiv:2503.07608, 2025

Bo Jiang, Shaoyu Chen, Qian Zhang, Wenyu Liu, and Xinggang Wang. Alphadrive: Unleashing the power of vlms in autonomous driving via reinforcement learning and reasoning.arXiv preprint arXiv:2503.07608, 2025

Pith/arXiv arXiv 2025
[17]

Measuring faithfulness in chain-of-thought reasoning

Tamera Lanham, Anna Chen, Ansh Radhakrishnan, Benoit Steiner, Carson Denison, Danny Hernandez, Dustin Li, Esin Durmus, Evan Hubinger, Jackson Kernion, et al. Measuring faithfulness in chain-of-thought reasoning. arXiv preprint arXiv:2307.13702, 2023

Pith/arXiv arXiv 2023
[18]

Recogdrive: A reinforced cognitive framework for end-to-end autonomous driving.arXiv preprint arXiv:2506.08052, 2025

Yongkang Li, Kaixin Xiong, Xiangyu Guo, Fang Li, Sixu Yan, Gangwei Xu, Lijun Zhou, Long Chen, Haiyang Sun, Bing Wang, et al. Recogdrive: A reinforced cognitive framework for end-to-end autonomous driving.arXiv preprint arXiv:2506.08052, 2025

Pith/arXiv arXiv 2025
[19]

Drive-r1: Bridging reasoning and planning in vlms for autonomous driving with reinforcement learning

Yue Li, Meng Tian, Dechang Zhu, Jiangtong Zhu, Zhenyu Lin, Zhiwei Xiong, and Xinhai Zhao. Drive-r1: Bridging reasoning and planning in vlms for autonomous driving with reinforcement learning. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 6708–6716, 2026

2026
[20]

Aln-p3: Unified language alignment for perception, prediction, and planning in autonomous driving.arXiv preprint arXiv:2505.15158, 2025

Yunsheng Ma, Burhaneddin Yaman, Xin Ye, Mahmut Yurt, Jingru Luo, Abhirup Mallik, Ziran Wang, and Liu Ren. Aln-p3: Unified language alignment for perception, prediction, and planning in autonomous driving.arXiv preprint arXiv:2505.15158, 2025

arXiv 2025
[21]

Gpt-driver: Learning to drive with gpt.arXiv preprint arXiv:2310.01415, 2023

Jiageng Mao, Yuxi Qian, Junjie Ye, Hang Zhao, and Yue Wang. Gpt-driver: Learning to drive with gpt.arXiv preprint arXiv:2310.01415, 2023

Pith/arXiv arXiv 2023
[22]

Lingoqa: Visual question answering for autonomous driving

Ana-Maria Marcu, Long Chen, Jan Hünermann, Alice Karnsund, Benoit Hanotte, Prajwal Chidananda, Saurabh Nair, Vijay Badrinarayanan, Alex Kendall, Jamie Shotton, et al. Lingoqa: Visual question answering for autonomous driving. InEuropean Conference on Computer Vision, pages 252–269. Springer, 2024

2024
[23]

GPT-5 system card.arXiv preprint arXiv:2601.03267, 2025

OpenAI. GPT-5 system card.arXiv preprint arXiv:2601.03267, 2025

Pith/arXiv arXiv 2025
[24]

Counterfactual vla: Self-reflective vision-language-action model with adaptive reasoning.arXiv preprint arXiv:2512.24426, 2025

Zhenghao Peng, Wenhao Ding, Yurong You, Yuxiao Chen, Wenjie Luo, Thomas Tian, Yulong Cao, Apoorva Sharma, Danfei Xu, Boris Ivanovic, et al. Counterfactual vla: Self-reflective vision-language-action model with adaptive reasoning.arXiv preprint arXiv:2512.24426, 2025

arXiv 2025
[25]

Drivelm: Driving with graph visual question answering

Chonghao Sima, Katrin Renz, Kashyap Chitta, Li Chen, Hanxue Zhang, Chengen Xie, Jens Beißwenger, Ping Luo, Andreas Geiger, and Hongyang Li. Drivelm: Driving with graph visual question answering. InEuropean Conference on Computer Vision, pages 256–274. Springer, 2024. 9

2024
[26]

Drivevlm: The convergence of autonomous driving and large vision-language models.arXiv preprint arXiv:2402.12289, 2024

Xiaoyu Tian, Junru Gu, Bailin Li, Yicheng Liu, Yang Wang, Zhiyong Zhao, Kun Zhan, Peng Jia, Xianpeng Lang, and Hang Zhao. Drivevlm: The convergence of autonomous driving and large vision-language models.arXiv preprint arXiv:2402.12289, 2024

Pith/arXiv arXiv 2024
[27]

Language models don’t always say what they think: Unfaithful explanations in chain-of-thought prompting.Advances in Neural Information Processing Systems, 36:74952–74965, 2023

Miles Turpin, Julian Michael, Ethan Perez, and Samuel R Bowman. Language models don’t always say what they think: Unfaithful explanations in chain-of-thought prompting.Advances in Neural Information Processing Systems, 36:74952–74965, 2023

2023
[28]

Omnidrive: A holistic vision-language dataset for autonomous driving with counterfactual reasoning

Shihao Wang, Zhiding Yu, Xiaohui Jiang, Shiyi Lan, Min Shi, Nadine Chang, Jan Kautz, Ying Li, and Jose M Alvarez. Omnidrive: A holistic vision-language dataset for autonomous driving with counterfactual reasoning. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 22442–22452, 2025

2025
[29]

Alpamayo-r1: Bridging reasoning and action prediction for generalizable autonomous driving in the long tail.arXiv preprint arXiv:2511.00088, 2025

Yan Wang, Wenjie Luo, Junjie Bai, Yulong Cao, Tong Che, Ke Chen, Yuxiao Chen, Jenna Diamond, Yifan Ding, Wenhao Ding, et al. Alpamayo-r1: Bridging reasoning and action prediction for generalizable autonomous driving in the long tail.arXiv preprint arXiv:2511.00088, 2025

Pith/arXiv arXiv 2025
[30]

maintain speed to follow lane

Shuang Zeng, Xinyuan Chang, Mengwei Xie, Xinran Liu, Yifan Bai, Zheng Pan, Mu Xu, and Xing Wei. Futuresightdrive: Thinking visually with spatio-temporal cot for autonomous driving.Advances in Neural Information Processing Systems, 38:67299–67318, 2026. 10 A Scenario Design Details All scenarios are implemented in CARLA 0.9.15 with no background traffic an...

2026

[1] [1]

Orion model weights and code.https://github.com/xiaomi-mlab/Orion, 2025

2025

[2] [2]

Alpamayo-r1-10b model weights.https://huggingface.co/nvidia/Alpamayo-R1-10B, 2026

2026

[3] [3]

Alpamayo-1.5-10b model weights.https://huggingface.co/nvidia/Alpamayo-1.5-10B, 2026

2026

[4] [4]

Claude opus 4.6 system card.https://www.anthropic.com/claude-opus-4-6-system-card, 2026

Anthropic. Claude opus 4.6 system card.https://www.anthropic.com/claude-opus-4-6-system-card, 2026

2026

[5] [5]

Introducing claude opus 4.7.https://www.anthropic.com/news/claude-opus-4-7, 2026

Anthropic. Introducing claude opus 4.7.https://www.anthropic.com/news/claude-opus-4-7, 2026

2026

[6] [6]

Carla: An open urban driving simulator

Alexey Dosovitskiy, German Ros, Felipe Codevilla, Antonio Lopez, and Vladlen Koltun. Carla: An open urban driving simulator. InConference on robot learning, pages 1–16. PMLR, 2017

2017

[7] [7]

Orion: A holistic end-to-end autonomous driving framework by vision-language instructed action generation

Haoyu Fu, Diankun Zhang, Zongchuang Zhao, Jianfeng Cui, Dingkang Liang, Chong Zhang, Dingyuan Zhang, Hongwei Xie, Bing Wang, and Xiang Bai. Orion: A holistic end-to-end autonomous driving framework by vision-language instructed action generation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 24823–24834, 2025. 8

2025

[8] [8]

Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025

Google DeepMind. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025

Pith/arXiv arXiv 2025

[9] [9]

Distilling multi-modal large language models for autonomous driving

Deepti Hegde, Rajeev Yasarla, Hong Cai, Shizhong Han, Apratim Bhattacharyya, Shweta Mahajan, Litian Liu, Risheek Garrepalli, Vishal M Patel, and Fatih Porikli. Distilling multi-modal large language models for autonomous driving. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 27575–27585, 2025

2025

[10] [10]

Planning-oriented autonomous driving

Yihan Hu, Jiazhi Yang, Li Chen, Keyu Li, Chonghao Sima, Xizhou Zhu, Siqi Chai, Senyao Du, Tianwei Lin, Wenhai Wang, Lewei Lu, Xiaosong Jia, Qiang Liu, Jifeng Dai, Yu Qiao, and Hongyang Li. Planning-oriented autonomous driving. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023

2023

[11] [11]

Emma: End-to-end multimodal model for autonomous driving.arXiv preprint arXiv:2410.23262, 2024

Jyh-Jing Hwang, Runsheng Xu, Hubert Lin, Wei-Chih Hung, Jingwei Ji, Kristy Choi, Di Huang, Tong He, Paul Covington, Benjamin Sapp, et al. Emma: End-to-end multimodal model for autonomous driving.arXiv preprint arXiv:2410.23262, 2024

Pith/arXiv arXiv 2024

[12] [12]

Alon Jacovi and Yoav Goldberg. Towards faithfully interpretable NLP systems: How should we define and evaluate faithfulness? InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4198–4205, 2020

2020

[13] [13]

Bench2drive: Towards multi-ability benchmarking of closed-loop end-to-end autonomous driving.Advances in Neural Information Processing Systems, 37:819–844, 2024

Xiaosong Jia, Zhenjie Yang, Qifeng Li, Zhiyuan Zhang, and Junchi Yan. Bench2drive: Towards multi-ability benchmarking of closed-loop end-to-end autonomous driving.Advances in Neural Information Processing Systems, 37:819–844, 2024

2024

[14] [14]

Vad: Vectorized scene representation for efficient autonomous driving.ICCV, 2023

Bo Jiang, Shaoyu Chen, Qing Xu, Bencheng Liao, Jiajie Chen, Helong Zhou, Qian Zhang, Wenyu Liu, Chang Huang, and Xinggang Wang. Vad: Vectorized scene representation for efficient autonomous driving.ICCV, 2023

2023

[15] [15]

Senna: Bridging large vision-language models and end-to-end autonomous driving.arXiv preprint arXiv:2410.22313, 2024

Bo Jiang, Shaoyu Chen, Bencheng Liao, Xingyu Zhang, Wei Yin, Qian Zhang, Chang Huang, Wenyu Liu, and Xinggang Wang. Senna: Bridging large vision-language models and end-to-end autonomous driving.arXiv preprint arXiv:2410.22313, 2024

Pith/arXiv arXiv 2024

[16] [16]

Alphadrive: Unleashing the power of vlms in autonomous driving via reinforcement learning and reasoning.arXiv preprint arXiv:2503.07608, 2025

Bo Jiang, Shaoyu Chen, Qian Zhang, Wenyu Liu, and Xinggang Wang. Alphadrive: Unleashing the power of vlms in autonomous driving via reinforcement learning and reasoning.arXiv preprint arXiv:2503.07608, 2025

Pith/arXiv arXiv 2025

[17] [17]

Measuring faithfulness in chain-of-thought reasoning

Tamera Lanham, Anna Chen, Ansh Radhakrishnan, Benoit Steiner, Carson Denison, Danny Hernandez, Dustin Li, Esin Durmus, Evan Hubinger, Jackson Kernion, et al. Measuring faithfulness in chain-of-thought reasoning. arXiv preprint arXiv:2307.13702, 2023

Pith/arXiv arXiv 2023

[18] [18]

Recogdrive: A reinforced cognitive framework for end-to-end autonomous driving.arXiv preprint arXiv:2506.08052, 2025

Yongkang Li, Kaixin Xiong, Xiangyu Guo, Fang Li, Sixu Yan, Gangwei Xu, Lijun Zhou, Long Chen, Haiyang Sun, Bing Wang, et al. Recogdrive: A reinforced cognitive framework for end-to-end autonomous driving.arXiv preprint arXiv:2506.08052, 2025

Pith/arXiv arXiv 2025

[19] [19]

Drive-r1: Bridging reasoning and planning in vlms for autonomous driving with reinforcement learning

Yue Li, Meng Tian, Dechang Zhu, Jiangtong Zhu, Zhenyu Lin, Zhiwei Xiong, and Xinhai Zhao. Drive-r1: Bridging reasoning and planning in vlms for autonomous driving with reinforcement learning. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 6708–6716, 2026

2026

[20] [20]

Aln-p3: Unified language alignment for perception, prediction, and planning in autonomous driving.arXiv preprint arXiv:2505.15158, 2025

Yunsheng Ma, Burhaneddin Yaman, Xin Ye, Mahmut Yurt, Jingru Luo, Abhirup Mallik, Ziran Wang, and Liu Ren. Aln-p3: Unified language alignment for perception, prediction, and planning in autonomous driving.arXiv preprint arXiv:2505.15158, 2025

arXiv 2025

[21] [21]

Gpt-driver: Learning to drive with gpt.arXiv preprint arXiv:2310.01415, 2023

Jiageng Mao, Yuxi Qian, Junjie Ye, Hang Zhao, and Yue Wang. Gpt-driver: Learning to drive with gpt.arXiv preprint arXiv:2310.01415, 2023

Pith/arXiv arXiv 2023

[22] [22]

Lingoqa: Visual question answering for autonomous driving

Ana-Maria Marcu, Long Chen, Jan Hünermann, Alice Karnsund, Benoit Hanotte, Prajwal Chidananda, Saurabh Nair, Vijay Badrinarayanan, Alex Kendall, Jamie Shotton, et al. Lingoqa: Visual question answering for autonomous driving. InEuropean Conference on Computer Vision, pages 252–269. Springer, 2024

2024

[23] [23]

GPT-5 system card.arXiv preprint arXiv:2601.03267, 2025

OpenAI. GPT-5 system card.arXiv preprint arXiv:2601.03267, 2025

Pith/arXiv arXiv 2025

[24] [24]

Counterfactual vla: Self-reflective vision-language-action model with adaptive reasoning.arXiv preprint arXiv:2512.24426, 2025

Zhenghao Peng, Wenhao Ding, Yurong You, Yuxiao Chen, Wenjie Luo, Thomas Tian, Yulong Cao, Apoorva Sharma, Danfei Xu, Boris Ivanovic, et al. Counterfactual vla: Self-reflective vision-language-action model with adaptive reasoning.arXiv preprint arXiv:2512.24426, 2025

arXiv 2025

[25] [25]

Drivelm: Driving with graph visual question answering

Chonghao Sima, Katrin Renz, Kashyap Chitta, Li Chen, Hanxue Zhang, Chengen Xie, Jens Beißwenger, Ping Luo, Andreas Geiger, and Hongyang Li. Drivelm: Driving with graph visual question answering. InEuropean Conference on Computer Vision, pages 256–274. Springer, 2024. 9

2024

[26] [26]

Drivevlm: The convergence of autonomous driving and large vision-language models.arXiv preprint arXiv:2402.12289, 2024

Xiaoyu Tian, Junru Gu, Bailin Li, Yicheng Liu, Yang Wang, Zhiyong Zhao, Kun Zhan, Peng Jia, Xianpeng Lang, and Hang Zhao. Drivevlm: The convergence of autonomous driving and large vision-language models.arXiv preprint arXiv:2402.12289, 2024

Pith/arXiv arXiv 2024

[27] [27]

Language models don’t always say what they think: Unfaithful explanations in chain-of-thought prompting.Advances in Neural Information Processing Systems, 36:74952–74965, 2023

Miles Turpin, Julian Michael, Ethan Perez, and Samuel R Bowman. Language models don’t always say what they think: Unfaithful explanations in chain-of-thought prompting.Advances in Neural Information Processing Systems, 36:74952–74965, 2023

2023

[28] [28]

Omnidrive: A holistic vision-language dataset for autonomous driving with counterfactual reasoning

Shihao Wang, Zhiding Yu, Xiaohui Jiang, Shiyi Lan, Min Shi, Nadine Chang, Jan Kautz, Ying Li, and Jose M Alvarez. Omnidrive: A holistic vision-language dataset for autonomous driving with counterfactual reasoning. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 22442–22452, 2025

2025

[29] [29]

Alpamayo-r1: Bridging reasoning and action prediction for generalizable autonomous driving in the long tail.arXiv preprint arXiv:2511.00088, 2025

Yan Wang, Wenjie Luo, Junjie Bai, Yulong Cao, Tong Che, Ke Chen, Yuxiao Chen, Jenna Diamond, Yifan Ding, Wenhao Ding, et al. Alpamayo-r1: Bridging reasoning and action prediction for generalizable autonomous driving in the long tail.arXiv preprint arXiv:2511.00088, 2025

Pith/arXiv arXiv 2025

[30] [30]

maintain speed to follow lane

Shuang Zeng, Xinyuan Chang, Mengwei Xie, Xinran Liu, Yifan Bai, Zheng Pan, Mu Xu, and Xing Wei. Futuresightdrive: Thinking visually with spatio-temporal cot for autonomous driving.Advances in Neural Information Processing Systems, 38:67299–67318, 2026. 10 A Scenario Design Details All scenarios are implemented in CARLA 0.9.15 with no background traffic an...

2026