pith. sign in

arxiv: 2606.12706 · v1 · pith:QI4SZV2Onew · submitted 2026-06-10 · 💻 cs.CV

VLADriveBench: Evaluating CoT-Action Relationship in VLA for Autonomous Driving

Pith reviewed 2026-06-27 09:34 UTC · model grok-4.3

classification 💻 cs.CV
keywords VLADriveBenchchain-of-thoughtvision-language-actionautonomous drivingcausal evaluationobservational metricsCoT intervention
0
0 comments X

The pith

VLADriveBench shows observational alignment and causal influence can diverge sharply in VLA driving models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents VLADriveBench to test whether chain-of-thought reasoning generated by vision-language-action models actually shapes their driving trajectories or is merely correlated with them. Current evaluations only measure final trajectory quality and ignore whether the reasoning mentions relevant elements, avoids contradictions, or influences the chosen action. The new framework adds an intervention step that alters the CoT and measures resulting changes in actions, revealing cases where strong observational scores coincide with reasoning that has no causal role and other cases where lower scores coincide with reasoning that does control behavior.

Core claim

VLADriveBench pairs four observational metrics (mentioning, hallucination, contradiction, action alignment) with a CoT intervention protocol. When run on three models, the two views conflict: one model achieves the highest observational alignment yet its CoT proves epiphenomenal, while another scores lower observationally yet its CoT exerts strong causal control, with visual salience modulating how much the reasoning affects the final action.

What carries the argument

The CoT intervention protocol, which modifies generated reasoning and records consequent shifts in driving actions, used together with observational checks for relevance and consistency.

If this is right

  • Trajectory-quality benchmarks alone can certify models whose reasoning does not affect their outputs.
  • Visual salience acts as a gate on whether generated reasoning influences actions.
  • Complementary causal checks are required to determine whether a model's explanations are functionally operative.
  • Architecture and training differences can produce opposite relationships between observational scores and causal impact.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Safety arguments for VLA deployment would need to verify causal CoT influence rather than rely on alignment metrics.
  • Training objectives that strengthen the link between salient visual features and reasoning steps could increase causal consistency.
  • The same intervention approach could be applied to other embodied VLA tasks to test whether reasoning is epiphenomenal.

Load-bearing premise

Modifying the chain-of-thought text isolates its causal effect on actions without being altered by differences in model architecture, training data, or implementation details.

What would settle it

A controlled test in which editing the CoT of the high-observational model produces no measurable change in its driving trajectories, or editing the CoT of the lower-scoring model produces large trajectory changes only when visual salience is high.

read the original abstract

Vision-language-action (VLA) models generate chain-of-thought (CoT) reasoning alongside driving trajectories, but existing benchmarks evaluate only trajectory quality and do not assess whether the CoT is relevant, consistent, or causally connected to the driving action. We introduce VLADriveBench, a framework that combines observational metrics (mentioning, hallucination, contradiction, action alignment) with a CoT intervention protocol to provide complementary views of the CoT-action relationship. Applying VLADriveBench to three models across two architectures, we find that the two analyses can diverge sharply: ORION scores highest on observational alignment yet its CoT is epiphenomenal, while Alpamayo v1.5 scores lower yet its CoT is strongly causal, with visual salience gating the extent of CoT influence.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces VLADriveBench, a benchmark combining observational metrics (mentioning, hallucination, contradiction, action alignment) with a CoT intervention protocol to evaluate whether chain-of-thought reasoning in vision-language-action (VLA) models is relevant and causally connected to driving actions. Applied to three models across two architectures, it reports that observational and causal analyses diverge sharply, with ORION scoring highest on observational alignment but having epiphenomenal CoT, while Alpamayo v1.5 shows lower observational scores but strongly causal CoT, modulated by visual salience.

Significance. If the intervention protocol can be shown to isolate causal effects without confounds, the result would usefully demonstrate that observational metrics alone are insufficient for assessing CoT utility in VLA driving models and could inform safer model design. The work addresses a clear gap in existing trajectory-only benchmarks.

major comments (2)
  1. [CoT intervention protocol] The CoT intervention protocol (methods section): the headline divergence result between observational and causal analyses for ORION vs. Alpamayo v1.5 requires that the protocol isolates the causal contribution of CoT without confounding from architecture, training data, or generation differences. No validation experiments, control conditions, token-editing mechanics, or architecture-specific controls are described, so alternative explanations for the reported divergence cannot be ruled out.
  2. [Abstract and results] Abstract and results: metric definitions, statistical tests, and implementation details for the observational metrics and intervention outcomes are absent, preventing assessment of whether the data support the specific claims about epiphenomenal vs. causal CoT.
minor comments (1)
  1. [Abstract] The abstract would benefit from a brief sentence on the number of scenarios or trajectories used in the benchmark evaluation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful review and constructive comments on our paper. We address each major comment below and indicate the revisions we plan to make.

read point-by-point responses
  1. Referee: [CoT intervention protocol] The CoT intervention protocol (methods section): the headline divergence result between observational and causal analyses for ORION vs. Alpamayo v1.5 requires that the protocol isolates the causal contribution of CoT without confounding from architecture, training data, or generation differences. No validation experiments, control conditions, token-editing mechanics, or architecture-specific controls are described, so alternative explanations for the reported divergence cannot be ruled out.

    Authors: We acknowledge the referee's concern regarding the need for validation of the CoT intervention protocol to ensure it isolates causal effects. The manuscript describes the protocol but lacks explicit validation experiments and detailed controls. In the revised version, we will expand the methods section to include descriptions of token-editing mechanics, control conditions, and architecture-specific considerations. We will also add a discussion of potential confounds from architecture and training data differences. This addresses the point by providing more transparency, though full empirical validation may require additional experiments beyond the current scope. revision: partial

  2. Referee: [Abstract and results] Abstract and results: metric definitions, statistical tests, and implementation details for the observational metrics and intervention outcomes are absent, preventing assessment of whether the data support the specific claims about epiphenomenal vs. causal CoT.

    Authors: We agree that the absence of detailed metric definitions and implementation details in the abstract and results sections hinders evaluation. The revised manuscript will incorporate precise definitions for the observational metrics (mentioning, hallucination, contradiction, action alignment), specify the statistical tests used, and provide implementation details for the intervention outcomes. These changes will be made in the methods and results sections, with a brief mention in the abstract. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical benchmark without derivation chain

full rationale

The paper introduces VLADriveBench as an empirical evaluation framework combining observational metrics and a CoT intervention protocol. No equations, fitted parameters, predictions, or first-principles derivations appear in the provided abstract or description. The central claims rest on applying the benchmark to existing models rather than any self-referential construction or reduction of results to inputs by definition. This matches the default expectation for non-derivational empirical work.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no free parameters, axioms, or invented entities can be identified.

pith-pipeline@v0.9.1-grok · 5677 in / 1174 out tokens · 26803 ms · 2026-06-27T09:34:57.091657+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

30 extracted references · 10 linked inside Pith

  1. [1]

    Orion model weights and code.https://github.com/xiaomi-mlab/Orion, 2025

  2. [2]

    Alpamayo-r1-10b model weights.https://huggingface.co/nvidia/Alpamayo-R1-10B, 2026

  3. [3]

    Alpamayo-1.5-10b model weights.https://huggingface.co/nvidia/Alpamayo-1.5-10B, 2026

  4. [4]

    Claude opus 4.6 system card.https://www.anthropic.com/claude-opus-4-6-system-card, 2026

    Anthropic. Claude opus 4.6 system card.https://www.anthropic.com/claude-opus-4-6-system-card, 2026

  5. [5]

    Introducing claude opus 4.7.https://www.anthropic.com/news/claude-opus-4-7, 2026

    Anthropic. Introducing claude opus 4.7.https://www.anthropic.com/news/claude-opus-4-7, 2026

  6. [6]

    Carla: An open urban driving simulator

    Alexey Dosovitskiy, German Ros, Felipe Codevilla, Antonio Lopez, and Vladlen Koltun. Carla: An open urban driving simulator. InConference on robot learning, pages 1–16. PMLR, 2017

  7. [7]

    Orion: A holistic end-to-end autonomous driving framework by vision-language instructed action generation

    Haoyu Fu, Diankun Zhang, Zongchuang Zhao, Jianfeng Cui, Dingkang Liang, Chong Zhang, Dingyuan Zhang, Hongwei Xie, Bing Wang, and Xiang Bai. Orion: A holistic end-to-end autonomous driving framework by vision-language instructed action generation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 24823–24834, 2025. 8

  8. [8]

    Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025

    Google DeepMind. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025

  9. [9]

    Distilling multi-modal large language models for autonomous driving

    Deepti Hegde, Rajeev Yasarla, Hong Cai, Shizhong Han, Apratim Bhattacharyya, Shweta Mahajan, Litian Liu, Risheek Garrepalli, Vishal M Patel, and Fatih Porikli. Distilling multi-modal large language models for autonomous driving. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 27575–27585, 2025

  10. [10]

    Planning-oriented autonomous driving

    Yihan Hu, Jiazhi Yang, Li Chen, Keyu Li, Chonghao Sima, Xizhou Zhu, Siqi Chai, Senyao Du, Tianwei Lin, Wenhai Wang, Lewei Lu, Xiaosong Jia, Qiang Liu, Jifeng Dai, Yu Qiao, and Hongyang Li. Planning-oriented autonomous driving. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023

  11. [11]

    Emma: End-to-end multimodal model for autonomous driving.arXiv preprint arXiv:2410.23262, 2024

    Jyh-Jing Hwang, Runsheng Xu, Hubert Lin, Wei-Chih Hung, Jingwei Ji, Kristy Choi, Di Huang, Tong He, Paul Covington, Benjamin Sapp, et al. Emma: End-to-end multimodal model for autonomous driving.arXiv preprint arXiv:2410.23262, 2024

  12. [12]

    Alon Jacovi and Yoav Goldberg. Towards faithfully interpretable NLP systems: How should we define and evaluate faithfulness? InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4198–4205, 2020

  13. [13]

    Bench2drive: Towards multi-ability benchmarking of closed-loop end-to-end autonomous driving.Advances in Neural Information Processing Systems, 37:819–844, 2024

    Xiaosong Jia, Zhenjie Yang, Qifeng Li, Zhiyuan Zhang, and Junchi Yan. Bench2drive: Towards multi-ability benchmarking of closed-loop end-to-end autonomous driving.Advances in Neural Information Processing Systems, 37:819–844, 2024

  14. [14]

    Vad: Vectorized scene representation for efficient autonomous driving.ICCV, 2023

    Bo Jiang, Shaoyu Chen, Qing Xu, Bencheng Liao, Jiajie Chen, Helong Zhou, Qian Zhang, Wenyu Liu, Chang Huang, and Xinggang Wang. Vad: Vectorized scene representation for efficient autonomous driving.ICCV, 2023

  15. [15]

    Senna: Bridging large vision-language models and end-to-end autonomous driving.arXiv preprint arXiv:2410.22313, 2024

    Bo Jiang, Shaoyu Chen, Bencheng Liao, Xingyu Zhang, Wei Yin, Qian Zhang, Chang Huang, Wenyu Liu, and Xinggang Wang. Senna: Bridging large vision-language models and end-to-end autonomous driving.arXiv preprint arXiv:2410.22313, 2024

  16. [16]

    Alphadrive: Unleashing the power of vlms in autonomous driving via reinforcement learning and reasoning.arXiv preprint arXiv:2503.07608, 2025

    Bo Jiang, Shaoyu Chen, Qian Zhang, Wenyu Liu, and Xinggang Wang. Alphadrive: Unleashing the power of vlms in autonomous driving via reinforcement learning and reasoning.arXiv preprint arXiv:2503.07608, 2025

  17. [17]

    Measuring faithfulness in chain-of-thought reasoning

    Tamera Lanham, Anna Chen, Ansh Radhakrishnan, Benoit Steiner, Carson Denison, Danny Hernandez, Dustin Li, Esin Durmus, Evan Hubinger, Jackson Kernion, et al. Measuring faithfulness in chain-of-thought reasoning. arXiv preprint arXiv:2307.13702, 2023

  18. [18]

    Recogdrive: A reinforced cognitive framework for end-to-end autonomous driving.arXiv preprint arXiv:2506.08052, 2025

    Yongkang Li, Kaixin Xiong, Xiangyu Guo, Fang Li, Sixu Yan, Gangwei Xu, Lijun Zhou, Long Chen, Haiyang Sun, Bing Wang, et al. Recogdrive: A reinforced cognitive framework for end-to-end autonomous driving.arXiv preprint arXiv:2506.08052, 2025

  19. [19]

    Drive-r1: Bridging reasoning and planning in vlms for autonomous driving with reinforcement learning

    Yue Li, Meng Tian, Dechang Zhu, Jiangtong Zhu, Zhenyu Lin, Zhiwei Xiong, and Xinhai Zhao. Drive-r1: Bridging reasoning and planning in vlms for autonomous driving with reinforcement learning. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 6708–6716, 2026

  20. [20]

    Aln-p3: Unified language alignment for perception, prediction, and planning in autonomous driving.arXiv preprint arXiv:2505.15158, 2025

    Yunsheng Ma, Burhaneddin Yaman, Xin Ye, Mahmut Yurt, Jingru Luo, Abhirup Mallik, Ziran Wang, and Liu Ren. Aln-p3: Unified language alignment for perception, prediction, and planning in autonomous driving.arXiv preprint arXiv:2505.15158, 2025

  21. [21]

    Gpt-driver: Learning to drive with gpt.arXiv preprint arXiv:2310.01415, 2023

    Jiageng Mao, Yuxi Qian, Junjie Ye, Hang Zhao, and Yue Wang. Gpt-driver: Learning to drive with gpt.arXiv preprint arXiv:2310.01415, 2023

  22. [22]

    Lingoqa: Visual question answering for autonomous driving

    Ana-Maria Marcu, Long Chen, Jan Hünermann, Alice Karnsund, Benoit Hanotte, Prajwal Chidananda, Saurabh Nair, Vijay Badrinarayanan, Alex Kendall, Jamie Shotton, et al. Lingoqa: Visual question answering for autonomous driving. InEuropean Conference on Computer Vision, pages 252–269. Springer, 2024

  23. [23]

    GPT-5 system card.arXiv preprint arXiv:2601.03267, 2025

    OpenAI. GPT-5 system card.arXiv preprint arXiv:2601.03267, 2025

  24. [24]

    Counterfactual vla: Self-reflective vision-language-action model with adaptive reasoning.arXiv preprint arXiv:2512.24426, 2025

    Zhenghao Peng, Wenhao Ding, Yurong You, Yuxiao Chen, Wenjie Luo, Thomas Tian, Yulong Cao, Apoorva Sharma, Danfei Xu, Boris Ivanovic, et al. Counterfactual vla: Self-reflective vision-language-action model with adaptive reasoning.arXiv preprint arXiv:2512.24426, 2025

  25. [25]

    Drivelm: Driving with graph visual question answering

    Chonghao Sima, Katrin Renz, Kashyap Chitta, Li Chen, Hanxue Zhang, Chengen Xie, Jens Beißwenger, Ping Luo, Andreas Geiger, and Hongyang Li. Drivelm: Driving with graph visual question answering. InEuropean Conference on Computer Vision, pages 256–274. Springer, 2024. 9

  26. [26]

    Drivevlm: The convergence of autonomous driving and large vision-language models.arXiv preprint arXiv:2402.12289, 2024

    Xiaoyu Tian, Junru Gu, Bailin Li, Yicheng Liu, Yang Wang, Zhiyong Zhao, Kun Zhan, Peng Jia, Xianpeng Lang, and Hang Zhao. Drivevlm: The convergence of autonomous driving and large vision-language models.arXiv preprint arXiv:2402.12289, 2024

  27. [27]

    Language models don’t always say what they think: Unfaithful explanations in chain-of-thought prompting.Advances in Neural Information Processing Systems, 36:74952–74965, 2023

    Miles Turpin, Julian Michael, Ethan Perez, and Samuel R Bowman. Language models don’t always say what they think: Unfaithful explanations in chain-of-thought prompting.Advances in Neural Information Processing Systems, 36:74952–74965, 2023

  28. [28]

    Omnidrive: A holistic vision-language dataset for autonomous driving with counterfactual reasoning

    Shihao Wang, Zhiding Yu, Xiaohui Jiang, Shiyi Lan, Min Shi, Nadine Chang, Jan Kautz, Ying Li, and Jose M Alvarez. Omnidrive: A holistic vision-language dataset for autonomous driving with counterfactual reasoning. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 22442–22452, 2025

  29. [29]

    Alpamayo-r1: Bridging reasoning and action prediction for generalizable autonomous driving in the long tail.arXiv preprint arXiv:2511.00088, 2025

    Yan Wang, Wenjie Luo, Junjie Bai, Yulong Cao, Tong Che, Ke Chen, Yuxiao Chen, Jenna Diamond, Yifan Ding, Wenhao Ding, et al. Alpamayo-r1: Bridging reasoning and action prediction for generalizable autonomous driving in the long tail.arXiv preprint arXiv:2511.00088, 2025

  30. [30]

    maintain speed to follow lane

    Shuang Zeng, Xinyuan Chang, Mengwei Xie, Xinran Liu, Yifan Bai, Zheng Pan, Mu Xu, and Xing Wei. Futuresightdrive: Thinking visually with spatio-temporal cot for autonomous driving.Advances in Neural Information Processing Systems, 38:67299–67318, 2026. 10 A Scenario Design Details All scenarios are implemented in CARLA 0.9.15 with no background traffic an...