pith. sign in

arxiv: 2606.07338 · v1 · pith:K2FMIKGYnew · submitted 2026-06-05 · 💻 cs.CV

VeriDrive: Verifiable Counterfactual Supervision for Cost-Efficient Vision-Language Planning

Pith reviewed 2026-06-27 22:18 UTC · model grok-4.3

classification 💻 cs.CV
keywords vision-language planningcounterfactual supervisionverifiable reasoningdriving modelsnuScenescost-efficient data generationOmni-Q protocolselective correction
0
0 comments X

The pith

VeriDrive converts driving rationales into a verifiable Perception-Evaluation-Revision chain that improves planning metrics at lower annotation cost.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces VeriDrive to replace expensive free-form rationales with structured, auditable supervision for vision-language driving models. It builds a Perception-Evaluation-Revision process that grounds objects, checks alternative trajectories against rules, and revises plans toward expert behavior, then scales the work by generating locally and correcting only flagged samples with a validator. On the nuScenes dataset under the Omni-Q protocol, the resulting data yields better L2, collision, and intersection scores than prior baselines while lowering token counts, generation time, and paid model expenses. The central demonstration is that auditable intermediate fields and targeted revision steps can deliver higher-quality supervision within realistic budgets.

Core claim

VeriDrive constructs planning-oriented, verifiable counterfactual supervision by converting driving reasoning into a Perception-Evaluation-Revision chain that grounds key objects in future motion, evaluates alternative ego trajectories with rule-checkable evidence, revises risky intent toward expert behavior, and produces final planning targets; local generation combined with validator-guided selective correction scales the process so that only invalid or difficult samples incur full model cost, resulting in improved open-loop planning metrics over OmniDrive at reduced logged token usage, generation time, and actual paid LLM/VLM cost.

What carries the argument

Perception-Evaluation-Revision chain: a structured sequence that grounds objects, applies rule-checkable trajectory evaluation, and produces revision targets to create auditable planning supervision.

If this is right

  • Auditable intermediate fields allow direct inspection of why a planning target was chosen, reducing reliance on opaque free-form rationales.
  • Selective correction limits expensive frontier-model calls to a minority of samples while preserving overall data quality.
  • The resulting dataset trains models that outperform OmniDrive baselines on nuScenes under identical Omni-Q training conditions.
  • Rule-checkable evidence in the evaluation step makes counterfactual trajectory comparisons reproducible without additional human review.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same chain structure could be adapted to other sequential decision domains where rule-based checks on future states are feasible.
  • Releasing the validator scripts would let independent groups audit or extend the generated dataset without re-running the full pipeline.
  • Lower per-sample cost might enable repeated iterations of dataset construction as base models improve, creating a feedback loop for supervision quality.

Load-bearing premise

The validator-guided selective correction accurately flags invalid or difficult samples without introducing selection bias that affects the final performance gains or cost savings.

What would settle it

Run the same training pipeline but replace validator selection with uniform random sampling of samples for full correction; if the L2, collision, and intersection gains disappear or total paid cost rises above the reported savings, the selective-correction benefit is not supported.

Figures

Figures reproduced from arXiv: 2606.07338 by Hubert P. H. Shum, Toby P. Breckon, Zikai Zhang.

Figure 1
Figure 1. Figure 1: Structured QA example from VeriDrive, linking scene evidence to a verifiable rea [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Efficiency–performance trade-off on nuScenes open-loop planning. Panels plot [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Overview of VeriDrive. Given multi-view images, map/BEV cues, candidate ego [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative planning comparison. In each row, the left panels show the multi [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Additional qualitative visualization from VeriDrive. [PITH_FULL_IMAGE:figures/full_fig_p023_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Additional qualitative visualization from VeriDrive. [PITH_FULL_IMAGE:figures/full_fig_p023_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Additional qualitative visualization from VeriDrive. [PITH_FULL_IMAGE:figures/full_fig_p023_7.png] view at source ↗
read the original abstract

Vision-language driving models increasingly use reasoning supervision to bridge perception, prediction, and planning, but existing driving rationales are often free-form and expensive to generate with frontier models. We present VeriDrive, a framework for constructing planning-oriented, verifiable counterfactual supervision. VeriDrive converts driving reasoning into a structured Perception-Evaluation-Revision chain that grounds key objects in future motion, evaluates alternative ego trajectories with rule-checkable evidence, revises risky intent toward expert behavior, and produces final planning targets. To scale data construction, VeriDrive combines local generation with validator-guided selective correction, escalating only invalid or difficult samples. We build the VeriDrive dataset on nuScenes and train under the Omni-Q protocol. Controlled open-loop experiments show that VeriDrive improves L2, Collision, and Intersection over OmniDrive while reducing logged token usage, generation time, and actual paid LLM/VLM cost. These results show that auditable intermediate fields and structured revision targets can improve vision-language planning supervision under realistic annotation budgets. Code, prompts, and validator scripts are coming soon and will be released after the review process.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript presents VeriDrive, a framework for constructing planning-oriented verifiable counterfactual supervision for vision-language driving models. It converts reasoning into a structured Perception-Evaluation-Revision chain that grounds objects in future motion, evaluates alternative trajectories with rule-checkable evidence, and revises toward expert behavior. The approach combines local generation with validator-guided selective correction (escalating only invalid or difficult samples) to build a dataset on nuScenes, trains under the Omni-Q protocol, and reports improvements in L2, Collision, and Intersection metrics over OmniDrive alongside reductions in token usage, generation time, and paid LLM/VLM costs.

Significance. If the empirical claims hold under rigorous controls, VeriDrive could provide a scalable, lower-cost method for generating auditable supervision data for vision-language planning in autonomous driving, addressing the expense of frontier-model rationales. The planned release of code, prompts, and validator scripts would be a positive contribution to reproducibility.

major comments (2)
  1. [Abstract] Abstract: The central claim that VeriDrive yields better supervision and cost savings rests on the validator-guided selective correction producing a representative dataset without selection bias. The abstract provides no validator accuracy metrics, inter-annotator agreement on escalation decisions, or ablations isolating selective vs. exhaustive correction; without these, reported gains in L2/Collision/Intersection and cost metrics cannot be attributed to the Perception-Evaluation-Revision structure rather than filtering artifacts.
  2. [Abstract] Abstract: The abstract asserts metric improvements and cost reductions from 'controlled open-loop experiments' but supplies no details on experimental controls, statistical significance testing, error bars, dataset split construction, or baseline implementation specifics for the OmniDrive comparison; these omissions make the soundness of the performance claims impossible to evaluate from the provided text.
minor comments (1)
  1. The statement that code, prompts, and validator scripts 'are coming soon and will be released after the review process' should specify a concrete timeline or repository to support the reproducibility claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on the abstract. We address each point below and will revise the abstract to improve self-containment while preserving its brevity.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim that VeriDrive yields better supervision and cost savings rests on the validator-guided selective correction producing a representative dataset without selection bias. The abstract provides no validator accuracy metrics, inter-annotator agreement on escalation decisions, or ablations isolating selective vs. exhaustive correction; without these, reported gains in L2/Collision/Intersection and cost metrics cannot be attributed to the Perception-Evaluation-Revision structure rather than filtering artifacts.

    Authors: The full manuscript reports validator accuracy, inter-annotator agreement, and the selective-vs-exhaustive ablation in Sections 3.2 and 5.2; these results indicate that performance gains are attributable to the structured chain rather than filtering. To make this evident from the abstract alone, we will revise the abstract to include a concise clause referencing validator performance and the ablation outcome. revision: yes

  2. Referee: [Abstract] Abstract: The abstract asserts metric improvements and cost reductions from 'controlled open-loop experiments' but supplies no details on experimental controls, statistical significance testing, error bars, dataset split construction, or baseline implementation specifics for the OmniDrive comparison; these omissions make the soundness of the performance claims impossible to evaluate from the provided text.

    Authors: The manuscript body (Section 4) details the open-loop protocol, nuScenes splits, OmniDrive baseline following the Omni-Q protocol, error bars, and significance testing. We will revise the abstract to briefly note these controls (e.g., 'with error bars and significance testing on nuScenes splits') so that the claims can be evaluated from the abstract text. revision: yes

Circularity Check

0 steps flagged

No derivations or self-referential steps; purely empirical framework

full rationale

The paper introduces VeriDrive as a structured Perception-Evaluation-Revision chain for counterfactual supervision, combined with validator-guided selective correction to build a dataset on nuScenes. It then reports controlled open-loop experiments comparing L2/Collision/Intersection metrics and token/cost reductions against OmniDrive under the Omni-Q protocol. No equations, fitted parameters renamed as predictions, uniqueness theorems, or self-citation chains appear in the provided text. All claims rest on direct empirical comparisons rather than any derivation that reduces to its own inputs by construction. This matches the reader's assessment of no significant circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The framework rests on the domain assumption that driving trajectories admit rule-checkable evaluations; it introduces one new structured process without fitted parameters or entities having external falsifiable evidence.

axioms (1)
  • domain assumption Alternative ego trajectories in driving scenes can be evaluated using rule-checkable evidence.
    Invoked in the Evaluation step of the Perception-Evaluation-Revision chain.
invented entities (1)
  • Perception-Evaluation-Revision chain no independent evidence
    purpose: Structures driving reasoning into verifiable steps for supervision.
    New construct introduced by the framework.

pith-pipeline@v0.9.1-grok · 5727 in / 1200 out tokens · 27639 ms · 2026-06-27T22:18:24.368803+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

30 extracted references · 10 canonical work pages · 1 internal anchor

  1. [1]

    Qwen3-VL technical report.arXiv preprint arXiv:2511.21631, 2025

    Shuai Bai et al. Qwen3-VL technical report.arXiv preprint arXiv:2511.21631, 2025

  2. [2]

    Lang, Sourabh V ora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Giancarlo Baldan, and Oscar Beijbom

    Holger Caesar, Varun Bankiti, Alex H. Lang, Sourabh V ora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Giancarlo Baldan, and Oscar Beijbom. nuScenes: A mul- timodal dataset for autonomous driving. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 11621–11631, 2020. 16ZHANG, SHUM, BRECKON: VERIDRIVE

  3. [3]

    SOLVE: Synergy of language-vision and end-to-end networks for au- tonomous driving

    Xuesong Chen, Linjiang Huang, Tao Ma, Rongyao Fang, Shaoshuai Shi, and Hong- sheng Li. SOLVE: Synergy of language-vision and end-to-end networks for au- tonomous driving. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 12068–12077, 2025

  4. [4]

    Sparseoccvla: Bridging occupancy and vision-language models via sparse queries for unified 4d scene understanding and planning.arXiv preprint arXiv:2601.06474, 2026

    Chenxu Dang, Jie Wang, Guang Li, Zhiwen Hou, Zihan You, Hangjun Ye, Jie Ma, Long Chen, and Yan Wang. Sparseoccvla: Bridging occupancy and vision-language models via sparse queries for unified 4d scene understanding and planning.arXiv preprint arXiv:2601.06474, 2026

  5. [5]

    Talk2car: Taking control of your self-driving car

    Thierry Deruyttere, Simon Vandenhende, Dusan Grujicic, Luc Van Gool, and Marie- Francine Moens. Talk2car: Taking control of your self-driving car. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP- IJCNLP), pages 2088–2098, 2019. doi: 10...

  6. [6]

    EV A-02: A visual representation for neon genesis.Image and Vision Computing, 149: 105171, 2024

    Yuxin Fang, Quan Sun, Xinggang Wang, Tiejun Huang, Xinlong Wang, and Yue Cao. EV A-02: A visual representation for neon genesis.Image and Vision Computing, 149: 105171, 2024. doi: 10.1016/j.imavis.2024.105171

  7. [7]

    ST-P3: End-to-end vision-based autonomous driving via spatial-temporal fea- ture learning

    Shengchao Hu, Li Chen, Penghao Wu, Hongyang Li, Junchi Yan, and Dacheng Tao. ST-P3: End-to-end vision-based autonomous driving via spatial-temporal fea- ture learning. InComputer Vision – ECCV 2022, pages 533–549, 2022. doi: 10.1007/978-3-031-19839-7_31

  8. [8]

    UniAD: Planning-oriented autonomous driving

    Yihan Hu, Jiazhi Yang, Li Chen, Keyu Li, Chonghao Sima, Xizhou Zhu, Siqi Chai, Senyao Du, Tianwei Lin, Wenhai Wang, Lewei Lu, Xiaosong Jia, Qiang Liu, Jifeng Dai, Yu Qiao, and Hongyang Li. UniAD: Planning-oriented autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 17853–17862, 2023

  9. [9]

    Robotron-drive: All-in-one large multimodal model for autonomous driving

    Zhijian Huang, Chengjian Feng, Feng Yan, Baihui Xiao, Zequn Jie, Yujie Zhong, Xi- aodan Liang, and Lin Ma. Robotron-drive: All-in-one large multimodal model for autonomous driving. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 8011–8021, 2025

  10. [10]

    V AD: Vectorized scene repre- sentation for efficient autonomous driving

    Bo Jiang, Shaoyu Chen, Qing Xu, Bencheng Liao, Jiajie Chen, Helong Zhou, Qian Zhang, Wenyu Liu, Chang Huang, and Xinggang Wang. V AD: Vectorized scene repre- sentation for efficient autonomous driving. InProceedings of the IEEE/CVF Interna- tional Conference on Computer Vision (ICCV), pages 8340–8350, 2023

  11. [11]

    Textual explanations for self-driving vehicles

    Jinkyu Kim, Anna Rohrbach, Trevor Darrell, John Canny, and Zeynep Akata. Textual explanations for self-driving vehicles. InProceedings of the European Conference on Computer Vision (ECCV), pages 563–578, 2018

  12. [12]

    Enhancing end-to-end autonomous driving with latent world model

    Yingyan Li, Lue Fan, Jiawei He, Yuqi Wang, Yuntao Chen, Zhaoxiang Zhang, and Tieniu Tan. Enhancing end-to-end autonomous driving with latent world model. In International Conference on Learning Representations (ICLR), 2025

  13. [13]

    Zhiqi Li, Zhiding Yu, Shiyi Lan, Jiahan Li, Jan Kautz, Tong Lu, and Jose M. Alvarez. Is ego status all you need for open-loop end-to-end autonomous driving? InProceedings ZHANG, SHUM, BRECKON: VERIDRIVE17 of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 14864–14873, 2024

  14. [14]

    Improved baselines with visual instruction tuning

    Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 26296–26306, 2024

  15. [15]

    DRAMA: Joint risk localization and captioning in driving

    Srikanth Malla, Chiho Choi, Isht Dwivedi, Joon Hee Choi, and Jiachen Li. DRAMA: Joint risk localization and captioning in driving. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 1043–1052, 2023

  16. [16]

    Lingoqa: Visual question answering for au- tonomous driving

    Ana-Maria Marcu, Long Chen, Jan Hünermann, Alice Karnsund, Benoit Hanotte, Pra- jwal Chidananda, Saurabh Nair, Vijay Badrinarayanan, Alex Kendall, Jamie Shot- ton, Elahe Arani, and Oleg Sinavski. Lingoqa: Visual question answering for au- tonomous driving. InComputer Vision – ECCV 2024, pages 252–269, 2024. doi: 10.1007/978-3-031-72980-5_15

  17. [17]

    Reason2drive: Towards interpretable and chain-based reasoning for au- tonomous driving

    Ming Nie, Renyuan Peng, Chunwei Wang, Xinyue Cai, Jianhua Han, Hang Xu, and Li Zhang. Reason2drive: Towards interpretable and chain-based reasoning for au- tonomous driving. InComputer Vision – ECCV 2024, pages 292–308, 2024. doi: 10.1007/978-3-031-73347-5_17

  18. [18]

    Nuscenes- qa: A multi-modal visual question answering benchmark for autonomous driving sce- nario

    Tianwen Qian, Jingjing Chen, Linhai Zhuo, Yang Jiao, and Yu-Gang Jiang. Nuscenes- qa: A multi-modal visual question answering benchmark for autonomous driving sce- nario. InProceedings of the AAAI Conference on Artificial Intelligence (AAAI), vol- ume 38, pages 4542–4550, 2024. doi: 10.1609/aaai.v38i5.28253

  19. [19]

    Toward driving scene understanding: A dataset for learning driver behavior and causal reasoning

    Vasili Ramanishka, Yi-Ting Chen, Teruhisa Misu, and Kate Saenko. Toward driving scene understanding: A dataset for learning driver behavior and causal reasoning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 7699–7707, 2018

  20. [20]

    Rank2tell: A multimodal driving dataset for joint importance ranking and reasoning

    Enna Sachdeva, Nakul Agarwal, Suhas Chundi, Sean Roelofs, Jiachen Li, Mykel Kochenderfer, Chiho Choi, and Behzad Dariush. Rank2tell: A multimodal driving dataset for joint importance ranking and reasoning. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 7513–7522, 2024

  21. [21]

    DriveLM: Driving with graph visual question answering

    Chonghao Sima, Katrin Renz, Kashyap Chitta, Li Chen, Hanxue Zhang, Chengen Xie, Jens Beißwenger, Ping Luo, Andreas Geiger, and Hongyang Li. DriveLM: Driving with graph visual question answering. InComputer Vision – ECCV 2024, pages 256–274, 2024

  22. [22]

    Drivevlm: The convergence of autonomous driving and large vision-language models

    Xiaoyu Tian, Junru Gu, Bailin Li, Yicheng Liu, Yang Wang, Zhiyong Zhao, Kun Zhan, Peng Jia, XianPeng Lang, and Hang Zhao. Drivevlm: The convergence of autonomous driving and large vision-language models. InProceedings of The 8th Conference on Robot Learning, pages 4698–4726, 2025

  23. [23]

    OpenLane-V2: A topology reasoning benchmark for unified 3d hd mapping

    Huijie Wang, Tianyu Li, Yang Li, Li Chen, Chonghao Sima, Zhenbo Liu, Bangjun Wang, Peijin Jia, Yuting Wang, Shengyin Jiang, Feng Wen, Hang Xu, Ping Luo, Junchi 18ZHANG, SHUM, BRECKON: VERIDRIVE Yan, Wei Zhang, and Hongyang Li. OpenLane-V2: A topology reasoning benchmark for unified 3d hd mapping. InAdvances in Neural Information Processing Systems (NeurIP...

  24. [24]

    Shihao Wang, Zhiding Yu, Xiaohui Jiang, Shiyi Lan, Min Shi, Nadine Chang, Jan Kautz, Ying Li, and Jose M. Alvarez. OmniDrive: A holistic vision-language dataset for autonomous driving with counterfactual reasoning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 22442–22452, June 2025

  25. [25]

    Language prompt for autonomous driving

    Dongming Wu, Wencheng Han, Yingfei Liu, Tiancai Wang, Cheng-Zhong Xu, Xi- angyu Zhang, and Jianbing Shen. Language prompt for autonomous driving. InPro- ceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 8359– 8367, 2025. doi: 10.1609/aaai.v39i8.32902

  26. [26]

    In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp

    Yiran Xu, Xiaoyin Yang, Lihang Gong, Hsuan-Chu Lin, Tz-Ying Wu, Yunsheng Li, and Nuno Vasconcelos. Explainable object-induced action decision for autonomous vehicles. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9523–9532, 2020. doi: 10.1109/CVPR42600.2020.00954

  27. [27]

    AutoDrive-P3: Unified chain of perception–prediction–planning thought via reinforce- ment fine-tuning

    Yuqi Ye, Zijian Zhang, Junhong Lin, Shangkun Sun, Changhao Peng, and Wei Gao. AutoDrive-P3: Unified chain of perception–prediction–planning thought via reinforce- ment fine-tuning. InInternational Conference on Learning Representations (ICLR),

  28. [28]

    Rethinking the Open-Loop Evaluation of End-to-End Autonomous Driving in nuScenes

    Jiang-Tian Zhai, Ze Feng, Jinhao Du, Yongqiang Mao, Jiang-Jiang Liu, Zichang Tan, Yifu Zhang, Xiaoqing Ye, and Jingdong Wang. Rethinking the open-loop evaluation of end-to-end autonomous driving in nuScenes.arXiv preprint arXiv:2305.10430, 2023. doi: 10.48550/arXiv.2305.10430

  29. [29]

    World4drive: End-to-end autonomous driving via intention-aware physical latent world model

    Yupeng Zheng, Pengxuan Yang, Zebin Xing, Qichao Zhang, Yuhang Zheng, Yinfeng Gao, Pengfei Li, Teng Zhang, Zhongpu Xia, Peng Jia, XianPeng Lang, and Dongbin Zhao. World4drive: End-to-end autonomous driving via intention-aware physical latent world model. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 28632–28642, 2025

  30. [30]

    Opendrivevla: Towards end-to-end autonomous driving with large vision language ac- tion model

    Xingcheng Zhou, Xuyuan Han, Feng Yang, Yunpu Ma, V olker Tresp, and Alois Knoll. Opendrivevla: Towards end-to-end autonomous driving with large vision language ac- tion model. InProceedings of the AAAI Conference on Artificial Intelligence (AAAI), volume 40, pages 13782–13790, 2026. doi: 10.1609/aaai.v40i16.38386. ZHANG, SHUM, BRECKON: VERIDRIVE19 A Suppl...