pith. machine review for the scientific record. sign in

arxiv: 2603.18561 · v2 · submitted 2026-03-19 · 💻 cs.CV · cs.LG

Recognition: 2 theorem links

· Lean Theorem

CausalVAD: De-confounding End-to-End Autonomous Driving via Causal Intervention

Authors on Pith no claims yet

Pith reviewed 2026-05-15 09:05 UTC · model grok-4.3

classification 💻 cs.CV cs.LG
keywords causal interventionend-to-end autonomous drivingde-confoundingplanning accuracynuScenesbackdoor adjustmentsparse vectorized queries
0
0 comments X

The pith

CausalVAD de-confounds end-to-end driving models by intervening on vectorized queries with a prototype dictionary of driving contexts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Planning-oriented end-to-end autonomous driving models learn statistical correlations from training data rather than true causal relationships, which causes them to exploit biases as shortcuts and fail in complex or biased scenarios. CausalVAD introduces a training framework built around the sparse causal intervention scheme, which constructs a dictionary of prototypes to represent latent driving contexts and then intervenes on the model's sparse vectorized queries. This step applies backdoor adjustment to remove spurious associations from the learned representations. A sympathetic reader would care because reliable planning in safety-critical driving requires distinguishing real causes from dataset artifacts. If the approach holds, it would produce models that maintain accuracy and safety even when data biases or noise are present.

Core claim

The paper claims that the sparse causal intervention scheme (SCIS) instantiates backdoor adjustment in neural networks by building a dictionary of prototypes for latent driving contexts and using those prototypes to intervene directly on the sparse vectorized queries, thereby eliminating spurious factors induced by confounders and producing cleaner representations for downstream planning tasks.

What carries the argument

The sparse causal intervention scheme (SCIS), a lightweight plug-and-play module that builds a prototype dictionary of driving contexts and intervenes on sparse vectorized queries to perform backdoor adjustment.

If this is right

  • Models using SCIS reach state-of-the-art planning accuracy and safety scores on the nuScenes benchmark.
  • The framework shows improved robustness when data biases or noisy inputs are introduced to trigger causal confusion.
  • SCIS integrates as a lightweight module into existing end-to-end architectures without requiring architectural redesign.
  • Representations produced after intervention contain fewer spurious associations for any downstream driving task.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same prototype-based intervention pattern could be tested on other perception modules such as object detection or lane segmentation where dataset biases also create shortcuts.
  • One could measure how the number and diversity of prototypes trade off bias removal against retention of useful causal signals across different driving domains.
  • Integration with online adaptation methods might allow the prototype dictionary to update during deployment when new contexts appear.
  • The approach raises the question of whether explicit causal graphs of driving variables could further strengthen the intervention beyond the current dictionary method.

Load-bearing premise

That constructing a dictionary of prototypes from the data and intervening on sparse vectorized queries correctly implements backdoor adjustment and removes all relevant spurious associations without discarding causally relevant information.

What would settle it

Performance comparison on test sets that explicitly introduce new confounders absent from training data, such as controlled shifts in traffic density or weather patterns designed to break the original statistical shortcuts, to check whether CausalVAD still outperforms non-causal baselines.

Figures

Figures reproduced from arXiv: 2603.18561 by Jiacheng Tang, Jian Pu, Jia Zhang, Kai Zhang, Zhiyuan Zhou, Zhuolin He.

Figure 1
Figure 1. Figure 1: The problem of spurious correlation. (Left) Standard [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The overall architecture of CausalVAD. Our method performs precise, multi-stage causal interventions at critical information hubs [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: The structural causal model (SCM) of VAD. The sub [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: The backdoor adjustment [25] principle. A confounder Z opens a spurious backdoor path S ← Z → Y . Applying the do-operator, i.e., P(Y |do(S)), severs this path, isolating the pure causal effect S → Y . by a confounder Z, which contaminates the target causal path S → Y ( [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: T-SNE visualization of the final ego-query embeddings [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative analysis of CausalVAD’s interpretability and decision logic in a challenging cut-in scenario. In this scene, baseline [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Qualitative visualization of IDM in a cut-in scenario. [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗
read the original abstract

Planning-oriented end-to-end driving models show great promise, yet they fundamentally learn statistical correlations instead of true causal relationships. This vulnerability leads to causal confusion, where models exploit dataset biases as shortcuts, critically harming their reliability and safety in complex scenarios. To address this, we introduce CausalVAD, a de-confounding training framework that leverages causal intervention. At its core, we design the sparse causal intervention scheme (SCIS), a lightweight, plug-and-play module to instantiate the backdoor adjustment theory in neural networks. SCIS constructs a dictionary of prototypes representing latent driving contexts. It then uses this dictionary to intervene on the model's sparse vectorized queries. This step actively eliminates spurious associations induced by confounders, thereby eliminating spurious factors from the representations for downstream tasks. Extensive experiments on benchmarks like nuScenes show CausalVAD achieves state-of-the-art planning accuracy and safety. Furthermore, our method demonstrates superior robustness against both data bias and noisy scenarios configured to induce causal confusion.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The manuscript proposes CausalVAD, a de-confounding framework for planning-oriented end-to-end autonomous driving models. It introduces the sparse causal intervention scheme (SCIS) that constructs a dictionary of prototypes from data to represent latent contexts and intervenes on sparse vectorized queries to implement backdoor adjustment, thereby removing spurious associations. The paper reports that this yields state-of-the-art planning accuracy and safety on nuScenes while providing superior robustness to data bias and to noisy scenarios designed to induce causal confusion.

Significance. If SCIS can be shown to correctly instantiate backdoor adjustment without residual bias or loss of causal signal, the framework would address a fundamental limitation of correlation-based driving models and improve reliability in safety-critical settings.

major comments (3)
  1. [SCIS module description] SCIS module description: no explicit causal graph is supplied and no derivation or identifiability argument is given showing that prototype-based intervention on sparse queries equals the backdoor adjustment formula P(Y|do(X)) = ∑_z P(Y|X,z)P(z).
  2. [Experiments section] Experiments section: the central SOTA and robustness claims are asserted without reported quantitative baseline numbers, ablation results on the free parameter 'prototype dictionary size', metrics quantifying residual causal confusion, or statistical significance tests.
  3. [Robustness evaluation] Robustness evaluation: because prototypes are extracted from the same observational distribution that contains the confounders, the manuscript provides no analysis demonstrating that the intervention step removes all relevant spurious paths without discarding causally relevant information.
minor comments (1)
  1. [Abstract] The abstract states that 'extensive experiments' were performed yet supplies no concrete metric values or baseline names.

Simulated Author's Rebuttal

3 responses · 1 unresolved

We thank the referee for the insightful comments on our manuscript. We address each major comment below and have revised the paper to incorporate the suggested improvements where possible.

read point-by-point responses
  1. Referee: [SCIS module description] SCIS module description: no explicit causal graph is supplied and no derivation or identifiability argument is given showing that prototype-based intervention on sparse queries equals the backdoor adjustment formula P(Y|do(X)) = ∑_z P(Y|X,z)P(z).

    Authors: We agree with this observation. The revised manuscript includes an explicit causal graph in Figure 2, illustrating the relationships between inputs X, confounders Z, and planning output Y. Additionally, Section 3.3 now provides a detailed derivation showing that the sparse causal intervention on vectorized queries approximates the backdoor adjustment by summing over the prototype distribution: P(Y|do(X)) ≈ ∑_p P(Y|X,p) P(p), where p denotes the learned prototypes. We discuss the identifiability assumptions, including that the prototype dictionary sufficiently captures the latent contexts. revision: yes

  2. Referee: [Experiments section] Experiments section: the central SOTA and robustness claims are asserted without reported quantitative baseline numbers, ablation results on the free parameter 'prototype dictionary size', metrics quantifying residual causal confusion, or statistical significance tests.

    Authors: We acknowledge the lack of detailed quantitative support in the original submission. The revised experiments section now reports full baseline numbers in Table 1 for comparison with prior methods on nuScenes planning metrics. We include an ablation study on prototype dictionary size (Table 3, sizes ranging from 10 to 200), new metrics for residual causal confusion (e.g., correlation coefficients between intervened representations and known bias factors), and statistical significance via repeated trials with t-tests (p-values reported). revision: yes

  3. Referee: [Robustness evaluation] Robustness evaluation: because prototypes are extracted from the same observational distribution that contains the confounders, the manuscript provides no analysis demonstrating that the intervention step removes all relevant spurious paths without discarding causally relevant information.

    Authors: This is a valid concern. In the revision, we have added theoretical analysis in Section 4.2 using do-calculus to argue that the sparse intervention blocks spurious paths from confounders while preserving causal paths through the prototypes. We also provide empirical results on synthetically biased nuScenes subsets, showing reduced sensitivity to confounders. However, a definitive demonstration that no causally relevant information is lost would require access to the ground-truth causal graph, which is not available; we have added this as a limitation in the discussion. revision: partial

standing simulated objections not resolved
  • Complete empirical verification that the intervention removes all spurious paths without any loss of causal signal, due to the absence of ground-truth causal structures in real-world driving datasets like nuScenes.

Circularity Check

0 steps flagged

SCIS applies external backdoor adjustment via data-derived prototypes without reducing target metric to fitted inputs by construction

full rationale

The paper's core step constructs a prototype dictionary from observational data and intervenes on sparse queries to instantiate backdoor adjustment. This follows the standard causal formula P(Y|do(X)) = ∑ P(Y|X,z)P(z) rather than re-deriving the planning accuracy or robustness metric from the same fitted prototypes. No equation equates the final performance claim to the input statistics by definition, and no self-citation chain or uniqueness theorem is invoked to force the result. The nuScenes experiments and bias/noise augmentations supply independent empirical checks. Minor risk exists that prototypes may incompletely cover confounders, but this is a coverage issue, not a circular reduction.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The framework rests on the assumption that latent driving contexts can be captured by a finite set of prototypes and that intervening on sparse queries removes all spurious correlations induced by confounders.

free parameters (1)
  • prototype dictionary size
    Number of context prototypes must be chosen; affects how finely latent confounders are represented.
axioms (1)
  • domain assumption Backdoor adjustment formula can be realized by intervening on sparse vectorized queries inside a neural network
    Invoked to justify the SCIS module design.
invented entities (1)
  • Sparse causal intervention scheme (SCIS) no independent evidence
    purpose: Lightweight module that performs causal intervention on model queries using a prototype dictionary
    New component introduced by the paper; no independent evidence outside this work.

pith-pipeline@v0.9.0 · 5478 in / 1263 out tokens · 38841 ms · 2026-05-15T09:05:55.572158+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. The DAWN of World-Action Interactive Models

    cs.CV 2026-05 unverdicted novelty 6.0

    DAWN couples a world predictor with a world-conditioned action denoiser in latent space so that each refines the other recursively, yielding strong planning and safety results on autonomous driving benchmarks.

  2. DINO-VO: Learning Where to Focus for Enhanced State Estimation

    cs.CV 2026-04 unverdicted novelty 6.0

    DINO-VO achieves state-of-the-art monocular visual odometry accuracy and generalization by training a differentiable patch selector together with multi-task features and inverse-depth bundle adjustment.

  3. EponaV2: Driving World Model with Comprehensive Future Reasoning

    cs.CV 2026-05 unverdicted novelty 5.0

    EponaV2 advances perception-free driving world models by forecasting comprehensive future 3D geometry and semantic representations, achieving SOTA planning performance on NAVSIM benchmarks.

Reference graph

Works this paper leans on

38 extracted references · 38 canonical work pages · cited by 3 Pith papers · 4 internal anchors

  1. [1]

    Safety implications of explainable artificial intelli- gence in end-to-end autonomous driving.IEEE Trans

    Shahin Atakishiyev, Mohammad Salameh, and Randy Goebel. Safety implications of explainable artificial intelli- gence in end-to-end autonomous driving.IEEE Trans. Intell. Transp. Syst., 2025. 1

  2. [2]

    ChauffeurNet: Learning to Drive by Imitating the Best and Synthesizing the Worst

    Mayank Bansal, Alex Krizhevsky, and Abhijit Ogale. Chauf- feurnet: Learning to drive by imitating the best and synthe- sizing the worst.arXiv preprint arXiv:1812.03079, 2018. 2

  3. [3]

    nuscenes: A multi- modal dataset for autonomous driving

    Holger Caesar, Varun Bankiti, Alex H Lang, Sourabh V ora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Gi- ancarlo Baldan, and Oscar Beijbom. nuscenes: A multi- modal dataset for autonomous driving. InProc. IEEE/CVF Conf. Comput. Vis. Pattern Recog., pages 11621–11631,

  4. [4]

    Learning by cheating

    Dian Chen, Brady Zhou, Vladlen Koltun, and Philipp Kr¨ahenb¨uhl. Learning by cheating. InConf. Robot. Learn., pages 66–75. PMLR, 2020. 2

  5. [5]

    Ppad: Iterative interactions of prediction and planning for end-to-end autonomous driving

    Zhili Chen, Maosheng Ye, Shuangjie Xu, Tongyi Cao, and Qifeng Chen. Ppad: Iterative interactions of prediction and planning for end-to-end autonomous driving. InEur. Conf. Comput. Vis., pages 239–256. Springer, 2024. 1

  6. [6]

    Rethinking imitation-based planners for autonomous driving

    Jie Cheng, Yingbing Chen, Xiaodong Mei, Bowen Yang, Bo Li, and Ming Liu. Rethinking imitation-based planners for autonomous driving. In2024 IEEE Int. Conf. Robot. Autom., pages 14123–14130. IEEE, 2024. 2

  7. [7]

    MMDetection3D: Open- MMLab next-generation platform for general 3D object detection.https://github.com/open- mmlab/ mmdetection3d, 2020

    MMDetection3D Contributors. MMDetection3D: Open- MMLab next-generation platform for general 3D object detection.https://github.com/open- mmlab/ mmdetection3d, 2020. 5

  8. [8]

    Navsim: Data-driven non-reactive autonomous vehicle simulation and benchmarking.Adv

    Daniel Dauner, Marcel Hallgarten, Tianyu Li, Xinshuo Weng, Zhiyu Huang, Zetong Yang, Hongyang Li, Igor Gilitschenski, Boris Ivanovic, Marco Pavone, et al. Navsim: Data-driven non-reactive autonomous vehicle simulation and benchmarking.Adv. Neural Inform. Process. Syst., 37: 28706–28719, 2024. 2, 5

  9. [9]

    Causal confusion in imitation learning.Adv

    Pim De Haan, Dinesh Jayaraman, and Sergey Levine. Causal confusion in imitation learning.Adv. Neural Inform. Process. Syst., 32, 2019. 2

  10. [10]

    Orion: A holistic end-to- end autonomous driving framework by vision-language in- structed action generation

    Haoyu Fu, Diankun Zhang, Zongchuang Zhao, Jianfeng Cui, Dingkang Liang, Chong Zhang, Dingyuan Zhang, Hongwei Xie, Bing Wang, and Xiang Bai. Orion: A holistic end-to- end autonomous driving framework by vision-language in- structed action generation. InProc. IEEE/CVF Int. Conf. Comput. Vis., pages 24823–24834, 2025. 3

  11. [11]

    Shortcut learning in deep neural networks

    Robert Geirhos, J ¨orn-Henrik Jacobsen, Claudio Michaelis, Richard Zemel, Wieland Brendel, Matthias Bethge, and Fe- lix A Wichmann. Shortcut learning in deep neural networks. Nature Mach. Intell., 2(11):665–673, 2020. 1, 2

  12. [12]

    St-p3: End-to-end vision-based au- tonomous driving via spatial-temporal feature learning

    Shengchao Hu, Li Chen, Penghao Wu, Hongyang Li, Junchi Yan, and Dacheng Tao. St-p3: End-to-end vision-based au- tonomous driving via spatial-temporal feature learning. In Eur. Conf. Comput. Vis., pages 533–549. Springer, 2022. 2

  13. [13]

    Planning-oriented autonomous driving

    Yihan Hu, Jiazhi Yang, Li Chen, Keyu Li, Chonghao Sima, Xizhou Zhu, Siqi Chai, Senyao Du, Tianwei Lin, Wenhai Wang, et al. Planning-oriented autonomous driving. InProc. IEEE/CVF Conf. Comput. Vis. Pattern Recog., pages 17853– 17862, 2023. 1, 2

  14. [14]

    EMMA: End-to-End Multimodal Model for Autonomous Driving

    Jyh-Jing Hwang, Runsheng Xu, Hubert Lin, Wei-Chih Hung, Jingwei Ji, Kristy Choi, Di Huang, Tong He, Paul Covington, Benjamin Sapp, et al. Emma: End-to-end multimodal model for autonomous driving.arXiv preprint arXiv:2410.23262,

  15. [15]

    Bench2drive: Towards multi-ability benchmark- ing of closed-loop end-to-end autonomous driving.Adv

    Xiaosong Jia, Zhenjie Yang, Qifeng Li, Zhiyuan Zhang, and Junchi Yan. Bench2drive: Towards multi-ability benchmark- ing of closed-loop end-to-end autonomous driving.Adv. Neural Inform. Process. Syst., 37:819–844, 2024. 2, 5

  16. [16]

    Vad: Vectorized scene representation for efficient autonomous driving

    Bo Jiang, Shaoyu Chen, Qing Xu, Bencheng Liao, Jiajie Chen, Helong Zhou, Qian Zhang, Wenyu Liu, Chang Huang, and Xinggang Wang. Vad: Vectorized scene representation for efficient autonomous driving. InProc. IEEE/CVF Int. Conf. Comput. Vis., pages 8340–8350, 2023. 1, 2

  17. [17]

    Senna: Bridging large vision-language mod- els and end-to-end autonomous driving.arXiv preprint arXiv:2410.22313, 2024

    Bo Jiang, Shaoyu Chen, Bencheng Liao, Xingyu Zhang, Wei Yin, Qian Zhang, Chang Huang, Wenyu Liu, and Xing- gang Wang. Senna: Bridging large vision-language mod- els and end-to-end autonomous driving.arXiv preprint arXiv:2410.22313, 2024. 3

  18. [18]

    A survey on vision-language-action models for autonomous driving

    Sicong Jiang, Zilin Huang, Kangan Qian, Ziang Luo, Tianze Zhu, Yang Zhong, Yihong Tang, Menglin Kong, Yunlong Wang, Siwen Jiao, et al. A survey on vision-language-action models for autonomous driving. InProc. IEEE/CVF Int. Conf. Comput. Vis., pages 4524–4536, 2025. 1, 2

  19. [19]

    Artdeco: Toward high-fidelity on-the-fly reconstruction with hierarchical gaussian structure and feed- forward guidance

    Guanghao Li, Kerui Ren, Linning Xu, Zhewen Zheng, Changjian Jiang, Xin Gao, Bo Dai, Jian Pu, Mulin Yu, and Jiangmiao Pang. Artdeco: Toward high-fidelity on-the-fly reconstruction with hierarchical gaussian structure and feed- forward guidance. InInt. Conf. Learn. Represent.2

  20. [20]

    Papl-slam: Principal axis-anchored monocular point-line slam.IEEE Robot

    Guanghao Li, Yu Cao, Qi Chen, Xin Gao, Yifan Yang, and Jian Pu. Papl-slam: Principal axis-anchored monocular point-line slam.IEEE Robot. Autom. Letters, 2025. 2

  21. [21]

    Is ego status all you need for open-loop end-to-end autonomous driving? InProc

    Zhiqi Li, Zhiding Yu, Shiyi Lan, Jiahan Li, Jan Kautz, Tong Lu, and Jose M Alvarez. Is ego status all you need for open-loop end-to-end autonomous driving? InProc. IEEE/CVF Conf. Comput. Vis. Pattern Recog., pages 14864– 14873, 2024. 1, 2

  22. [22]

    Diffusiondrive: Truncated diffusion model for end-to-end autonomous driving

    Bencheng Liao, Shaoyu Chen, Haoran Yin, Bo Jiang, Cheng Wang, Sixu Yan, Xinbang Zhang, Xiangyu Li, Ying Zhang, Qian Zhang, et al. Diffusiondrive: Truncated diffusion model for end-to-end autonomous driving. InProc. IEEE/CVF Conf. Comput. Vis. Pattern Recog., pages 12037–12047,

  23. [23]

    A Survey on Hallucination in Large Vision-Language Models

    Hanchao Liu, Wenyuan Xue, Yifei Chen, Dapeng Chen, Xiu- tian Zhao, Ke Wang, Liping Hou, Rongjun Li, and Wei Peng. A survey on hallucination in large vision-language models. arXiv preprint arXiv:2402.00253, 2024. 2

  24. [24]

    Off-road obstacle avoidance through end-to-end learn- ing.Adv

    Urs Muller, Jan Ben, Eric Cosatto, Beat Flepp, and Yann Cun. Off-road obstacle avoidance through end-to-end learn- ing.Adv. Neural Inform. Process. Syst., 18, 2005. 2, 3

  25. [25]

    Cambridge university press, 2009

    Judea Pearl.Causality. Cambridge university press, 2009. 2, 3, 4, 1

  26. [26]

    Cadet: a causal disentanglement approach for robust trajec- tory prediction in autonomous driving

    Mozhgan Pourkeshavarz, Junrui Zhang, and Amir Rasouli. Cadet: a causal disentanglement approach for robust trajec- tory prediction in autonomous driving. InProc. IEEE/CVF Conf. Comput. Vis. Pattern Recog., pages 14874–14884,

  27. [27]

    Sparsedrive: End-to-end au- tonomous driving via sparse scene representation

    Wenchao Sun, Xuewu Lin, Yining Shi, Chuang Zhang, Hao- ran Wu, and Sifa Zheng. Sparsedrive: End-to-end au- tonomous driving via sparse scene representation. In2025 IEEE Int. Conf. Robot. Autom., pages 8795–8801. IEEE,

  28. [28]

    Decoupling scene perception and ego sta- tus: A multi-context fusion approach for enhanced gener- alization in end-to-end autonomous driving.arXiv preprint arXiv:2511.13079, 2025

    Jiacheng Tang, Mingyue Feng, Jiachao Liu, Yaonong Wang, and Jian Pu. Decoupling scene perception and ego sta- tus: A multi-context fusion approach for enhanced gener- alization in end-to-end autonomous driving.arXiv preprint arXiv:2511.13079, 2025. 2

  29. [29]

    DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models

    Xiaoyu Tian, Junru Gu, Bailin Li, Yicheng Liu, Yang Wang, Zhiyong Zhao, Kun Zhan, Peng Jia, Xianpeng Lang, and Hang Zhao. Drivevlm: The convergence of autonomous driving and large vision-language models.arXiv preprint arXiv:2402.12289, 2024. 2, 3

  30. [30]

    Omnidrive: A holistic vision-language dataset for autonomous driving with counterfactual reasoning

    Shihao Wang, Zhiding Yu, Xiaohui Jiang, Shiyi Lan, Min Shi, Nadine Chang, Jan Kautz, Ying Li, and Jose M Al- varez. Omnidrive: A holistic vision-language dataset for autonomous driving with counterfactual reasoning. InProc. IEEE/CVF Conf. Comput. Vis. Pattern Recog., pages 22442– 22452, 2025. 2

  31. [31]

    Visual commonsense r-cnn

    Tan Wang, Jianqiang Huang, Hanwang Zhang, and Qianru Sun. Visual commonsense r-cnn. InProc. IEEE/CVF Conf. Comput. Vis. Pattern Recog., pages 10760–10770, 2020. 1

  32. [32]

    Are vlms ready for autonomous driving? an empirical study from the reliability, data and metric perspectives

    Shaoyuan Xie, Lingdong Kong, Yuhao Dong, Chonghao Sima, Wenwei Zhang, Qi Alfred Chen, Ziwei Liu, and Liang Pan. Are vlms ready for autonomous driving? an empirical study from the reliability, data and metric perspectives. In Proc. IEEE/CVF Int. Conf. Comput. Vis., pages 6585–6597,

  33. [33]

    Show, attend and tell: Neural image caption genera- tion with visual attention

    Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. Show, attend and tell: Neural image caption genera- tion with visual attention. InInt. Conf. Mach. Learn., pages 2048–2057. PMLR, 2015. 1

  34. [34]

    Drivegpt4-v2: Harnessing large language model capabili- ties for enhanced closed-loop autonomous driving

    Zhenhua Xu, Yan Bai, Yujia Zhang, Zhuoling Li, Fei Xia, Kwan-Yee K Wong, Jianqiang Wang, and Hengshuang Zhao. Drivegpt4-v2: Harnessing large language model capabili- ties for enhanced closed-loop autonomous driving. InProc. IEEE/CVF Conf. Comput. Vis. Pattern Recog., pages 17261– 17270, 2025. 2

  35. [35]

    PointSSC: A cooperative vehicle-infrastructure point cloud benchmark for semantic scene completion

    Yuxiang Yan, Boda Liu, Jianfei Ai, Qinbu Li, Ru Wan, and Jian Pu. PointSSC: A cooperative vehicle-infrastructure point cloud benchmark for semantic scene completion. In IEEE Int. Conf. Robot. Autom., pages 17027–17034. IEEE,

  36. [36]

    Learning spatial-aware manipulation ordering

    Yuxiang Yan, Zhiyuan Zhou, Xin Gao, Guanghao Li, Shenglin Li, Jiaqi Chen, Qunyan Pu, and Jian Pu. Learning spatial-aware manipulation ordering. InAdv. Neural Inform. Process. Syst., 2025. 2

  37. [37]

    Towards context-aware emotion recognition debiasing from a causal demystification perspective via de-confounded training.IEEE Trans

    Dingkang Yang, Kun Yang, Haopeng Kuang, Zhaoyu Chen, Yuzheng Wang, and Lihua Zhang. Towards context-aware emotion recognition debiasing from a causal demystification perspective via de-confounded training.IEEE Trans. Pattern Anal. Mach. Intell., 46(12):10663–10680, 2024. 7

  38. [38]

    Drivemoe: Mixture-of-experts for vision-language-action model in end-to-end autonomous driving.arXiv preprint arXiv:2505.16278, 2025

    Zhenjie Yang, Yilin Chai, Xiaosong Jia, Qifeng Li, Yuqian Shao, Xuekai Zhu, Haisheng Su, and Junchi Yan. Drivemoe: Mixture-of-experts for vision-language-action model in end-to-end autonomous driving.arXiv preprint arXiv:2505.16278, 2025. 7 CausalV AD: De-confounding End-to-End Autonomous Driving via Causal Intervention Supplementary Material This documen...