pith. sign in

arxiv: 2505.15925 · v4 · submitted 2025-05-21 · 💻 cs.RO · cs.AI· cs.CV

VERDI: VLM-Embedded Reasoning for Autonomous Driving

Pith reviewed 2026-05-22 13:24 UTC · model grok-4.3

classification 💻 cs.RO cs.AIcs.CV
keywords autonomous drivingvision-language modelsknowledge distillationend-to-end planninglatent space alignmentcommonsense reasoning
0
0 comments X

The pith

VERDI aligns intermediate outputs from perception, prediction and planning modules with VLM text features at training time so the driving stack absorbs commonsense reasoning without paying VLM inference costs at runtime.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Autonomous driving stacks often fail at decisions under partial observability while large VLMs can supply commonsense but are too slow and memory-heavy for deployment. VERDI solves this by distilling VLM reasoning into existing modular end-to-end AD models through a latent-space alignment loss applied at the three main stages. The alignment encourages each module to produce outputs whose features match explanations the VLM would generate for the same scene, letting the compact AD network internalize the reasoning structure. Experiments report up to 11 percent lower L2 trajectory error in open-loop tests and a 10 percent gain in non-collision rate inside the closed-loop HugSim simulator, all while retaining the fast inference speed of the original modular stack.

Core claim

VERDI augments modular differentiable end-to-end AD models by aligning intermediate module outputs at the perception, prediction, and planning stages with text features explaining the driving reasoning process produced by VLMs, enabling the modular AD stack to internalize structured reasoning without incurring the inference-time costs of large VLMs.

What carries the argument

Latent-space alignment loss that matches AD module outputs at perception, prediction and planning stages to VLM-generated text features describing driving reasoning.

If this is right

  • The aligned models achieve up to 11 percent lower L2 distance than prior end-to-end methods without embedded reasoning.
  • Closed-loop driving in the HugSim simulator reaches the highest overall score with a 10 percent gain in non-collision rate.
  • Inference speed remains fast because no VLM runs at test time.
  • Modular structure is preserved, supporting safety decomposition that monolithic VLM planners lack.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same latent-alignment idea could transfer commonsense from large models into other sequential control tasks that must run at real-time rates.
  • Measuring how much each stage (perception versus planning) benefits from the alignment would clarify where the reasoning transfer occurs most strongly.
  • Replacing the VLM text features with cheaper synthetic descriptions might test how much of the gain depends on the specific VLM chosen.

Load-bearing premise

That matching the latent representations of driving modules to VLM text features is enough to transfer useful commonsense reasoning into the driving policy.

What would settle it

A closed-loop test in which the VERDI-trained model shows no improvement or a drop in non-collision rate and trajectory accuracy relative to the identical baseline without any VLM alignment would falsify the claimed benefit.

Figures

Figures reproduced from arXiv: 2505.15925 by Anirudha Majumdar, Baiang Li, Bowen Feng, Felix Heide, Filippo Ghilotti, Julian Ost, Roger Girgis, Zhiting Mei.

Figure 1
Figure 1. Figure 1: Overview of VERDI. Our pipeline aligns the VLM reasoning module with our e2e driving model. During training, the ground truth (GT) trajectory and observed images are provided to the VLM for it to explain the reasoning throughout perception, prediction, and planning during the driving process. The VLM’s answers to each submodule is aligned with the corresponding submodule outputs from the e2e driving model,… view at source ↗
Figure 2
Figure 2. Figure 2: Obtaining description features through chain-of-thought prompting and text encoder. For each query, the prompt consists [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: VERDI Training. The e2e model is trained with VERDI for the individual perception, prediction, and planning modules. All relevant feature maps F and Q are first mapped to a feature fP in a representation space, which is shared with the encoded language features fM. This mapping is facilitated by VERDI’s trainable PFP layers. The perception outputs Fperception including the extracted image features, are dir… view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative comparison of VERDI (Ours, right column) and the Supervised e2e model (baseline, left column) on the nuScenes dataset [13]. Each entry shows the multi-view camera observations on the left and the BEV view on the right at one time step t. The left panel overlays the ego agent’s planned 3-second trajectory on the front-camera image and BEV panel as a solid green line that fades to blue. The BEV p… view at source ↗
Figure 5
Figure 5. Figure 5: Example scenes with RC drops. We qualitatively show [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 7
Figure 7. Figure 7: OpenEMMA Testing Example (City Scene) on the [PITH_FULL_IMAGE:figures/full_fig_p011_7.png] view at source ↗
Figure 6
Figure 6. Figure 6: OpenEMMA Testing Example (Bus Scene) on the [PITH_FULL_IMAGE:figures/full_fig_p011_6.png] view at source ↗
Figure 8
Figure 8. Figure 8: OpenEMMA Testing Example (Stop Sign Scene) on [PITH_FULL_IMAGE:figures/full_fig_p012_8.png] view at source ↗
Figure 10
Figure 10. Figure 10: Additional qualitative comparison of VERDI (Ours, right column) and the Supervised e2e model (baseline, left column) on the nuScenes dataset [13]. Each entry shows the multi-view camera observations on the left and the BEV view on the right at one time step t. The left panel overlays the ego agent’s planned 3-second trajectory on the front-camera image and BEV panel as a solid green line that fades to blu… view at source ↗
Figure 11
Figure 11. Figure 11: Qualitative comparison of VERDI (Ours, right column) and the Supervised e2e model (baseline, left column) in the HugSim simulator [27] close loop test. Each entry shows the front view camera observations at one time step t. The left panel overlays the ego agent’s planned 3-second trajectory on the front-camera image as sequential yellow dots as waypoints. Each example shows successful performance on end t… view at source ↗
read the original abstract

While autonomous driving (AD) stacks struggle with decision making under partial observability and real-world complexity, human drivers are capable of applying commonsense reasoning to make near-optimal decisions with limited information. Recent work has attempted to leverage finetuned Vision-Language Models (VLMs) for trajectory planning at inference time to emulate human behavior. Despite their success in benchmark evaluations, these methods are often impractical to deploy (a 70B parameter VLM inference at merely 8 tokens per second requires more than 160G of memory), and their monolithic network structure prohibits safety decomposition. To bridge this gap, we propose VLM-Embedded Reasoning for autonomous DrIving (VERDI), a training-time framework that distills the reasoning process and commonsense knowledge of VLMs into the AD stack. VERDI augments modular differentiable end-to-end (e2e) AD models by aligning intermediate module outputs at the perception, prediction, and planning stages with text features explaining the driving reasoning process produced by VLMs. By encouraging alignment in latent space, VERDI enables the modular AD stack to internalize structured reasoning, without incurring the inference-time costs of large VLMs. We evaluate VERDI in both open-loop and closed-loop settings. Our method outperforms existing end-to-end approaches without embedded reasoning by up to 11% in $\ell_{2}$ distance, and achieves the best overall driving performance in the closed-loop HugSim simulator, including a 10% improvement in Non-Collision Rate, while maintaining fast inference speed.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes VERDI, a training-time framework that augments modular differentiable end-to-end autonomous driving models by aligning intermediate outputs at the perception, prediction, and planning stages with text features from VLMs that describe driving reasoning. This alignment is intended to distill commonsense knowledge and structured reasoning into the AD stack without incurring VLM inference costs at deployment. The method is evaluated in open- and closed-loop settings, reporting up to 11% improvement in L2 distance over existing end-to-end approaches and a 10% gain in non-collision rate in the HugSim simulator while preserving fast inference.

Significance. If the alignment mechanism demonstrably transfers functional reasoning rather than providing generic regularization, the approach would offer a practical route to embedding human-like decision-making in efficient, safety-decomposable AD stacks. The reported closed-loop gains on non-collision rate and trajectory accuracy would then represent a meaningful advance for handling partial observability without monolithic VLM deployment.

major comments (2)
  1. [Abstract / alignment objective] Abstract and the paragraph describing the alignment objective: the central claim that minimizing distance between module outputs and VLM text embeddings causes the stack to internalize and apply commonsense reasoning (e.g., inference under partial observability) is not supported by any mechanism or ablation showing that VLM-derived logic becomes causally active in the forward pass; performance deltas could arise from auxiliary supervision alone.
  2. [Evaluation] Evaluation section: the abstract states concrete gains (11% L2, 10% non-collision) but supplies no statistical significance tests, exact baseline configurations, or controls for post-hoc hyperparameter choices, leaving the strength of the outperformance claim moderate at best.
minor comments (2)
  1. [Method] Clarify whether the VLM text features are high-level summaries or step-wise decision traces, as this affects the interpretation of what reasoning is being transferred.
  2. [Discussion] Add a brief discussion of potential failure modes when VLM features contain hallucinations or biases that could propagate through the alignment.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and the opportunity to clarify our contributions. We address each major comment below, providing the strongest honest defense of the manuscript while indicating planned revisions where appropriate.

read point-by-point responses
  1. Referee: [Abstract / alignment objective] Abstract and the paragraph describing the alignment objective: the central claim that minimizing distance between module outputs and VLM text embeddings causes the stack to internalize and apply commonsense reasoning (e.g., inference under partial observability) is not supported by any mechanism or ablation showing that VLM-derived logic becomes causally active in the forward pass; performance deltas could arise from auxiliary supervision alone.

    Authors: We agree that stronger evidence is needed to distinguish reasoning transfer from generic auxiliary supervision. The alignment objective specifically projects module outputs onto VLM text embeddings that encode explicit driving reasoning (e.g., descriptions of occluded agents or intent inference), rather than arbitrary features. The closed-loop gains, especially the 10% non-collision improvement in scenarios requiring partial-observability reasoning, provide indirect support that the internalized representations are functionally relevant. To directly address the concern, we will add an ablation replacing reasoning-specific VLM text with random or non-driving captions and demonstrate degraded performance, isolating the contribution of the structured reasoning content. revision: yes

  2. Referee: [Evaluation] Evaluation section: the abstract states concrete gains (11% L2, 10% non-collision) but supplies no statistical significance tests, exact baseline configurations, or controls for post-hoc hyperparameter choices, leaving the strength of the outperformance claim moderate at best.

    Authors: We concur that statistical rigor and precise experimental details are necessary to substantiate the reported improvements. In the revised manuscript we will include statistical significance tests (e.g., paired t-tests across multiple seeds with p-values) for both L2 distance and non-collision rate. We will also document the exact hyperparameter settings, training protocols, and baseline configurations used, together with additional controls such as fixed random seeds and sensitivity analysis to post-hoc choices. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper's derivation consists of a training-time latent alignment objective between modular AD outputs and externally generated VLM text features, followed by separate empirical evaluation of the resulting model in open-loop and closed-loop settings. The alignment serves as an auxiliary supervision signal rather than a self-referential definition of the reported metrics; L2 distance and non-collision rate are measured against independent baselines and simulators. No self-citations, uniqueness theorems, or fitted parameters that are later renamed as predictions appear in the abstract or method description. The central claim therefore remains an empirical hypothesis about transfer via alignment, not a reduction to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that VLM text features contain transferable driving reasoning and that latent alignment can embed this reasoning into the AD modules. No explicit free parameters or invented entities are named in the abstract.

axioms (1)
  • domain assumption VLM-generated text features encode useful commonsense reasoning for driving decisions under partial observability.
    Invoked when the method aligns AD module outputs with these features to internalize reasoning.

pith-pipeline@v0.9.0 · 5829 in / 1276 out tokens · 26426 ms · 2026-05-22T13:24:49.534205+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Orion-Lite: Distilling LLM Reasoning into Efficient Vision-Only Driving Models

    cs.CV 2026-04 unverdicted novelty 6.0

    Orion-Lite uses latent feature distillation and trajectory supervision to create a vision-only model that surpasses its LLM-based teacher on closed-loop Bench2Drive evaluation, achieving a new SOTA driving score of 80.6.

  2. How Well Do Vision-Language Models Understand Sequential Driving Scenes? A Sensitivity Study

    cs.CV 2026-04 conditional novelty 6.0

    VENUSS benchmark shows top VLMs achieve 57% accuracy on sequential driving scenes, strong on static objects but weak on vehicle dynamics and temporal relations.

  3. How Well Do Vision-Language Models Understand Sequential Driving Scenes? A Sensitivity Study

    cs.CV 2026-04 unverdicted novelty 6.0

    VENUSS evaluates 25+ VLMs across 2600+ sequential driving scenarios and finds top models reach only 57% accuracy versus 65% for humans, with good static detection but poor performance on vehicle dynamics and temporal ...

Reference graph

Works this paper leans on

55 extracted references · 55 canonical work pages · cited by 2 Pith papers · 8 internal anchors

  1. [1]

    Comparative safety performance of autonomous-and human drivers: A real-world case study of the waymo driver,

    L. Di Lillo, T. Gode, X. Zhou, M. Atzei, R. Chen, and T. Victor, “Comparative safety performance of autonomous-and human drivers: A real-world case study of the waymo driver,”Heliyon, vol. 10, no. 14, 2024

  2. [2]

    Safety on autopilot: An empirical investigation of autonomous driving and traffic safety,

    M. Jung, J. Park, and M.-S. Pang, “Safety on autopilot: An empirical investigation of autonomous driving and traffic safety,” 2024

  3. [3]

    Learning from all vehicles,

    D. Chen and P. Kr ¨ahenb¨uhl, “Learning from all vehicles,” inProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 17 222–17 231

  4. [4]

    Effective adaptation in multi-task co-training for unified autonomous driving,

    X. Liang, Y . Wu, J. Han, H. Xu, C. Xu, and X. Liang, “Effective adaptation in multi-task co-training for unified autonomous driving,” Advances in Neural Information Processing Systems, vol. 35, pp. 19 645– 19 658, 2022

  5. [5]

    End-to-end interpretable neural motion planner,

    W. Zeng, W. Luo, S. Suo, A. Sadat, B. Yang, S. Casas, and R. Urtasun, “End-to-end interpretable neural motion planner,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 8660–8669

  6. [6]

    Beverse: Unified perception and prediction in birds-eye-view for vision-centric autonomous driving,

    Y . Zhang, Z. Zhu, W. Zheng, J. Huang, G. Huang, J. Zhou, and J. Lu, “Beverse: Unified perception and prediction in birds-eye-view for vision- centric autonomous driving,”arXiv preprint arXiv:2205.09743, 2022

  7. [7]

    Learning unsupervised world models for autonomous driving via discrete diffusion,

    L. Zhang, Y . Xiong, Z. Yang, S. Casas, R. Hu, and R. Urtasun, “Learning unsupervised world models for autonomous driving via discrete diffusion,” arXiv preprint arXiv:2311.01017, 2023

  8. [8]

    Oatracker: Object-aware anti-occlusion 3d multiobject tracking for autonomous driving,

    X. Zhang, X. Tan, Y . An, Y . Li, and Z. Fan, “Oatracker: Object-aware anti-occlusion 3d multiobject tracking for autonomous driving,”Expert Systems with Applications, vol. 252, p. 124158, 2024

  9. [9]

    Planning-oriented autonomous driving,

    Y . Hu, J. Yang, L. Chen, K. Li, C. Sima, X. Zhu, S. Chai, S. Du, T. Lin, W. Wanget al., “Planning-oriented autonomous driving,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 17 853–17 862

  10. [10]

    Vad: Vectorized scene representation for efficient autonomous driving,

    B. Jiang, S. Chen, Q. Xu, B. Liao, J. Chen, H. Zhou, Q. Zhang, W. Liu, C. Huang, and X. Wang, “Vad: Vectorized scene representation for efficient autonomous driving,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 8340–8350

  11. [11]

    VADv2: End-to-End Vectorized Autonomous Driving via Probabilistic Planning

    S. Chen, B. Jiang, H. Gao, B. Liao, Q. Xu, Q. Zhang, C. Huang, W. Liu, and X. Wang, “Vadv2: End-to-end vectorized autonomous driving via probabilistic planning,”arXiv preprint arXiv:2402.13243, 2024

  12. [12]

    Is ego status all you need for open-loop end-to-end autonomous driving?

    Z. Li, Z. Yu, S. Lan, J. Li, J. Kautz, T. Lu, and J. M. Alvarez, “Is ego status all you need for open-loop end-to-end autonomous driving?” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 14 864–14 873

  13. [13]

    nuscenes: A multimodal dataset for autonomous driving,

    H. Caesar, V . Bankiti, A. H. Lang, S. V ora, V . E. Liong, Q. Xu, A. Krishnan, Y . Pan, G. Baldan, and O. Beijbom, “nuscenes: A multimodal dataset for autonomous driving,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 11 621–11 631

  14. [14]

    Scalability in perception for autonomous driving: Waymo open dataset,

    P. Sun, H. Kretzschmar, X. Dotiwalla, A. Chouard, V . Patnaik, P. Tsui, J. Guo, Y . Zhou, Y . Chai, B. Caineet al., “Scalability in perception for autonomous driving: Waymo open dataset,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 2446–2454

  15. [16]

    Bounded rationality,

    H. A. Simon, “Bounded rationality,”Utility and probability, pp. 15–18, 1990

  16. [17]

    Bounded rationality,

    B. D. Jones, “Bounded rationality,”Annual review of political science, vol. 2, no. 1, pp. 297–321, 1999

  17. [18]

    EMMA: End-to-End Multimodal Model for Autonomous Driving

    J.-J. Hwang, R. Xu, H. Lin, W.-C. Hung, J. Ji, K. Choi, D. Huang, T. He, P. Covington, B. Sapp, J. Guo, D. Anguelov, and M. Tan, “EMMA: End-to-End Multimodal Model for Autonomous Driving,” Oct. 2024, arXiv:2410.23262 [cs] version: 1. [Online]. Available: http://arxiv.org/abs/2410.23262

  18. [19]

    OpenEMMA: Open-Source Multimodal Model for End-to-End Autonomous Driving,

    S. Xing, C. Qian, Y . Wang, H. Hua, K. Tian, Y . Zhou, and Z. Tu, “OpenEMMA: Open-Source Multimodal Model for End-to-End Autonomous Driving,” Dec. 2024, arXiv:2412.15208 [cs]. [Online]. Available: http://arxiv.org/abs/2412.15208

  19. [20]

    Driving with llms: Fusing object- level vector modality for explainable autonomous driving,

    L. Chen, O. Sinavski, J. H ¨unermann, A. Karnsund, A. J. Willmott, D. Birch, D. Maund, and J. Shotton, “Driving with llms: Fusing object- level vector modality for explainable autonomous driving,” in2024 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2024, pp. 14 093–14 100

  20. [21]

    Drivegpt4: Interpretable end-to-end autonomous driving via large language model,

    Z. Xu, Y . Zhang, E. Xie, Z. Zhao, Y . Guo, K.-Y . K. Wong, Z. Li, and H. Zhao, “Drivegpt4: Interpretable end-to-end autonomous driving via large language model,”IEEE Robotics and Automation Letters, 2024

  21. [22]

    Omnidrive: A holistic llm-agent framework for autonomous driving with 3d perception, reasoning and planning,

    S. Wang, Z. Yu, X. Jiang, S. Lan, M. Shi, N. Chang, J. Kautz, Y . Li, and J. M. Alvarez, “Omnidrive: A holistic llm-agent framework for autonomous driving with 3d perception, reasoning and planning,”arXiv preprint arXiv:2405.01533, 2024

  22. [23]

    Drivelm: Driving with graph visual question answering,

    C. Sima, K. Renz, K. Chitta, L. Chen, H. Zhang, C. Xie, J. Beißwenger, P. Luo, A. Geiger, and H. Li, “Drivelm: Driving with graph visual question answering,” inEuropean Conference on Computer Vision. Springer, 2024, pp. 256–274

  23. [24]

    Drivemlm: Aligning multi-modal large language models with behavioral planning states for au- tonomous driving

    W. Wang, J. Xie, C. Hu, H. Zou, J. Fan, W. Tong, Y . Wen, S. Wu, H. Deng, Z. Liet al., “Drivemlm: Aligning multi-modal large language models with behavioral planning states for autonomous driving,”arXiv preprint arXiv:2312.09245, 2023

  24. [25]

    DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models

    X. Tian, J. Gu, B. Li, Y . Liu, Y . Wang, Z. Zhao, K. Zhan, P. Jia, X. Lang, and H. Zhao, “Drivevlm: The convergence of autonomous driving and large vision-language models,”arXiv preprint arXiv:2402.12289, 2024

  25. [26]

    Chain-of-thought prompting elicits reasoning in large language models,

    J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V . Le, D. Zhouet al., “Chain-of-thought prompting elicits reasoning in large language models,”Advances in neural information processing systems, vol. 35, pp. 24 824–24 837, 2022

  26. [27]

    Hugsim: A real-time, photo-realistic and closed-loop simulator for autonomous driving,

    H. Zhou, L. Lin, J. Wang, Y . Lu, D. Bai, B. Liu, Y . Wang, A. Geiger, and Y . Liao, “Hugsim: A real-time, photo-realistic and closed-loop simulator for autonomous driving,”arXiv preprint arXiv:2412.01718, 2024

  27. [28]

    End-to-end autonomous driving: Challenges and frontiers,

    L. Chen, P. Wu, K. Chitta, B. Jaeger, A. Geiger, and H. Li, “End-to-end autonomous driving: Challenges and frontiers,”IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024

  28. [29]

    Recent advancements in end-to-end autonomous driving using deep learning: A survey,

    P. S. Chib and P. Singh, “Recent advancements in end-to-end autonomous driving using deep learning: A survey,”IEEE Transactions on Intelligent V ehicles, vol. 9, no. 1, pp. 103–118, 2023

  29. [30]

    Diffstack: A differentiable and modular control stack for autonomous vehicles,

    P. Karkus, B. Ivanovic, S. Mannor, and M. Pavone, “Diffstack: A differentiable and modular control stack for autonomous vehicles,” in Conference on robot learning. PMLR, 2023, pp. 2170–2180

  30. [31]

    Planning-oriented autonomous driving,

    Y . Hu, J. Yang, L. Chen, K. Li, C. Sima, X. Zhu, S. Chai, S. Du, T. Lin, W. Wang, L. Lu, X. Jia, Q. Liu, J. Dai, Y . Qiao, and H. Li, “Planning-oriented autonomous driving,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023

  31. [32]

    Safety-enhanced autonomous driving using interpretable sensor fusion transformer,

    H. Shao, L. Wang, R. Chen, H. Li, and Y . Liu, “Safety-enhanced autonomous driving using interpretable sensor fusion transformer,” in Conference on Robot Learning. PMLR, 2023, pp. 726–737

  32. [33]

    Visual point cloud forecasting enables scalable autonomous driving,

    Z. Yang, L. Chen, Y . Sun, and H. Li, “Visual point cloud forecasting enables scalable autonomous driving,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 14 673–14 684

  33. [34]

    Behavior-inspired neural networks for relational inference,

    Y . Yang, B. Feng, K. Wang, N. Leonard, A. B. Dieng, and C. Allen- Blanchette, “Behavior-inspired neural networks for relational inference,” arXiv preprint arXiv:2406.14746, 2024

  34. [35]

    Latent variable sequential set transformers for joint multi-agent motion prediction,

    R. Girgis, F. Golemo, F. Codevilla, M. Weiss, J. A. D’Souza, S. E. Kahou, F. Heide, and C. Pal, “Latent variable sequential set transformers for joint multi-agent motion prediction,”arXiv preprint arXiv:2104.00563, 2021

  35. [36]

    Plant: Explainable planning transformers via object-level representations,

    K. Renz, K. Chitta, O.-B. Mercea, A. Koepke, Z. Akata, and A. Geiger, “Plant: Explainable planning transformers via object-level representations,” arXiv preprint arXiv:2210.14222, 2022

  36. [37]

    Quad: Query-based interpretable neural motion planning for autonomous driving,

    S. Biswas, S. Casas, Q. Sykora, B. Agro, A. Sadat, and R. Urtasun, “Quad: Query-based interpretable neural motion planning for autonomous driving,” in2024 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2024, pp. 14 236–14 243

  37. [38]

    St-p3: End-to-end vision-based autonomous driving via spatial-temporal feature learning,

    S. Hu, L. Chen, P. Wu, H. Li, J. Yan, and D. Tao, “St-p3: End-to-end vision-based autonomous driving via spatial-temporal feature learning,” in European Conference on Computer Vision. Springer, 2022, pp. 533–549

  38. [39]

    Dualad: Disentangling the dynamic and static world for end-to-end driving,

    S. Doll, N. Hanselmann, L. Schneider, R. Schulz, M. Cordts, M. En- zweiler, and H. P. Lensch, “Dualad: Disentangling the dynamic and static world for end-to-end driving,” in2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 14 728– 14 737

  39. [40]

    Para- drive: Parallelized architecture for real-time autonomous driving,

    X. Weng, B. Ivanovic, Y . Wang, Y . Wang, and M. Pavone, “Para- drive: Parallelized architecture for real-time autonomous driving,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 15 449–15 458

  40. [41]

    Vlp: Vision language planning for autonomous driving,

    C. Pan, B. Yaman, T. Nesti, A. Mallik, A. G. Allievi, S. Velipasalar, and L. Ren, “Vlp: Vision language planning for autonomous driving,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 14 760–14 769

  41. [42]

    Vlm-ad: End-to-end autonomous driving through vision-language model supervision,

    Y . Xu, Y . Hu, Z. Zhang, G. P. Meyer, S. K. Mustikovela, S. Srini- vasa, E. M. Wolff, and X. Huang, “Vlm-ad: End-to-end autonomous driving through vision-language model supervision,”arXiv preprint arXiv:2412.14446, 2024

  42. [43]

    Alpamayo-R1: Bridging Reasoning and Action Prediction for Generalizable Autonomous Driving in the Long Tail

    Y . Wang, W. Luo, J. Bai, Y . Cao, T. Che, K. Chen, Y . Chen, J. Diamond, Y . Ding, W. Dinget al., “Alpamayo-r1: Bridging reasoning and action prediction for generalizable autonomous driving in the long tail,”arXiv preprint arXiv:2511.00088, 2025

  43. [44]

    Distilling the Knowledge in a Neural Network

    G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neural network,”arXiv preprint arXiv:1503.02531, 2015

  44. [45]

    Knowledge distillation: A survey,

    J. Gou, B. Yu, S. J. Maybank, and D. Tao, “Knowledge distillation: A survey,”International Journal of Computer Vision, vol. 129, no. 6, pp. 1789–1819, 2021

  45. [46]

    A Survey on Knowledge Distillation of Large Language Models

    X. Xu, M. Li, C. Tao, T. Shen, R. Cheng, J. Li, C. Xu, D. Tao, and T. Zhou, “A survey on knowledge distillation of large language models,” arXiv preprint arXiv:2402.13116, 2024

  46. [47]

    Learning Transferable Visual Models From Natural Language Supervision,

    A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever, “Learning Transferable Visual Models From Natural Language Supervision,” inProceedings of the 38th International Conference on Machine Learning. PMLR, Jul. 2021, pp. 8748–8763, iSSN: 2640-3498. [Online]. Available: ht...

  47. [48]

    Reproducible Scaling Laws for Contrastive Language-Image Learning,

    M. Cherti, R. Beaumont, R. Wightman, M. Wortsman, G. Ilharco, C. Gordon, C. Schuhmann, L. Schmidt, and J. Jitsev, “Reproducible Scaling Laws for Contrastive Language-Image Learning,” 2023, pp. 2818–2829. [Online]. Available: https://openaccess.thecvf.com/content/ CVPR2023/html/Cherti Reproducible Scaling Laws for Contrastive Language-Image Learning CVPR 2...

  48. [49]

    Supervised contrastive learning,

    P. Khosla, P. Teterwak, C. Wang, A. Sarna, Y . Tian, P. Isola, A. Maschinot, C. Liu, and D. Krishnan, “Supervised contrastive learning,”Advances in neural information processing systems, vol. 33, pp. 18 661–18 673, 2020

  49. [50]

    Sensei: Semantic exploration guided by foundation models to learn versatile world models,

    C. Sancaktar, C. Gumbsch, A. Zadaianchuk, P. Kolev, and G. Martius, “Sensei: Semantic exploration guided by foundation models to learn versatile world models,”arXiv preprint arXiv:2503.01584, 2025

  50. [51]

    Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks

    N. Reimers and I. Gurevych, “Sentence-bert: Sentence embeddings using siamese bert-networks,”arXiv preprint arXiv:1908.10084, 2019

  51. [52]

    Qwen2.5 Technical Report

    A. Yang, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Li, D. Liu, F. Huang, H. Weiet al., “Qwen2. 5 technical report,”arXiv preprint arXiv:2412.15115, 2024

  52. [53]

    Bevformer: learning bird’s-eye-view representation from lidar-camera via spatiotemporal transformers,

    Z. Li, W. Wang, H. Li, E. Xie, C. Sima, T. Lu, Q. Yu, and J. Dai, “Bevformer: learning bird’s-eye-view representation from lidar-camera via spatiotemporal transformers,”IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024. APPENDIXA HALLUCINATIONS FORINFERENCE-TIMEVLMS(SEC. 1.0 MAIN) Methods that use finetuned multimodal VLMs at inference ...

  53. [54]

    **Traffic Lights **: There are no visible traffic lights in the image. 2. **Movements of Other Cars or Pedestrians **: There are no other cars or pedestrians visible in the image. 3. **Lane Markings **: The road has clear lane markings, including a solid white line on the right side and a dashed white line on the left side. There is also a black and white...

  54. [55]

    Bus Scene

    **Bus (Location: Center of the image, moving towards the camera) **: - **Description**: The bus is moving towards you on the same road. It is important to monitor its speed and direction to ensure safe overtaking or passing. - **Why it’s important**: Ensuring you have enough space to overtake safely is crucial to avoid collisions. ... Fig. 6: OpenEMMA Tes...

  55. [56]

    It is important to pay attention to this truck because it is a large vehicle that may have a longer stopping distance than smaller cars

    **FedEx Truck (Location: Right side of the image, near the center) ** - **Description:** The FedEx truck is on the right side of the image, near the center. It is important to pay attention to this truck because it is a large vehicle that may have a longer stopping distance than smaller cars. You should be prepared for it to slow down or stop suddenly, es...