VERDI: VLM-Embedded Reasoning for Autonomous Driving

Anirudha Majumdar; Baiang Li; Bowen Feng; Felix Heide; Filippo Ghilotti; Julian Ost; Roger Girgis; Zhiting Mei

arxiv: 2505.15925 · v4 · submitted 2025-05-21 · 💻 cs.RO · cs.AI· cs.CV

VERDI: VLM-Embedded Reasoning for Autonomous Driving

Bowen Feng , Zhiting Mei , Julian Ost , Filippo Ghilotti , Baiang Li , Roger Girgis , Anirudha Majumdar , Felix Heide This is my paper

Pith reviewed 2026-05-22 13:24 UTC · model grok-4.3

classification 💻 cs.RO cs.AIcs.CV

keywords autonomous drivingvision-language modelsknowledge distillationend-to-end planninglatent space alignmentcommonsense reasoning

0 comments

The pith

VERDI aligns intermediate outputs from perception, prediction and planning modules with VLM text features at training time so the driving stack absorbs commonsense reasoning without paying VLM inference costs at runtime.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Autonomous driving stacks often fail at decisions under partial observability while large VLMs can supply commonsense but are too slow and memory-heavy for deployment. VERDI solves this by distilling VLM reasoning into existing modular end-to-end AD models through a latent-space alignment loss applied at the three main stages. The alignment encourages each module to produce outputs whose features match explanations the VLM would generate for the same scene, letting the compact AD network internalize the reasoning structure. Experiments report up to 11 percent lower L2 trajectory error in open-loop tests and a 10 percent gain in non-collision rate inside the closed-loop HugSim simulator, all while retaining the fast inference speed of the original modular stack.

Core claim

VERDI augments modular differentiable end-to-end AD models by aligning intermediate module outputs at the perception, prediction, and planning stages with text features explaining the driving reasoning process produced by VLMs, enabling the modular AD stack to internalize structured reasoning without incurring the inference-time costs of large VLMs.

What carries the argument

Latent-space alignment loss that matches AD module outputs at perception, prediction and planning stages to VLM-generated text features describing driving reasoning.

If this is right

The aligned models achieve up to 11 percent lower L2 distance than prior end-to-end methods without embedded reasoning.
Closed-loop driving in the HugSim simulator reaches the highest overall score with a 10 percent gain in non-collision rate.
Inference speed remains fast because no VLM runs at test time.
Modular structure is preserved, supporting safety decomposition that monolithic VLM planners lack.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same latent-alignment idea could transfer commonsense from large models into other sequential control tasks that must run at real-time rates.
Measuring how much each stage (perception versus planning) benefits from the alignment would clarify where the reasoning transfer occurs most strongly.
Replacing the VLM text features with cheaper synthetic descriptions might test how much of the gain depends on the specific VLM chosen.

Load-bearing premise

That matching the latent representations of driving modules to VLM text features is enough to transfer useful commonsense reasoning into the driving policy.

What would settle it

A closed-loop test in which the VERDI-trained model shows no improvement or a drop in non-collision rate and trajectory accuracy relative to the identical baseline without any VLM alignment would falsify the claimed benefit.

Figures

Figures reproduced from arXiv: 2505.15925 by Anirudha Majumdar, Baiang Li, Bowen Feng, Felix Heide, Filippo Ghilotti, Julian Ost, Roger Girgis, Zhiting Mei.

**Figure 1.** Figure 1: Overview of VERDI. Our pipeline aligns the VLM reasoning module with our e2e driving model. During training, the ground truth (GT) trajectory and observed images are provided to the VLM for it to explain the reasoning throughout perception, prediction, and planning during the driving process. The VLM’s answers to each submodule is aligned with the corresponding submodule outputs from the e2e driving model,… view at source ↗

**Figure 2.** Figure 2: Obtaining description features through chain-of-thought prompting and text encoder. For each query, the prompt consists [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: VERDI Training. The e2e model is trained with VERDI for the individual perception, prediction, and planning modules. All relevant feature maps F and Q are first mapped to a feature fP in a representation space, which is shared with the encoded language features fM. This mapping is facilitated by VERDI’s trainable PFP layers. The perception outputs Fperception including the extracted image features, are dir… view at source ↗

**Figure 4.** Figure 4: Qualitative comparison of VERDI (Ours, right column) and the Supervised e2e model (baseline, left column) on the nuScenes dataset [13]. Each entry shows the multi-view camera observations on the left and the BEV view on the right at one time step t. The left panel overlays the ego agent’s planned 3-second trajectory on the front-camera image and BEV panel as a solid green line that fades to blue. The BEV p… view at source ↗

**Figure 5.** Figure 5: Example scenes with RC drops. We qualitatively show [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 7.** Figure 7: OpenEMMA Testing Example (City Scene) on the [PITH_FULL_IMAGE:figures/full_fig_p011_7.png] view at source ↗

**Figure 6.** Figure 6: OpenEMMA Testing Example (Bus Scene) on the [PITH_FULL_IMAGE:figures/full_fig_p011_6.png] view at source ↗

**Figure 8.** Figure 8: OpenEMMA Testing Example (Stop Sign Scene) on [PITH_FULL_IMAGE:figures/full_fig_p012_8.png] view at source ↗

**Figure 10.** Figure 10: Additional qualitative comparison of VERDI (Ours, right column) and the Supervised e2e model (baseline, left column) on the nuScenes dataset [13]. Each entry shows the multi-view camera observations on the left and the BEV view on the right at one time step t. The left panel overlays the ego agent’s planned 3-second trajectory on the front-camera image and BEV panel as a solid green line that fades to blu… view at source ↗

**Figure 11.** Figure 11: Qualitative comparison of VERDI (Ours, right column) and the Supervised e2e model (baseline, left column) in the HugSim simulator [27] close loop test. Each entry shows the front view camera observations at one time step t. The left panel overlays the ego agent’s planned 3-second trajectory on the front-camera image as sequential yellow dots as waypoints. Each example shows successful performance on end t… view at source ↗

read the original abstract

While autonomous driving (AD) stacks struggle with decision making under partial observability and real-world complexity, human drivers are capable of applying commonsense reasoning to make near-optimal decisions with limited information. Recent work has attempted to leverage finetuned Vision-Language Models (VLMs) for trajectory planning at inference time to emulate human behavior. Despite their success in benchmark evaluations, these methods are often impractical to deploy (a 70B parameter VLM inference at merely 8 tokens per second requires more than 160G of memory), and their monolithic network structure prohibits safety decomposition. To bridge this gap, we propose VLM-Embedded Reasoning for autonomous DrIving (VERDI), a training-time framework that distills the reasoning process and commonsense knowledge of VLMs into the AD stack. VERDI augments modular differentiable end-to-end (e2e) AD models by aligning intermediate module outputs at the perception, prediction, and planning stages with text features explaining the driving reasoning process produced by VLMs. By encouraging alignment in latent space, VERDI enables the modular AD stack to internalize structured reasoning, without incurring the inference-time costs of large VLMs. We evaluate VERDI in both open-loop and closed-loop settings. Our method outperforms existing end-to-end approaches without embedded reasoning by up to 11% in $\ell_{2}$ distance, and achieves the best overall driving performance in the closed-loop HugSim simulator, including a 10% improvement in Non-Collision Rate, while maintaining fast inference speed.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

VERDI aligns perception-prediction-planning latents with VLM text embeddings at training time to add reasoning without inference cost, but the reported gains could come from generic extra supervision rather than transferred commonsense.

read the letter

VERDI aligns the intermediate outputs from perception, prediction, and planning modules in a modular differentiable AD stack with text features produced by a VLM that describe the driving reasoning. This alignment runs only during training, so the deployed model stays fast and keeps its modular structure for safety checks. The approach directly targets the memory and speed problems of running large VLMs at inference while trying to keep some of their high-level knowledge. They report up to 11% lower L2 distance and the best closed-loop results in HugSim, including a 10% gain in non-collision rate over existing end-to-end baselines. The multi-stage alignment is the concrete new piece relative to prior inference-time VLM planning work. It gives a practical way to inject external knowledge into an existing pipeline without rewriting the whole system. The central weakness is that latent alignment shows statistical similarity but supplies no mechanism that forces the planning module to apply the same step-by-step logic for partial observability or edge cases. The performance deltas could result from the extra loss term acting as ordinary regularization. The abstract gives no statistical significance numbers, exact baseline details, or ablations that isolate the VLM contribution, so the evidence for actual reasoning transfer remains moderate. This paper is useful for groups already running modular AD stacks who want to test whether language-model knowledge can be baked in cheaply. It has enough technical coherence and empirical signal to deserve peer review, though any referee would press for stronger mechanistic checks and full experimental transparency before the claims can be taken as settled.

Referee Report

2 major / 2 minor

Summary. The paper proposes VERDI, a training-time framework that augments modular differentiable end-to-end autonomous driving models by aligning intermediate outputs at the perception, prediction, and planning stages with text features from VLMs that describe driving reasoning. This alignment is intended to distill commonsense knowledge and structured reasoning into the AD stack without incurring VLM inference costs at deployment. The method is evaluated in open- and closed-loop settings, reporting up to 11% improvement in L2 distance over existing end-to-end approaches and a 10% gain in non-collision rate in the HugSim simulator while preserving fast inference.

Significance. If the alignment mechanism demonstrably transfers functional reasoning rather than providing generic regularization, the approach would offer a practical route to embedding human-like decision-making in efficient, safety-decomposable AD stacks. The reported closed-loop gains on non-collision rate and trajectory accuracy would then represent a meaningful advance for handling partial observability without monolithic VLM deployment.

major comments (2)

[Abstract / alignment objective] Abstract and the paragraph describing the alignment objective: the central claim that minimizing distance between module outputs and VLM text embeddings causes the stack to internalize and apply commonsense reasoning (e.g., inference under partial observability) is not supported by any mechanism or ablation showing that VLM-derived logic becomes causally active in the forward pass; performance deltas could arise from auxiliary supervision alone.
[Evaluation] Evaluation section: the abstract states concrete gains (11% L2, 10% non-collision) but supplies no statistical significance tests, exact baseline configurations, or controls for post-hoc hyperparameter choices, leaving the strength of the outperformance claim moderate at best.

minor comments (2)

[Method] Clarify whether the VLM text features are high-level summaries or step-wise decision traces, as this affects the interpretation of what reasoning is being transferred.
[Discussion] Add a brief discussion of potential failure modes when VLM features contain hallucinations or biases that could propagate through the alignment.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and the opportunity to clarify our contributions. We address each major comment below, providing the strongest honest defense of the manuscript while indicating planned revisions where appropriate.

read point-by-point responses

Referee: [Abstract / alignment objective] Abstract and the paragraph describing the alignment objective: the central claim that minimizing distance between module outputs and VLM text embeddings causes the stack to internalize and apply commonsense reasoning (e.g., inference under partial observability) is not supported by any mechanism or ablation showing that VLM-derived logic becomes causally active in the forward pass; performance deltas could arise from auxiliary supervision alone.

Authors: We agree that stronger evidence is needed to distinguish reasoning transfer from generic auxiliary supervision. The alignment objective specifically projects module outputs onto VLM text embeddings that encode explicit driving reasoning (e.g., descriptions of occluded agents or intent inference), rather than arbitrary features. The closed-loop gains, especially the 10% non-collision improvement in scenarios requiring partial-observability reasoning, provide indirect support that the internalized representations are functionally relevant. To directly address the concern, we will add an ablation replacing reasoning-specific VLM text with random or non-driving captions and demonstrate degraded performance, isolating the contribution of the structured reasoning content. revision: yes
Referee: [Evaluation] Evaluation section: the abstract states concrete gains (11% L2, 10% non-collision) but supplies no statistical significance tests, exact baseline configurations, or controls for post-hoc hyperparameter choices, leaving the strength of the outperformance claim moderate at best.

Authors: We concur that statistical rigor and precise experimental details are necessary to substantiate the reported improvements. In the revised manuscript we will include statistical significance tests (e.g., paired t-tests across multiple seeds with p-values) for both L2 distance and non-collision rate. We will also document the exact hyperparameter settings, training protocols, and baseline configurations used, together with additional controls such as fixed random seeds and sensitivity analysis to post-hoc choices. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper's derivation consists of a training-time latent alignment objective between modular AD outputs and externally generated VLM text features, followed by separate empirical evaluation of the resulting model in open-loop and closed-loop settings. The alignment serves as an auxiliary supervision signal rather than a self-referential definition of the reported metrics; L2 distance and non-collision rate are measured against independent baselines and simulators. No self-citations, uniqueness theorems, or fitted parameters that are later renamed as predictions appear in the abstract or method description. The central claim therefore remains an empirical hypothesis about transfer via alignment, not a reduction to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that VLM text features contain transferable driving reasoning and that latent alignment can embed this reasoning into the AD modules. No explicit free parameters or invented entities are named in the abstract.

axioms (1)

domain assumption VLM-generated text features encode useful commonsense reasoning for driving decisions under partial observability.
Invoked when the method aligns AD module outputs with these features to internalize reasoning.

pith-pipeline@v0.9.0 · 5829 in / 1276 out tokens · 26426 ms · 2026-05-22T13:24:49.534205+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

VERDI augments modular differentiable end-to-end AD models by aligning intermediate module outputs at the perception, prediction, and planning stages with text features explaining the driving reasoning process produced by VLMs... Lf(fP,fM)=fP·fM/∥fP∥∥fM∥
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We evaluate VERDI in both open-loop and closed-loop settings... 11% in ℓ2 distance... 10% improvement in Non-Collision Rate

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Orion-Lite: Distilling LLM Reasoning into Efficient Vision-Only Driving Models
cs.CV 2026-04 unverdicted novelty 6.0

Orion-Lite uses latent feature distillation and trajectory supervision to create a vision-only model that surpasses its LLM-based teacher on closed-loop Bench2Drive evaluation, achieving a new SOTA driving score of 80.6.
How Well Do Vision-Language Models Understand Sequential Driving Scenes? A Sensitivity Study
cs.CV 2026-04 conditional novelty 6.0

VENUSS benchmark shows top VLMs achieve 57% accuracy on sequential driving scenes, strong on static objects but weak on vehicle dynamics and temporal relations.
How Well Do Vision-Language Models Understand Sequential Driving Scenes? A Sensitivity Study
cs.CV 2026-04 unverdicted novelty 6.0

VENUSS evaluates 25+ VLMs across 2600+ sequential driving scenarios and finds top models reach only 57% accuracy versus 65% for humans, with good static detection but poor performance on vehicle dynamics and temporal ...

Reference graph

Works this paper leans on

55 extracted references · 55 canonical work pages · cited by 2 Pith papers · 8 internal anchors

[1]

Comparative safety performance of autonomous-and human drivers: A real-world case study of the waymo driver,

L. Di Lillo, T. Gode, X. Zhou, M. Atzei, R. Chen, and T. Victor, “Comparative safety performance of autonomous-and human drivers: A real-world case study of the waymo driver,”Heliyon, vol. 10, no. 14, 2024

work page 2024
[2]

Safety on autopilot: An empirical investigation of autonomous driving and traffic safety,

M. Jung, J. Park, and M.-S. Pang, “Safety on autopilot: An empirical investigation of autonomous driving and traffic safety,” 2024

work page 2024
[3]

Learning from all vehicles,

D. Chen and P. Kr ¨ahenb¨uhl, “Learning from all vehicles,” inProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 17 222–17 231

work page 2022
[4]

Effective adaptation in multi-task co-training for unified autonomous driving,

X. Liang, Y . Wu, J. Han, H. Xu, C. Xu, and X. Liang, “Effective adaptation in multi-task co-training for unified autonomous driving,” Advances in Neural Information Processing Systems, vol. 35, pp. 19 645– 19 658, 2022

work page 2022
[5]

End-to-end interpretable neural motion planner,

W. Zeng, W. Luo, S. Suo, A. Sadat, B. Yang, S. Casas, and R. Urtasun, “End-to-end interpretable neural motion planner,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 8660–8669

work page 2019
[6]

Beverse: Uniﬁed perception and prediction in birds-eye-view for vision-centric autonomous driving,

Y . Zhang, Z. Zhu, W. Zheng, J. Huang, G. Huang, J. Zhou, and J. Lu, “Beverse: Unified perception and prediction in birds-eye-view for vision- centric autonomous driving,”arXiv preprint arXiv:2205.09743, 2022

work page arXiv 2022
[7]

Learning unsupervised world models for autonomous driving via discrete diffusion,

L. Zhang, Y . Xiong, Z. Yang, S. Casas, R. Hu, and R. Urtasun, “Learning unsupervised world models for autonomous driving via discrete diffusion,” arXiv preprint arXiv:2311.01017, 2023

work page arXiv 2023
[8]

Oatracker: Object-aware anti-occlusion 3d multiobject tracking for autonomous driving,

X. Zhang, X. Tan, Y . An, Y . Li, and Z. Fan, “Oatracker: Object-aware anti-occlusion 3d multiobject tracking for autonomous driving,”Expert Systems with Applications, vol. 252, p. 124158, 2024

work page 2024
[9]

Planning-oriented autonomous driving,

Y . Hu, J. Yang, L. Chen, K. Li, C. Sima, X. Zhu, S. Chai, S. Du, T. Lin, W. Wanget al., “Planning-oriented autonomous driving,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 17 853–17 862

work page 2023
[10]

Vad: Vectorized scene representation for efficient autonomous driving,

B. Jiang, S. Chen, Q. Xu, B. Liao, J. Chen, H. Zhou, Q. Zhang, W. Liu, C. Huang, and X. Wang, “Vad: Vectorized scene representation for efficient autonomous driving,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 8340–8350

work page 2023
[11]

VADv2: End-to-End Vectorized Autonomous Driving via Probabilistic Planning

S. Chen, B. Jiang, H. Gao, B. Liao, Q. Xu, Q. Zhang, C. Huang, W. Liu, and X. Wang, “Vadv2: End-to-end vectorized autonomous driving via probabilistic planning,”arXiv preprint arXiv:2402.13243, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[12]

Is ego status all you need for open-loop end-to-end autonomous driving?

Z. Li, Z. Yu, S. Lan, J. Li, J. Kautz, T. Lu, and J. M. Alvarez, “Is ego status all you need for open-loop end-to-end autonomous driving?” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 14 864–14 873

work page 2024
[13]

nuscenes: A multimodal dataset for autonomous driving,

H. Caesar, V . Bankiti, A. H. Lang, S. V ora, V . E. Liong, Q. Xu, A. Krishnan, Y . Pan, G. Baldan, and O. Beijbom, “nuscenes: A multimodal dataset for autonomous driving,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 11 621–11 631

work page 2020
[14]

Scalability in perception for autonomous driving: Waymo open dataset,

P. Sun, H. Kretzschmar, X. Dotiwalla, A. Chouard, V . Patnaik, P. Tsui, J. Guo, Y . Zhou, Y . Chai, B. Caineet al., “Scalability in perception for autonomous driving: Waymo open dataset,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 2446–2454

work page 2020
[16]

Bounded rationality,

H. A. Simon, “Bounded rationality,”Utility and probability, pp. 15–18, 1990

work page 1990
[17]

Bounded rationality,

B. D. Jones, “Bounded rationality,”Annual review of political science, vol. 2, no. 1, pp. 297–321, 1999

work page 1999
[18]

EMMA: End-to-End Multimodal Model for Autonomous Driving

J.-J. Hwang, R. Xu, H. Lin, W.-C. Hung, J. Ji, K. Choi, D. Huang, T. He, P. Covington, B. Sapp, J. Guo, D. Anguelov, and M. Tan, “EMMA: End-to-End Multimodal Model for Autonomous Driving,” Oct. 2024, arXiv:2410.23262 [cs] version: 1. [Online]. Available: http://arxiv.org/abs/2410.23262

work page internal anchor Pith review Pith/arXiv arXiv 2024
[19]

OpenEMMA: Open-Source Multimodal Model for End-to-End Autonomous Driving,

S. Xing, C. Qian, Y . Wang, H. Hua, K. Tian, Y . Zhou, and Z. Tu, “OpenEMMA: Open-Source Multimodal Model for End-to-End Autonomous Driving,” Dec. 2024, arXiv:2412.15208 [cs]. [Online]. Available: http://arxiv.org/abs/2412.15208

work page arXiv 2024
[20]

Driving with llms: Fusing object- level vector modality for explainable autonomous driving,

L. Chen, O. Sinavski, J. H ¨unermann, A. Karnsund, A. J. Willmott, D. Birch, D. Maund, and J. Shotton, “Driving with llms: Fusing object- level vector modality for explainable autonomous driving,” in2024 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2024, pp. 14 093–14 100

work page 2024
[21]

Drivegpt4: Interpretable end-to-end autonomous driving via large language model,

Z. Xu, Y . Zhang, E. Xie, Z. Zhao, Y . Guo, K.-Y . K. Wong, Z. Li, and H. Zhao, “Drivegpt4: Interpretable end-to-end autonomous driving via large language model,”IEEE Robotics and Automation Letters, 2024

work page 2024
[22]

Omnidrive: A holistic llm-agent framework for autonomous driving with 3d perception, reasoning and planning,

S. Wang, Z. Yu, X. Jiang, S. Lan, M. Shi, N. Chang, J. Kautz, Y . Li, and J. M. Alvarez, “Omnidrive: A holistic llm-agent framework for autonomous driving with 3d perception, reasoning and planning,”arXiv preprint arXiv:2405.01533, 2024

work page arXiv 2024
[23]

Drivelm: Driving with graph visual question answering,

C. Sima, K. Renz, K. Chitta, L. Chen, H. Zhang, C. Xie, J. Beißwenger, P. Luo, A. Geiger, and H. Li, “Drivelm: Driving with graph visual question answering,” inEuropean Conference on Computer Vision. Springer, 2024, pp. 256–274

work page 2024
[24]

Drivemlm: Aligning multi-modal large language models with behavioral planning states for au- tonomous driving

W. Wang, J. Xie, C. Hu, H. Zou, J. Fan, W. Tong, Y . Wen, S. Wu, H. Deng, Z. Liet al., “Drivemlm: Aligning multi-modal large language models with behavioral planning states for autonomous driving,”arXiv preprint arXiv:2312.09245, 2023

work page arXiv 2023
[25]

DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models

X. Tian, J. Gu, B. Li, Y . Liu, Y . Wang, Z. Zhao, K. Zhan, P. Jia, X. Lang, and H. Zhao, “Drivevlm: The convergence of autonomous driving and large vision-language models,”arXiv preprint arXiv:2402.12289, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[26]

Chain-of-thought prompting elicits reasoning in large language models,

J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V . Le, D. Zhouet al., “Chain-of-thought prompting elicits reasoning in large language models,”Advances in neural information processing systems, vol. 35, pp. 24 824–24 837, 2022

work page 2022
[27]

Hugsim: A real-time, photo-realistic and closed-loop simulator for autonomous driving,

H. Zhou, L. Lin, J. Wang, Y . Lu, D. Bai, B. Liu, Y . Wang, A. Geiger, and Y . Liao, “Hugsim: A real-time, photo-realistic and closed-loop simulator for autonomous driving,”arXiv preprint arXiv:2412.01718, 2024

work page arXiv 2024
[28]

End-to-end autonomous driving: Challenges and frontiers,

L. Chen, P. Wu, K. Chitta, B. Jaeger, A. Geiger, and H. Li, “End-to-end autonomous driving: Challenges and frontiers,”IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024

work page 2024
[29]

Recent advancements in end-to-end autonomous driving using deep learning: A survey,

P. S. Chib and P. Singh, “Recent advancements in end-to-end autonomous driving using deep learning: A survey,”IEEE Transactions on Intelligent V ehicles, vol. 9, no. 1, pp. 103–118, 2023

work page 2023
[30]

Diffstack: A differentiable and modular control stack for autonomous vehicles,

P. Karkus, B. Ivanovic, S. Mannor, and M. Pavone, “Diffstack: A differentiable and modular control stack for autonomous vehicles,” in Conference on robot learning. PMLR, 2023, pp. 2170–2180

work page 2023
[31]

Planning-oriented autonomous driving,

Y . Hu, J. Yang, L. Chen, K. Li, C. Sima, X. Zhu, S. Chai, S. Du, T. Lin, W. Wang, L. Lu, X. Jia, Q. Liu, J. Dai, Y . Qiao, and H. Li, “Planning-oriented autonomous driving,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023

work page 2023
[32]

Safety-enhanced autonomous driving using interpretable sensor fusion transformer,

H. Shao, L. Wang, R. Chen, H. Li, and Y . Liu, “Safety-enhanced autonomous driving using interpretable sensor fusion transformer,” in Conference on Robot Learning. PMLR, 2023, pp. 726–737

work page 2023
[33]

Visual point cloud forecasting enables scalable autonomous driving,

Z. Yang, L. Chen, Y . Sun, and H. Li, “Visual point cloud forecasting enables scalable autonomous driving,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 14 673–14 684

work page 2024
[34]

Behavior-inspired neural networks for relational inference,

Y . Yang, B. Feng, K. Wang, N. Leonard, A. B. Dieng, and C. Allen- Blanchette, “Behavior-inspired neural networks for relational inference,” arXiv preprint arXiv:2406.14746, 2024

work page arXiv 2024
[35]

Latent variable sequential set transformers for joint multi-agent motion prediction,

R. Girgis, F. Golemo, F. Codevilla, M. Weiss, J. A. D’Souza, S. E. Kahou, F. Heide, and C. Pal, “Latent variable sequential set transformers for joint multi-agent motion prediction,”arXiv preprint arXiv:2104.00563, 2021

work page arXiv 2021
[36]

Plant: Explainable planning transformers via object-level representations,

K. Renz, K. Chitta, O.-B. Mercea, A. Koepke, Z. Akata, and A. Geiger, “Plant: Explainable planning transformers via object-level representations,” arXiv preprint arXiv:2210.14222, 2022

work page arXiv 2022
[37]

Quad: Query-based interpretable neural motion planning for autonomous driving,

S. Biswas, S. Casas, Q. Sykora, B. Agro, A. Sadat, and R. Urtasun, “Quad: Query-based interpretable neural motion planning for autonomous driving,” in2024 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2024, pp. 14 236–14 243

work page 2024
[38]

St-p3: End-to-end vision-based autonomous driving via spatial-temporal feature learning,

S. Hu, L. Chen, P. Wu, H. Li, J. Yan, and D. Tao, “St-p3: End-to-end vision-based autonomous driving via spatial-temporal feature learning,” in European Conference on Computer Vision. Springer, 2022, pp. 533–549

work page 2022
[39]

Dualad: Disentangling the dynamic and static world for end-to-end driving,

S. Doll, N. Hanselmann, L. Schneider, R. Schulz, M. Cordts, M. En- zweiler, and H. P. Lensch, “Dualad: Disentangling the dynamic and static world for end-to-end driving,” in2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 14 728– 14 737

work page 2024
[40]

Para- drive: Parallelized architecture for real-time autonomous driving,

X. Weng, B. Ivanovic, Y . Wang, Y . Wang, and M. Pavone, “Para- drive: Parallelized architecture for real-time autonomous driving,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 15 449–15 458

work page 2024
[41]

Vlp: Vision language planning for autonomous driving,

C. Pan, B. Yaman, T. Nesti, A. Mallik, A. G. Allievi, S. Velipasalar, and L. Ren, “Vlp: Vision language planning for autonomous driving,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 14 760–14 769

work page 2024
[42]

Vlm-ad: End-to-end autonomous driving through vision-language model supervision,

Y . Xu, Y . Hu, Z. Zhang, G. P. Meyer, S. K. Mustikovela, S. Srini- vasa, E. M. Wolff, and X. Huang, “Vlm-ad: End-to-end autonomous driving through vision-language model supervision,”arXiv preprint arXiv:2412.14446, 2024

work page arXiv 2024
[43]

Alpamayo-R1: Bridging Reasoning and Action Prediction for Generalizable Autonomous Driving in the Long Tail

Y . Wang, W. Luo, J. Bai, Y . Cao, T. Che, K. Chen, Y . Chen, J. Diamond, Y . Ding, W. Dinget al., “Alpamayo-r1: Bridging reasoning and action prediction for generalizable autonomous driving in the long tail,”arXiv preprint arXiv:2511.00088, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[44]

Distilling the Knowledge in a Neural Network

G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neural network,”arXiv preprint arXiv:1503.02531, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[45]

Knowledge distillation: A survey,

J. Gou, B. Yu, S. J. Maybank, and D. Tao, “Knowledge distillation: A survey,”International Journal of Computer Vision, vol. 129, no. 6, pp. 1789–1819, 2021

work page 2021
[46]

A Survey on Knowledge Distillation of Large Language Models

X. Xu, M. Li, C. Tao, T. Shen, R. Cheng, J. Li, C. Xu, D. Tao, and T. Zhou, “A survey on knowledge distillation of large language models,” arXiv preprint arXiv:2402.13116, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[47]

Learning Transferable Visual Models From Natural Language Supervision,

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever, “Learning Transferable Visual Models From Natural Language Supervision,” inProceedings of the 38th International Conference on Machine Learning. PMLR, Jul. 2021, pp. 8748–8763, iSSN: 2640-3498. [Online]. Available: ht...

work page 2021
[48]

Reproducible Scaling Laws for Contrastive Language-Image Learning,

M. Cherti, R. Beaumont, R. Wightman, M. Wortsman, G. Ilharco, C. Gordon, C. Schuhmann, L. Schmidt, and J. Jitsev, “Reproducible Scaling Laws for Contrastive Language-Image Learning,” 2023, pp. 2818–2829. [Online]. Available: https://openaccess.thecvf.com/content/ CVPR2023/html/Cherti Reproducible Scaling Laws for Contrastive Language-Image Learning CVPR 2...

work page 2023
[49]

Supervised contrastive learning,

P. Khosla, P. Teterwak, C. Wang, A. Sarna, Y . Tian, P. Isola, A. Maschinot, C. Liu, and D. Krishnan, “Supervised contrastive learning,”Advances in neural information processing systems, vol. 33, pp. 18 661–18 673, 2020

work page 2020
[50]

Sensei: Semantic exploration guided by foundation models to learn versatile world models,

C. Sancaktar, C. Gumbsch, A. Zadaianchuk, P. Kolev, and G. Martius, “Sensei: Semantic exploration guided by foundation models to learn versatile world models,”arXiv preprint arXiv:2503.01584, 2025

work page arXiv 2025
[51]

Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks

N. Reimers and I. Gurevych, “Sentence-bert: Sentence embeddings using siamese bert-networks,”arXiv preprint arXiv:1908.10084, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1908
[52]

Qwen2.5 Technical Report

A. Yang, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Li, D. Liu, F. Huang, H. Weiet al., “Qwen2. 5 technical report,”arXiv preprint arXiv:2412.15115, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[53]

Bevformer: learning bird’s-eye-view representation from lidar-camera via spatiotemporal transformers,

Z. Li, W. Wang, H. Li, E. Xie, C. Sima, T. Lu, Q. Yu, and J. Dai, “Bevformer: learning bird’s-eye-view representation from lidar-camera via spatiotemporal transformers,”IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024. APPENDIXA HALLUCINATIONS FORINFERENCE-TIMEVLMS(SEC. 1.0 MAIN) Methods that use finetuned multimodal VLMs at inference ...

work page 2024
[54]

**Traffic Lights **: There are no visible traffic lights in the image. 2. **Movements of Other Cars or Pedestrians **: There are no other cars or pedestrians visible in the image. 3. **Lane Markings **: The road has clear lane markings, including a solid white line on the right side and a dashed white line on the left side. There is also a black and white...

work page
[55]

Bus Scene

**Bus (Location: Center of the image, moving towards the camera) **: - **Description**: The bus is moving towards you on the same road. It is important to monitor its speed and direction to ensure safe overtaking or passing. - **Why it’s important**: Ensuring you have enough space to overtake safely is crucial to avoid collisions. ... Fig. 6: OpenEMMA Tes...

work page
[56]

It is important to pay attention to this truck because it is a large vehicle that may have a longer stopping distance than smaller cars

**FedEx Truck (Location: Right side of the image, near the center) ** - **Description:** The FedEx truck is on the right side of the image, near the center. It is important to pay attention to this truck because it is a large vehicle that may have a longer stopping distance than smaller cars. You should be prepared for it to slow down or stop suddenly, es...

work page

[1] [1]

Comparative safety performance of autonomous-and human drivers: A real-world case study of the waymo driver,

L. Di Lillo, T. Gode, X. Zhou, M. Atzei, R. Chen, and T. Victor, “Comparative safety performance of autonomous-and human drivers: A real-world case study of the waymo driver,”Heliyon, vol. 10, no. 14, 2024

work page 2024

[2] [2]

Safety on autopilot: An empirical investigation of autonomous driving and traffic safety,

M. Jung, J. Park, and M.-S. Pang, “Safety on autopilot: An empirical investigation of autonomous driving and traffic safety,” 2024

work page 2024

[3] [3]

Learning from all vehicles,

D. Chen and P. Kr ¨ahenb¨uhl, “Learning from all vehicles,” inProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 17 222–17 231

work page 2022

[4] [4]

Effective adaptation in multi-task co-training for unified autonomous driving,

X. Liang, Y . Wu, J. Han, H. Xu, C. Xu, and X. Liang, “Effective adaptation in multi-task co-training for unified autonomous driving,” Advances in Neural Information Processing Systems, vol. 35, pp. 19 645– 19 658, 2022

work page 2022

[5] [5]

End-to-end interpretable neural motion planner,

W. Zeng, W. Luo, S. Suo, A. Sadat, B. Yang, S. Casas, and R. Urtasun, “End-to-end interpretable neural motion planner,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 8660–8669

work page 2019

[6] [6]

Beverse: Uniﬁed perception and prediction in birds-eye-view for vision-centric autonomous driving,

Y . Zhang, Z. Zhu, W. Zheng, J. Huang, G. Huang, J. Zhou, and J. Lu, “Beverse: Unified perception and prediction in birds-eye-view for vision- centric autonomous driving,”arXiv preprint arXiv:2205.09743, 2022

work page arXiv 2022

[7] [7]

Learning unsupervised world models for autonomous driving via discrete diffusion,

L. Zhang, Y . Xiong, Z. Yang, S. Casas, R. Hu, and R. Urtasun, “Learning unsupervised world models for autonomous driving via discrete diffusion,” arXiv preprint arXiv:2311.01017, 2023

work page arXiv 2023

[8] [8]

Oatracker: Object-aware anti-occlusion 3d multiobject tracking for autonomous driving,

X. Zhang, X. Tan, Y . An, Y . Li, and Z. Fan, “Oatracker: Object-aware anti-occlusion 3d multiobject tracking for autonomous driving,”Expert Systems with Applications, vol. 252, p. 124158, 2024

work page 2024

[9] [9]

Planning-oriented autonomous driving,

Y . Hu, J. Yang, L. Chen, K. Li, C. Sima, X. Zhu, S. Chai, S. Du, T. Lin, W. Wanget al., “Planning-oriented autonomous driving,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 17 853–17 862

work page 2023

[10] [10]

Vad: Vectorized scene representation for efficient autonomous driving,

B. Jiang, S. Chen, Q. Xu, B. Liao, J. Chen, H. Zhou, Q. Zhang, W. Liu, C. Huang, and X. Wang, “Vad: Vectorized scene representation for efficient autonomous driving,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 8340–8350

work page 2023

[11] [11]

VADv2: End-to-End Vectorized Autonomous Driving via Probabilistic Planning

S. Chen, B. Jiang, H. Gao, B. Liao, Q. Xu, Q. Zhang, C. Huang, W. Liu, and X. Wang, “Vadv2: End-to-end vectorized autonomous driving via probabilistic planning,”arXiv preprint arXiv:2402.13243, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[12] [12]

Is ego status all you need for open-loop end-to-end autonomous driving?

Z. Li, Z. Yu, S. Lan, J. Li, J. Kautz, T. Lu, and J. M. Alvarez, “Is ego status all you need for open-loop end-to-end autonomous driving?” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 14 864–14 873

work page 2024

[13] [13]

nuscenes: A multimodal dataset for autonomous driving,

H. Caesar, V . Bankiti, A. H. Lang, S. V ora, V . E. Liong, Q. Xu, A. Krishnan, Y . Pan, G. Baldan, and O. Beijbom, “nuscenes: A multimodal dataset for autonomous driving,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 11 621–11 631

work page 2020

[14] [14]

Scalability in perception for autonomous driving: Waymo open dataset,

P. Sun, H. Kretzschmar, X. Dotiwalla, A. Chouard, V . Patnaik, P. Tsui, J. Guo, Y . Zhou, Y . Chai, B. Caineet al., “Scalability in perception for autonomous driving: Waymo open dataset,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 2446–2454

work page 2020

[15] [16]

Bounded rationality,

H. A. Simon, “Bounded rationality,”Utility and probability, pp. 15–18, 1990

work page 1990

[16] [17]

Bounded rationality,

B. D. Jones, “Bounded rationality,”Annual review of political science, vol. 2, no. 1, pp. 297–321, 1999

work page 1999

[17] [18]

EMMA: End-to-End Multimodal Model for Autonomous Driving

J.-J. Hwang, R. Xu, H. Lin, W.-C. Hung, J. Ji, K. Choi, D. Huang, T. He, P. Covington, B. Sapp, J. Guo, D. Anguelov, and M. Tan, “EMMA: End-to-End Multimodal Model for Autonomous Driving,” Oct. 2024, arXiv:2410.23262 [cs] version: 1. [Online]. Available: http://arxiv.org/abs/2410.23262

work page internal anchor Pith review Pith/arXiv arXiv 2024

[18] [19]

OpenEMMA: Open-Source Multimodal Model for End-to-End Autonomous Driving,

S. Xing, C. Qian, Y . Wang, H. Hua, K. Tian, Y . Zhou, and Z. Tu, “OpenEMMA: Open-Source Multimodal Model for End-to-End Autonomous Driving,” Dec. 2024, arXiv:2412.15208 [cs]. [Online]. Available: http://arxiv.org/abs/2412.15208

work page arXiv 2024

[19] [20]

Driving with llms: Fusing object- level vector modality for explainable autonomous driving,

L. Chen, O. Sinavski, J. H ¨unermann, A. Karnsund, A. J. Willmott, D. Birch, D. Maund, and J. Shotton, “Driving with llms: Fusing object- level vector modality for explainable autonomous driving,” in2024 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2024, pp. 14 093–14 100

work page 2024

[20] [21]

Drivegpt4: Interpretable end-to-end autonomous driving via large language model,

Z. Xu, Y . Zhang, E. Xie, Z. Zhao, Y . Guo, K.-Y . K. Wong, Z. Li, and H. Zhao, “Drivegpt4: Interpretable end-to-end autonomous driving via large language model,”IEEE Robotics and Automation Letters, 2024

work page 2024

[21] [22]

Omnidrive: A holistic llm-agent framework for autonomous driving with 3d perception, reasoning and planning,

S. Wang, Z. Yu, X. Jiang, S. Lan, M. Shi, N. Chang, J. Kautz, Y . Li, and J. M. Alvarez, “Omnidrive: A holistic llm-agent framework for autonomous driving with 3d perception, reasoning and planning,”arXiv preprint arXiv:2405.01533, 2024

work page arXiv 2024

[22] [23]

Drivelm: Driving with graph visual question answering,

C. Sima, K. Renz, K. Chitta, L. Chen, H. Zhang, C. Xie, J. Beißwenger, P. Luo, A. Geiger, and H. Li, “Drivelm: Driving with graph visual question answering,” inEuropean Conference on Computer Vision. Springer, 2024, pp. 256–274

work page 2024

[23] [24]

Drivemlm: Aligning multi-modal large language models with behavioral planning states for au- tonomous driving

W. Wang, J. Xie, C. Hu, H. Zou, J. Fan, W. Tong, Y . Wen, S. Wu, H. Deng, Z. Liet al., “Drivemlm: Aligning multi-modal large language models with behavioral planning states for autonomous driving,”arXiv preprint arXiv:2312.09245, 2023

work page arXiv 2023

[24] [25]

DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models

X. Tian, J. Gu, B. Li, Y . Liu, Y . Wang, Z. Zhao, K. Zhan, P. Jia, X. Lang, and H. Zhao, “Drivevlm: The convergence of autonomous driving and large vision-language models,”arXiv preprint arXiv:2402.12289, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[25] [26]

Chain-of-thought prompting elicits reasoning in large language models,

J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V . Le, D. Zhouet al., “Chain-of-thought prompting elicits reasoning in large language models,”Advances in neural information processing systems, vol. 35, pp. 24 824–24 837, 2022

work page 2022

[26] [27]

Hugsim: A real-time, photo-realistic and closed-loop simulator for autonomous driving,

H. Zhou, L. Lin, J. Wang, Y . Lu, D. Bai, B. Liu, Y . Wang, A. Geiger, and Y . Liao, “Hugsim: A real-time, photo-realistic and closed-loop simulator for autonomous driving,”arXiv preprint arXiv:2412.01718, 2024

work page arXiv 2024

[27] [28]

End-to-end autonomous driving: Challenges and frontiers,

L. Chen, P. Wu, K. Chitta, B. Jaeger, A. Geiger, and H. Li, “End-to-end autonomous driving: Challenges and frontiers,”IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024

work page 2024

[28] [29]

Recent advancements in end-to-end autonomous driving using deep learning: A survey,

P. S. Chib and P. Singh, “Recent advancements in end-to-end autonomous driving using deep learning: A survey,”IEEE Transactions on Intelligent V ehicles, vol. 9, no. 1, pp. 103–118, 2023

work page 2023

[29] [30]

Diffstack: A differentiable and modular control stack for autonomous vehicles,

P. Karkus, B. Ivanovic, S. Mannor, and M. Pavone, “Diffstack: A differentiable and modular control stack for autonomous vehicles,” in Conference on robot learning. PMLR, 2023, pp. 2170–2180

work page 2023

[30] [31]

Planning-oriented autonomous driving,

Y . Hu, J. Yang, L. Chen, K. Li, C. Sima, X. Zhu, S. Chai, S. Du, T. Lin, W. Wang, L. Lu, X. Jia, Q. Liu, J. Dai, Y . Qiao, and H. Li, “Planning-oriented autonomous driving,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023

work page 2023

[31] [32]

Safety-enhanced autonomous driving using interpretable sensor fusion transformer,

H. Shao, L. Wang, R. Chen, H. Li, and Y . Liu, “Safety-enhanced autonomous driving using interpretable sensor fusion transformer,” in Conference on Robot Learning. PMLR, 2023, pp. 726–737

work page 2023

[32] [33]

Visual point cloud forecasting enables scalable autonomous driving,

Z. Yang, L. Chen, Y . Sun, and H. Li, “Visual point cloud forecasting enables scalable autonomous driving,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 14 673–14 684

work page 2024

[33] [34]

Behavior-inspired neural networks for relational inference,

Y . Yang, B. Feng, K. Wang, N. Leonard, A. B. Dieng, and C. Allen- Blanchette, “Behavior-inspired neural networks for relational inference,” arXiv preprint arXiv:2406.14746, 2024

work page arXiv 2024

[34] [35]

Latent variable sequential set transformers for joint multi-agent motion prediction,

R. Girgis, F. Golemo, F. Codevilla, M. Weiss, J. A. D’Souza, S. E. Kahou, F. Heide, and C. Pal, “Latent variable sequential set transformers for joint multi-agent motion prediction,”arXiv preprint arXiv:2104.00563, 2021

work page arXiv 2021

[35] [36]

Plant: Explainable planning transformers via object-level representations,

K. Renz, K. Chitta, O.-B. Mercea, A. Koepke, Z. Akata, and A. Geiger, “Plant: Explainable planning transformers via object-level representations,” arXiv preprint arXiv:2210.14222, 2022

work page arXiv 2022

[36] [37]

Quad: Query-based interpretable neural motion planning for autonomous driving,

S. Biswas, S. Casas, Q. Sykora, B. Agro, A. Sadat, and R. Urtasun, “Quad: Query-based interpretable neural motion planning for autonomous driving,” in2024 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2024, pp. 14 236–14 243

work page 2024

[37] [38]

St-p3: End-to-end vision-based autonomous driving via spatial-temporal feature learning,

S. Hu, L. Chen, P. Wu, H. Li, J. Yan, and D. Tao, “St-p3: End-to-end vision-based autonomous driving via spatial-temporal feature learning,” in European Conference on Computer Vision. Springer, 2022, pp. 533–549

work page 2022

[38] [39]

Dualad: Disentangling the dynamic and static world for end-to-end driving,

S. Doll, N. Hanselmann, L. Schneider, R. Schulz, M. Cordts, M. En- zweiler, and H. P. Lensch, “Dualad: Disentangling the dynamic and static world for end-to-end driving,” in2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 14 728– 14 737

work page 2024

[39] [40]

Para- drive: Parallelized architecture for real-time autonomous driving,

X. Weng, B. Ivanovic, Y . Wang, Y . Wang, and M. Pavone, “Para- drive: Parallelized architecture for real-time autonomous driving,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 15 449–15 458

work page 2024

[40] [41]

Vlp: Vision language planning for autonomous driving,

C. Pan, B. Yaman, T. Nesti, A. Mallik, A. G. Allievi, S. Velipasalar, and L. Ren, “Vlp: Vision language planning for autonomous driving,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 14 760–14 769

work page 2024

[41] [42]

Vlm-ad: End-to-end autonomous driving through vision-language model supervision,

Y . Xu, Y . Hu, Z. Zhang, G. P. Meyer, S. K. Mustikovela, S. Srini- vasa, E. M. Wolff, and X. Huang, “Vlm-ad: End-to-end autonomous driving through vision-language model supervision,”arXiv preprint arXiv:2412.14446, 2024

work page arXiv 2024

[42] [43]

Alpamayo-R1: Bridging Reasoning and Action Prediction for Generalizable Autonomous Driving in the Long Tail

Y . Wang, W. Luo, J. Bai, Y . Cao, T. Che, K. Chen, Y . Chen, J. Diamond, Y . Ding, W. Dinget al., “Alpamayo-r1: Bridging reasoning and action prediction for generalizable autonomous driving in the long tail,”arXiv preprint arXiv:2511.00088, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[43] [44]

Distilling the Knowledge in a Neural Network

G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neural network,”arXiv preprint arXiv:1503.02531, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015

[44] [45]

Knowledge distillation: A survey,

J. Gou, B. Yu, S. J. Maybank, and D. Tao, “Knowledge distillation: A survey,”International Journal of Computer Vision, vol. 129, no. 6, pp. 1789–1819, 2021

work page 2021

[45] [46]

A Survey on Knowledge Distillation of Large Language Models

X. Xu, M. Li, C. Tao, T. Shen, R. Cheng, J. Li, C. Xu, D. Tao, and T. Zhou, “A survey on knowledge distillation of large language models,” arXiv preprint arXiv:2402.13116, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[46] [47]

Learning Transferable Visual Models From Natural Language Supervision,

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever, “Learning Transferable Visual Models From Natural Language Supervision,” inProceedings of the 38th International Conference on Machine Learning. PMLR, Jul. 2021, pp. 8748–8763, iSSN: 2640-3498. [Online]. Available: ht...

work page 2021

[47] [48]

Reproducible Scaling Laws for Contrastive Language-Image Learning,

M. Cherti, R. Beaumont, R. Wightman, M. Wortsman, G. Ilharco, C. Gordon, C. Schuhmann, L. Schmidt, and J. Jitsev, “Reproducible Scaling Laws for Contrastive Language-Image Learning,” 2023, pp. 2818–2829. [Online]. Available: https://openaccess.thecvf.com/content/ CVPR2023/html/Cherti Reproducible Scaling Laws for Contrastive Language-Image Learning CVPR 2...

work page 2023

[48] [49]

Supervised contrastive learning,

P. Khosla, P. Teterwak, C. Wang, A. Sarna, Y . Tian, P. Isola, A. Maschinot, C. Liu, and D. Krishnan, “Supervised contrastive learning,”Advances in neural information processing systems, vol. 33, pp. 18 661–18 673, 2020

work page 2020

[49] [50]

Sensei: Semantic exploration guided by foundation models to learn versatile world models,

C. Sancaktar, C. Gumbsch, A. Zadaianchuk, P. Kolev, and G. Martius, “Sensei: Semantic exploration guided by foundation models to learn versatile world models,”arXiv preprint arXiv:2503.01584, 2025

work page arXiv 2025

[50] [51]

Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks

N. Reimers and I. Gurevych, “Sentence-bert: Sentence embeddings using siamese bert-networks,”arXiv preprint arXiv:1908.10084, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1908

[51] [52]

Qwen2.5 Technical Report

A. Yang, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Li, D. Liu, F. Huang, H. Weiet al., “Qwen2. 5 technical report,”arXiv preprint arXiv:2412.15115, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[52] [53]

Bevformer: learning bird’s-eye-view representation from lidar-camera via spatiotemporal transformers,

Z. Li, W. Wang, H. Li, E. Xie, C. Sima, T. Lu, Q. Yu, and J. Dai, “Bevformer: learning bird’s-eye-view representation from lidar-camera via spatiotemporal transformers,”IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024. APPENDIXA HALLUCINATIONS FORINFERENCE-TIMEVLMS(SEC. 1.0 MAIN) Methods that use finetuned multimodal VLMs at inference ...

work page 2024

[53] [54]

**Traffic Lights **: There are no visible traffic lights in the image. 2. **Movements of Other Cars or Pedestrians **: There are no other cars or pedestrians visible in the image. 3. **Lane Markings **: The road has clear lane markings, including a solid white line on the right side and a dashed white line on the left side. There is also a black and white...

work page

[54] [55]

Bus Scene

**Bus (Location: Center of the image, moving towards the camera) **: - **Description**: The bus is moving towards you on the same road. It is important to monitor its speed and direction to ensure safe overtaking or passing. - **Why it’s important**: Ensuring you have enough space to overtake safely is crucial to avoid collisions. ... Fig. 6: OpenEMMA Tes...

work page

[55] [56]

It is important to pay attention to this truck because it is a large vehicle that may have a longer stopping distance than smaller cars

**FedEx Truck (Location: Right side of the image, near the center) ** - **Description:** The FedEx truck is on the right side of the image, near the center. It is important to pay attention to this truck because it is a large vehicle that may have a longer stopping distance than smaller cars. You should be prepared for it to slow down or stop suddenly, es...

work page