VERDI: VLM-Embedded Reasoning for Autonomous Driving

Bowen Feng , Zhiting Mei , Julian Ost , Filippo Ghilotti , Baiang Li , Roger Girgis , Anirudha Majumdar , Felix Heide

Authors on Pith no claims yet

classification 💻 cs.RO cs.AIcs.CV

keywords reasoningdrivingverdivlmsautonomousinferenceclosed-loopcommonsense

read the original abstract

While autonomous driving (AD) stacks struggle with decision making under partial observability and real-world complexity, human drivers are capable of applying commonsense reasoning to make near-optimal decisions with limited information. Recent work has attempted to leverage finetuned Vision-Language Models (VLMs) for trajectory planning at inference time to emulate human behavior. Despite their success in benchmark evaluations, these methods are often impractical to deploy (a 70B parameter VLM inference at merely 8 tokens per second requires more than 160G of memory), and their monolithic network structure prohibits safety decomposition. To bridge this gap, we propose VLM-Embedded Reasoning for autonomous DrIving (VERDI), a training-time framework that distills the reasoning process and commonsense knowledge of VLMs into the AD stack. VERDI augments modular differentiable end-to-end (e2e) AD models by aligning intermediate module outputs at the perception, prediction, and planning stages with text features explaining the driving reasoning process produced by VLMs. By encouraging alignment in latent space, VERDI enables the modular AD stack to internalize structured reasoning, without incurring the inference-time costs of large VLMs. We evaluate VERDI in both open-loop and closed-loop settings. Our method outperforms existing end-to-end approaches without embedded reasoning by up to 11% in $\ell_{2}$ distance, and achieves the best overall driving performance in the closed-loop HugSim simulator, including a 10% improvement in Non-Collision Rate, while maintaining fast inference speed.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Orion-Lite: Distilling LLM Reasoning into Efficient Vision-Only Driving Models
cs.CV 2026-04 unverdicted novelty 6.0

Orion-Lite uses latent feature distillation and trajectory supervision to create a vision-only model that surpasses its LLM-based teacher on closed-loop Bench2Drive evaluation, achieving a new SOTA driving score of 80.6.
How Well Do Vision-Language Models Understand Sequential Driving Scenes? A Sensitivity Study
cs.CV 2026-04 conditional novelty 6.0

VENUSS benchmark shows top VLMs achieve 57% accuracy on sequential driving scenes, strong on static objects but weak on vehicle dynamics and temporal relations.