CarLLaVA: Vision language models for camera-only closed-loop driving

Alice Karnsund; Ana-Maria Marcu; Benoit Hanotte; Elahe Arani; Jamie Shotton; Jan H\"unermann; Katrin Renz; Long Chen; Oleg Sinavski

arxiv: 2406.10165 · v1 · pith:DHHI2GTCnew · submitted 2024-06-14 · 💻 cs.CV · cs.RO

CarLLaVA: Vision language models for camera-only closed-loop driving

Katrin Renz , Long Chen , Ana-Maria Marcu , Jan H\"unermann , Benoit Hanotte , Alice Karnsund , Jamie Shotton , Elahe Arani

show 1 more author

Oleg Sinavski

This is my paper

classification 💻 cs.CV cs.RO

keywords drivingcarllavaautonomouslanguagevisionbettercarlachallenge

0 comments

read the original abstract

In this technical report, we present CarLLaVA, a Vision Language Model (VLM) for autonomous driving, developed for the CARLA Autonomous Driving Challenge 2.0. CarLLaVA uses the vision encoder of the LLaVA VLM and the LLaMA architecture as backbone, achieving state-of-the-art closed-loop driving performance with only camera input and without the need for complex or expensive labels. Additionally, we show preliminary results on predicting language commentary alongside the driving output. CarLLaVA uses a semi-disentangled output representation of both path predictions and waypoints, getting the advantages of the path for better lateral control and the waypoints for better longitudinal control. We propose an efficient training recipe to train on large driving datasets without wasting compute on easy, trivial data. CarLLaVA ranks 1st place in the sensor track of the CARLA Autonomous Driving Challenge 2.0 outperforming the previous state of the art by 458% and the best concurrent submission by 32.6%.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 10 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Teaching Vision-Language-Action Models What to See and Where to Look
cs.CV 2026-07 unverdicted novelty 6.0

DriveTeach-VLA adds Driving-aware Vision Distillation pretraining and 2D Trajectory-Guided Prompts to VLA models, then reports state-of-the-art results on NAVSIM and nuScenes.
VLGA: Vision-Language-Geometry-Action Models for Autonomous Driving
cs.CV 2026-06 unverdicted novelty 6.0

VLGA introduces geometry as a fourth modality in VLA models via pointmap regression loss, reporting SOTA open-loop and closed-loop driving metrics on nuScenes and Bench2Drive.
DVGT-2: Vision-Geometry-Action Model for Autonomous Driving at Scale
cs.CV 2026-04 unverdicted novelty 6.0

DVGT-2 is a streaming vision-geometry-action model that jointly reconstructs dense 3D geometry and plans trajectories online, achieving better reconstruction than prior batch methods while transferring directly to pla...
AlignDrive: Aligned Lateral-Longitudinal Planning for End-to-End Autonomous Driving
cs.RO 2026-01 unverdicted novelty 6.0

A cascaded end-to-end driving model conditions longitudinal planning on the lateral path via anchor-based regression and path-conditioned 1D displacement prediction, achieving SOTA driving score of 89.07 and 73.18% su...
DriveVLA-W0: World Models Amplify Data Scaling Law in Autonomous Driving
cs.CV 2025-10 unverdicted novelty 6.0

DriveVLA-W0 adds world modeling to predict future images in VLA models, overcoming sparse action supervision and amplifying data scaling laws on NAVSIM benchmarks and a large in-house dataset.
CogDriver: Integrating Cognitive Inertia for Temporally Coherent Planning in Autonomous Driving
cs.CV 2025-08 unverdicted novelty 6.0

CogDriver-Agent with sparse temporal memory and spatiotemporal distillation on CogDriver-Data achieves 22% higher closed-loop Driving Score on Bench2Drive and 21% lower mean L2 error on nuScenes.
AutoVLA: A Vision-Language-Action Model for End-to-End Autonomous Driving with Adaptive Reasoning and Reinforcement Fine-Tuning
cs.CV 2025-06 unverdicted novelty 6.0

AutoVLA unifies semantic reasoning and trajectory planning in one autoregressive VLA model for end-to-end autonomous driving by tokenizing trajectories into discrete actions and using GRPO reinforcement fine-tuning to...
ORION: A Holistic End-to-End Autonomous Driving Framework by Vision-Language Instructed Action Generation
cs.CV 2025-03 unverdicted novelty 6.0

ORION reports 77.74 Driving Score and 54.62% Success Rate on Bench2Drive, outperforming prior end-to-end methods by 14.28 DS and 19.61% SR through unified VQA and planning optimization.
LVDrive: Latent Visual Representation Enhanced Vision-Language-Action Autonomous Driving Model
cs.CV 2026-05 unverdicted novelty 5.0

LVDrive improves closed-loop driving on Bench2Drive by adding latent future scene prediction to VLA models via unified embedding space processing and two-stage trajectory decoding.
DeepSight: Long-Horizon World Modeling via Latent States Prediction for End-to-End Autonomous Driving
cs.CV 2026-05 unverdicted novelty 4.0

DeepSight uses parallel latent feature prediction in BEV for long-horizon world modeling and adaptive text reasoning to reach state-of-the-art closed-loop performance on the Bench2drive benchmark.