pith. sign in

arxiv: 2406.10165 · v1 · pith:DHHI2GTCnew · submitted 2024-06-14 · 💻 cs.CV · cs.RO

CarLLaVA: Vision language models for camera-only closed-loop driving

classification 💻 cs.CV cs.RO
keywords drivingcarllavaautonomouslanguagevisionbettercarlachallenge
0
0 comments X
read the original abstract

In this technical report, we present CarLLaVA, a Vision Language Model (VLM) for autonomous driving, developed for the CARLA Autonomous Driving Challenge 2.0. CarLLaVA uses the vision encoder of the LLaVA VLM and the LLaMA architecture as backbone, achieving state-of-the-art closed-loop driving performance with only camera input and without the need for complex or expensive labels. Additionally, we show preliminary results on predicting language commentary alongside the driving output. CarLLaVA uses a semi-disentangled output representation of both path predictions and waypoints, getting the advantages of the path for better lateral control and the waypoints for better longitudinal control. We propose an efficient training recipe to train on large driving datasets without wasting compute on easy, trivial data. CarLLaVA ranks 1st place in the sensor track of the CARLA Autonomous Driving Challenge 2.0 outperforming the previous state of the art by 458% and the best concurrent submission by 32.6%.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 10 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Teaching Vision-Language-Action Models What to See and Where to Look

    cs.CV 2026-07 unverdicted novelty 6.0

    DriveTeach-VLA adds Driving-aware Vision Distillation pretraining and 2D Trajectory-Guided Prompts to VLA models, then reports state-of-the-art results on NAVSIM and nuScenes.

  2. VLGA: Vision-Language-Geometry-Action Models for Autonomous Driving

    cs.CV 2026-06 unverdicted novelty 6.0

    VLGA introduces geometry as a fourth modality in VLA models via pointmap regression loss, reporting SOTA open-loop and closed-loop driving metrics on nuScenes and Bench2Drive.

  3. DVGT-2: Vision-Geometry-Action Model for Autonomous Driving at Scale

    cs.CV 2026-04 unverdicted novelty 6.0

    DVGT-2 is a streaming vision-geometry-action model that jointly reconstructs dense 3D geometry and plans trajectories online, achieving better reconstruction than prior batch methods while transferring directly to pla...

  4. AlignDrive: Aligned Lateral-Longitudinal Planning for End-to-End Autonomous Driving

    cs.RO 2026-01 unverdicted novelty 6.0

    A cascaded end-to-end driving model conditions longitudinal planning on the lateral path via anchor-based regression and path-conditioned 1D displacement prediction, achieving SOTA driving score of 89.07 and 73.18% su...

  5. DriveVLA-W0: World Models Amplify Data Scaling Law in Autonomous Driving

    cs.CV 2025-10 unverdicted novelty 6.0

    DriveVLA-W0 adds world modeling to predict future images in VLA models, overcoming sparse action supervision and amplifying data scaling laws on NAVSIM benchmarks and a large in-house dataset.

  6. CogDriver: Integrating Cognitive Inertia for Temporally Coherent Planning in Autonomous Driving

    cs.CV 2025-08 unverdicted novelty 6.0

    CogDriver-Agent with sparse temporal memory and spatiotemporal distillation on CogDriver-Data achieves 22% higher closed-loop Driving Score on Bench2Drive and 21% lower mean L2 error on nuScenes.

  7. AutoVLA: A Vision-Language-Action Model for End-to-End Autonomous Driving with Adaptive Reasoning and Reinforcement Fine-Tuning

    cs.CV 2025-06 unverdicted novelty 6.0

    AutoVLA unifies semantic reasoning and trajectory planning in one autoregressive VLA model for end-to-end autonomous driving by tokenizing trajectories into discrete actions and using GRPO reinforcement fine-tuning to...

  8. ORION: A Holistic End-to-End Autonomous Driving Framework by Vision-Language Instructed Action Generation

    cs.CV 2025-03 unverdicted novelty 6.0

    ORION reports 77.74 Driving Score and 54.62% Success Rate on Bench2Drive, outperforming prior end-to-end methods by 14.28 DS and 19.61% SR through unified VQA and planning optimization.

  9. LVDrive: Latent Visual Representation Enhanced Vision-Language-Action Autonomous Driving Model

    cs.CV 2026-05 unverdicted novelty 5.0

    LVDrive improves closed-loop driving on Bench2Drive by adding latent future scene prediction to VLA models via unified embedding space processing and two-stage trajectory decoding.

  10. DeepSight: Long-Horizon World Modeling via Latent States Prediction for End-to-End Autonomous Driving

    cs.CV 2026-05 unverdicted novelty 4.0

    DeepSight uses parallel latent feature prediction in BEV for long-horizon world modeling and adaptive text reasoning to reach state-of-the-art closed-loop performance on the Bench2drive benchmark.