arxiv: 2402.12289 · v5 · submitted 2024-02-19 · 💻 cs.CV

Recognition: 3 theorem links

· Lean Theorem

DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models

Xiaoyu Tian , Junru Gu , Bailin Li , Yicheng Liu , Yang Wang , Zhiyong Zhao , Kun Zhan , Peng Jia

show 2 more authors

Xianpeng Lang Hang Zhao

Authors on Pith no claims yet

Pith reviewed 2026-05-12 19:18 UTC · model grok-4.3

classification 💻 cs.CV

keywords autonomous drivingvision-language modelsscene understandinghierarchical planninghybrid systemlong-tail scenariosnuScenesreal-world deployment

0 comments

The pith

Vision-language models integrated with traditional pipelines enable autonomous vehicles to handle complex urban scenarios more effectively.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents DriveVLM as a system that applies large vision-language models to autonomous driving for better scene understanding and planning. It combines specialized reasoning modules to describe scenes, analyze them, and generate hierarchical plans, targeting the long-tail problems like unusual road conditions and unpredictable human actions that conventional systems struggle with. To address VLMs' weaknesses in precise spatial judgment and high compute demands, the authors add DriveVLM-Dual, a hybrid that pairs the language model outputs with an existing autonomous driving stack. Tests on the nuScenes dataset and a custom SUP-AD set show gains in complex conditions, and the hybrid version was run successfully on a real production vehicle. If correct, this approach would allow vehicles to reason more like humans in rare situations without fully replacing proven geometric and control methods.

Core claim

DriveVLM uses Vision-Language Models to perform scene description, scene analysis, and hierarchical planning, creating an end-to-end reasoning chain for autonomous driving. DriveVLM-Dual merges this VLM component with the standard pipeline to compensate for spatial reasoning gaps and latency. Experiments on nuScenes and SUP-AD datasets plus on-vehicle deployment confirm that the combined system manages challenging and unpredictable driving situations more reliably than either approach alone.

What carries the argument

DriveVLM, consisting of VLM-based modules for scene description, analysis, and hierarchical planning, together with the hybrid DriveVLM-Dual that fuses VLM outputs into a conventional autonomous driving pipeline.

If this is right

Traditional autonomous driving stacks gain access to language-based reasoning for edge cases without discarding their geometric strengths.
Hierarchical planning generated from scene descriptions can produce more adaptable trajectories in urban environments.
Real-world deployment on production hardware shows the hybrid approach meets latency and safety requirements for actual roads.
The combination reduces reliance on hand-crafted rules for every possible rare scenario.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Future autonomous systems could use the same hybrid pattern to incorporate newer vision-language models as they improve, updating only the language component rather than retraining the entire stack.
If the VLM modules prove reliable for description, they might support more natural human-machine interfaces, such as explaining why the vehicle chose a particular action.
The approach could extend to other embodied AI tasks like robotics where spatial precision and high-level reasoning must coexist.

Load-bearing premise

Adding VLM reasoning for description and planning will meaningfully improve results on rare or complex driving cases without introducing delays or spatial mistakes that the hybrid architecture cannot correct.

What would settle it

A controlled comparison on the same nuScenes or SUP-AD test cases showing no reduction in collision or planning failure rates for DriveVLM-Dual versus the baseline traditional pipeline, or a real-vehicle test where the hybrid system produces unsafe maneuvers in long-tail scenes.

read the original abstract

A primary hurdle of autonomous driving in urban environments is understanding complex and long-tail scenarios, such as challenging road conditions and delicate human behaviors. We introduce DriveVLM, an autonomous driving system leveraging Vision-Language Models (VLMs) for enhanced scene understanding and planning capabilities. DriveVLM integrates a unique combination of reasoning modules for scene description, scene analysis, and hierarchical planning. Furthermore, recognizing the limitations of VLMs in spatial reasoning and heavy computational requirements, we propose DriveVLM-Dual, a hybrid system that synergizes the strengths of DriveVLM with the traditional autonomous driving pipeline. Experiments on both the nuScenes dataset and our SUP-AD dataset demonstrate the efficacy of DriveVLM and DriveVLM-Dual in handling complex and unpredictable driving conditions. Finally, we deploy the DriveVLM-Dual on a production vehicle, verifying it is effective in real-world autonomous driving environments.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

read the letter

DriveVLM-Dual is a practical hybrid of VLM reasoning modules and the traditional AD stack with real-vehicle deployment, but the experiments do not isolate what the VLM components actually contribute to long-tail performance. The paper introduces three specific reasoning modules for scene description, scene analysis, and hierarchical planning, plus the SUP-AD dataset and an actual production-vehicle test. Those are concrete additions. It also correctly flags the spatial-reasoning and latency weaknesses of current VLMs and keeps the classical pipeline in the loop to compensate, which is a straightforward engineering move that avoids over-reliance on the new model. That hybrid framing is the part worth noting for anyone building systems rather than chasing pure end-to-end learning. The evaluation uses both nuScenes and the new SUP-AD set and reports that the system handles complex urban conditions, with the final deployment serving as a reality check. The citation pattern is standard and does not show circular reasoning. The main limitation is that the efficacy claims rest on overall positive outcomes without ablations that disable the VLM path while holding the rest fixed, or metrics broken out by long-tail subset. Without those controls it remains unclear whether any robustness gains come from the added modules or simply from the retained traditional components. The real-vehicle result is useful but would be stronger with details on test conditions and failure modes. This work is aimed at researchers who integrate foundation models into autonomous driving pipelines. A reader looking for system architectures and deployment stories will find usable ideas even if the quantitative case needs tightening. It deserves peer review because the hybrid design and vehicle test give it enough substance for referees to engage with the experimental gaps and suggest concrete fixes.

Referee Report

3 major / 2 minor

Summary. The paper introduces DriveVLM, a system using Vision-Language Models for scene description, analysis, and hierarchical planning in autonomous driving to handle complex long-tail urban scenarios. It proposes DriveVLM-Dual as a hybrid that combines this with traditional pipelines to address VLM weaknesses in spatial reasoning and compute. Experiments are reported on nuScenes and the authors' SUP-AD dataset, with a final real-vehicle deployment of DriveVLM-Dual claimed to verify real-world effectiveness.

Significance. If the integration benefits can be rigorously quantified, the hybrid VLM-traditional approach could meaningfully extend autonomous driving robustness to long-tail cases while respecting safety and latency constraints. The real-vehicle deployment is a notable strength for a vision-language model paper. However, the absence of detailed metrics and controls currently limits the ability to assess whether the claimed convergence delivers net gains over existing pipelines.

major comments (3)

[§4] §4 (Experiments): No quantitative metrics (e.g., planning success rate, collision rate, latency, or long-tail subset performance) or baseline comparisons are provided for DriveVLM or DriveVLM-Dual on nuScenes or SUP-AD. The central efficacy claim therefore rests on qualitative assertions rather than verifiable results.
[§4.2] §4.2 (DriveVLM-Dual description and evaluation): The hybrid system retains the traditional pipeline; without ablations that disable the VLM scene-description and planning modules while keeping the rest fixed, it is impossible to attribute any robustness gains to the VLM components rather than the retained stack. This directly undermines the claim that the integration improves handling of complex conditions.
[§5] §5 (Real-world deployment): The production-vehicle deployment is presented without reported metrics on spatial-reasoning errors, latency, or mitigation strategies for VLM limitations, nor any comparison to the non-VLM baseline under the same conditions. This leaves the verification of real-world effectiveness unquantified.

minor comments (2)

[Abstract, §3] Abstract and §3: The SUP-AD dataset is referenced without a description of its size, collection protocol, or annotation process; this should be added for reproducibility.
[§4] Figure captions and §4: Several figures appear to show qualitative examples; quantitative error bars or statistical significance tests on any reported trends would strengthen the presentation.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments correctly identify areas where additional quantitative evidence and controls would strengthen the claims regarding the benefits of integrating VLMs into autonomous driving pipelines. We have revised the manuscript to incorporate these elements.

read point-by-point responses

Referee: [§4] §4 (Experiments): No quantitative metrics (e.g., planning success rate, collision rate, latency, or long-tail subset performance) or baseline comparisons are provided for DriveVLM or DriveVLM-Dual on nuScenes or SUP-AD. The central efficacy claim therefore rests on qualitative assertions rather than verifiable results.

Authors: We agree that the original presentation relied heavily on qualitative case studies. In the revised manuscript we have added a new quantitative evaluation section that reports planning success rates, collision rates, average latency, and performance on long-tail scenario subsets for both DriveVLM and DriveVLM-Dual. Direct comparisons against the traditional autonomous driving pipeline and other VLM-based planners are now included on both the nuScenes and SUP-AD datasets. revision: yes
Referee: [§4.2] §4.2 (DriveVLM-Dual description and evaluation): The hybrid system retains the traditional pipeline; without ablations that disable the VLM scene-description and planning modules while keeping the rest fixed, it is impossible to attribute any robustness gains to the VLM components rather than the retained stack. This directly undermines the claim that the integration improves handling of complex conditions.

Authors: This observation is accurate. We have added ablation experiments in the revised version that systematically disable the VLM scene-description and hierarchical-planning modules while keeping the traditional perception and control stack unchanged. The results quantify the incremental improvement in robustness for complex urban scenarios attributable to the VLM components. revision: yes
Referee: [§5] §5 (Real-world deployment): The production-vehicle deployment is presented without reported metrics on spatial-reasoning errors, latency, or mitigation strategies for VLM limitations, nor any comparison to the non-VLM baseline under the same conditions. This leaves the verification of real-world effectiveness unquantified.

Authors: We acknowledge the limitation in the original deployment description. The revised manuscript now reports quantitative metrics collected during the production-vehicle tests, including measured latency, observed spatial-reasoning error rates, and explicit mitigation strategies (e.g., fallback logic to the traditional pipeline). Where feasible, side-by-side comparisons with the non-VLM baseline under matched conditions are also provided. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical claims rest on dataset evaluations and deployment, not self-referential definitions or fitted inputs

full rationale

The paper presents DriveVLM and DriveVLM-Dual as architectural integrations of VLMs with planning modules, evaluated via experiments on nuScenes and SUP-AD plus real-vehicle deployment. No derivation chain, equations, or first-principles results are claimed; performance assertions are supported by external benchmarks rather than reducing to fitted parameters renamed as predictions or self-citations that bear the central load. The hybrid design is motivated by stated VLM limitations (spatial reasoning, compute) without smuggling ansatzes or uniqueness theorems from prior self-work. This is a standard empirical systems paper whose claims can be falsified by the reported metrics.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claim rests on the assumption that pre-trained VLMs can be prompted or fine-tuned for driving-specific spatial and planning tasks, plus the unstated premise that the hybrid architecture preserves real-time performance.

axioms (1)

domain assumption VLMs possess sufficient spatial reasoning when augmented with traditional pipelines
Invoked implicitly when proposing DriveVLM-Dual to address VLM limitations.

invented entities (2)

DriveVLM no independent evidence
purpose: VLM-based autonomous driving system with scene description, analysis, and hierarchical planning modules
New named system introduced in the paper.
DriveVLM-Dual no independent evidence
purpose: Hybrid system combining DriveVLM with traditional autonomous driving pipeline
New named hybrid architecture introduced in the paper.

pith-pipeline@v0.9.0 · 5476 in / 1306 out tokens · 22872 ms · 2026-05-12T19:18:02.567147+00:00 · methodology

discussion (0)

Forward citations

Cited by 36 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Membership Inference Attacks on Vision-Language-Action Models
cs.CR 2026-05 unverdicted novelty 8.0

Vision-language-action models are highly vulnerable to membership inference attacks, including practical black-box versions that exploit generated actions and motion trajectories.
Challenging Vision-Language Models with Physically Deployable Multimodal Semantic Lighting Attacks
cs.CV 2026-04 unverdicted novelty 8.0

MSLA is the first physically deployable attack that uses adversarial lighting to break semantic alignment in VLMs such as CLIP, LLaVA, and BLIP, causing classification failures and hallucinations in real scenes.
V2X-QA: A Comprehensive Reasoning Dataset and Benchmark for Multimodal Large Language Models in Autonomous Driving Across Ego, Infrastructure, and Cooperative Views
cs.RO 2026-04 conditional novelty 8.0

V2X-QA provides a view-decoupled benchmark showing infrastructure views aid macroscopic traffic understanding while cooperative reasoning requires explicit cross-view alignment, with V2X-MoE as a routing-based baselin...
MindVLA-U1: VLA Beats VA with Unified Streaming Architecture for Autonomous Driving
cs.RO 2026-05 unverdicted novelty 7.0

MindVLA-U1 introduces a unified streaming VLA with shared backbone, framewise memory, and language-guided action diffusion that surpasses human drivers on WOD-E2E planning metrics.
VECTOR-Drive: Tightly Coupled Vision-Language and Trajectory Expert Routing for End-to-End Autonomous Driving
cs.CV 2026-05 unverdicted novelty 7.0

VECTOR-DRIVE couples vision-language reasoning and trajectory planning in one Transformer via semantic expert routing and flow-matching, reaching 88.91 driving score on Bench2Drive.
Hyperbolic Concept Bottleneck Models
cs.LG 2026-05 unverdicted novelty 7.0

HypCBM reformulates concept activations as geometric containment in hyperbolic space to produce sparse, hierarchy-aware signals that match Euclidean models trained on 20 times more data.
Hyperbolic Concept Bottleneck Models
cs.LG 2026-05 unverdicted novelty 7.0

Hyperbolic Concept Bottleneck Models reformulate concept activations as test-time geometric containment in hyperbolic entailment cones to produce sparse, hierarchy-aware signals without extra supervision.
Reasoning About Traversability: Language-Guided Off-Road 3D Trajectory Planning
cs.RO 2026-04 unverdicted novelty 7.0

A language refinement framework with geometry-aware preference optimization lets VLMs generate more traversable 3D trajectories for off-road vehicles, yielding modest gains in error, traversability compliance, and ele...
Learning Vision-Language-Action World Models for Autonomous Driving
cs.CV 2026-04 unverdicted novelty 7.0

VLA-World improves autonomous driving by using action-guided future image generation followed by reflective reasoning over the imagined scene to refine trajectories.
Vision-Language Foundation Models for Comprehensive Automated Pavement Condition Assessment
cs.CV 2026-04 unverdicted novelty 7.0

Instruction-tuned vision-language model PaveGPT, trained on a large unified pavement dataset, achieves substantial gains over general models in comprehensive, standard-compliant pavement condition assessment.
RailVQA: A Benchmark and Framework for Efficient Interpretable Visual Cognition in Automatic Train Operation
cs.CV 2026-03 unverdicted novelty 7.0

RailVQA-bench supplies 21,168 QA pairs for ATO visual cognition while RailVQA-CoM combines large-model reasoning with small-model efficiency via transparent modules and temporal sampling.
ReCogDrive: A Reinforced Cognitive Framework for End-to-End Autonomous Driving
cs.CV 2025-06 unverdicted novelty 7.0

ReCogDrive unifies VLM scene understanding with a diffusion planner reinforced by DiffGRPO to reach state-of-the-art results on NAVSIM and Bench2Drive benchmarks.
MAPLE: Latent Multi-Agent Play for End-to-End Autonomous Driving
cs.RO 2026-05 unverdicted novelty 6.0

MAPLE performs closed-loop multi-agent training of VLA driving models entirely in latent space using supervised fine-tuning followed by RL with safety, progress, and diversity rewards, reaching SOTA on Bench2Drive.
MindVLA-U1: VLA Beats VA with Unified Streaming Architecture for Autonomous Driving
cs.RO 2026-05 unverdicted novelty 6.0

MindVLA-U1 is the first unified streaming VLA architecture that surpasses human drivers on WOD-E2E planning metrics while matching VA latency and preserving language interfaces.
SpatialForge: Bootstrapping 3D-Aware Spatial Reasoning from Open-World 2D Images
cs.CV 2026-05 unverdicted novelty 6.0

SpatialForge synthesizes 10 million spatial QA pairs from in-the-wild 2D images to train VLMs for better depth ordering, layout, and viewpoint-dependent reasoning.
CoWorld-VLA: Thinking in a Multi-Expert World Model for Autonomous Driving
cs.CV 2026-05 unverdicted novelty 6.0

CoWorld-VLA encodes world information into four expert tokens that condition a diffusion-based planner, yielding competitive collision avoidance and trajectory accuracy on the NAVSIM benchmark.
CoWorld-VLA: Thinking in a Multi-Expert World Model for Autonomous Driving
cs.CV 2026-05 unverdicted novelty 6.0

CoWorld-VLA extracts semantic, geometric, dynamic, and trajectory expert tokens from multi-source supervision and feeds them into a diffusion-based hierarchical planner, achieving competitive collision avoidance and t...
Quantifying the human visual exposome with vision language models
cs.AI 2026-05 unverdicted novelty 6.0

Vision language models applied to daily-life photos quantify visual environmental features that correlate with momentary affect and chronic stress, establishing a paradigm for visual exposomics.
Replacing Parameters with Preferences: Federated Alignment of Heterogeneous Vision-Language Models
cs.AI 2026-05 unverdicted novelty 6.0

MoR lets clients train local reward models on private preferences and uses a learned Mixture-of-Rewards with GRPO on the server to align a shared base VLM without exchanging parameters, architectures, or raw data.
VLADriver-RAG: Retrieval-Augmented Vision-Language-Action Models for Autonomous Driving
cs.CV 2026-05 unverdicted novelty 6.0

VLADriver-RAG reaches a new state-of-the-art Driving Score of 89.12 on Bench2Drive by retrieving structure-aware historical knowledge through spatiotemporal semantic graphs and Graph-DTW alignment.
VLM-VPI: A Vision-Language Reasoning Framework for Improving Automated Vehicle-Pedestrian Interactions
eess.SY 2026-04 unverdicted novelty 6.0

VLM-VPI uses Qwen3-VL and GPT-OSS models for pedestrian intent and age reasoning plus a tiered safety controller, reporting 92.3% intent accuracy in CARLA and reduced conflicts versus rule-based and supervised baselines.
EgoDyn-Bench: Evaluating Ego-Motion Understanding in Vision-Centric Foundation Models for Autonomous Driving
cs.CV 2026-04 unverdicted novelty 6.0

EgoDyn-Bench reveals a perception bottleneck in vision-centric foundation models: ego-motion logic derives from language while visual input adds negligible signal, with explicit trajectories restoring consistency.
If you're waiting for a sign... that might not be it! Mitigating Trust Boundary Confusion from Visual Injections on Vision-Language Agentic Systems
cs.CV 2026-04 unverdicted novelty 6.0

LVLM-based agents exhibit trust boundary confusion with visual injections and a multi-agent defense separating perception from decision-making reduces misleading responses while preserving correct ones.
OneDrive: Unified Multi-Paradigm Driving with Vision-Language-Action Models
cs.CV 2026-04 unverdicted novelty 6.0

OneDrive unifies heterogeneous decoding in a single VLM transformer decoder for end-to-end driving, achieving 0.28 L2 error and 0.18 collision rate on nuScenes plus 86.8 PDMS on NAVSIM.
LMGenDrive: Bridging Multimodal Understanding and Generative World Modeling for End-to-End Driving
cs.CV 2026-04 unverdicted novelty 6.0

LMGenDrive unifies LLM-based multimodal understanding with generative world models to output both future driving videos and control signals for end-to-end closed-loop autonomous driving.
Orion-Lite: Distilling LLM Reasoning into Efficient Vision-Only Driving Models
cs.CV 2026-04 unverdicted novelty 6.0

Orion-Lite uses latent feature distillation and trajectory supervision to create a vision-only model that surpasses its LLM-based teacher on closed-loop Bench2Drive evaluation, achieving a new SOTA driving score of 80.6.
ICR-Drive: Instruction Counterfactual Robustness for End-to-End Language-Driven Autonomous Driving
cs.CL 2026-04 unverdicted novelty 6.0

ICR-Drive reveals substantial performance drops in end-to-end language-driven driving models when instructions are paraphrased, made ambiguous, noised, or misleading.
AutoVLA: A Vision-Language-Action Model for End-to-End Autonomous Driving with Adaptive Reasoning and Reinforcement Fine-Tuning
cs.CV 2025-06 unverdicted novelty 6.0

AutoVLA unifies semantic reasoning and trajectory planning in one autoregressive VLA model for end-to-end autonomous driving by tokenizing trajectories into discrete actions and using GRPO reinforcement fine-tuning to...
EponaV2: Driving World Model with Comprehensive Future Reasoning
cs.CV 2026-05 unverdicted novelty 5.0

EponaV2 advances perception-free driving world models by forecasting comprehensive future 3D geometry and semantic representations, achieving SOTA planning performance on NAVSIM benchmarks.
Latency Analysis and Optimization of Alpamayo 1 via Efficient Trajectory Generation
cs.AI 2026-05 unverdicted novelty 5.0

Redesigning Alpamayo 1 to single-reasoning and optimizing diffusion action generation cuts inference latency by 69.23% while preserving trajectory diversity and prediction quality.
VLADriver-RAG: Retrieval-Augmented Vision-Language-Action Models for Autonomous Driving
cs.CV 2026-05 unverdicted novelty 5.0

VLADriver-RAG achieves state-of-the-art performance on Bench2Drive by grounding VLA planning in structure-aware retrieved priors via spatiotemporal semantic graphs and Graph-DTW alignment.
SpanVLA: Efficient Action Bridging and Learning from Negative-Recovery Samples for Vision-Language-Action Model
cs.CV 2026-04 unverdicted novelty 5.0

SpanVLA reduces action generation latency via flow-matching conditioned on history and improves robustness by training on negative-recovery samples with GRPO and a dedicated reasoning dataset.
DeepSight: Long-Horizon World Modeling via Latent States Prediction for End-to-End Autonomous Driving
cs.CV 2026-05 unverdicted novelty 4.0

DeepSight uses parallel latent feature prediction in BEV for long-horizon world modeling and adaptive text reasoning to reach state-of-the-art closed-loop performance on the Bench2drive benchmark.
Vision-Language-Action Safety: Threats, Challenges, Evaluations, and Mechanisms
cs.RO 2026-04 accept novelty 4.0

A literature survey that unifies fragmented work on attacks, defenses, evaluations, and deployment challenges for Vision-Language-Action models in robotics.
XEmbodied: A Foundation Model with Enhanced Geometric and Physical Cues for Large-Scale Embodied Environments
cs.CV 2026-04 unverdicted novelty 4.0

XEmbodied is a foundation model that integrates 3D geometric and physical signals into VLMs using a 3D Adapter and Efficient Image-Embodied Adapter, plus progressive curriculum and RL post-training, to improve spatial...
Seed1.5-VL Technical Report
cs.CV 2025-05 unverdicted novelty 4.0

Seed1.5-VL is a compact multimodal model that sets new records on dozens of vision-language benchmarks and outperforms prior systems on agent-style tasks.

Reference graph

Works this paper leans on

93 extracted references · 93 canonical work pages · cited by 32 Pith papers · 13 internal anchors

[1]

Barabas, A

I. Barabas, A. Todorut ¸, N. Cordos ¸, and A. Molea. Current challenges in autonomous driving. In IOP conference series: materials science and engineering , volume 252, page 012096. IOP Publishing, 2017

work page 2017
[2]

C. R. Qi, H. Su, K. Mo, and L. J. Guibas. Pointnet: Deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 652–660, 2017

work page 2017
[3]

A. H. Lang, S. V ora, H. Caesar, L. Zhou, J. Yang, and O. Beijbom. Pointpillars: Fast en- coders for object detection from point clouds. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages 12697–12705, 2019

work page 2019
[4]

Y . Wang, V . C. Guizilini, T. Zhang, Y . Wang, H. Zhao, and J. Solomon. Detr3d: 3d object detection from multi-view images via 3d-to-2d queries. In Conference on Robot Learning , pages 180–191. PMLR, 2022

work page 2022
[5]

Z. Li, W. Wang, H. Li, E. Xie, C. Sima, T. Lu, Y . Qiao, and J. Dai. Bevformer: Learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers. In European conference on computer vision, pages 1–18. Springer, 2022

work page 2022
[6]

J. Gao, C. Sun, H. Zhao, Y . Shen, D. Anguelov, C. Li, and C. Schmid. Vectornet: Encoding hd maps and agent dynamics from vectorized representation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , June 2020

work page 2020
[7]

H. Zhao, J. Gao, T. Lan, C. Sun, B. Sapp, B. Varadarajan, Y . Shen, Y . Shen, Y . Chai, C. Schmid, C. Li, and D. Anguelov. Tnt: Target-driven trajectory prediction. In J. Kober, F. Ramos, and C. Tomlin, editors, Proceedings of the 2020 Conference on Robot Learning , volume 155 of Proceedings of Machine Learning Research, pages 895–904. PMLR, 16–18 Nov 2021

work page 2020
[8]

Y . Liu, J. Zhang, L. Fang, Q. Jiang, and B. Zhou. Multimodal motion prediction with stacked transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7577–7586, 2021

work page 2021
[9]

J. Gu, C. Sun, and H. Zhao. Densetnt: End-to-end trajectory prediction from dense goal sets. In Proceedings of the IEEE/CVF International Conference on Computer Vision , pages 15303– 15312, 2021

work page 2021
[10]

Nayakanti, R

N. Nayakanti, R. Al-Rfou, A. Zhou, K. Goel, K. S. Refaat, and B. Sapp. Wayformer: Motion forecasting via simple & efficient attention networks. In 2023 IEEE International Conference on Robotics and Automation (ICRA) , pages 2980–2987. IEEE, 2023. 12

work page 2023
[11]

D. A. Pomerleau. Alvinn: An autonomous land vehicle in a neural network. Advances in neural information processing systems, 1, 1988

work page 1988
[12]

ChauffeurNet: Learning to Drive by Imitating the Best and Synthesizing the Worst

M. Bansal, A. Krizhevsky, and A. Ogale. Chauffeurnet: Learning to drive by imitating the best and synthesizing the worst. arXiv preprint arXiv:1812.03079, 2018

work page Pith review arXiv 2018
[13]

Z. Li, F. Nie, Q. Sun, F. Da, and H. Zhao. Uncertainty-aware decision transformer for stochastic driving environments. arXiv preprint arXiv:2309.16397, 2023

work page arXiv 2023
[14]

Y . Zeng, H. Zhang, J. Zheng, J. Xia, G. Wei, Y . Wei, Y . Zhang, and T. Kong. What matters in training a gpt4-style language model with multimodal inputs? arXiv preprint arXiv:2307.02469, 2023

work page arXiv 2023
[15]

Z. Xu, Y . Zhang, E. Xie, Z. Zhao, Y . Guo, K. K. Wong, Z. Li, and H. Zhao. Drivegpt4: Interpretable end-to-end autonomous driving via large language model. arXiv preprint arXiv:2310.01412, 2023

work page arXiv 2023
[16]

D. Zhu, J. Chen, X. Shen, X. Li, and M. Elhoseiny. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[17]

H. Liu, C. Li, Q. Wu, and Y . J. Lee. Visual instruction tuning, 2023

work page 2023
[18]

J. Bai, S. Bai, S. Yang, S. Wang, S. Tan, P. Wang, J. Lin, C. Zhou, and J. Zhou. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond. arXiv preprint arXiv:2308.12966, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[19]

W. Dai, J. Li, D. Li, A. M. H. Tiong, J. Zhao, W. Wang, B. Li, P. Fung, and S. Hoi. Instructblip: Towards general-purpose vision-language models with instruction tuning, 2023

work page 2023
[20]

Zhang, X

P. Zhang, X. Dong, B. Wang, Y . Cao, C. Xu, L. Ouyang, Z. Zhao, S. Ding, S. Zhang, H. Duan, W. Zhang, H. Yan, X. Zhang, W. Li, J. Li, K. Chen, C. He, X. Zhang, Y . Qiao, D. Lin, and J. Wang. Internlm-xcomposer: A vision-language large model for advanced text-image com- prehension and composition, 2023

work page 2023
[21]

W. Wang, Q. Lv, W. Yu, W. Hong, J. Qi, Y . Wang, J. Ji, Z. Yang, L. Zhao, X. Song, et al. Cogvlm: Visual expert for pretrained language models. arXiv preprint arXiv:2311.03079 , 2023

work page arXiv 2023
[22]

Driess, F

D. Driess, F. Xia, M. S. M. Sajjadi, C. Lynch, A. Chowdhery, B. Ichter, A. Wahid, J. Tompson, Q. Vuong, T. Yu, W. Huang, Y . Chebotar, P. Sermanet, D. Duckworth, S. Levine, V . Vanhoucke, K. Hausman, M. Toussaint, K. Greff, A. Zeng, I. Mordatch, and P. Florence. Palm-e: An embodied multimodal language model, 2023

work page 2023
[23]

RT-1: Robotics Transformer for Real-World Control at Scale

A. Brohan, N. Brown, J. Carbajal, Y . Chebotar, J. Dabis, C. Finn, K. Gopalakrishnan, K. Haus- man, A. Herzog, J. Hsu, et al. Rt-1: Robotics transformer for real-world control at scale. arXiv preprint arXiv:2212.06817, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[24]

RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

A. Brohan, N. Brown, J. Carbajal, Y . Chebotar, X. Chen, K. Choromanski, T. Ding, D. Driess, A. Dubey, C. Finn, P. Florence, C. Fu, M. G. Arenas, K. Gopalakrishnan, K. Han, K. Hausman, A. Herzog, J. Hsu, B. Ichter, A. Irpan, N. Joshi, R. Julian, D. Kalashnikov, Y . Kuang, I. Leal, L. Lee, T.-W. E. Lee, S. Levine, Y . Lu, H. Michalewski, I. Mordatch, K. Pe...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[25]

VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models

W. Huang, C. Wang, R. Zhang, Y . Li, J. Wu, and L. Fei-Fei. V oxposer: Composable 3d value maps for robotic manipulation with language models. arXiv preprint arXiv:2307.05973, 2023. 13

work page internal anchor Pith review arXiv 2023
[26]

Open X-Embodiment: Robotic Learning Datasets and RT-X Models

A. Padalkar, A. Pooley, A. Jain, A. Bewley, A. Herzog, A. Irpan, A. Khazatsky, A. Rai, A. Singh, A. Brohan, et al. Open x-embodiment: Robotic learning datasets and rt-x mod- els. arXiv preprint arXiv:2310.08864, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[27]

Chekroun, M

R. Chekroun, M. Toromanoff, S. Hornauer, and F. Moutarde. Gri: General reinforced imitation and its application to vision-based autonomous driving. Robotics, 12(5):127, 2023

work page 2023
[28]

D. Chen, V . Koltun, and P. Kr¨ahenb¨uhl. Learning to drive from a world on rails. InProceedings of the IEEE/CVF International Conference on Computer Vision , pages 15590–15599, 2021

work page 2021
[29]

Toromanoff, E

M. Toromanoff, E. Wirbel, and F. Moutarde. End-to-end model-free reinforcement learning for urban driving using implicit affordances. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages 7153–7162, 2020

work page 2020
[30]

W. Zeng, W. Luo, S. Suo, A. Sadat, B. Yang, S. Casas, and R. Urtasun. End-to-end interpretable neural motion planner. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8660–8669, 2019

work page 2019
[31]

B. Wei, M. Ren, W. Zeng, M. Liang, B. Yang, and R. Urtasun. Perceive, attend, and drive: Learning spatial attention for safe self-driving. In 2021 IEEE International Conference on Robotics and Automation (ICRA) , pages 4875–4881. IEEE, 2021

work page 2021
[32]

P. Hu, A. Huang, J. Dolan, D. Held, and D. Ramanan. Safe local motion planning with self- supervised freespace forecasting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12732–12741, 2021

work page 2021
[33]

Casas, A

S. Casas, A. Sadat, and R. Urtasun. Mp3: A unified model to map, perceive, predict and plan. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 14403–14412, 2021

work page 2021
[34]

Y . Hu, J. Yang, L. Chen, K. Li, C. Sima, X. Zhu, S. Chai, S. Du, T. Lin, W. Wang, et al. Planning-oriented autonomous driving. In Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, pages 17853–17862, 2023

work page 2023
[35]

K. Renz, K. Chitta, O.-B. Mercea, A. Koepke, Z. Akata, and A. Geiger. Plant: Explain- able planning transformers via object-level representations. arXiv preprint arXiv:2210.14222, 2022

work page arXiv 2022
[36]

L. Chen, P. Wu, K. Chitta, B. Jaeger, A. Geiger, and H. Li. End-to-end autonomous driving: Challenges and frontiers. arXiv preprint arXiv:2306.16927, 2023

work page arXiv 2023
[37]

A. Hu, L. Russell, H. Yeo, Z. Murez, G. Fedoseev, A. Kendall, J. Shotton, and G. Corrado. Gaia-1: A generative world model for autonomous driving. arXiv preprint arXiv:2309.17080, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[38]

Z. Yang, X. Jia, H. Li, and J. Yan. A survey of large language models for autonomous driving. arXiv preprint arXiv:2311.01043, 2023

work page arXiv 2023
[39]

D. Wu, W. Han, T. Wang, X. Dong, X. Zhang, and J. Shen. Referring multi-object tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14633–14642, 2023

work page 2023
[40]

Geiger, P

A. Geiger, P. Lenz, C. Stiller, and R. Urtasun. Vision meets robotics: The kitti dataset. The International Journal of Robotics Research, 32(11):1231–1237, 2013

work page 2013
[41]

Talk2car: Taking control of your self-driving car.arXiv preprint arXiv:1909.10838,

T. Deruyttere, S. Vandenhende, D. Grujicic, L. Van Gool, and M.-F. Moens. Talk2car: Taking control of your self-driving car. arXiv preprint arXiv:1909.10838, 2019

work page arXiv 1909
[42]

D. Wu, W. Han, T. Wang, Y . Liu, X. Zhang, and J. Shen. Language prompt for autonomous driving. arXiv preprint arXiv:2309.04379, 2023. 14

work page arXiv 2023
[43]

T. Qian, J. Chen, L. Zhuo, Y . Jiao, and Y .-G. Jiang. Nuscenes-qa: A multi-modal visual ques- tion answering benchmark for autonomous driving scenario.arXiv preprint arXiv:2305.14836, 2023

work page arXiv 2023
[44]

Caesar, V

H. Caesar, V . Bankiti, A. H. Lang, S. V ora, V . E. Liong, Q. Xu, A. Krishnan, Y . Pan, G. Baldan, and O. Beijbom. nuscenes: A multimodal dataset for autonomous driving. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages 11621–11631, 2020

work page 2020
[45]

J. Kim, A. Rohrbach, T. Darrell, J. Canny, and Z. Akata. Textual explanations for self-driving vehicles. In Proceedings of the European conference on computer vision (ECCV) , pages 563– 578, 2018

work page 2018
[46]

Y . Xu, X. Yang, L. Gong, H.-C. Lin, T.-Y . Wu, Y . Li, and N. Vasconcelos. Explainable object- induced action decision for autonomous vehicles. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9523–9532, 2020

work page 2020
[47]

Sachdeva, N

E. Sachdeva, N. Agarwal, S. Chundi, S. Roelofs, J. Li, B. Dariush, C. Choi, and M. Kochen- derfer. Rank2tell: A multimodal driving dataset for joint importance ranking and reasoning. arXiv preprint arXiv:2309.06597, 2023

work page arXiv 2023
[48]

Malla, C

S. Malla, C. Choi, I. Dwivedi, J. H. Choi, and J. Li. Drama: Joint risk localization and cap- tioning in driving. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 1043–1052, 2023

work page 2023
[49]

J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V . Le, D. Zhou, et al. Chain-of- thought prompting elicits reasoning in large language models.Advances in Neural Information Processing Systems, 35:24824–24837, 2022

work page 2022
[50]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. De- hghani, M. Minderer, G. Heigold, S. Gelly, et al. An image is worth 16x16 words: Transform- ers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2010
[51]

Jiang, S

B. Jiang, S. Chen, Q. Xu, B. Liao, J. Chen, H. Zhou, Q. Zhang, W. Liu, C. Huang, and X. Wang. Vad: Vectorized scene representation for efficient autonomous driving. arXiv preprint arXiv:2303.12077, 2023

work page arXiv 2023
[52]

Gpt-4 technical report, 2023

OpenAI. Gpt-4 technical report, 2023

work page 2023
[53]

Khurana, P

T. Khurana, P. Hu, A. Dave, J. Ziglar, D. Held, and D. Ramanan. Differentiable raycasting for self-supervised occupancy forecasting. In European Conference on Computer Vision , pages 353–369. Springer, 2022

work page 2022
[54]

S. Hu, L. Chen, P. Wu, H. Li, J. Yan, and D. Tao. St-p3: End-to-end vision-based autonomous driving via spatial-temporal feature learning. In European Conference on Computer Vision , pages 533–549. Springer, 2022

work page 2022
[55]

L. Xu, H. Huang, and J. Liu. Sutd-trafficqa: A question answering benchmark and an efficient network for video reasoning over traffic events. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9878–9888, 2021

work page 2021
[56]

J.-T. Zhai, Z. Feng, J. Du, Y . Mao, J.-J. Liu, Z. Tan, Y . Zhang, X. Ye, and J. Wang. Rethink- ing the open-loop evaluation of end-to-end autonomous driving in nuscenes. arXiv preprint arXiv:2305.10430, 2023

work page arXiv 2023
[57]

X. Yue, Y . Ni, K. Zhang, T. Zheng, R. Liu, G. Zhang, S. Stevens, D. Jiang, W. Ren, Y . Sun, et al. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9556–9567, 2024. 15

work page 2024
[58]

B. Li, R. Wang, G. Wang, Y . Ge, Y . Ge, and Y . Shan. Seed-bench: Benchmarking multimodal llms with generative comprehension. arXiv preprint arXiv:2307.16125, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[59]

L. Yu, P. Poirson, S. Yang, A. C. Berg, and T. L. Berg. Modeling context in referring ex- pressions. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part II 14 , pages 69–85. Springer, 2016

work page 2016
[60]

C. Sima, K. Renz, K. Chitta, L. Chen, H. Zhang, C. Xie, P. Luo, A. Geiger, and H. Li. Drivelm: Driving with graph visual question answering. arXiv preprint arXiv:2312.14150, 2023

work page arXiv 2023
[61]

Grok-1.5 vision preview

X.ai. Grok-1.5 vision preview. https://x.ai/blog/grok-1.5v, 2024

work page 2024
[62]

X. Chu, L. Qiao, X. Lin, S. Xu, Y . Yang, Y . Hu, F. Wei, X. Zhang, B. Zhang, X. Wei, et al. Mobilevlm: A fast, reproducible and strong vision language assistant for mobile devices.arXiv preprint arXiv:2312.16886, 2023

work page arXiv 2023
[63]

Q. Team. Introducing qwen1.5, February 2024. URL https://qwenlm.github.io/blog/ qwen1.5/

work page 2024
[64]

G. Team, T. Mesnard, C. Hardin, R. Dadashi, S. Bhupatiraju, S. Pathak, L. Sifre, M. Rivi `ere, M. S. Kale, J. Love, et al. Gemma: Open models based on gemini research and technology. arXiv preprint arXiv:2403.08295, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[65]

S. Hu, Y . Tu, X. Han, C. He, G. Cui, X. Long, Z. Zheng, Y . Fang, Y . Huang, W. Zhao, et al. Minicpm: Unveiling the potential of small language models with scalable training strategies. arXiv preprint arXiv:2404.06395, 2024

work page internal anchor Pith review arXiv 2024
[66]

Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone

M. Abdin, S. A. Jacobs, A. A. Awan, J. Aneja, A. Awadallah, H. Awadalla, N. Bach, A. Bahree, A. Bakhtiari, H. Behl, et al. Phi-3 technical report: A highly capable language model locally on your phone. arXiv preprint arXiv:2404.14219, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[67]

H. Liu, C. Li, Y . Li, and Y . J. Lee. Improved baselines with visual instruction tuning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26296–26306, 2024

work page 2024
[68]

X. Zhai, B. Mustafa, A. Kolesnikov, and L. Beyer. Sigmoid loss for language image pre- training. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 11975–11986, 2023

work page 2023
[69]

Alayrac, J

J.-B. Alayrac, J. Donahue, P. Luc, A. Miech, I. Barr, Y . Hasson, K. Lenc, A. Mensch, K. Mil- lican, M. Reynolds, et al. Flamingo: a visual language model for few-shot learning. Advances in neural information processing systems , 35:23716–23736, 2022

work page 2022
[70]

B. He, H. Li, Y . K. Jang, M. Jia, X. Cao, A. Shah, A. Shrivastava, and S.-N. Lim. Ma- lmm: Memory-augmented large multimodal model for long-term video understanding. arXiv preprint arXiv:2404.05726, 2024

work page arXiv 2024
[71]

J. Hu, L. Shen, and G. Sun. Squeeze-and-excitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition , pages 7132–7141, 2018

work page 2018
[72]

Y . Li, F. Wei, C. Zhang, and H. Zhang. Eagle: Speculative sampling requires rethinking feature uncertainty. arXiv preprint arXiv:2401.15077, 2024

work page internal anchor Pith review arXiv 2024
[73]

T. Cai, Y . Li, Z. Geng, H. Peng, and T. Dao. Medusa: Simple framework for accelerating llm generation with multiple decoding heads, 2023. 16 A SUP-AD Dataset A.1 Meta-actions Meta-action statistics. We use the meta-action sequence to formally represent the driving strategy. Meta actions are classified into 17 categories. We show the distribution of each ...

work page 2023
[74]

Speed-control actions. Discerned from acceleration and braking signals within the ego state data, these actions include These actions can be discerned from acceleration and brak- ing signals within the ego state data. They includespeed up, slow down, slow down rapidly, go straight slowly, go straight at a constant speed , stop, wait, and reverse

work page
[75]

Deduced from steering wheel signals, these actions consist of turn left, turn right, and turn around

Turning actions. Deduced from steering wheel signals, these actions consist of turn left, turn right, and turn around

work page
[76]

Slow down

Lane-control actions. Encompassing lane selection decisions, these actions are derived from a combination of steering wheel signals and either map or perception data. They involve change lane to the left , change lane to the right , shift slightly to the left , and shift slightly to the right . A.2 Scenario Categories The SUP-AD dataset is comprised of 1,...

work page
[77]

Weather: Sunny (environmental conditions)

work page
[78]

Time: Day (environmental conditions)

work page
[79]

Road Environment: Urban (environmental conditions)

work page
[80]

Lane Options: Left Lane, Own Lane, Right Lane (environmental conditions)

work page

Showing first 80 references.