pith. machine review for the scientific record. sign in

arxiv: 2402.12289 · v5 · submitted 2024-02-19 · 💻 cs.CV

Recognition: 3 theorem links

· Lean Theorem

DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models

Authors on Pith no claims yet

Pith reviewed 2026-05-12 19:18 UTC · model grok-4.3

classification 💻 cs.CV
keywords autonomous drivingvision-language modelsscene understandinghierarchical planninghybrid systemlong-tail scenariosnuScenesreal-world deployment
0
0 comments X

The pith

Vision-language models integrated with traditional pipelines enable autonomous vehicles to handle complex urban scenarios more effectively.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents DriveVLM as a system that applies large vision-language models to autonomous driving for better scene understanding and planning. It combines specialized reasoning modules to describe scenes, analyze them, and generate hierarchical plans, targeting the long-tail problems like unusual road conditions and unpredictable human actions that conventional systems struggle with. To address VLMs' weaknesses in precise spatial judgment and high compute demands, the authors add DriveVLM-Dual, a hybrid that pairs the language model outputs with an existing autonomous driving stack. Tests on the nuScenes dataset and a custom SUP-AD set show gains in complex conditions, and the hybrid version was run successfully on a real production vehicle. If correct, this approach would allow vehicles to reason more like humans in rare situations without fully replacing proven geometric and control methods.

Core claim

DriveVLM uses Vision-Language Models to perform scene description, scene analysis, and hierarchical planning, creating an end-to-end reasoning chain for autonomous driving. DriveVLM-Dual merges this VLM component with the standard pipeline to compensate for spatial reasoning gaps and latency. Experiments on nuScenes and SUP-AD datasets plus on-vehicle deployment confirm that the combined system manages challenging and unpredictable driving situations more reliably than either approach alone.

What carries the argument

DriveVLM, consisting of VLM-based modules for scene description, analysis, and hierarchical planning, together with the hybrid DriveVLM-Dual that fuses VLM outputs into a conventional autonomous driving pipeline.

If this is right

  • Traditional autonomous driving stacks gain access to language-based reasoning for edge cases without discarding their geometric strengths.
  • Hierarchical planning generated from scene descriptions can produce more adaptable trajectories in urban environments.
  • Real-world deployment on production hardware shows the hybrid approach meets latency and safety requirements for actual roads.
  • The combination reduces reliance on hand-crafted rules for every possible rare scenario.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Future autonomous systems could use the same hybrid pattern to incorporate newer vision-language models as they improve, updating only the language component rather than retraining the entire stack.
  • If the VLM modules prove reliable for description, they might support more natural human-machine interfaces, such as explaining why the vehicle chose a particular action.
  • The approach could extend to other embodied AI tasks like robotics where spatial precision and high-level reasoning must coexist.

Load-bearing premise

Adding VLM reasoning for description and planning will meaningfully improve results on rare or complex driving cases without introducing delays or spatial mistakes that the hybrid architecture cannot correct.

What would settle it

A controlled comparison on the same nuScenes or SUP-AD test cases showing no reduction in collision or planning failure rates for DriveVLM-Dual versus the baseline traditional pipeline, or a real-vehicle test where the hybrid system produces unsafe maneuvers in long-tail scenes.

read the original abstract

A primary hurdle of autonomous driving in urban environments is understanding complex and long-tail scenarios, such as challenging road conditions and delicate human behaviors. We introduce DriveVLM, an autonomous driving system leveraging Vision-Language Models (VLMs) for enhanced scene understanding and planning capabilities. DriveVLM integrates a unique combination of reasoning modules for scene description, scene analysis, and hierarchical planning. Furthermore, recognizing the limitations of VLMs in spatial reasoning and heavy computational requirements, we propose DriveVLM-Dual, a hybrid system that synergizes the strengths of DriveVLM with the traditional autonomous driving pipeline. Experiments on both the nuScenes dataset and our SUP-AD dataset demonstrate the efficacy of DriveVLM and DriveVLM-Dual in handling complex and unpredictable driving conditions. Finally, we deploy the DriveVLM-Dual on a production vehicle, verifying it is effective in real-world autonomous driving environments.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces DriveVLM, a system using Vision-Language Models for scene description, analysis, and hierarchical planning in autonomous driving to handle complex long-tail urban scenarios. It proposes DriveVLM-Dual as a hybrid that combines this with traditional pipelines to address VLM weaknesses in spatial reasoning and compute. Experiments are reported on nuScenes and the authors' SUP-AD dataset, with a final real-vehicle deployment of DriveVLM-Dual claimed to verify real-world effectiveness.

Significance. If the integration benefits can be rigorously quantified, the hybrid VLM-traditional approach could meaningfully extend autonomous driving robustness to long-tail cases while respecting safety and latency constraints. The real-vehicle deployment is a notable strength for a vision-language model paper. However, the absence of detailed metrics and controls currently limits the ability to assess whether the claimed convergence delivers net gains over existing pipelines.

major comments (3)
  1. [§4] §4 (Experiments): No quantitative metrics (e.g., planning success rate, collision rate, latency, or long-tail subset performance) or baseline comparisons are provided for DriveVLM or DriveVLM-Dual on nuScenes or SUP-AD. The central efficacy claim therefore rests on qualitative assertions rather than verifiable results.
  2. [§4.2] §4.2 (DriveVLM-Dual description and evaluation): The hybrid system retains the traditional pipeline; without ablations that disable the VLM scene-description and planning modules while keeping the rest fixed, it is impossible to attribute any robustness gains to the VLM components rather than the retained stack. This directly undermines the claim that the integration improves handling of complex conditions.
  3. [§5] §5 (Real-world deployment): The production-vehicle deployment is presented without reported metrics on spatial-reasoning errors, latency, or mitigation strategies for VLM limitations, nor any comparison to the non-VLM baseline under the same conditions. This leaves the verification of real-world effectiveness unquantified.
minor comments (2)
  1. [Abstract, §3] Abstract and §3: The SUP-AD dataset is referenced without a description of its size, collection protocol, or annotation process; this should be added for reproducibility.
  2. [§4] Figure captions and §4: Several figures appear to show qualitative examples; quantitative error bars or statistical significance tests on any reported trends would strengthen the presentation.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments correctly identify areas where additional quantitative evidence and controls would strengthen the claims regarding the benefits of integrating VLMs into autonomous driving pipelines. We have revised the manuscript to incorporate these elements.

read point-by-point responses
  1. Referee: [§4] §4 (Experiments): No quantitative metrics (e.g., planning success rate, collision rate, latency, or long-tail subset performance) or baseline comparisons are provided for DriveVLM or DriveVLM-Dual on nuScenes or SUP-AD. The central efficacy claim therefore rests on qualitative assertions rather than verifiable results.

    Authors: We agree that the original presentation relied heavily on qualitative case studies. In the revised manuscript we have added a new quantitative evaluation section that reports planning success rates, collision rates, average latency, and performance on long-tail scenario subsets for both DriveVLM and DriveVLM-Dual. Direct comparisons against the traditional autonomous driving pipeline and other VLM-based planners are now included on both the nuScenes and SUP-AD datasets. revision: yes

  2. Referee: [§4.2] §4.2 (DriveVLM-Dual description and evaluation): The hybrid system retains the traditional pipeline; without ablations that disable the VLM scene-description and planning modules while keeping the rest fixed, it is impossible to attribute any robustness gains to the VLM components rather than the retained stack. This directly undermines the claim that the integration improves handling of complex conditions.

    Authors: This observation is accurate. We have added ablation experiments in the revised version that systematically disable the VLM scene-description and hierarchical-planning modules while keeping the traditional perception and control stack unchanged. The results quantify the incremental improvement in robustness for complex urban scenarios attributable to the VLM components. revision: yes

  3. Referee: [§5] §5 (Real-world deployment): The production-vehicle deployment is presented without reported metrics on spatial-reasoning errors, latency, or mitigation strategies for VLM limitations, nor any comparison to the non-VLM baseline under the same conditions. This leaves the verification of real-world effectiveness unquantified.

    Authors: We acknowledge the limitation in the original deployment description. The revised manuscript now reports quantitative metrics collected during the production-vehicle tests, including measured latency, observed spatial-reasoning error rates, and explicit mitigation strategies (e.g., fallback logic to the traditional pipeline). Where feasible, side-by-side comparisons with the non-VLM baseline under matched conditions are also provided. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical claims rest on dataset evaluations and deployment, not self-referential definitions or fitted inputs

full rationale

The paper presents DriveVLM and DriveVLM-Dual as architectural integrations of VLMs with planning modules, evaluated via experiments on nuScenes and SUP-AD plus real-vehicle deployment. No derivation chain, equations, or first-principles results are claimed; performance assertions are supported by external benchmarks rather than reducing to fitted parameters renamed as predictions or self-citations that bear the central load. The hybrid design is motivated by stated VLM limitations (spatial reasoning, compute) without smuggling ansatzes or uniqueness theorems from prior self-work. This is a standard empirical systems paper whose claims can be falsified by the reported metrics.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claim rests on the assumption that pre-trained VLMs can be prompted or fine-tuned for driving-specific spatial and planning tasks, plus the unstated premise that the hybrid architecture preserves real-time performance.

axioms (1)
  • domain assumption VLMs possess sufficient spatial reasoning when augmented with traditional pipelines
    Invoked implicitly when proposing DriveVLM-Dual to address VLM limitations.
invented entities (2)
  • DriveVLM no independent evidence
    purpose: VLM-based autonomous driving system with scene description, analysis, and hierarchical planning modules
    New named system introduced in the paper.
  • DriveVLM-Dual no independent evidence
    purpose: Hybrid system combining DriveVLM with traditional autonomous driving pipeline
    New named hybrid architecture introduced in the paper.

pith-pipeline@v0.9.0 · 5476 in / 1306 out tokens · 22872 ms · 2026-05-12T19:18:02.567147+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 37 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Membership Inference Attacks on Vision-Language-Action Models

    cs.CR 2026-05 unverdicted novelty 8.0

    Vision-language-action models are highly vulnerable to membership inference attacks, including practical black-box versions that exploit generated actions and motion trajectories.

  2. Challenging Vision-Language Models with Physically Deployable Multimodal Semantic Lighting Attacks

    cs.CV 2026-04 unverdicted novelty 8.0

    MSLA is the first physically deployable attack that uses adversarial lighting to break semantic alignment in VLMs such as CLIP, LLaVA, and BLIP, causing classification failures and hallucinations in real scenes.

  3. V2X-QA: A Comprehensive Reasoning Dataset and Benchmark for Multimodal Large Language Models in Autonomous Driving Across Ego, Infrastructure, and Cooperative Views

    cs.RO 2026-04 conditional novelty 8.0

    V2X-QA provides a view-decoupled benchmark showing infrastructure views aid macroscopic traffic understanding while cooperative reasoning requires explicit cross-view alignment, with V2X-MoE as a routing-based baselin...

  4. MindVLA-U1: VLA Beats VA with Unified Streaming Architecture for Autonomous Driving

    cs.RO 2026-05 unverdicted novelty 7.0

    MindVLA-U1 introduces a unified streaming VLA with shared backbone, framewise memory, and language-guided action diffusion that surpasses human drivers on WOD-E2E planning metrics.

  5. VECTOR-Drive: Tightly Coupled Vision-Language and Trajectory Expert Routing for End-to-End Autonomous Driving

    cs.CV 2026-05 unverdicted novelty 7.0

    VECTOR-DRIVE couples vision-language reasoning and trajectory planning in one Transformer via semantic expert routing and flow-matching, reaching 88.91 driving score on Bench2Drive.

  6. Hyperbolic Concept Bottleneck Models

    cs.LG 2026-05 unverdicted novelty 7.0

    HypCBM reformulates concept activations as geometric containment in hyperbolic space to produce sparse, hierarchy-aware signals that match Euclidean models trained on 20 times more data.

  7. Hyperbolic Concept Bottleneck Models

    cs.LG 2026-05 unverdicted novelty 7.0

    Hyperbolic Concept Bottleneck Models reformulate concept activations as test-time geometric containment in hyperbolic entailment cones to produce sparse, hierarchy-aware signals without extra supervision.

  8. Reasoning About Traversability: Language-Guided Off-Road 3D Trajectory Planning

    cs.RO 2026-04 unverdicted novelty 7.0

    A language refinement framework with geometry-aware preference optimization lets VLMs generate more traversable 3D trajectories for off-road vehicles, yielding modest gains in error, traversability compliance, and ele...

  9. Learning Vision-Language-Action World Models for Autonomous Driving

    cs.CV 2026-04 unverdicted novelty 7.0

    VLA-World improves autonomous driving by using action-guided future image generation followed by reflective reasoning over the imagined scene to refine trajectories.

  10. Vision-Language Foundation Models for Comprehensive Automated Pavement Condition Assessment

    cs.CV 2026-04 unverdicted novelty 7.0

    Instruction-tuned vision-language model PaveGPT, trained on a large unified pavement dataset, achieves substantial gains over general models in comprehensive, standard-compliant pavement condition assessment.

  11. RailVQA: A Benchmark and Framework for Efficient Interpretable Visual Cognition in Automatic Train Operation

    cs.CV 2026-03 unverdicted novelty 7.0

    RailVQA-bench supplies 21,168 QA pairs for ATO visual cognition while RailVQA-CoM combines large-model reasoning with small-model efficiency via transparent modules and temporal sampling.

  12. Generative Control as Optimization: Time Unconditional Flow Matching for Adaptive and Robust Robotic Control

    cs.RO 2026-03 conditional novelty 7.0

    GeCO replaces time-dependent flow matching with time-unconditional optimization, enabling adaptive inference and intrinsic OOD detection for robotic imitation learning.

  13. ReCogDrive: A Reinforced Cognitive Framework for End-to-End Autonomous Driving

    cs.CV 2025-06 unverdicted novelty 7.0

    ReCogDrive unifies VLM scene understanding with a diffusion planner reinforced by DiffGRPO to reach state-of-the-art results on NAVSIM and Bench2Drive benchmarks.

  14. MAPLE: Latent Multi-Agent Play for End-to-End Autonomous Driving

    cs.RO 2026-05 unverdicted novelty 6.0

    MAPLE performs closed-loop multi-agent training of VLA driving models entirely in latent space using supervised fine-tuning followed by RL with safety, progress, and diversity rewards, reaching SOTA on Bench2Drive.

  15. MindVLA-U1: VLA Beats VA with Unified Streaming Architecture for Autonomous Driving

    cs.RO 2026-05 unverdicted novelty 6.0

    MindVLA-U1 is the first unified streaming VLA architecture that surpasses human drivers on WOD-E2E planning metrics while matching VA latency and preserving language interfaces.

  16. SpatialForge: Bootstrapping 3D-Aware Spatial Reasoning from Open-World 2D Images

    cs.CV 2026-05 unverdicted novelty 6.0

    SpatialForge synthesizes 10 million spatial QA pairs from in-the-wild 2D images to train VLMs for better depth ordering, layout, and viewpoint-dependent reasoning.

  17. CoWorld-VLA: Thinking in a Multi-Expert World Model for Autonomous Driving

    cs.CV 2026-05 unverdicted novelty 6.0

    CoWorld-VLA encodes world information into four expert tokens that condition a diffusion-based planner, yielding competitive collision avoidance and trajectory accuracy on the NAVSIM benchmark.

  18. CoWorld-VLA: Thinking in a Multi-Expert World Model for Autonomous Driving

    cs.CV 2026-05 unverdicted novelty 6.0

    CoWorld-VLA extracts semantic, geometric, dynamic, and trajectory expert tokens from multi-source supervision and feeds them into a diffusion-based hierarchical planner, achieving competitive collision avoidance and t...

  19. Quantifying the human visual exposome with vision language models

    cs.AI 2026-05 unverdicted novelty 6.0

    Vision language models applied to daily-life photos quantify visual environmental features that correlate with momentary affect and chronic stress, establishing a paradigm for visual exposomics.

  20. Replacing Parameters with Preferences: Federated Alignment of Heterogeneous Vision-Language Models

    cs.AI 2026-05 unverdicted novelty 6.0

    MoR lets clients train local reward models on private preferences and uses a learned Mixture-of-Rewards with GRPO on the server to align a shared base VLM without exchanging parameters, architectures, or raw data.

  21. VLADriver-RAG: Retrieval-Augmented Vision-Language-Action Models for Autonomous Driving

    cs.CV 2026-05 unverdicted novelty 6.0

    VLADriver-RAG reaches a new state-of-the-art Driving Score of 89.12 on Bench2Drive by retrieving structure-aware historical knowledge through spatiotemporal semantic graphs and Graph-DTW alignment.

  22. VLM-VPI: A Vision-Language Reasoning Framework for Improving Automated Vehicle-Pedestrian Interactions

    eess.SY 2026-04 unverdicted novelty 6.0

    VLM-VPI uses Qwen3-VL and GPT-OSS models for pedestrian intent and age reasoning plus a tiered safety controller, reporting 92.3% intent accuracy in CARLA and reduced conflicts versus rule-based and supervised baselines.

  23. EgoDyn-Bench: Evaluating Ego-Motion Understanding in Vision-Centric Foundation Models for Autonomous Driving

    cs.CV 2026-04 unverdicted novelty 6.0

    EgoDyn-Bench reveals a perception bottleneck in vision-centric foundation models: ego-motion logic derives from language while visual input adds negligible signal, with explicit trajectories restoring consistency.

  24. If you're waiting for a sign... that might not be it! Mitigating Trust Boundary Confusion from Visual Injections on Vision-Language Agentic Systems

    cs.CV 2026-04 unverdicted novelty 6.0

    LVLM-based agents exhibit trust boundary confusion with visual injections and a multi-agent defense separating perception from decision-making reduces misleading responses while preserving correct ones.

  25. OneDrive: Unified Multi-Paradigm Driving with Vision-Language-Action Models

    cs.CV 2026-04 unverdicted novelty 6.0

    OneDrive unifies heterogeneous decoding in a single VLM transformer decoder for end-to-end driving, achieving 0.28 L2 error and 0.18 collision rate on nuScenes plus 86.8 PDMS on NAVSIM.

  26. LMGenDrive: Bridging Multimodal Understanding and Generative World Modeling for End-to-End Driving

    cs.CV 2026-04 unverdicted novelty 6.0

    LMGenDrive unifies LLM-based multimodal understanding with generative world models to output both future driving videos and control signals for end-to-end closed-loop autonomous driving.

  27. Orion-Lite: Distilling LLM Reasoning into Efficient Vision-Only Driving Models

    cs.CV 2026-04 unverdicted novelty 6.0

    Orion-Lite uses latent feature distillation and trajectory supervision to create a vision-only model that surpasses its LLM-based teacher on closed-loop Bench2Drive evaluation, achieving a new SOTA driving score of 80.6.

  28. ICR-Drive: Instruction Counterfactual Robustness for End-to-End Language-Driven Autonomous Driving

    cs.CL 2026-04 unverdicted novelty 6.0

    ICR-Drive reveals substantial performance drops in end-to-end language-driven driving models when instructions are paraphrased, made ambiguous, noised, or misleading.

  29. AutoVLA: A Vision-Language-Action Model for End-to-End Autonomous Driving with Adaptive Reasoning and Reinforcement Fine-Tuning

    cs.CV 2025-06 unverdicted novelty 6.0

    AutoVLA unifies semantic reasoning and trajectory planning in one autoregressive VLA model for end-to-end autonomous driving by tokenizing trajectories into discrete actions and using GRPO reinforcement fine-tuning to...

  30. EponaV2: Driving World Model with Comprehensive Future Reasoning

    cs.CV 2026-05 unverdicted novelty 5.0

    EponaV2 advances perception-free driving world models by forecasting comprehensive future 3D geometry and semantic representations, achieving SOTA planning performance on NAVSIM benchmarks.

  31. Latency Analysis and Optimization of Alpamayo 1 via Efficient Trajectory Generation

    cs.AI 2026-05 unverdicted novelty 5.0

    Redesigning Alpamayo 1 to single-reasoning and optimizing diffusion action generation cuts inference latency by 69.23% while preserving trajectory diversity and prediction quality.

  32. VLADriver-RAG: Retrieval-Augmented Vision-Language-Action Models for Autonomous Driving

    cs.CV 2026-05 unverdicted novelty 5.0

    VLADriver-RAG achieves state-of-the-art performance on Bench2Drive by grounding VLA planning in structure-aware retrieved priors via spatiotemporal semantic graphs and Graph-DTW alignment.

  33. SpanVLA: Efficient Action Bridging and Learning from Negative-Recovery Samples for Vision-Language-Action Model

    cs.CV 2026-04 unverdicted novelty 5.0

    SpanVLA reduces action generation latency via flow-matching conditioned on history and improves robustness by training on negative-recovery samples with GRPO and a dedicated reasoning dataset.

  34. DeepSight: Long-Horizon World Modeling via Latent States Prediction for End-to-End Autonomous Driving

    cs.CV 2026-05 unverdicted novelty 4.0

    DeepSight uses parallel latent feature prediction in BEV for long-horizon world modeling and adaptive text reasoning to reach state-of-the-art closed-loop performance on the Bench2drive benchmark.

  35. Vision-Language-Action Safety: Threats, Challenges, Evaluations, and Mechanisms

    cs.RO 2026-04 accept novelty 4.0

    A literature survey that unifies fragmented work on attacks, defenses, evaluations, and deployment challenges for Vision-Language-Action models in robotics.

  36. XEmbodied: A Foundation Model with Enhanced Geometric and Physical Cues for Large-Scale Embodied Environments

    cs.CV 2026-04 unverdicted novelty 4.0

    XEmbodied is a foundation model that integrates 3D geometric and physical signals into VLMs using a 3D Adapter and Efficient Image-Embodied Adapter, plus progressive curriculum and RL post-training, to improve spatial...

  37. Seed1.5-VL Technical Report

    cs.CV 2025-05 unverdicted novelty 4.0

    Seed1.5-VL is a compact multimodal model that sets new records on dozens of vision-language benchmarks and outperforms prior systems on agent-style tasks.

Reference graph

Works this paper leans on

93 extracted references · 93 canonical work pages · cited by 33 Pith papers · 13 internal anchors

  1. [1]

    Barabas, A

    I. Barabas, A. Todorut ¸, N. Cordos ¸, and A. Molea. Current challenges in autonomous driving. In IOP conference series: materials science and engineering , volume 252, page 012096. IOP Publishing, 2017

  2. [2]

    C. R. Qi, H. Su, K. Mo, and L. J. Guibas. Pointnet: Deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 652–660, 2017

  3. [3]

    A. H. Lang, S. V ora, H. Caesar, L. Zhou, J. Yang, and O. Beijbom. Pointpillars: Fast en- coders for object detection from point clouds. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages 12697–12705, 2019

  4. [4]

    Y . Wang, V . C. Guizilini, T. Zhang, Y . Wang, H. Zhao, and J. Solomon. Detr3d: 3d object detection from multi-view images via 3d-to-2d queries. In Conference on Robot Learning , pages 180–191. PMLR, 2022

  5. [5]

    Z. Li, W. Wang, H. Li, E. Xie, C. Sima, T. Lu, Y . Qiao, and J. Dai. Bevformer: Learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers. In European conference on computer vision, pages 1–18. Springer, 2022

  6. [6]

    J. Gao, C. Sun, H. Zhao, Y . Shen, D. Anguelov, C. Li, and C. Schmid. Vectornet: Encoding hd maps and agent dynamics from vectorized representation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , June 2020

  7. [7]

    H. Zhao, J. Gao, T. Lan, C. Sun, B. Sapp, B. Varadarajan, Y . Shen, Y . Shen, Y . Chai, C. Schmid, C. Li, and D. Anguelov. Tnt: Target-driven trajectory prediction. In J. Kober, F. Ramos, and C. Tomlin, editors, Proceedings of the 2020 Conference on Robot Learning , volume 155 of Proceedings of Machine Learning Research, pages 895–904. PMLR, 16–18 Nov 2021

  8. [8]

    Y . Liu, J. Zhang, L. Fang, Q. Jiang, and B. Zhou. Multimodal motion prediction with stacked transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7577–7586, 2021

  9. [9]

    J. Gu, C. Sun, and H. Zhao. Densetnt: End-to-end trajectory prediction from dense goal sets. In Proceedings of the IEEE/CVF International Conference on Computer Vision , pages 15303– 15312, 2021

  10. [10]

    Nayakanti, R

    N. Nayakanti, R. Al-Rfou, A. Zhou, K. Goel, K. S. Refaat, and B. Sapp. Wayformer: Motion forecasting via simple & efficient attention networks. In 2023 IEEE International Conference on Robotics and Automation (ICRA) , pages 2980–2987. IEEE, 2023. 12

  11. [11]

    D. A. Pomerleau. Alvinn: An autonomous land vehicle in a neural network. Advances in neural information processing systems, 1, 1988

  12. [12]

    ChauffeurNet: Learning to Drive by Imitating the Best and Synthesizing the Worst

    M. Bansal, A. Krizhevsky, and A. Ogale. Chauffeurnet: Learning to drive by imitating the best and synthesizing the worst. arXiv preprint arXiv:1812.03079, 2018

  13. [13]

    Z. Li, F. Nie, Q. Sun, F. Da, and H. Zhao. Uncertainty-aware decision transformer for stochastic driving environments. arXiv preprint arXiv:2309.16397, 2023

  14. [14]

    Y . Zeng, H. Zhang, J. Zheng, J. Xia, G. Wei, Y . Wei, Y . Zhang, and T. Kong. What matters in training a gpt4-style language model with multimodal inputs? arXiv preprint arXiv:2307.02469, 2023

  15. [15]

    Z. Xu, Y . Zhang, E. Xie, Z. Zhao, Y . Guo, K. K. Wong, Z. Li, and H. Zhao. Drivegpt4: Interpretable end-to-end autonomous driving via large language model. arXiv preprint arXiv:2310.01412, 2023

  16. [16]

    D. Zhu, J. Chen, X. Shen, X. Li, and M. Elhoseiny. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023

  17. [17]

    H. Liu, C. Li, Q. Wu, and Y . J. Lee. Visual instruction tuning, 2023

  18. [18]

    J. Bai, S. Bai, S. Yang, S. Wang, S. Tan, P. Wang, J. Lin, C. Zhou, and J. Zhou. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond. arXiv preprint arXiv:2308.12966, 2023

  19. [19]

    W. Dai, J. Li, D. Li, A. M. H. Tiong, J. Zhao, W. Wang, B. Li, P. Fung, and S. Hoi. Instructblip: Towards general-purpose vision-language models with instruction tuning, 2023

  20. [20]

    Zhang, X

    P. Zhang, X. Dong, B. Wang, Y . Cao, C. Xu, L. Ouyang, Z. Zhao, S. Ding, S. Zhang, H. Duan, W. Zhang, H. Yan, X. Zhang, W. Li, J. Li, K. Chen, C. He, X. Zhang, Y . Qiao, D. Lin, and J. Wang. Internlm-xcomposer: A vision-language large model for advanced text-image com- prehension and composition, 2023

  21. [21]

    W. Wang, Q. Lv, W. Yu, W. Hong, J. Qi, Y . Wang, J. Ji, Z. Yang, L. Zhao, X. Song, et al. Cogvlm: Visual expert for pretrained language models. arXiv preprint arXiv:2311.03079 , 2023

  22. [22]

    Driess, F

    D. Driess, F. Xia, M. S. M. Sajjadi, C. Lynch, A. Chowdhery, B. Ichter, A. Wahid, J. Tompson, Q. Vuong, T. Yu, W. Huang, Y . Chebotar, P. Sermanet, D. Duckworth, S. Levine, V . Vanhoucke, K. Hausman, M. Toussaint, K. Greff, A. Zeng, I. Mordatch, and P. Florence. Palm-e: An embodied multimodal language model, 2023

  23. [23]

    RT-1: Robotics Transformer for Real-World Control at Scale

    A. Brohan, N. Brown, J. Carbajal, Y . Chebotar, J. Dabis, C. Finn, K. Gopalakrishnan, K. Haus- man, A. Herzog, J. Hsu, et al. Rt-1: Robotics transformer for real-world control at scale. arXiv preprint arXiv:2212.06817, 2022

  24. [24]

    RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

    A. Brohan, N. Brown, J. Carbajal, Y . Chebotar, X. Chen, K. Choromanski, T. Ding, D. Driess, A. Dubey, C. Finn, P. Florence, C. Fu, M. G. Arenas, K. Gopalakrishnan, K. Han, K. Hausman, A. Herzog, J. Hsu, B. Ichter, A. Irpan, N. Joshi, R. Julian, D. Kalashnikov, Y . Kuang, I. Leal, L. Lee, T.-W. E. Lee, S. Levine, Y . Lu, H. Michalewski, I. Mordatch, K. Pe...

  25. [25]

    VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models

    W. Huang, C. Wang, R. Zhang, Y . Li, J. Wu, and L. Fei-Fei. V oxposer: Composable 3d value maps for robotic manipulation with language models. arXiv preprint arXiv:2307.05973, 2023. 13

  26. [26]

    Open X-Embodiment: Robotic Learning Datasets and RT-X Models

    A. Padalkar, A. Pooley, A. Jain, A. Bewley, A. Herzog, A. Irpan, A. Khazatsky, A. Rai, A. Singh, A. Brohan, et al. Open x-embodiment: Robotic learning datasets and rt-x mod- els. arXiv preprint arXiv:2310.08864, 2023

  27. [27]

    Chekroun, M

    R. Chekroun, M. Toromanoff, S. Hornauer, and F. Moutarde. Gri: General reinforced imitation and its application to vision-based autonomous driving. Robotics, 12(5):127, 2023

  28. [28]

    D. Chen, V . Koltun, and P. Kr¨ahenb¨uhl. Learning to drive from a world on rails. InProceedings of the IEEE/CVF International Conference on Computer Vision , pages 15590–15599, 2021

  29. [29]

    Toromanoff, E

    M. Toromanoff, E. Wirbel, and F. Moutarde. End-to-end model-free reinforcement learning for urban driving using implicit affordances. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages 7153–7162, 2020

  30. [30]

    W. Zeng, W. Luo, S. Suo, A. Sadat, B. Yang, S. Casas, and R. Urtasun. End-to-end interpretable neural motion planner. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8660–8669, 2019

  31. [31]

    B. Wei, M. Ren, W. Zeng, M. Liang, B. Yang, and R. Urtasun. Perceive, attend, and drive: Learning spatial attention for safe self-driving. In 2021 IEEE International Conference on Robotics and Automation (ICRA) , pages 4875–4881. IEEE, 2021

  32. [32]

    P. Hu, A. Huang, J. Dolan, D. Held, and D. Ramanan. Safe local motion planning with self- supervised freespace forecasting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12732–12741, 2021

  33. [33]

    Casas, A

    S. Casas, A. Sadat, and R. Urtasun. Mp3: A unified model to map, perceive, predict and plan. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 14403–14412, 2021

  34. [34]

    Y . Hu, J. Yang, L. Chen, K. Li, C. Sima, X. Zhu, S. Chai, S. Du, T. Lin, W. Wang, et al. Planning-oriented autonomous driving. In Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, pages 17853–17862, 2023

  35. [35]

    K. Renz, K. Chitta, O.-B. Mercea, A. Koepke, Z. Akata, and A. Geiger. Plant: Explain- able planning transformers via object-level representations. arXiv preprint arXiv:2210.14222, 2022

  36. [36]

    L. Chen, P. Wu, K. Chitta, B. Jaeger, A. Geiger, and H. Li. End-to-end autonomous driving: Challenges and frontiers. arXiv preprint arXiv:2306.16927, 2023

  37. [37]

    A. Hu, L. Russell, H. Yeo, Z. Murez, G. Fedoseev, A. Kendall, J. Shotton, and G. Corrado. Gaia-1: A generative world model for autonomous driving. arXiv preprint arXiv:2309.17080, 2023

  38. [38]

    Z. Yang, X. Jia, H. Li, and J. Yan. A survey of large language models for autonomous driving. arXiv preprint arXiv:2311.01043, 2023

  39. [39]

    D. Wu, W. Han, T. Wang, X. Dong, X. Zhang, and J. Shen. Referring multi-object tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14633–14642, 2023

  40. [40]

    Geiger, P

    A. Geiger, P. Lenz, C. Stiller, and R. Urtasun. Vision meets robotics: The kitti dataset. The International Journal of Robotics Research, 32(11):1231–1237, 2013

  41. [41]

    Talk2car: Taking control of your self-driving car.arXiv preprint arXiv:1909.10838,

    T. Deruyttere, S. Vandenhende, D. Grujicic, L. Van Gool, and M.-F. Moens. Talk2car: Taking control of your self-driving car. arXiv preprint arXiv:1909.10838, 2019

  42. [42]

    D. Wu, W. Han, T. Wang, Y . Liu, X. Zhang, and J. Shen. Language prompt for autonomous driving. arXiv preprint arXiv:2309.04379, 2023. 14

  43. [43]

    T. Qian, J. Chen, L. Zhuo, Y . Jiao, and Y .-G. Jiang. Nuscenes-qa: A multi-modal visual ques- tion answering benchmark for autonomous driving scenario.arXiv preprint arXiv:2305.14836, 2023

  44. [44]

    Caesar, V

    H. Caesar, V . Bankiti, A. H. Lang, S. V ora, V . E. Liong, Q. Xu, A. Krishnan, Y . Pan, G. Baldan, and O. Beijbom. nuscenes: A multimodal dataset for autonomous driving. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages 11621–11631, 2020

  45. [45]

    J. Kim, A. Rohrbach, T. Darrell, J. Canny, and Z. Akata. Textual explanations for self-driving vehicles. In Proceedings of the European conference on computer vision (ECCV) , pages 563– 578, 2018

  46. [46]

    Y . Xu, X. Yang, L. Gong, H.-C. Lin, T.-Y . Wu, Y . Li, and N. Vasconcelos. Explainable object- induced action decision for autonomous vehicles. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9523–9532, 2020

  47. [47]

    Sachdeva, N

    E. Sachdeva, N. Agarwal, S. Chundi, S. Roelofs, J. Li, B. Dariush, C. Choi, and M. Kochen- derfer. Rank2tell: A multimodal driving dataset for joint importance ranking and reasoning. arXiv preprint arXiv:2309.06597, 2023

  48. [48]

    Malla, C

    S. Malla, C. Choi, I. Dwivedi, J. H. Choi, and J. Li. Drama: Joint risk localization and cap- tioning in driving. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 1043–1052, 2023

  49. [49]

    J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V . Le, D. Zhou, et al. Chain-of- thought prompting elicits reasoning in large language models.Advances in Neural Information Processing Systems, 35:24824–24837, 2022

  50. [50]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. De- hghani, M. Minderer, G. Heigold, S. Gelly, et al. An image is worth 16x16 words: Transform- ers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020

  51. [51]

    Jiang, S

    B. Jiang, S. Chen, Q. Xu, B. Liao, J. Chen, H. Zhou, Q. Zhang, W. Liu, C. Huang, and X. Wang. Vad: Vectorized scene representation for efficient autonomous driving. arXiv preprint arXiv:2303.12077, 2023

  52. [52]

    Gpt-4 technical report, 2023

    OpenAI. Gpt-4 technical report, 2023

  53. [53]

    Khurana, P

    T. Khurana, P. Hu, A. Dave, J. Ziglar, D. Held, and D. Ramanan. Differentiable raycasting for self-supervised occupancy forecasting. In European Conference on Computer Vision , pages 353–369. Springer, 2022

  54. [54]

    S. Hu, L. Chen, P. Wu, H. Li, J. Yan, and D. Tao. St-p3: End-to-end vision-based autonomous driving via spatial-temporal feature learning. In European Conference on Computer Vision , pages 533–549. Springer, 2022

  55. [55]

    L. Xu, H. Huang, and J. Liu. Sutd-trafficqa: A question answering benchmark and an efficient network for video reasoning over traffic events. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9878–9888, 2021

  56. [56]

    J.-T. Zhai, Z. Feng, J. Du, Y . Mao, J.-J. Liu, Z. Tan, Y . Zhang, X. Ye, and J. Wang. Rethink- ing the open-loop evaluation of end-to-end autonomous driving in nuscenes. arXiv preprint arXiv:2305.10430, 2023

  57. [57]

    X. Yue, Y . Ni, K. Zhang, T. Zheng, R. Liu, G. Zhang, S. Stevens, D. Jiang, W. Ren, Y . Sun, et al. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9556–9567, 2024. 15

  58. [58]

    B. Li, R. Wang, G. Wang, Y . Ge, Y . Ge, and Y . Shan. Seed-bench: Benchmarking multimodal llms with generative comprehension. arXiv preprint arXiv:2307.16125, 2023

  59. [59]

    L. Yu, P. Poirson, S. Yang, A. C. Berg, and T. L. Berg. Modeling context in referring ex- pressions. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part II 14 , pages 69–85. Springer, 2016

  60. [60]

    C. Sima, K. Renz, K. Chitta, L. Chen, H. Zhang, C. Xie, P. Luo, A. Geiger, and H. Li. Drivelm: Driving with graph visual question answering. arXiv preprint arXiv:2312.14150, 2023

  61. [61]

    Grok-1.5 vision preview

    X.ai. Grok-1.5 vision preview. https://x.ai/blog/grok-1.5v, 2024

  62. [62]

    X. Chu, L. Qiao, X. Lin, S. Xu, Y . Yang, Y . Hu, F. Wei, X. Zhang, B. Zhang, X. Wei, et al. Mobilevlm: A fast, reproducible and strong vision language assistant for mobile devices.arXiv preprint arXiv:2312.16886, 2023

  63. [63]

    Q. Team. Introducing qwen1.5, February 2024. URL https://qwenlm.github.io/blog/ qwen1.5/

  64. [64]

    G. Team, T. Mesnard, C. Hardin, R. Dadashi, S. Bhupatiraju, S. Pathak, L. Sifre, M. Rivi `ere, M. S. Kale, J. Love, et al. Gemma: Open models based on gemini research and technology. arXiv preprint arXiv:2403.08295, 2024

  65. [65]

    S. Hu, Y . Tu, X. Han, C. He, G. Cui, X. Long, Z. Zheng, Y . Fang, Y . Huang, W. Zhao, et al. Minicpm: Unveiling the potential of small language models with scalable training strategies. arXiv preprint arXiv:2404.06395, 2024

  66. [66]

    Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone

    M. Abdin, S. A. Jacobs, A. A. Awan, J. Aneja, A. Awadallah, H. Awadalla, N. Bach, A. Bahree, A. Bakhtiari, H. Behl, et al. Phi-3 technical report: A highly capable language model locally on your phone. arXiv preprint arXiv:2404.14219, 2024

  67. [67]

    H. Liu, C. Li, Y . Li, and Y . J. Lee. Improved baselines with visual instruction tuning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26296–26306, 2024

  68. [68]

    X. Zhai, B. Mustafa, A. Kolesnikov, and L. Beyer. Sigmoid loss for language image pre- training. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 11975–11986, 2023

  69. [69]

    Alayrac, J

    J.-B. Alayrac, J. Donahue, P. Luc, A. Miech, I. Barr, Y . Hasson, K. Lenc, A. Mensch, K. Mil- lican, M. Reynolds, et al. Flamingo: a visual language model for few-shot learning. Advances in neural information processing systems , 35:23716–23736, 2022

  70. [70]

    B. He, H. Li, Y . K. Jang, M. Jia, X. Cao, A. Shah, A. Shrivastava, and S.-N. Lim. Ma- lmm: Memory-augmented large multimodal model for long-term video understanding. arXiv preprint arXiv:2404.05726, 2024

  71. [71]

    J. Hu, L. Shen, and G. Sun. Squeeze-and-excitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition , pages 7132–7141, 2018

  72. [72]

    Y . Li, F. Wei, C. Zhang, and H. Zhang. Eagle: Speculative sampling requires rethinking feature uncertainty. arXiv preprint arXiv:2401.15077, 2024

  73. [73]

    T. Cai, Y . Li, Z. Geng, H. Peng, and T. Dao. Medusa: Simple framework for accelerating llm generation with multiple decoding heads, 2023. 16 A SUP-AD Dataset A.1 Meta-actions Meta-action statistics. We use the meta-action sequence to formally represent the driving strategy. Meta actions are classified into 17 categories. We show the distribution of each ...

  74. [74]

    Speed-control actions. Discerned from acceleration and braking signals within the ego state data, these actions include These actions can be discerned from acceleration and brak- ing signals within the ego state data. They includespeed up, slow down, slow down rapidly, go straight slowly, go straight at a constant speed , stop, wait, and reverse

  75. [75]

    Deduced from steering wheel signals, these actions consist of turn left, turn right, and turn around

    Turning actions. Deduced from steering wheel signals, these actions consist of turn left, turn right, and turn around

  76. [76]

    Slow down

    Lane-control actions. Encompassing lane selection decisions, these actions are derived from a combination of steering wheel signals and either map or perception data. They involve change lane to the left , change lane to the right , shift slightly to the left , and shift slightly to the right . A.2 Scenario Categories The SUP-AD dataset is comprised of 1,...

  77. [77]

    Weather: Sunny (environmental conditions)

  78. [78]

    Time: Day (environmental conditions)

  79. [79]

    Road Environment: Urban (environmental conditions)

  80. [80]

    Lane Options: Left Lane, Own Lane, Right Lane (environmental conditions)

Showing first 80 references.