Vision-language-action models are highly vulnerable to membership inference attacks, including practical black-box versions that exploit generated actions and motion trajectories.
hub Canonical reference
DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models
Canonical reference. 90% of citing Pith papers cite this work as background.
abstract
A primary hurdle of autonomous driving in urban environments is understanding complex and long-tail scenarios, such as challenging road conditions and delicate human behaviors. We introduce DriveVLM, an autonomous driving system leveraging Vision-Language Models (VLMs) for enhanced scene understanding and planning capabilities. DriveVLM integrates a unique combination of reasoning modules for scene description, scene analysis, and hierarchical planning. Furthermore, recognizing the limitations of VLMs in spatial reasoning and heavy computational requirements, we propose DriveVLM-Dual, a hybrid system that synergizes the strengths of DriveVLM with the traditional autonomous driving pipeline. Experiments on both the nuScenes dataset and our SUP-AD dataset demonstrate the efficacy of DriveVLM and DriveVLM-Dual in handling complex and unpredictable driving conditions. Finally, we deploy the DriveVLM-Dual on a production vehicle, verifying it is effective in real-world autonomous driving environments.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
MSLA is the first physically deployable attack that uses adversarial lighting to break semantic alignment in VLMs such as CLIP, LLaVA, and BLIP, causing classification failures and hallucinations in real scenes.
V2X-QA provides a view-decoupled benchmark showing infrastructure views aid macroscopic traffic understanding while cooperative reasoning requires explicit cross-view alignment, with V2X-MoE as a routing-based baseline that improves performance.
VGenST-Bench is a new video benchmark for MLLM spatio-temporal reasoning built via generative synthesis, a multi-agent pipeline with human oversight, a 3x2x2 taxonomy, and hierarchical tasks separating perception from reasoning.
By adding future visual state prediction and a dedicated inverse kinematics diffusion network that uses only visual boundary conditions, a 0.5B driving VLA recovers visual grounding and matches 7-8B models on NAVSIM-v2 and nuScenes.
Proposes Spatial Narrative Score (SNS) evaluation for VLMs' camera motion understanding and introduces CaMo model achieving consistent performance on SNS and direct QA.
HypCBM reformulates concept activations as geometric containment in hyperbolic space to produce sparse, hierarchy-aware signals that match Euclidean models trained on 20 times more data.
A language refinement framework with geometry-aware preference optimization lets VLMs generate more traversable 3D trajectories for off-road vehicles, yielding modest gains in error, traversability compliance, and elevation consistency on the ORAD-3D benchmark.
VLA-World improves autonomous driving by using action-guided future image generation followed by reflective reasoning over the imagined scene to refine trajectories.
Instruction-tuned vision-language model PaveGPT, trained on a large unified pavement dataset, achieves substantial gains over general models in comprehensive, standard-compliant pavement condition assessment.
RailVQA-bench supplies 21,168 QA pairs for ATO visual cognition while RailVQA-CoM combines large-model reasoning with small-model efficiency via transparent modules and temporal sampling.
GeCO replaces time-dependent flow matching with time-unconditional optimization, enabling adaptive inference and intrinsic OOD detection for robotic imitation learning.
ReCogDrive unifies VLM scene understanding with a diffusion planner reinforced by DiffGRPO to reach state-of-the-art results on NAVSIM and Bench2Drive benchmarks.
AlphaDrive uses GRPO-based RL rewards and two-stage SFT+RL training on VLMs to improve autonomous driving planning performance and efficiency while producing emergent multimodal capabilities.
MAPLE proposes latent multi-agent rollouts with supervised fine-tuning followed by reinforcement learning using safety, progress, interaction, and diversity rewards to enable scalable closed-loop training for end-to-end autonomous driving.
MindVLA-U1 is the first unified streaming VLA architecture that surpasses human drivers on WOD-E2E planning metrics while matching VA latency and preserving language interfaces.
SpatialForge synthesizes 10 million spatial QA pairs from in-the-wild 2D images to train VLMs for better depth ordering, layout, and viewpoint-dependent reasoning.
CoWorld-VLA extracts semantic, geometric, dynamic, and trajectory expert tokens from multi-source supervision and feeds them into a diffusion-based hierarchical planner, achieving competitive collision avoidance and trajectory accuracy on the NAVSIM v1 benchmark.
VECTOR-DRIVE uses shared self-attention with semantic-aware expert routing of tokens to VL and trajectory experts plus flow-matching action decoding to reach 88.91 driving score on Bench2Drive.
Vision language models applied to daily-life photos quantify visual environmental features that correlate with momentary affect and chronic stress, establishing a paradigm for visual exposomics.
MoR lets clients train local reward models on private preferences and uses a learned Mixture-of-Rewards with GRPO on the server to align a shared base VLM without exchanging parameters, architectures, or raw data.
VLADriver-RAG reaches a new state-of-the-art Driving Score of 89.12 on Bench2Drive by retrieving structure-aware historical knowledge through spatiotemporal semantic graphs and Graph-DTW alignment.
VLM-VPI uses Qwen3-VL and GPT-OSS models for pedestrian intent and age reasoning plus a tiered safety controller, reporting 92.3% intent accuracy in CARLA and reduced conflicts versus rule-based and supervised baselines.
EgoDyn-Bench reveals a perception bottleneck in vision-centric foundation models: ego-motion logic derives from language while visual input adds negligible signal, with explicit trajectories restoring consistency.
citing papers explorer
-
Membership Inference Attacks on Vision-Language-Action Models
Vision-language-action models are highly vulnerable to membership inference attacks, including practical black-box versions that exploit generated actions and motion trajectories.
-
Challenging Vision-Language Models with Physically Deployable Multimodal Semantic Lighting Attacks
MSLA is the first physically deployable attack that uses adversarial lighting to break semantic alignment in VLMs such as CLIP, LLaVA, and BLIP, causing classification failures and hallucinations in real scenes.
-
V2X-QA: A Comprehensive Reasoning Dataset and Benchmark for Multimodal Large Language Models in Autonomous Driving Across Ego, Infrastructure, and Cooperative Views
V2X-QA provides a view-decoupled benchmark showing infrastructure views aid macroscopic traffic understanding while cooperative reasoning requires explicit cross-view alignment, with V2X-MoE as a routing-based baseline that improves performance.
-
VGenST-Bench: A Benchmark for Spatio-Temporal Reasoning via Active Video Synthesis
VGenST-Bench is a new video benchmark for MLLM spatio-temporal reasoning built via generative synthesis, a multi-agent pipeline with human oversight, a 3x2x2 taxonomy, and hierarchical tasks separating perception from reasoning.
-
Grounding Driving VLA via Inverse Kinematics
By adding future visual state prediction and a dedicated inverse kinematics diffusion network that uses only visual boundary conditions, a 0.5B driving VLA recovers visual grounding and matches 7-8B models on NAVSIM-v2 and nuScenes.
-
CaMo: Camera Motion Grounded Evaluation and Training for Vision-Language Models
Proposes Spatial Narrative Score (SNS) evaluation for VLMs' camera motion understanding and introduces CaMo model achieving consistent performance on SNS and direct QA.
-
Hyperbolic Concept Bottleneck Models
HypCBM reformulates concept activations as geometric containment in hyperbolic space to produce sparse, hierarchy-aware signals that match Euclidean models trained on 20 times more data.
-
Reasoning About Traversability: Language-Guided Off-Road 3D Trajectory Planning
A language refinement framework with geometry-aware preference optimization lets VLMs generate more traversable 3D trajectories for off-road vehicles, yielding modest gains in error, traversability compliance, and elevation consistency on the ORAD-3D benchmark.
-
Learning Vision-Language-Action World Models for Autonomous Driving
VLA-World improves autonomous driving by using action-guided future image generation followed by reflective reasoning over the imagined scene to refine trajectories.
-
Vision-Language Foundation Models for Comprehensive Automated Pavement Condition Assessment
Instruction-tuned vision-language model PaveGPT, trained on a large unified pavement dataset, achieves substantial gains over general models in comprehensive, standard-compliant pavement condition assessment.
-
RailVQA: A Benchmark and Framework for Efficient Interpretable Visual Cognition in Automatic Train Operation
RailVQA-bench supplies 21,168 QA pairs for ATO visual cognition while RailVQA-CoM combines large-model reasoning with small-model efficiency via transparent modules and temporal sampling.
-
Generative Control as Optimization: Time Unconditional Flow Matching for Adaptive and Robust Robotic Control
GeCO replaces time-dependent flow matching with time-unconditional optimization, enabling adaptive inference and intrinsic OOD detection for robotic imitation learning.
-
ReCogDrive: A Reinforced Cognitive Framework for End-to-End Autonomous Driving
ReCogDrive unifies VLM scene understanding with a diffusion planner reinforced by DiffGRPO to reach state-of-the-art results on NAVSIM and Bench2Drive benchmarks.
-
AlphaDrive: Unleashing the Power of VLMs in Autonomous Driving via Reinforcement Learning and Reasoning
AlphaDrive uses GRPO-based RL rewards and two-stage SFT+RL training on VLMs to improve autonomous driving planning performance and efficiency while producing emergent multimodal capabilities.
-
MAPLE: Latent Multi-Agent Play for End-to-End Autonomous Driving
MAPLE proposes latent multi-agent rollouts with supervised fine-tuning followed by reinforcement learning using safety, progress, interaction, and diversity rewards to enable scalable closed-loop training for end-to-end autonomous driving.
-
MindVLA-U1: VLA Beats VA with Unified Streaming Architecture for Autonomous Driving
MindVLA-U1 is the first unified streaming VLA architecture that surpasses human drivers on WOD-E2E planning metrics while matching VA latency and preserving language interfaces.
-
SpatialForge: Bootstrapping 3D-Aware Spatial Reasoning from Open-World 2D Images
SpatialForge synthesizes 10 million spatial QA pairs from in-the-wild 2D images to train VLMs for better depth ordering, layout, and viewpoint-dependent reasoning.
-
CoWorld-VLA: Thinking in a Multi-Expert World Model for Autonomous Driving
CoWorld-VLA extracts semantic, geometric, dynamic, and trajectory expert tokens from multi-source supervision and feeds them into a diffusion-based hierarchical planner, achieving competitive collision avoidance and trajectory accuracy on the NAVSIM v1 benchmark.
-
VECTOR-Drive: Tightly Coupled Vision-Language and Trajectory Expert Routing for End-to-End Autonomous Driving
VECTOR-DRIVE uses shared self-attention with semantic-aware expert routing of tokens to VL and trajectory experts plus flow-matching action decoding to reach 88.91 driving score on Bench2Drive.
-
Quantifying the human visual exposome with vision language models
Vision language models applied to daily-life photos quantify visual environmental features that correlate with momentary affect and chronic stress, establishing a paradigm for visual exposomics.
-
Replacing Parameters with Preferences: Federated Alignment of Heterogeneous Vision-Language Models
MoR lets clients train local reward models on private preferences and uses a learned Mixture-of-Rewards with GRPO on the server to align a shared base VLM without exchanging parameters, architectures, or raw data.
-
VLADriver-RAG: Retrieval-Augmented Vision-Language-Action Models for Autonomous Driving
VLADriver-RAG reaches a new state-of-the-art Driving Score of 89.12 on Bench2Drive by retrieving structure-aware historical knowledge through spatiotemporal semantic graphs and Graph-DTW alignment.
-
VLM-VPI: A Vision-Language Reasoning Framework for Improving Automated Vehicle-Pedestrian Interactions
VLM-VPI uses Qwen3-VL and GPT-OSS models for pedestrian intent and age reasoning plus a tiered safety controller, reporting 92.3% intent accuracy in CARLA and reduced conflicts versus rule-based and supervised baselines.
-
EgoDyn-Bench: Evaluating Ego-Motion Understanding in Vision-Centric Foundation Models for Autonomous Driving
EgoDyn-Bench reveals a perception bottleneck in vision-centric foundation models: ego-motion logic derives from language while visual input adds negligible signal, with explicit trajectories restoring consistency.
-
If you're waiting for a sign... that might not be it! Mitigating Trust Boundary Confusion from Visual Injections on Vision-Language Agentic Systems
LVLM-based agents exhibit trust boundary confusion with visual injections and a multi-agent defense separating perception from decision-making reduces misleading responses while preserving correct ones.
-
OneDrive: Unified Multi-Paradigm Driving with Vision-Language-Action Models
OneDrive unifies heterogeneous decoding in a single VLM transformer decoder for end-to-end driving, achieving 0.28 L2 error and 0.18 collision rate on nuScenes plus 86.8 PDMS on NAVSIM.
-
LMGenDrive: Bridging Multimodal Understanding and Generative World Modeling for End-to-End Driving
LMGenDrive unifies LLM-based multimodal understanding with generative world models to output both future driving videos and control signals for end-to-end closed-loop autonomous driving.
-
Orion-Lite: Distilling LLM Reasoning into Efficient Vision-Only Driving Models
Orion-Lite uses latent feature distillation and trajectory supervision to create a vision-only model that surpasses its LLM-based teacher on closed-loop Bench2Drive evaluation, achieving a new SOTA driving score of 80.6.
-
ICR-Drive: Instruction Counterfactual Robustness for End-to-End Language-Driven Autonomous Driving
ICR-Drive reveals substantial performance drops in end-to-end language-driven driving models when instructions are paraphrased, made ambiguous, noised, or misleading.
-
CausalVAD: De-confounding End-to-End Autonomous Driving via Causal Intervention
CausalVAD applies sparse causal intervention to remove spurious correlations from end-to-end autonomous driving models, reporting state-of-the-art planning accuracy and robustness on nuScenes.
-
CogDriver: Integrating Cognitive Inertia for Temporally Coherent Planning in Autonomous Driving
CogDriver-Agent with sparse temporal memory and spatiotemporal distillation on CogDriver-Data achieves 22% higher closed-loop Driving Score on Bench2Drive and 21% lower mean L2 error on nuScenes.
-
AutoVLA: A Vision-Language-Action Model for End-to-End Autonomous Driving with Adaptive Reasoning and Reinforcement Fine-Tuning
AutoVLA unifies semantic reasoning and trajectory planning in one autoregressive VLA model for end-to-end autonomous driving by tokenizing trajectories into discrete actions and using GRPO reinforcement fine-tuning to adaptively reduce unnecessary reasoning.
-
VERDI: VLM-Embedded Reasoning for Autonomous Driving
VERDI aligns perception, prediction, and planning outputs of end-to-end AD models with VLM-generated text features at training time to embed structured reasoning, yielding up to 11% better l2 distance and 10% higher non-collision rate in closed-loop tests.
-
ORION: A Holistic End-to-End Autonomous Driving Framework by Vision-Language Instructed Action Generation
ORION reports 77.74 Driving Score and 54.62% Success Rate on Bench2Drive, outperforming prior end-to-end methods by 14.28 DS and 19.61% SR through unified VQA and planning optimization.
-
Senna: Bridging Large Vision-Language Models and End-to-End Autonomous Driving
Senna decouples language-based high-level planning from an LVLM with low-level trajectory prediction from an E2E model, reporting 27% lower planning error and 33% lower collisions after pre-training on DriveX and fine-tuning on nuScenes.
-
Enhancing End-to-End Autonomous Driving with Latent World Model
LAW introduces a self-supervised prediction task on latent scene features that boosts end-to-end driving performance on nuScenes, NAVSIM, and CARLA benchmarks.
-
LVDrive: Latent Visual Representation Enhanced Vision-Language-Action Autonomous Driving Model
LVDrive improves closed-loop driving on Bench2Drive by adding latent future scene prediction to VLA models via unified embedding space processing and two-stage trajectory decoding.
-
SafeAlign-VLA: A Negative-Enhanced Safe Alignment Framework for Risk-Aware Autonomous Driving
SafeAlign-VLA uses counterfactual safety pairing and anchor-based group relative policy optimization to incorporate negative data for safer VLA-based autonomous driving.
-
EponaV2: Driving World Model with Comprehensive Future Reasoning
EponaV2 advances perception-free driving world models by forecasting comprehensive future 3D geometry and semantic representations, achieving SOTA planning performance on NAVSIM benchmarks.
-
Latency Analysis and Optimization of Alpamayo 1 via Efficient Trajectory Generation
Redesigning Alpamayo 1 to single-reasoning and optimizing diffusion action generation cuts inference latency by 69.23% while preserving trajectory diversity and prediction quality.
-
SpanVLA: Efficient Action Bridging and Learning from Negative-Recovery Samples for Vision-Language-Action Model
SpanVLA reduces action generation latency via flow-matching conditioned on history and improves robustness by training on negative-recovery samples with GRPO and a dedicated reasoning dataset.
-
EvoDriveVLA: Evolving Driving VLA Models via Collaborative Perception-Planning Distillation
EvoDriveVLA uses collaborative perception-planning distillation with self-anchor and future-aware teachers to fix perception degradation and long-term instability in driving VLA models, reaching SOTA on nuScenes and NAVSIM.
-
DeepSight: Long-Horizon World Modeling via Latent States Prediction for End-to-End Autonomous Driving
DeepSight uses parallel latent feature prediction in BEV for long-horizon world modeling and adaptive text reasoning to reach state-of-the-art closed-loop performance on the Bench2drive benchmark.
-
Vision-Language-Action Safety: Threats, Challenges, Evaluations, and Mechanisms
A literature survey that unifies fragmented work on attacks, defenses, evaluations, and deployment challenges for Vision-Language-Action models in robotics.
-
XEmbodied: A Foundation Model with Enhanced Geometric and Physical Cues for Large-Scale Embodied Environments
XEmbodied is a foundation model that integrates 3D geometric and physical signals into VLMs using a 3D Adapter and Efficient Image-Embodied Adapter, plus progressive curriculum and RL post-training, to improve spatial reasoning and embodied performance on 18 benchmarks.
-
From Virtual Environments to Real-World Trials: Emerging Trends in Autonomous Driving
A survey organizes synthetic data use, digital twin simulation, and domain adaptation techniques for autonomous driving while identifying open challenges like Sim2Real transfer.
-
Dataset Safety in Autonomous Driving: Requirements, Risks, and Assurance
The paper introduces a safety framework for datasets in autonomous driving that uses the AI Data Flywheel and lifecycle processes to identify hazards and ensure compliance with ISO/PAS 8800.
-
Structured Labeling Enables Faster Vision-Language Models for End-to-End Autonomous Driving
Introduces structured NuScenes-S dataset and 0.9B FastDrive VLM claiming 20% higher decision accuracy and over 10x inference speedup versus larger unstructured VLMs.
-
Seed1.5-VL Technical Report
Seed1.5-VL is a compact multimodal model that sets new records on dozens of vision-language benchmarks and outperforms prior systems on agent-style tasks.
-
Multimodal Chain-of-Thought Reasoning: A Comprehensive Survey
The paper provides the first comprehensive survey of multimodal chain-of-thought reasoning, including foundational concepts, a taxonomy of methodologies, application analyses, challenges, and future directions.