hub Canonical reference

arXiv preprint arXiv:2505.16278 (2025)

Zhenjie Yang, Yilin Chai, Xiaosong Jia, Qifeng Li, Yuqian Shao, Xuekai Zhu, Haisheng Su, Junchi Yan · 2025 · cs.CV · arXiv 2505.16278

Canonical reference. 89% of citing Pith papers cite this work as background.

20 Pith papers citing it

Background 89% of classified citations

open full Pith review browse 20 citing papers arXiv PDF

abstract

End-to-end autonomous driving (E2E-AD) demands effective processing of multi-view sensory data and robust handling of diverse and complex driving scenarios, particularly rare maneuvers such as aggressive turns. Recent success of Mixture-of-Experts (MoE) architecture in Large Language Models (LLMs) demonstrates that specialization of parameters enables strong scalability. In this work, we propose DriveMoE, a novel MoE-based E2E-AD framework, with a Scene-Specialized Vision MoE and a Skill-Specialized Action MoE. DriveMoE is built upon our $\pi_0$ Vision-Language-Action (VLA) baseline (originally from the embodied AI field), called Drive-$\pi_0$. Specifically, we add Vision MoE to Drive-$\pi_0$ by training a router to select relevant cameras according to the driving context dynamically. This design mirrors human driving cognition, where drivers selectively attend to crucial visual cues rather than exhaustively processing all visual information. In addition, we add Action MoE by training another router to activate specialized expert modules for different driving behaviors. Through explicit behavioral specialization, DriveMoE is able to handle diverse scenarios without suffering from modes averaging like existing models. In Bench2Drive closed-loop evaluation experiments, DriveMoE achieves state-of-the-art (SOTA) performance, demonstrating the effectiveness of combining vision and action MoE in autonomous driving tasks. We will release our code and models of DriveMoE and Drive-$\pi_0$.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 8 baseline 1

citation-polarity summary

background 8 baseline 1

representative citing papers

Bench2Drive-Robust: Benchmarking Closed-Loop Autonomous Driving under Deployment Perturbations

cs.RO · 2026-05-18 · unverdicted · novelty 7.0

Bench2Drive-Robust is a new closed-loop benchmark that evaluates end-to-end autonomous driving models under deployment perturbations from camera failures, ego-state errors, and compute delays, showing substantial performance degradation beyond image-level tests.

Learning Vision-Language-Action World Models for Autonomous Driving

cs.CV · 2026-04-10 · unverdicted · novelty 7.0

VLA-World improves autonomous driving by using action-guided future image generation followed by reflective reasoning over the imagined scene to refine trajectories.

ChainFlow-VLA: Causal Flow Planning with Vision-Language Models

cs.CV · 2026-05-22 · unverdicted · novelty 6.0

ChainFlow-VLA unifies autoregressive causal trajectory modes with VLM-conditioned diffusion refinement to reach 94.85 on NAVSIM v1, matching human performance.

LACO: Adaptive Latent Communication for Collaborative Driving

cs.AI · 2026-05-21 · unverdicted · novelty 6.0

LACO introduces Iterative Latent Deliberation, Cross-Horizon Saliency Attribution, and Structured Semantic Knowledge Distillation to enable low-latency latent communication in collaborative driving while preserving performance in CARLA simulations.

One Model to Translate Them All: Universal Any-to-Any Translation for Heterogeneous Collaborative Perception

cs.CV · 2026-05-18 · conditional · novelty 6.0

UniTrans pretrains a bank of translator experts and learns combination coefficients from modality mappings in a scene-invariant latent space to enable zero-shot any-to-any feature translation for heterogeneous collaborative perception.

CoWorld-VLA: Thinking in a Multi-Expert World Model for Autonomous Driving

cs.CV · 2026-05-11 · unverdicted · novelty 6.0 · 2 refs

CoWorld-VLA extracts semantic, geometric, dynamic, and trajectory expert tokens from multi-source supervision and feeds them into a diffusion-based hierarchical planner, achieving competitive collision avoidance and trajectory accuracy on the NAVSIM v1 benchmark.

VECTOR-Drive: Tightly Coupled Vision-Language and Trajectory Expert Routing for End-to-End Autonomous Driving

cs.CV · 2026-05-09 · unverdicted · novelty 6.0 · 2 refs

VECTOR-DRIVE uses shared self-attention with semantic-aware expert routing of tokens to VL and trajectory experts plus flow-matching action decoding to reach 88.91 driving score on Bench2Drive.

ST-Prune: Training-Free Spatio-Temporal Token Pruning for Vision-Language Models in Autonomous Driving

cs.CV · 2026-04-21 · unverdicted · novelty 6.0

ST-Prune is a training-free spatio-temporal token pruning framework for VLMs in autonomous driving that achieves near-lossless results at 90% token reduction by exploiting motion volatility, temporal recency, and multi-view geometry.

LMGenDrive: Bridging Multimodal Understanding and Generative World Modeling for End-to-End Driving

cs.CV · 2026-04-09 · unverdicted · novelty 6.0

LMGenDrive unifies LLM-based multimodal understanding with generative world models to output both future driving videos and control signals for end-to-end closed-loop autonomous driving.

DVGT-2: Vision-Geometry-Action Model for Autonomous Driving at Scale

cs.CV · 2026-04-01 · unverdicted · novelty 6.0

DVGT-2 is a streaming vision-geometry-action model that jointly reconstructs dense 3D geometry and plans trajectories online, achieving better reconstruction than prior batch methods while transferring directly to planning benchmarks without fine-tuning.

CausalVAD: De-confounding End-to-End Autonomous Driving via Causal Intervention

cs.CV · 2026-03-19 · unverdicted · novelty 6.0

CausalVAD applies sparse causal intervention to remove spurious correlations from end-to-end autonomous driving models, reporting state-of-the-art planning accuracy and robustness on nuScenes.

PALM: Progress-Aware Policy Learning via Affordance Reasoning for Long-Horizon Robotic Manipulation

cs.RO · 2026-01-11 · unverdicted · novelty 6.0

PALM improves long-horizon robotic manipulation success by distilling affordance representations for object interaction and predicting within-subtask progress in a VLA model.

SpaceDrive: Infusing Spatial Awareness into VLM-based Autonomous Driving

cs.CV · 2025-12-11 · conditional · novelty 6.0

SpaceDrive integrates 3D positional encodings derived from depth and ego-states into VLMs, replacing digit tokens to improve spatial reasoning and trajectory regression in autonomous driving.

Continually Evolving Skill Knowledge in Vision Language Action Model

cs.RO · 2025-11-22 · unverdicted · novelty 6.0

Stellar VLA achieves continual learning in VLA models by maintaining a growing knowledge space and routing tasks to specialized experts conditioned on semantic relations, delivering strong LIBERO benchmark results with only 1% data replay and successful real-world transfer on dual-arm hardware.

DriveVLA-W0: World Models Amplify Data Scaling Law in Autonomous Driving

cs.CV · 2025-10-14 · unverdicted · novelty 6.0

DriveVLA-W0 adds world modeling to predict future images in VLA models, overcoming sparse action supervision and amplifying data scaling laws on NAVSIM benchmarks and a large in-house dataset.

ReSim: Reliable World Simulation for Autonomous Driving

cs.CV · 2025-06-11 · unverdicted · novelty 6.0

ReSim is a controllable video world model trained on heterogeneous real and simulated driving data that achieves higher fidelity and controllability for both expert and non-expert actions, plus a Video2Reward module for estimating action quality from simulated futures.

LVDrive: Latent Visual Representation Enhanced Vision-Language-Action Autonomous Driving Model

cs.CV · 2026-05-21 · unverdicted · novelty 5.0

LVDrive improves closed-loop driving on Bench2Drive by adding latent future scene prediction to VLA models via unified embedding space processing and two-stage trajectory decoding.

EponaV2: Driving World Model with Comprehensive Future Reasoning

cs.CV · 2026-05-14 · unverdicted · novelty 5.0

EponaV2 advances perception-free driving world models by forecasting comprehensive future 3D geometry and semantic representations, achieving SOTA planning performance on NAVSIM benchmarks.

SceneSelect: Selective Learning for Trajectory Scene Classification and Expert Scheduling

cs.LG · 2026-04-27 · unverdicted · novelty 5.0 · 2 refs

SceneSelect discovers latent scene categories via clustering, trains a classifier to assign inputs, and dispatches to expert trajectory predictors, reporting 10.5% average gains over single-model and ensemble baselines on ETH-UCY, SDD, and NBA.

GuidedVLA: Specifying Task-Relevant Factors via Plug-and-Play Action Attention Specialization

cs.RO · 2026-05-12

citing papers explorer

Showing 20 of 20 citing papers.

Bench2Drive-Robust: Benchmarking Closed-Loop Autonomous Driving under Deployment Perturbations cs.RO · 2026-05-18 · unverdicted · none · ref 50 · internal anchor
Bench2Drive-Robust is a new closed-loop benchmark that evaluates end-to-end autonomous driving models under deployment perturbations from camera failures, ego-state errors, and compute delays, showing substantial performance degradation beyond image-level tests.
Learning Vision-Language-Action World Models for Autonomous Driving cs.CV · 2026-04-10 · unverdicted · none · ref 72 · internal anchor
VLA-World improves autonomous driving by using action-guided future image generation followed by reflective reasoning over the imagined scene to refine trajectories.
ChainFlow-VLA: Causal Flow Planning with Vision-Language Models cs.CV · 2026-05-22 · unverdicted · none · ref 19 · internal anchor
ChainFlow-VLA unifies autoregressive causal trajectory modes with VLM-conditioned diffusion refinement to reach 94.85 on NAVSIM v1, matching human performance.
LACO: Adaptive Latent Communication for Collaborative Driving cs.AI · 2026-05-21 · unverdicted · none · ref 40 · internal anchor
LACO introduces Iterative Latent Deliberation, Cross-Horizon Saliency Attribution, and Structured Semantic Knowledge Distillation to enable low-latency latent communication in collaborative driving while preserving performance in CARLA simulations.
One Model to Translate Them All: Universal Any-to-Any Translation for Heterogeneous Collaborative Perception cs.CV · 2026-05-18 · conditional · none · ref 42 · internal anchor
UniTrans pretrains a bank of translator experts and learns combination coefficients from modality mappings in a scene-invariant latent space to enable zero-shot any-to-any feature translation for heterogeneous collaborative perception.
CoWorld-VLA: Thinking in a Multi-Expert World Model for Autonomous Driving cs.CV · 2026-05-11 · unverdicted · none · ref 53 · 2 links · internal anchor
CoWorld-VLA extracts semantic, geometric, dynamic, and trajectory expert tokens from multi-source supervision and feeds them into a diffusion-based hierarchical planner, achieving competitive collision avoidance and trajectory accuracy on the NAVSIM v1 benchmark.
VECTOR-Drive: Tightly Coupled Vision-Language and Trajectory Expert Routing for End-to-End Autonomous Driving cs.CV · 2026-05-09 · unverdicted · none · ref 26 · 2 links · internal anchor
VECTOR-DRIVE uses shared self-attention with semantic-aware expert routing of tokens to VL and trajectory experts plus flow-matching action decoding to reach 88.91 driving score on Bench2Drive.
ST-Prune: Training-Free Spatio-Temporal Token Pruning for Vision-Language Models in Autonomous Driving cs.CV · 2026-04-21 · unverdicted · none · ref 30 · internal anchor
ST-Prune is a training-free spatio-temporal token pruning framework for VLMs in autonomous driving that achieves near-lossless results at 90% token reduction by exploiting motion volatility, temporal recency, and multi-view geometry.
LMGenDrive: Bridging Multimodal Understanding and Generative World Modeling for End-to-End Driving cs.CV · 2026-04-09 · unverdicted · none · ref 61 · internal anchor
LMGenDrive unifies LLM-based multimodal understanding with generative world models to output both future driving videos and control signals for end-to-end closed-loop autonomous driving.
DVGT-2: Vision-Geometry-Action Model for Autonomous Driving at Scale cs.CV · 2026-04-01 · unverdicted · none · ref 81 · internal anchor
DVGT-2 is a streaming vision-geometry-action model that jointly reconstructs dense 3D geometry and plans trajectories online, achieving better reconstruction than prior batch methods while transferring directly to planning benchmarks without fine-tuning.
CausalVAD: De-confounding End-to-End Autonomous Driving via Causal Intervention cs.CV · 2026-03-19 · unverdicted · none · ref 38 · internal anchor
CausalVAD applies sparse causal intervention to remove spurious correlations from end-to-end autonomous driving models, reporting state-of-the-art planning accuracy and robustness on nuScenes.
PALM: Progress-Aware Policy Learning via Affordance Reasoning for Long-Horizon Robotic Manipulation cs.RO · 2026-01-11 · unverdicted · none · ref 128 · internal anchor
PALM improves long-horizon robotic manipulation success by distilling affordance representations for object interaction and predicting within-subtask progress in a VLA model.
SpaceDrive: Infusing Spatial Awareness into VLM-based Autonomous Driving cs.CV · 2025-12-11 · conditional · none · ref 69 · internal anchor
SpaceDrive integrates 3D positional encodings derived from depth and ego-states into VLMs, replacing digit tokens to improve spatial reasoning and trajectory regression in autonomous driving.
Continually Evolving Skill Knowledge in Vision Language Action Model cs.RO · 2025-11-22 · unverdicted · none · ref 46 · internal anchor
Stellar VLA achieves continual learning in VLA models by maintaining a growing knowledge space and routing tasks to specialized experts conditioned on semantic relations, delivering strong LIBERO benchmark results with only 1% data replay and successful real-world transfer on dual-arm hardware.
DriveVLA-W0: World Models Amplify Data Scaling Law in Autonomous Driving cs.CV · 2025-10-14 · unverdicted · none · ref 30 · internal anchor
DriveVLA-W0 adds world modeling to predict future images in VLA models, overcoming sparse action supervision and amplifying data scaling laws on NAVSIM benchmarks and a large in-house dataset.
ReSim: Reliable World Simulation for Autonomous Driving cs.CV · 2025-06-11 · unverdicted · none · ref 97 · internal anchor
ReSim is a controllable video world model trained on heterogeneous real and simulated driving data that achieves higher fidelity and controllability for both expert and non-expert actions, plus a Video2Reward module for estimating action quality from simulated futures.
LVDrive: Latent Visual Representation Enhanced Vision-Language-Action Autonomous Driving Model cs.CV · 2026-05-21 · unverdicted · none · ref 56 · internal anchor
LVDrive improves closed-loop driving on Bench2Drive by adding latent future scene prediction to VLA models via unified embedding space processing and two-stage trajectory decoding.
EponaV2: Driving World Model with Comprehensive Future Reasoning cs.CV · 2026-05-14 · unverdicted · none · ref 82 · internal anchor
EponaV2 advances perception-free driving world models by forecasting comprehensive future 3D geometry and semantic representations, achieving SOTA planning performance on NAVSIM benchmarks.
SceneSelect: Selective Learning for Trajectory Scene Classification and Expert Scheduling cs.LG · 2026-04-27 · unverdicted · none · ref 37 · 2 links · internal anchor
SceneSelect discovers latent scene categories via clustering, trains a classifier to assign inputs, and dispatches to expert trajectory predictors, reporting 10.5% average gains over single-model and ensemble baselines on ETH-UCY, SDD, and NBA.
GuidedVLA: Specifying Task-Relevant Factors via Plug-and-Play Action Attention Specialization cs.RO · 2026-05-12 · unreviewed · ref 90 · internal anchor

arXiv preprint arXiv:2505.16278 (2025)

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer