VECTOR-DRIVE couples vision-language reasoning and trajectory planning in one Transformer via semantic expert routing and flow-matching, reaching 88.91 driving score on Bench2Drive.
hub Baseline reference
arXiv preprint arXiv:2510.12796 (2025)
Baseline reference. 60% of citing Pith papers use this work as a benchmark or comparison.
hub tools
citation-role summary
citation-polarity summary
years
2026 15verdicts
UNVERDICTED 15representative citing papers
MindVLA-U1 is the first unified streaming VLA architecture that surpasses human drivers on WOD-E2E planning metrics while matching VA latency and preserving language interfaces.
CoWorld-VLA extracts semantic, geometric, dynamic, and trajectory expert tokens from multi-source supervision and feeds them into a diffusion-based hierarchical planner, achieving competitive collision avoidance and trajectory accuracy on the NAVSIM v1 benchmark.
DriveFuture achieves SOTA results on NAVSIM by conditioning latent world model states on future predictions to directly inform trajectory planning.
ReflectDrive-2 combines masked discrete diffusion with RL-aligned self-editing to generate and refine driving trajectories, reaching 91.0 PDMS on NAVSIM camera-only and 94.8 in best-of-6.
Infrastructure-centric world models use roadside sensors' temporal depth to complement vehicle spatial breadth for better traffic simulation and prediction.
SceneScribe-1M is a new dataset of 1 million videos with semantic text, camera parameters, dense depth, and consistent 3D point tracks to support monocular depth estimation, scene reconstruction, point tracking, and text-to-video synthesis.
ExploreVLA augments VLA driving models with future RGB and depth prediction for dense supervision and uses prediction uncertainty as a safety-gated intrinsic reward for RL-based exploration, reaching SOTA PDMS 93.7 on NAVSIM.
DVGT-2 is a streaming vision-geometry-action model that jointly reconstructs dense 3D geometry and plans trajectories online, achieving better reconstruction than prior batch methods while transferring directly to planning benchmarks without fine-tuning.
OpenVO estimates ego-motion from monocular dashcam footage with varying observation rates and uncalibrated cameras by encoding temporal dynamics in a two-frame regression framework and using 3D priors from foundation models, delivering over 20% gains and 46-92% lower errors on KITTI, nuScenes, and A
EponaV2 advances perception-free driving world models by forecasting comprehensive future 3D geometry and semantic representations, achieving SOTA planning performance on NAVSIM benchmarks.
DIAL expands continuous-action driving policies via intent-conditioned flow matching and multi-intent GRPO, lifting best-of-N preference scores above human demonstrations for the first time on WOD-E2E.
CRAFT is an on-policy RL fine-tuning framework that decomposes closed-loop policy gradients into a group-normalized counterfactual proxy plus residual correction from interaction events, achieving top closed-loop performance on Bench2Drive across multiple driving architectures.
SpanVLA reduces action generation latency via flow-matching conditioned on history and improves robustness by training on negative-recovery samples with GRPO and a dedicated reasoning dataset.
OmniVLA-RL uses a mix-of-transformers architecture and flow-matching reformulated as SDE with group segmented policy optimization to surpass prior VLA models on LIBERO benchmarks.
citing papers explorer
-
VECTOR-Drive: Tightly Coupled Vision-Language and Trajectory Expert Routing for End-to-End Autonomous Driving
VECTOR-DRIVE couples vision-language reasoning and trajectory planning in one Transformer via semantic expert routing and flow-matching, reaching 88.91 driving score on Bench2Drive.
-
MindVLA-U1: VLA Beats VA with Unified Streaming Architecture for Autonomous Driving
MindVLA-U1 is the first unified streaming VLA architecture that surpasses human drivers on WOD-E2E planning metrics while matching VA latency and preserving language interfaces.
-
CoWorld-VLA: Thinking in a Multi-Expert World Model for Autonomous Driving
CoWorld-VLA extracts semantic, geometric, dynamic, and trajectory expert tokens from multi-source supervision and feeds them into a diffusion-based hierarchical planner, achieving competitive collision avoidance and trajectory accuracy on the NAVSIM v1 benchmark.
-
DriveFuture: Future-Aware Latent World Models for Autonomous Driving
DriveFuture achieves SOTA results on NAVSIM by conditioning latent world model states on future predictions to directly inform trajectory planning.
-
ReflectDrive-2: Reinforcement-Learning-Aligned Self-Editing for Discrete Diffusion Driving
ReflectDrive-2 combines masked discrete diffusion with RL-aligned self-editing to generate and refine driving trajectories, reaching 91.0 PDMS on NAVSIM camera-only and 94.8 in best-of-6.
-
Infrastructure-Centric World Models: Bridging Temporal Depth and Spatial Breadth for Roadside Perception
Infrastructure-centric world models use roadside sensors' temporal depth to complement vehicle spatial breadth for better traffic simulation and prediction.
-
SceneScribe-1M: A Large-Scale Video Dataset with Comprehensive Geometric and Semantic Annotations
SceneScribe-1M is a new dataset of 1 million videos with semantic text, camera parameters, dense depth, and consistent 3D point tracks to support monocular depth estimation, scene reconstruction, point tracking, and text-to-video synthesis.
-
ExploreVLA: Dense World Modeling and Exploration for End-to-End Autonomous Driving
ExploreVLA augments VLA driving models with future RGB and depth prediction for dense supervision and uses prediction uncertainty as a safety-gated intrinsic reward for RL-based exploration, reaching SOTA PDMS 93.7 on NAVSIM.
-
DVGT-2: Vision-Geometry-Action Model for Autonomous Driving at Scale
DVGT-2 is a streaming vision-geometry-action model that jointly reconstructs dense 3D geometry and plans trajectories online, achieving better reconstruction than prior batch methods while transferring directly to planning benchmarks without fine-tuning.
-
OpenVO: Open-World Visual Odometry with Temporal Dynamics Awareness
OpenVO estimates ego-motion from monocular dashcam footage with varying observation rates and uncalibrated cameras by encoding temporal dynamics in a two-frame regression framework and using 3D priors from foundation models, delivering over 20% gains and 46-92% lower errors on KITTI, nuScenes, and A
-
EponaV2: Driving World Model with Comprehensive Future Reasoning
EponaV2 advances perception-free driving world models by forecasting comprehensive future 3D geometry and semantic representations, achieving SOTA planning performance on NAVSIM benchmarks.
-
Driving Intents Amplify Planning-Oriented Reinforcement Learning
DIAL expands continuous-action driving policies via intent-conditioned flow matching and multi-intent GRPO, lifting best-of-N preference scores above human demonstrations for the first time on WOD-E2E.
-
CRAFT: Counterfactual-to-Interactive Reinforcement Fine-Tuning for Driving Policies
CRAFT is an on-policy RL fine-tuning framework that decomposes closed-loop policy gradients into a group-normalized counterfactual proxy plus residual correction from interaction events, achieving top closed-loop performance on Bench2Drive across multiple driving architectures.
-
SpanVLA: Efficient Action Bridging and Learning from Negative-Recovery Samples for Vision-Language-Action Model
SpanVLA reduces action generation latency via flow-matching conditioned on history and improves robustness by training on negative-recovery samples with GRPO and a dedicated reasoning dataset.
-
OmniVLA-RL: A Vision-Language-Action Model with Spatial Understanding and Online RL
OmniVLA-RL uses a mix-of-transformers architecture and flow-matching reformulated as SDE with group segmented policy optimization to surpass prior VLA models on LIBERO benchmarks.