hub Baseline reference

arXiv preprint arXiv:2510.12796 (2025)

Yingyan Li, Shuyao Shang, Weisong Liu, Bing Zhan, Haochen Wang, Yuqi Wang, Yuntao Chen, Xiaoman Wang, Yasong An, Chufeng Tang, et al · 2025 · arXiv 2510.12796

Baseline reference. 60% of citing Pith papers use this work as a benchmark or comparison.

15 Pith papers citing it

Baseline 60% of classified citations

read on arXiv browse 15 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

baseline 3 background 2

citation-polarity summary

baseline 3 background 2

representative citing papers

VECTOR-Drive: Tightly Coupled Vision-Language and Trajectory Expert Routing for End-to-End Autonomous Driving

cs.CV · 2026-05-09 · unverdicted · novelty 7.0

VECTOR-DRIVE couples vision-language reasoning and trajectory planning in one Transformer via semantic expert routing and flow-matching, reaching 88.91 driving score on Bench2Drive.

MindVLA-U1: VLA Beats VA with Unified Streaming Architecture for Autonomous Driving

cs.RO · 2026-05-12 · unverdicted · novelty 6.0 · 2 refs

MindVLA-U1 is the first unified streaming VLA architecture that surpasses human drivers on WOD-E2E planning metrics while matching VA latency and preserving language interfaces.

CoWorld-VLA: Thinking in a Multi-Expert World Model for Autonomous Driving

cs.CV · 2026-05-11 · unverdicted · novelty 6.0 · 2 refs

CoWorld-VLA extracts semantic, geometric, dynamic, and trajectory expert tokens from multi-source supervision and feeds them into a diffusion-based hierarchical planner, achieving competitive collision avoidance and trajectory accuracy on the NAVSIM v1 benchmark.

DriveFuture: Future-Aware Latent World Models for Autonomous Driving

cs.CV · 2026-05-10 · unverdicted · novelty 6.0

DriveFuture achieves SOTA results on NAVSIM by conditioning latent world model states on future predictions to directly inform trajectory planning.

ReflectDrive-2: Reinforcement-Learning-Aligned Self-Editing for Discrete Diffusion Driving

cs.RO · 2026-05-06 · unverdicted · novelty 6.0 · 2 refs

ReflectDrive-2 combines masked discrete diffusion with RL-aligned self-editing to generate and refine driving trajectories, reaching 91.0 PDMS on NAVSIM camera-only and 94.8 in best-of-6.

Infrastructure-Centric World Models: Bridging Temporal Depth and Spatial Breadth for Roadside Perception

cs.CV · 2026-04-19 · unverdicted · novelty 6.0

Infrastructure-centric world models use roadside sensors' temporal depth to complement vehicle spatial breadth for better traffic simulation and prediction.

SceneScribe-1M: A Large-Scale Video Dataset with Comprehensive Geometric and Semantic Annotations

cs.CV · 2026-04-09 · unverdicted · novelty 6.0

SceneScribe-1M is a new dataset of 1 million videos with semantic text, camera parameters, dense depth, and consistent 3D point tracks to support monocular depth estimation, scene reconstruction, point tracking, and text-to-video synthesis.

ExploreVLA: Dense World Modeling and Exploration for End-to-End Autonomous Driving

cs.CV · 2026-04-03 · unverdicted · novelty 6.0

ExploreVLA augments VLA driving models with future RGB and depth prediction for dense supervision and uses prediction uncertainty as a safety-gated intrinsic reward for RL-based exploration, reaching SOTA PDMS 93.7 on NAVSIM.

DVGT-2: Vision-Geometry-Action Model for Autonomous Driving at Scale

cs.CV · 2026-04-01 · unverdicted · novelty 6.0

DVGT-2 is a streaming vision-geometry-action model that jointly reconstructs dense 3D geometry and plans trajectories online, achieving better reconstruction than prior batch methods while transferring directly to planning benchmarks without fine-tuning.

OpenVO: Open-World Visual Odometry with Temporal Dynamics Awareness

cs.CV · 2026-02-22 · unverdicted · novelty 6.0

OpenVO estimates ego-motion from monocular dashcam footage with varying observation rates and uncalibrated cameras by encoding temporal dynamics in a two-frame regression framework and using 3D priors from foundation models, delivering over 20% gains and 46-92% lower errors on KITTI, nuScenes, and A

EponaV2: Driving World Model with Comprehensive Future Reasoning

cs.CV · 2026-05-14 · unverdicted · novelty 5.0

EponaV2 advances perception-free driving world models by forecasting comprehensive future 3D geometry and semantic representations, achieving SOTA planning performance on NAVSIM benchmarks.

Driving Intents Amplify Planning-Oriented Reinforcement Learning

cs.RO · 2026-05-12 · unverdicted · novelty 5.0 · 2 refs

DIAL expands continuous-action driving policies via intent-conditioned flow matching and multi-intent GRPO, lifting best-of-N preference scores above human demonstrations for the first time on WOD-E2E.

CRAFT: Counterfactual-to-Interactive Reinforcement Fine-Tuning for Driving Policies

cs.LG · 2026-05-06 · unverdicted · novelty 5.0

CRAFT is an on-policy RL fine-tuning framework that decomposes closed-loop policy gradients into a group-normalized counterfactual proxy plus residual correction from interaction events, achieving top closed-loop performance on Bench2Drive across multiple driving architectures.

SpanVLA: Efficient Action Bridging and Learning from Negative-Recovery Samples for Vision-Language-Action Model

cs.CV · 2026-04-21 · unverdicted · novelty 5.0

SpanVLA reduces action generation latency via flow-matching conditioned on history and improves robustness by training on negative-recovery samples with GRPO and a dedicated reasoning dataset.

OmniVLA-RL: A Vision-Language-Action Model with Spatial Understanding and Online RL

cs.RO · 2026-04-20 · unverdicted · novelty 4.0

OmniVLA-RL uses a mix-of-transformers architecture and flow-matching reformulated as SDE with group segmented policy optimization to surpass prior VLA models on LIBERO benchmarks.

citing papers explorer

Showing 15 of 15 citing papers.

VECTOR-Drive: Tightly Coupled Vision-Language and Trajectory Expert Routing for End-to-End Autonomous Driving cs.CV · 2026-05-09 · unverdicted · none · ref 25
VECTOR-DRIVE couples vision-language reasoning and trajectory planning in one Transformer via semantic expert routing and flow-matching, reaching 88.91 driving score on Bench2Drive.
MindVLA-U1: VLA Beats VA with Unified Streaming Architecture for Autonomous Driving cs.RO · 2026-05-12 · unverdicted · none · ref 28 · 2 links
MindVLA-U1 is the first unified streaming VLA architecture that surpasses human drivers on WOD-E2E planning metrics while matching VA latency and preserving language interfaces.
CoWorld-VLA: Thinking in a Multi-Expert World Model for Autonomous Driving cs.CV · 2026-05-11 · unverdicted · none · ref 65 · 2 links
CoWorld-VLA extracts semantic, geometric, dynamic, and trajectory expert tokens from multi-source supervision and feeds them into a diffusion-based hierarchical planner, achieving competitive collision avoidance and trajectory accuracy on the NAVSIM v1 benchmark.
DriveFuture: Future-Aware Latent World Models for Autonomous Driving cs.CV · 2026-05-10 · unverdicted · none · ref 22
DriveFuture achieves SOTA results on NAVSIM by conditioning latent world model states on future predictions to directly inform trajectory planning.
ReflectDrive-2: Reinforcement-Learning-Aligned Self-Editing for Discrete Diffusion Driving cs.RO · 2026-05-06 · unverdicted · none · ref 111 · 2 links
ReflectDrive-2 combines masked discrete diffusion with RL-aligned self-editing to generate and refine driving trajectories, reaching 91.0 PDMS on NAVSIM camera-only and 94.8 in best-of-6.
Infrastructure-Centric World Models: Bridging Temporal Depth and Spatial Breadth for Roadside Perception cs.CV · 2026-04-19 · unverdicted · none · ref 22
Infrastructure-centric world models use roadside sensors' temporal depth to complement vehicle spatial breadth for better traffic simulation and prediction.
SceneScribe-1M: A Large-Scale Video Dataset with Comprehensive Geometric and Semantic Annotations cs.CV · 2026-04-09 · unverdicted · none · ref 31
SceneScribe-1M is a new dataset of 1 million videos with semantic text, camera parameters, dense depth, and consistent 3D point tracks to support monocular depth estimation, scene reconstruction, point tracking, and text-to-video synthesis.
ExploreVLA: Dense World Modeling and Exploration for End-to-End Autonomous Driving cs.CV · 2026-04-03 · unverdicted · none · ref 27
ExploreVLA augments VLA driving models with future RGB and depth prediction for dense supervision and uses prediction uncertainty as a safety-gated intrinsic reward for RL-based exploration, reaching SOTA PDMS 93.7 on NAVSIM.
DVGT-2: Vision-Geometry-Action Model for Autonomous Driving at Scale cs.CV · 2026-04-01 · unverdicted · none · ref 34
DVGT-2 is a streaming vision-geometry-action model that jointly reconstructs dense 3D geometry and plans trajectories online, achieving better reconstruction than prior batch methods while transferring directly to planning benchmarks without fine-tuning.
OpenVO: Open-World Visual Odometry with Temporal Dynamics Awareness cs.CV · 2026-02-22 · unverdicted · none · ref 27
OpenVO estimates ego-motion from monocular dashcam footage with varying observation rates and uncalibrated cameras by encoding temporal dynamics in a two-frame regression framework and using 3D priors from foundation models, delivering over 20% gains and 46-92% lower errors on KITTI, nuScenes, and A
EponaV2: Driving World Model with Comprehensive Future Reasoning cs.CV · 2026-05-14 · unverdicted · none · ref 36
EponaV2 advances perception-free driving world models by forecasting comprehensive future 3D geometry and semantic representations, achieving SOTA planning performance on NAVSIM benchmarks.
Driving Intents Amplify Planning-Oriented Reinforcement Learning cs.RO · 2026-05-12 · unverdicted · none · ref 10 · 2 links
DIAL expands continuous-action driving policies via intent-conditioned flow matching and multi-intent GRPO, lifting best-of-N preference scores above human demonstrations for the first time on WOD-E2E.
CRAFT: Counterfactual-to-Interactive Reinforcement Fine-Tuning for Driving Policies cs.LG · 2026-05-06 · unverdicted · none · ref 7
CRAFT is an on-policy RL fine-tuning framework that decomposes closed-loop policy gradients into a group-normalized counterfactual proxy plus residual correction from interaction events, achieving top closed-loop performance on Bench2Drive across multiple driving architectures.
SpanVLA: Efficient Action Bridging and Learning from Negative-Recovery Samples for Vision-Language-Action Model cs.CV · 2026-04-21 · unverdicted · none · ref 39
SpanVLA reduces action generation latency via flow-matching conditioned on history and improves robustness by training on negative-recovery samples with GRPO and a dedicated reasoning dataset.
OmniVLA-RL: A Vision-Language-Action Model with Spatial Understanding and Online RL cs.RO · 2026-04-20 · unverdicted · none · ref 28
OmniVLA-RL uses a mix-of-transformers architecture and flow-matching reformulated as SDE with group segmented policy optimization to surpass prior VLA models on LIBERO benchmarks.

arXiv preprint arXiv:2510.12796 (2025)

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer