TPS-Drive uses an agent-centric tokenizer supervised by a frozen 3D detection head to purify VLM spatial representations, enabling better scene forecasting and lower collision rates on nuScenes and NAVSIM benchmarks.
hub Mixed citations
ReCogDrive: A Reinforced Cognitive Framework for End-to-End Autonomous Driving
Mixed citation behavior. Most common role is background (60%).
abstract
Recent studies have explored leveraging the world knowledge and cognitive capabilities of Vision-Language Models (VLMs) to address the long-tail problem in end-to-end autonomous driving. However, existing methods typically formulate trajectory planning as a language modeling task, where physical actions are output in the language space, potentially leading to issues such as format-violating outputs, infeasible actions, and slow inference speeds. In this paper, we propose ReCogDrive, a novel Reinforced Cognitive framework for end-to-end autonomous Driving, unifying driving understanding and planning by integrating an autoregressive model with a diffusion planner. First, to instill human driving cognition into the VLM, we introduce a hierarchical data pipeline that mimics the sequential cognitive process of human drivers through three stages: generation, refinement, and quality control. Building on this cognitive foundation, we then address the language-action mismatch by injecting the VLM's learned driving priors into a diffusion planner to efficiently generate continuous and stable trajectories. Furthermore, to enhance driving safety and reduce collisions, we introduce a Diffusion Group Relative Policy Optimization (DiffGRPO) stage, reinforcing the planner for enhanced safety and comfort. Extensive experiments on the NAVSIM and Bench2Drive benchmarks demonstrate that ReCogDrive achieves state-of-the-art performance. Additionally, qualitative results across diverse driving scenarios and DriveBench highlight the model's scene comprehension. All code, model weights, and datasets will be made publicly available to facilitate subsequent research.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
By adding future visual state prediction and a dedicated inverse kinematics diffusion network that uses only visual boundary conditions, a 0.5B driving VLA recovers visual grounding and matches 7-8B models on NAVSIM-v2 and nuScenes.
SCORP delivers 10-28% gains in safety and 2-7% in efficiency metrics on WOMD by using dual-path scene conditioning in diffusion planning plus variance-gated group-relative policy optimization for closed-loop stability.
Fine-tuning VLMs for driving erodes pre-trained world knowledge, but shifting adaptation to prompt space via the Drive Expert Adapter preserves generalization while improving task performance.
PaIR-Drive runs IL and RL in parallel branches with a tree-structured sampler to reach 91.2 PDMS and 87.9 EPDMS on NAVSIM benchmarks while outperforming sequential RL fine-tuning and correcting some human errors.
A structured perturbation framework applied to VLA driving models reveals evaluation-dependent visual grounding patterns and uneven dependency across abstraction levels.
AnyScene is an occupancy-centric framework using a Spatial-Temporal Occupancy Diffusion Transformer and Geometry-Grounded View Expansion to generate controllable driving scenes and videos from BEV layouts.
LACO introduces Iterative Latent Deliberation, Cross-Horizon Saliency Attribution, and Structured Semantic Knowledge Distillation to enable low-latency latent communication in collaborative driving while preserving performance in CARLA simulations.
CLAP reduces planning error on challenging driving scenarios by 24% on NAVSIM using contrastive latent-space prompt optimization on frozen VLA models with no regression on normal frames.
CLOVER is a closed-loop generator-scorer framework that expands proposal coverage with pseudo-expert trajectories and performs conservative self-distillation to achieve state-of-the-art planning scores on NAVSIM and nuScenes.
MindVLA-U1 is the first unified streaming VLA architecture that surpasses human drivers on WOD-E2E planning metrics while matching VA latency and preserving language interfaces.
CoWorld-VLA extracts semantic, geometric, dynamic, and trajectory expert tokens from multi-source supervision and feeds them into a diffusion-based hierarchical planner, achieving competitive collision avoidance and trajectory accuracy on the NAVSIM v1 benchmark.
DriveFuture achieves SOTA results on NAVSIM by conditioning latent world model states on future predictions to directly inform trajectory planning.
VECTOR-DRIVE uses shared self-attention with semantic-aware expert routing of tokens to VL and trajectory experts plus flow-matching action decoding to reach 88.91 driving score on Bench2Drive.
ReflectDrive-2 combines masked discrete diffusion with RL-aligned self-editing to generate and refine driving trajectories, reaching 91.0 PDMS on NAVSIM camera-only and 94.8 in best-of-6.
GSDrive combines IL priors with RL feedback by probing multi-mode futures inside a 3D Gaussian Splatting simulator to supply dense rewards for closed-loop driving policy improvement on nuScenes.
Creates LTD dataset for open-ended traffic VQA and trains UniVLT model to achieve SOTA on unified microscopic AD and macroscopic traffic reasoning tasks.
OneVL achieves superior accuracy to explicit chain-of-thought reasoning at answer-only latency by supervising latent tokens with a visual world model decoder that predicts future frames.
OneDrive unifies heterogeneous decoding in a single VLM transformer decoder for end-to-end driving, achieving 0.28 L2 error and 0.18 collision rate on nuScenes plus 86.8 PDMS on NAVSIM.
TRFP combines rectified flow models with truncation to support multimodal policies in MaxEnt RL while allowing fast one-step sampling and stable training.
DVGT-2 is a streaming vision-geometry-action model that jointly reconstructs dense 3D geometry and plans trajectories online, achieving better reconstruction than prior batch methods while transferring directly to planning benchmarks without fine-tuning.
The paper introduces Hyper Diffusion Planner (HDP), a diffusion-based E2E AD framework that identifies insights on loss space, trajectory representation and data scaling, adds RL post-training, and reports 10x performance gains over 200 km of real-world testing across 6 scenarios.
DriveLaW unifies video world modeling and trajectory planning by injecting video-generator latents into a diffusion planner, achieving SOTA video prediction and a new record on the NAVSIM planning benchmark.
Pseudo-expert regularized offline RL reduces collisions and improves route completion for camera-based driving models trained on fixed simulator datasets from nuScenes.
citing papers explorer
-
LACO: Adaptive Latent Communication for Collaborative Driving
LACO introduces Iterative Latent Deliberation, Cross-Horizon Saliency Attribution, and Structured Semantic Knowledge Distillation to enable low-latency latent communication in collaborative driving while preserving performance in CARLA simulations.