{"total":22,"items":[{"citing_arxiv_id":"2605.23270","ref_index":13,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"ChainFlow-VLA: Causal Flow Planning with Vision-Language Models","primary_cat":"cs.CV","submitted_at":"2026-05-22T06:17:35+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ChainFlow-VLA unifies autoregressive causal trajectory modes with VLM-conditioned diffusion refinement to reach 94.85 on NAVSIM v1, matching human performance.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.22089","ref_index":31,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"LVDrive: Latent Visual Representation Enhanced Vision-Language-Action Autonomous Driving Model","primary_cat":"cs.CV","submitted_at":"2026-05-21T07:31:49+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"LVDrive improves closed-loop driving on Bench2Drive by adding latent future scene prediction to VLA models via unified embedding space processing and two-stage trajectory decoding.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.21061","ref_index":23,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Grounding Driving VLA via Inverse Kinematics","primary_cat":"cs.CV","submitted_at":"2026-05-20T11:45:32+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"By adding future visual state prediction and a dedicated inverse kinematics diffusion network that uses only visual boundary conditions, a 0.5B driving VLA recovers visual grounding and matches 7-8B models on NAVSIM-v2 and nuScenes.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.17284","ref_index":22,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"CLAP: Contrastive Latent-space Prompt Optimization for End-to-end Autonomous Driving","primary_cat":"cs.CV","submitted_at":"2026-05-17T06:45:53+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"CLAP reduces planning error on challenging driving scenarios by 24% on NAVSIM using contrastive latent-space prompt optimization on frozen VLA models with no regression on normal frames.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.15120","ref_index":22,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"CLOVER: Closed-Loop Value Estimation and Ranking for End-to-End Autonomous Driving Planning","primary_cat":"cs.RO","submitted_at":"2026-05-14T17:32:18+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"CLOVER is a closed-loop generator-scorer framework that expands proposal coverage with pseudo-expert trajectories and performs conservative self-distillation to achieve state-of-the-art planning scores on NAVSIM and nuScenes.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.14696","ref_index":36,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"EponaV2: Driving World Model with Comprehensive Future Reasoning","primary_cat":"cs.CV","submitted_at":"2026-05-14T11:12:23+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"EponaV2 advances perception-free driving world models by forecasting comprehensive future 3D geometry and semantic representations, achieving SOTA planning performance on NAVSIM benchmarks.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"DrivingGPT [9] employs VLMs to unify simulation and planning tasks, while RoboTron-Sim [3] enhances safety using scenario-aware prompts and an image-to-ego encoder. To facilitate planning without explicit perception, World4Drive [ 96] constructs a latent world model, and Epona [ 92] utilizes a diffusion world model for autoregressive video generation. Additionally, DriveVLA- W0 [36], DriveLaW [77] and PWM [95] implement dense self-supervision for planning by training on future image predictions. While perception-free models offer advantages in data scaling, they often struggle to interpret complex environments with the limited self-supervision, which can lead to suboptimal performance. 3 Methods 3.1 Applying World Model to Driving with Flow Matching"},{"citing_arxiv_id":"2605.12625","ref_index":10,"ref_count":2,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Driving Intents Amplify Planning-Oriented Reinforcement Learning","primary_cat":"cs.RO","submitted_at":"2026-05-12T18:10:19+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"DIAL expands continuous-action driving policies via intent-conditioned flow matching and multi-intent GRPO, lifting best-of-N preference scores above human demonstrations for the first time on WOD-E2E.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.12624","ref_index":28,"ref_count":2,"confidence":0.9,"is_internal_anchor":true,"paper_title":"MindVLA-U1: VLA Beats VA with Unified Streaming Architecture for Autonomous Driving","primary_cat":"cs.RO","submitted_at":"2026-05-12T18:09:42+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"MindVLA-U1 is the first unified streaming VLA architecture that surpasses human drivers on WOD-E2E planning metrics while matching VA latency and preserving language interfaces.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.10426","ref_index":65,"ref_count":4,"confidence":0.9,"is_internal_anchor":true,"paper_title":"CoWorld-VLA: Thinking in a Multi-Expert World Model for Autonomous Driving","primary_cat":"cs.CV","submitted_at":"2026-05-11T12:01:13+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"CoWorld-VLA extracts semantic, geometric, dynamic, and trajectory expert tokens from multi-source supervision and feeds them into a diffusion-based hierarchical planner, achieving competitive collision avoidance and trajectory accuracy on the NAVSIM v1 benchmark.","context_count":2,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"PWM [81] NeurIPS'25 CN98.6 95.9 95.410081.8 88.1 WoTE [43] ICCV'25 C & LN98.5 96.8 94.9 99.9 81.9 88.3 ResWorld [82] ICLR'26 C & LN98.9 96.5 95.610083.1 89.0 WorldDrive [83] arXiv'26 CN98.4 96.8 95.210083.3 89.0 DriveLaW [19] CVPR'26 CN99.0 97.196.7 10081.3 89.1 Vision-Language Model Methods ReCogDrive† [50] ICLR'26 C 1 98.1 94.7 94.210080.9 86.5 DriveVLA-W0 [65] ICLR'26 CN98.4 95.3 95.410080.9 87.2 LaST-VLA† [54] arXiv'26 C 1 98.7 95.4 95.710080.5 87.3 SGDrive† [84] CVPR'26 CN98.6 95.1 95.410081.2 87.4 Uni-World VLA [66] arXiv'26 CN98.7 96.7 96.110083.2 89.4 CoWorld-VLA¶ - C 1 98.5 96.9 95.410083.2 89.1 CoWorld-VLA (ours) - C 199.296.8 96.6 100 83.6 89.8 Implementation details.CoWorld-VLA consists of a video diffusion Transformer (Wan2."},{"citing_arxiv_id":"2605.09701","ref_index":22,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"DriveFuture: Future-Aware Latent World Models for Autonomous Driving","primary_cat":"cs.CV","submitted_at":"2026-05-10T18:45:21+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"DriveFuture achieves SOTA results on NAVSIM by conditioning latent world model states on future predictions to directly inform trajectory planning.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"compact BEV or structured state spaces [ 37, 38], which support future evaluation and trajectory selection through a unified BEV latent space or a BEV world model. More recently, world models have also been increasingly integrated with planner- or VLA-oriented frameworks, as exemplified by LAW [17], World4Drive [18], WorldRFT [19], DriveWorld-VLA [20], DriveVLA-W0 [21], and DriveLaW [22], suggesting a clear trend toward tighter coupling between world modeling, trajectory planning, and scalable autonomous driving systems. Although these methods have demonstrated strong potential for future scene modeling and planning support, they typically involve high modeling complexity and are prone to error accumulation in observation-space or dense-space reconstruction."},{"citing_arxiv_id":"2605.08830","ref_index":25,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"VECTOR-Drive: Tightly Coupled Vision-Language and Trajectory Expert Routing for End-to-End Autonomous Driving","primary_cat":"cs.CV","submitted_at":"2026-05-09T09:34:50+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"VECTOR-DRIVE uses shared self-attention with semantic-aware expert routing of tokens to VL and trajectory experts plus flow-matching action decoding to reach 88.91 driving score on Bench2Drive.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"π0 introduces a flow-matching action model on top of a pretrained vision-language backbone [24]. This paradigm has also been explored in autonomous driving. DriveVLA-W0 in- troduces a lightweight action expert, while DriveMoE extends theπ 0-style formulation with mixture-of-experts modules for scene-specialized perception and skill-specialized action gen- eration [25], [26]. These designs reduce the burden on general- purpose VLM representations and strengthen action modeling. Diffusion- and flow-based planners further improve continu- ous trajectory generation by modeling complex future distribu- tions beyond deterministic regression [4], [24], [29]. However, fully separated reasoning-action pipelines often reduce the"},{"citing_arxiv_id":"2605.04647","ref_index":111,"ref_count":2,"confidence":0.9,"is_internal_anchor":true,"paper_title":"ReflectDrive-2: Reinforcement-Learning-Aligned Self-Editing for Discrete Diffusion Driving","primary_cat":"cs.RO","submitted_at":"2026-05-06T08:52:32+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ReflectDrive-2 combines masked discrete diffusion with RL-aligned self-editing to generate and refine driving trajectories, reaching 91.0 PDMS on NAVSIM camera-only and 94.8 in best-of-6.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.04470","ref_index":7,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"CRAFT: Counterfactual-to-Interactive Reinforcement Fine-Tuning for Driving Policies","primary_cat":"cs.LG","submitted_at":"2026-05-06T03:49:01+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"CRAFT is an on-policy RL fine-tuning framework that decomposes closed-loop policy gradients into a group-normalized counterfactual proxy plus residual correction from interaction events, achieving top closed-loop performance on Bench2Drive across multiple driving architectures.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Chen, and Y . Wang. Drivefine: Refining-augmented masked diffusion vla for precise and robust driving.arXiv preprint arXiv:2602.14577, 2026. [6] T. Xia, Y . Li, L. Zhou, J. Yao, K. Xiong, H. Sun, B. Wang, K. Ma, G. Chen, H. Ye, et al. Drivelaw: Unifying planning and video generation in a latent driving world.arXiv preprint arXiv:2512.23421, 2025. [7] Y . Li, S. Shang, W. Liu, B. Zhan, H. Wang, Y . Wang, Y . Chen, X. Wang, Y . An, C. Tang, et al. Drivevla-w0: World models amplify data scaling law in autonomous driving.arXiv preprint arXiv:2510.12796, 2025. [8] Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, X. Bi, H. Zhang, M. Zhang, Y . Li, Y . Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models."},{"citing_arxiv_id":"2604.19710","ref_index":39,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"SpanVLA: Efficient Action Bridging and Learning from Negative-Recovery Samples for Vision-Language-Action Model","primary_cat":"cs.CV","submitted_at":"2026-04-21T17:34:19+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"SpanVLA reduces action generation latency via flow-matching conditioned on history and improves robustness by training on negative-recovery samples with GRPO and a dedicated reasoning dataset.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"(Predictive Driver Model Score), NC (No Collision), DAC (Drivable Area Compliance), EP (Ego Process), TTC (Time-To-Collision), Comf. (Comfort), Methods Cam. Lid. PDMS↑ NC↑DAC↑EP↑TTC↑Comf.↑ Conventional End-to-end-based Methods TransFuser [6] ✓ ✓ 84.0 97.8 92.6 78.9 92.9100.0 DRAMA [76] ✓ ✓ 86.9 98.2 95.2 81.3 94.2100.0 Hydra-MDP [41] ✓ ✓ 86.5 98.3 96.0 78.7 94.6100.0 DiffusionDrive [42] ✓ ✓ 88.1 98.2 96.2 82.2 94.7100.0 WoTE [39] ✓ ✓ 88.3 98.5 96.8 81.9 94.4 99.9 VLA-based Methods ReCogDrive [38] ✓- 89.6 98.2 97.8 83.5 95.2 99.8 DriveVLA-W0 [38] ✓- 90.2 98.799.183.3 95.3 99.3 AutoVLA [38] ✓- 89.1 98.4 95.6 81.998.099.9 Ours SpanVLA (One-shot) ✓- 82.1 97.5 90.8 76.9 93.7 99.5 SpanVLA (Post-RFT) ✓- 90.3 99.197.186.395.2100.0 pseudo closed-loop simulation to evaluate the driving performance and robust-"},{"citing_arxiv_id":"2604.17706","ref_index":28,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"OmniVLA-RL: A Vision-Language-Action Model with Spatial Understanding and Online RL","primary_cat":"cs.RO","submitted_at":"2026-04-20T01:36:58+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"OmniVLA-RL uses a mix-of-transformers architecture and flow-matching reformulated as SDE with group segmented policy optimization to surpass prior VLA models on LIBERO benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.17651","ref_index":22,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Infrastructure-Centric World Models: Bridging Temporal Depth and Spatial Breadth for Roadside Perception","primary_cat":"cs.CV","submitted_at":"2026-04-19T22:50:32+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Infrastructure-centric world models use roadside sensors' temporal depth to complement vehicle spatial breadth for better traffic simulation and prediction.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.07990","ref_index":31,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"SceneScribe-1M: A Large-Scale Video Dataset with Comprehensive Geometric and Semantic Annotations","primary_cat":"cs.CV","submitted_at":"2026-04-09T08:59:33+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SceneScribe-1M is a new dataset of 1 million videos with semantic text, camera parameters, dense depth, and consistent 3D point tracks to support monocular depth estimation, scene reconstruction, point tracking, and text-to-video synthesis.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Tapvid-3d: A benchmark for tracking any point in 3d.Advances in Neural Information Processing Systems (NeurIPS), 37:82149-82165, 2024. 8 [30] Bohan Li, Zhuang Ma, Dalong Du, Baorui Peng, Zhujin Liang, Zhenqiang Liu, Chao Ma, Yueming Jin, Hao Zhao, Wenjun Zeng, et al. Omninwm: Omniscient driving naviga- tion world models.arXiv preprint arXiv:2510.18313, 2025. 2 [31] Yingyan Li, Shuyao Shang, Weisong Liu, Bing Zhan, Haochen Wang, Yuqi Wang, Yuntao Chen, Xiaoman Wang, Yasong An, Chufeng Tang, et al. Drivevla-w0: World mod- els amplify data scaling law in autonomous driving.arXiv preprint arXiv:2510.12796, 2025. 2 [32] Zhen Li, Chuanhao Li, Xiaofeng Mao, Shaoheng Lin, Ming Li, Shitian Zhao, Zhaopan Xu, Xinyue Li, Yukang Feng,"},{"citing_arxiv_id":"2604.02714","ref_index":27,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"ExploreVLA: Dense World Modeling and Exploration for End-to-End Autonomous Driving","primary_cat":"cs.CV","submitted_at":"2026-04-03T04:14:13+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ExploreVLA augments VLA driving models with future RGB and depth prediction for dense supervision and uses prediction uncertainty as a safety-gated intrinsic reward for RL-based exploration, reaching SOTA PDMS 93.7 on NAVSIM.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"decision-making, fail to capture the rich spatial geometry and fine-grained ap- pearance of driving scenes. This supervisory deficit constrains the model's ability to build comprehensive internal representations, particularly for aspects of the environment (such as road topology, object extent, and depth ordering) that are critical for safe planning but are not explicitly encoded in sparse action labels [27]. In this work, we propose a unified understanding-and-generation framework that addresses both limitations through a single mechanism:dense world mod- eling and exploration(Fig. 1). Specifically, we augment trajectory prediction with future RGB and depth image generation as auxiliary objectives. On the supervision side, these generation tasks require the model to predict fine-grained"},{"citing_arxiv_id":"2604.00813","ref_index":34,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"DVGT-2: Vision-Geometry-Action Model for Autonomous Driving at Scale","primary_cat":"cs.CV","submitted_at":"2026-04-01T12:21:26+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"DVGT-2 is a streaming vision-geometry-action model that jointly reconstructs dense 3D geometry and plans trajectories online, achieving better reconstruction than prior batch methods while transferring directly to planning benchmarks without fine-tuning.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"DiffusionDrive [41]C & L Map & Box 98.2 96.2 94.7 100 82.2 88.1 WoTE [35] C & L Map & Box 98.5 96.8 94.9 99.9 81.9 88.3 DriveSuprim [82] C & L Map & Box 97.8 97.3 93.6 100 86.7 89.9 AutoVLA [90] C Language 96.9 92.4 88.1 99.9 75.8 80.5 AdaThinkDrive [53]C Language 98.5 94.4 94.9 100 79.9 86.2 ReCogDrive [37] C Language 98.3 95.1 94.3 100 81.1 86.8 DriveVLA-W0 [34] C Future States 98.7 99.1 95.3 99.3 83.3 90.2 AutoVLA† [90] C Language & RL 98.4 95.6 98.0 99.9 85.9 89.1 ReCogDrive† [37] C Language & RL 98.2 97.8 95.2 99.8 83.5 89.6 DVGT-2 C Dense Geometry 97.8 97.2 93.9 100 83.4 88.6 DVGT-2-NAVSIMC Dense Geometry 98.7 97.9 95.8 100 84.3 90.3 Table 6: Closed-loop planning results on NAVSIM v2navtestsplit. Method NC↑DAC↑DDC↑TL↑EP↑TTC↑LK↑HC↑EC↑ EPDMS↑"},{"citing_arxiv_id":"2602.19035","ref_index":27,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"OpenVO: Open-World Visual Odometry with Temporal Dynamics Awareness","primary_cat":"cs.CV","submitted_at":"2026-02-22T04:18:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"OpenVO estimates ego-motion from monocular dashcam footage with varying observation rates and uncalibrated cameras by encoding temporal dynamics in a two-frame regression framework and using 3D priors from foundation models, delivering over 20% gains and 46-92% lower errors on KITTI, nuScenes, and A","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2512.23421","ref_index":42,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"DriveLaW:Unifying Planning and Video Generation in a Latent Driving World","primary_cat":"cs.CV","submitted_at":"2025-12-29T12:32:27+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"DriveLaW unifies video world modeling and trajectory planning by injecting video-generator latents into a diffusion planner, achieving SOTA video prediction and a new record on the NAVSIM planning benchmark.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2511.23369","ref_index":49,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"SimScale: Learning to Drive via Real-World Simulation at Scale","primary_cat":"cs.CV","submitted_at":"2025-11-28T17:17:38+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SimScale synthesizes unseen driving states from real logs via neural rendering and reactive environments, generates pseudo-expert trajectories, and shows that co-training on real plus simulated data improves planning robustness and generalization on real benchmarks, with gains scaling by simulation ","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}