{"total":17,"items":[{"citing_arxiv_id":"2606.28757","ref_index":63,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"A Physics-Grounded Benchmark for Multi-Agent Dynamics in World Models","primary_cat":"cs.CV","submitted_at":"2026-06-27T06:13:35+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"CrashTwin is a new benchmark framework that exposes physical violations in state-of-the-art world models during multi-agent collisions despite high visual quality.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.02482","ref_index":55,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"X-Stream: Exploring MLLMs as Multiplexers for Multi-Stream Understanding","primary_cat":"cs.CV","submitted_at":"2026-06-01T16:52:11+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"X-Stream benchmark shows SOTA MLLMs score ~50% on concurrent multi-stream tasks and lack proactive ability, using a dual-verification pipeline to avoid single-stream bias.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.23163","ref_index":20,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Fast-dDrive: Efficient Block-Diffusion VLM for Autonomous Driving","primary_cat":"cs.CL","submitted_at":"2026-05-22T02:31:32+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Fast-dDrive is a block-diffusion VLA that reports SOTA accuracy on WOD-E2E and nuScenes driving benchmarks together with 12x throughput over autoregressive baselines via section scaffolds and test-time averaging.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.22446","ref_index":19,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Pre-VLA: Preemptive Runtime Verification for Reliable Vision-Language-Action and World-Model Rollouts","primary_cat":"cs.CV","submitted_at":"2026-05-21T13:13:31+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Pre-VLA is a multimodal runtime verifier that predicts safety confidence and advantage scores for action chunks, raising closed-loop success rates on the LIBERO benchmark from 30.79% to 37.62%.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.16737","ref_index":37,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"DriveSafer: End-to-End Autonomous Driving with Safety Guidance","primary_cat":"cs.RO","submitted_at":"2026-05-16T01:21:30+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"DriveSafer reduces catastrophic failures (PDMS=0) by 48% and drivable-area compliance failures by over 65% versus DiffusionDrive on the NAVSIM benchmark by combining training-time safety constraints with inference-time guidance.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.12625","ref_index":23,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Driving Intents Amplify Planning-Oriented Reinforcement Learning","primary_cat":"cs.RO","submitted_at":"2026-05-12T18:10:19+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"DIAL expands continuous-action driving policies via intent-conditioned flow matching and multi-intent GRPO, lifting best-of-N preference scores above human demonstrations for the first time on WOD-E2E.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.12624","ref_index":47,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"MindVLA-U1: VLA Beats VA with Unified Streaming Architecture for Autonomous Driving","primary_cat":"cs.RO","submitted_at":"2026-05-12T18:09:42+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"MindVLA-U1 is the first unified streaming VLA architecture that surpasses human drivers on WOD-E2E planning metrics while matching VA latency and preserving language interfaces.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.12622","ref_index":3,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Action Emergence from Streaming Intent","primary_cat":"cs.RO","submitted_at":"2026-05-12T18:09:04+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"A new VLA model called SI uses a four-step chain-of-thought to derive driving intent and applies it via classifier-free guidance to a flow-matching trajectory generator, showing competitive Waymo scores and intent-controllable plans.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.10388","ref_index":28,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Temporal Sampling Frequency Matters: A Capacity-Aware Study of End-to-End Driving Trajectory Prediction","primary_cat":"cs.CV","submitted_at":"2026-05-11T11:34:42+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Smaller end-to-end autonomous driving models achieve optimal 3-second trajectory prediction accuracy at lower or intermediate temporal sampling frequencies, whereas larger VLA-style models perform best at the highest frequencies across Waymo, nuScenes, and PAVE datasets.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"Is the highest available frequency always the best, or does the best frequency depend on the dataset and the model? We study this question by measuring thefrequency-responseof different E2E trajectory prediction models, defined as the change in trajectory-prediction performance across temporal sampling fre- quencies. Starting from three E2E datasets, namely Waymo [28], nuScenes [4], and PA VE [17], we construct frequency-sweep training sets by temporally subsampling camera frames along each trajec- tory timeline. Each training set corresponds to one temporal sampling frequency. Higher-frequency training sets retain more camera frames and generate more training samples, whereas lower-frequency training sets retain fewer frames and generate fewer samples."},{"citing_arxiv_id":"2604.19710","ref_index":73,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"SpanVLA: Efficient Action Bridging and Learning from Negative-Recovery Samples for Vision-Language-Action Model","primary_cat":"cs.CV","submitted_at":"2026-04-21T17:34:19+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"SpanVLA reduces action generation latency via flow-matching conditioned on history and improves robustness by training on negative-recovery samples with GRPO and a dedicated reasoning dataset.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"sion,reasoning,andactionspaceisthecorequestionoftheVLAmodel.Although directly generating action tokens within VLM [34,81] simplifies the model struc- ture and unifies the reasoning and planning, it requires autoregressive decoding with the large model, leading to high latency, especially for high-frequency con- trol in autonomous driving. Thus, some methods [29,43,52,70,73] decouple the VLM from the end-to-end driving pipeline, using the VLM to provide super- vision or high-level guidance while delegating low-level planning to a separate end-to-end module. However, such designs break the end-to-end optimization paradigm, increasing system complexity and training difficulty.2) Only learn- ing from positive samples with limited robustness."},{"citing_arxiv_id":"2604.08008","ref_index":59,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"SearchAD: Large-Scale Rare Image Retrieval Dataset for Autonomous Driving","primary_cat":"cs.CV","submitted_at":"2026-04-09T09:10:41+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SearchAD is a large-scale semantic image retrieval benchmark for rare driving scenarios that supports text-to-image and image-to-image tasks and shows text-based methods outperform image-based ones while overall performance stays limited.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.03497","ref_index":14,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Sim2Real-AD: A Modular Sim-to-Real Framework for Deploying VLM-Guided Reinforcement Learning in Real-World Autonomous Driving","primary_cat":"cs.RO","submitted_at":"2026-04-03T22:41:45+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Sim2Real-AD enables zero-shot transfer of CARLA-trained VLM-guided RL policies to full-scale vehicles, reporting 75-90% success rates in car-following, obstacle avoidance, and stop-sign scenarios without real-world RL training data.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2603.00696","ref_index":12,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"DRIV-EX: Counterfactual Explanations for Driving LLMs","primary_cat":"cs.CL","submitted_at":"2026-02-28T15:12:51+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"DRIV-EX generates fluent counterfactual scene descriptions by using gradient-optimized embeddings only as a guide for controlled text decoding, producing more reliable explanations than baselines on transcribed highD driving data.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2512.17445","ref_index":55,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"LangDriveCTRL: Natural Language Controllable Driving Scene Editing with Multi-modal Agents","primary_cat":"cs.CV","submitted_at":"2025-12-19T10:57:03+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"LangDriveCTRL decomposes driving videos into 3D scene graphs and uses an agentic pipeline with specialized multi-modal agents to perform language-controlled object and behavior edits, achieving nearly 2x higher instruction alignment than prior state-of-the-art methods.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2512.03370","ref_index":79,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"ShelfGaussian: Shelf-Supervised Open-Vocabulary Gaussian-based 3D Scene Understanding","primary_cat":"cs.CV","submitted_at":"2025-12-03T02:06:09+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ShelfGaussian achieves state-of-the-art zero-shot semantic occupancy prediction on Occ3D-nuScenes by jointly supervising Gaussian representations with vision foundation model features at 2D image and 3D scene levels.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2511.23369","ref_index":77,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"SimScale: Learning to Drive via Real-World Simulation at Scale","primary_cat":"cs.CV","submitted_at":"2025-11-28T17:17:38+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SimScale synthesizes unseen driving states from real logs via neural rendering and reactive environments, generates pseudo-expert trajectories, and shows that co-training on real plus simulated data improves planning robustness and generalization on real benchmarks, with gains scaling by simulation ","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2506.13757","ref_index":52,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"AutoVLA: A Vision-Language-Action Model for End-to-End Autonomous Driving with Adaptive Reasoning and Reinforcement Fine-Tuning","primary_cat":"cs.CV","submitted_at":"2025-06-16T17:58:50+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"AutoVLA unifies semantic reasoning and trajectory planning in one autoregressive VLA model for end-to-end autonomous driving by tokenizing trajectories into discrete actions and using GRPO reinforcement fine-tuning to adaptively reduce unnecessary reasoning.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"Policy Optimization (GRPO) [49] with verifiable planning reward functions. This enables adaptive reasoning that balances planning accuracy and efficiency. The RFT method not only improves planning performance but also runtime efficiency by minimizing unnecessary reasoning. We extensively evaluate AutoVLA using real-world datasets, including nuPlan [50, 51], Waymo [52], nuScenes [53], and simulation datasets such as CARLA [54, 55]. Experimental results demonstrate that AutoVLA achieves superior performance across various end-to-end autonomous driving bench- marks under both open-loop and closed-loop tests. Empirical results validate that our RFT approach 2 Redundant ReasoninginSimpleScenariosConventionalEnd-to-endModel"}],"limit":50,"offset":0}