hub Canonical reference

AutoDrive-R$^2$: Incentivizing Reasoning and Self-Reflection Capacity for VLA Model in Autonomous Driving

· 2025 · cs.RO · arXiv 2509.01944

Canonical reference. 78% of citing Pith papers cite this work as background.

12 Pith papers citing it

Background 78% of classified citations

open full Pith review browse 12 citing papers arXiv PDF

abstract

Vision-Language-Action (VLA) models in autonomous driving systems have recently demonstrated transformative potential by integrating multimodal perception with decision-making capabilities. However, the interpretability and coherence of the decision process and the plausibility of action sequences remain largely underexplored. To address these issues, we propose AutoDrive-R$^2$, a novel VLA framework that enhances both reasoning and self-reflection capabilities of autonomous driving systems through chain-of-thought (CoT) processing and reinforcement learning (RL). Specifically, we first propose an innovative CoT dataset named nuScenesR$^2$-6K for supervised fine-tuning, which effectively builds cognitive bridges between input information and output trajectories through a four-step logical chain with self-reflection for validation. Moreover, to maximize both reasoning and self-reflection during the RL stage, we further employ the Group Relative Policy Optimization (GRPO) algorithm within a physics-grounded reward framework that incorporates spatial alignment, vehicle dynamic, and temporal smoothness criteria to ensure reliable and realistic trajectory planning. Extensive evaluation results across both nuScenes and Waymo datasets demonstrates the state-of-the-art performance and robust generalization capacity of our proposed method.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 8 method 1

citation-polarity summary

background 7 unclear 1 use method 1

representative citing papers

ConeSep: Cone-based Robust Noise-Unlearning Compositional Network for Composed Image Retrieval

cs.CV · 2026-04-22 · unverdicted · novelty 7.0

ConeSep tackles noisy triplet correspondences in composed image retrieval by introducing geometric fidelity quantization to locate noise, negative boundary learning for semantic opposites, and targeted unlearning via optimal transport, outperforming prior methods on FashionIQ and CIRR.

Learning Vision-Language-Action World Models for Autonomous Driving

cs.CV · 2026-04-10 · unverdicted · novelty 7.0

VLA-World improves autonomous driving by using action-guided future image generation followed by reflective reasoning over the imagined scene to refine trajectories.

The Blind Spot of Adaptation: Quantifying and Mitigating Forgetting in Fine-tuned Driving Models

cs.CV · 2026-04-06 · unverdicted · novelty 7.0

Fine-tuning VLMs for driving erodes pre-trained world knowledge, but shifting adaptation to prompt space via the Drive Expert Adapter preserves generalization while improving task performance.

Learning Physics from Pretrained Video Models: A Multimodal Continuous and Sequential World Interaction Models for Robotic Manipulation

cs.RO · 2026-02-18 · unverdicted · novelty 7.0

PhysGen uses video models to learn physics for robots, outperforming baselines by up to 13.8% on Libero and matching specialized models in real-world tasks.

MindVLA-U1: VLA Beats VA with Unified Streaming Architecture for Autonomous Driving

cs.RO · 2026-05-12 · unverdicted · novelty 6.0 · 2 refs

MindVLA-U1 is the first unified streaming VLA architecture that surpasses human drivers on WOD-E2E planning metrics while matching VA latency and preserving language interfaces.

Air-Know: Arbiter-Calibrated Knowledge-Internalizing Robust Network for Composed Image Retrieval

cs.CV · 2026-04-21 · unverdicted · novelty 6.0

Air-Know decouples MLLM-based external arbitration from proxy learning via knowledge internalization and dual-stream training to overcome noisy triplet correspondence in composed image retrieval.

Alpamayo-R1: Bridging Reasoning and Action Prediction for Generalizable Autonomous Driving in the Long Tail

cs.RO · 2025-10-30 · conditional · novelty 6.0

Alpamayo-R1 introduces a VLA model with a Chain of Causation dataset and multi-stage SFT-plus-RL training that reports 12% better planning accuracy and 35% fewer close encounters versus trajectory-only baselines in driving tasks.

IndusAgent: Reinforcing Open-Vocabulary Industrial Anomaly Detection with Agentic Tools

cs.CV · 2026-05-20 · unverdicted · novelty 5.0

IndusAgent achieves state-of-the-art zero-shot performance on industrial anomaly benchmarks by using a custom Indus-CoT dataset, dynamic tool orchestration, and gated RL to optimize anomaly classification, localization, and reasoning.

EponaV2: Driving World Model with Comprehensive Future Reasoning

cs.CV · 2026-05-14 · unverdicted · novelty 5.0

EponaV2 advances perception-free driving world models by forecasting comprehensive future 3D geometry and semantic representations, achieving SOTA planning performance on NAVSIM benchmarks.

Causality-Aware End-to-End Autonomous Driving via Ego-Centric Joint Scene Modeling

cs.RO · 2026-05-13 · unverdicted · novelty 5.0 · 2 refs

CaAD adds ego-centric joint-causal modeling and causality-aware policy alignment to end-to-end driving, reporting Driving Score 87.53 and PDMS 91.1 on Bench2Drive and NAVSIM.

Latency Analysis and Optimization of Alpamayo 1 via Efficient Trajectory Generation

cs.AI · 2026-05-09 · unverdicted · novelty 5.0

Redesigning Alpamayo 1 to single-reasoning and optimizing diffusion action generation cuts inference latency by 69.23% while preserving trajectory diversity and prediction quality.

Aligning Perception, Reasoning, Modeling and Interaction: A Survey on Physical AI

cs.AI · 2025-10-06 · unverdicted · novelty 4.0

A survey of physical AI that distinguishes theoretical physics reasoning from applied understanding and synthesizes advances in symbolic reasoning, embodied systems, and generative models to advocate for physics-grounded world models.

citing papers explorer

Showing 12 of 12 citing papers.

ConeSep: Cone-based Robust Noise-Unlearning Compositional Network for Composed Image Retrieval cs.CV · 2026-04-22 · unverdicted · none · ref 112 · internal anchor
ConeSep tackles noisy triplet correspondences in composed image retrieval by introducing geometric fidelity quantization to locate noise, negative boundary learning for semantic opposites, and targeted unlearning via optimal transport, outperforming prior methods on FashionIQ and CIRR.
Learning Vision-Language-Action World Models for Autonomous Driving cs.CV · 2026-04-10 · unverdicted · none · ref 73 · internal anchor
VLA-World improves autonomous driving by using action-guided future image generation followed by reflective reasoning over the imagined scene to refine trajectories.
The Blind Spot of Adaptation: Quantifying and Mitigating Forgetting in Fine-tuned Driving Models cs.CV · 2026-04-06 · unverdicted · none · ref 51 · internal anchor
Fine-tuning VLMs for driving erodes pre-trained world knowledge, but shifting adaptation to prompt space via the Drive Expert Adapter preserves generalization while improving task performance.
Learning Physics from Pretrained Video Models: A Multimodal Continuous and Sequential World Interaction Models for Robotic Manipulation cs.RO · 2026-02-18 · unverdicted · none · ref 68 · internal anchor
PhysGen uses video models to learn physics for robots, outperforming baselines by up to 13.8% on Libero and matching specialized models in real-world tasks.
MindVLA-U1: VLA Beats VA with Unified Streaming Architecture for Autonomous Driving cs.RO · 2026-05-12 · unverdicted · none · ref 21 · 2 links · internal anchor
MindVLA-U1 is the first unified streaming VLA architecture that surpasses human drivers on WOD-E2E planning metrics while matching VA latency and preserving language interfaces.
Air-Know: Arbiter-Calibrated Knowledge-Internalizing Robust Network for Composed Image Retrieval cs.CV · 2026-04-21 · unverdicted · none · ref 113 · internal anchor
Air-Know decouples MLLM-based external arbitration from proxy learning via knowledge internalization and dual-stream training to overcome noisy triplet correspondence in composed image retrieval.
Alpamayo-R1: Bridging Reasoning and Action Prediction for Generalizable Autonomous Driving in the Long Tail cs.RO · 2025-10-30 · conditional · none · ref 107 · internal anchor
Alpamayo-R1 introduces a VLA model with a Chain of Causation dataset and multi-stage SFT-plus-RL training that reports 12% better planning accuracy and 35% fewer close encounters versus trajectory-only baselines in driving tasks.
IndusAgent: Reinforcing Open-Vocabulary Industrial Anomaly Detection with Agentic Tools cs.CV · 2026-05-20 · unverdicted · none · ref 99 · internal anchor
IndusAgent achieves state-of-the-art zero-shot performance on industrial anomaly benchmarks by using a custom Indus-CoT dataset, dynamic tool orchestration, and gated RL to optimize anomaly classification, localization, and reasoning.
EponaV2: Driving World Model with Comprehensive Future Reasoning cs.CV · 2026-05-14 · unverdicted · none · ref 86 · internal anchor
EponaV2 advances perception-free driving world models by forecasting comprehensive future 3D geometry and semantic representations, achieving SOTA planning performance on NAVSIM benchmarks.
Causality-Aware End-to-End Autonomous Driving via Ego-Centric Joint Scene Modeling cs.RO · 2026-05-13 · unverdicted · none · ref 57 · 2 links · internal anchor
CaAD adds ego-centric joint-causal modeling and causality-aware policy alignment to end-to-end driving, reporting Driving Score 87.53 and PDMS 91.1 on Bench2Drive and NAVSIM.
Latency Analysis and Optimization of Alpamayo 1 via Efficient Trajectory Generation cs.AI · 2026-05-09 · unverdicted · none · ref 17 · internal anchor
Redesigning Alpamayo 1 to single-reasoning and optimizing diffusion action generation cuts inference latency by 69.23% while preserving trajectory diversity and prediction quality.
Aligning Perception, Reasoning, Modeling and Interaction: A Survey on Physical AI cs.AI · 2025-10-06 · unverdicted · none · ref 22 · internal anchor
A survey of physical AI that distinguishes theoretical physics reasoning from applied understanding and synthesizes advances in symbolic reasoning, embodied systems, and generative models to advocate for physics-grounded world models.

AutoDrive-R$^2$: Incentivizing Reasoning and Self-Reflection Capacity for VLA Model in Autonomous Driving

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer