hub Canonical reference

Alpamayo-R1: Bridging Reasoning and Action Prediction for Generalizable Autonomous Driving in the Long Tail

NVIDIA: Yan Wang, Wenjie Luo, Junjie Bai, Yulong Cao, Tong Che, Ke Chen · 2025 · cs.RO · arXiv 2511.00088

Canonical reference. 86% of citing Pith papers cite this work as background.

45 Pith papers citing it

Background 86% of classified citations

open full Pith review browse 45 citing papers arXiv PDF

abstract

End-to-end architectures trained via imitation learning have advanced autonomous driving by scaling model size and data, yet performance remains brittle in safety-critical long-tail scenarios where supervision is sparse and causal understanding is limited. We introduce Alpamayo-R1 (AR1), a vision-language-action model (VLA) that integrates Chain of Causation reasoning with trajectory planning for complex driving scenarios. Our approach features three key innovations: (1) the Chain of Causation (CoC) dataset, built through a hybrid auto-labeling and human-in-the-loop pipeline producing decision-grounded, causally linked reasoning traces aligned with driving behaviors; (2) a modular VLA architecture combining Cosmos-Reason, a vision-language model pre-trained for Physical AI, with a diffusion-based trajectory decoder that generates dynamically feasible trajectories in real time; (3) a multi-stage training strategy using supervised fine-tuning to elicit reasoning and reinforcement learning (RL) to enforce reasoning-action consistency and optimize reasoning quality. AR1 achieves up to a 12% improvement in planning accuracy on challenging cases compared to a trajectory-only baseline, with a 35% reduction in close encounter rate in closed-loop simulation. RL post-training improves reasoning quality by 45% and reasoning-action consistency by 37%. Model scaling from 0.5B to 7B parameters shows consistent improvements. On-vehicle road tests confirm real-time performance (99 ms latency) and successful urban deployment. By bridging interpretable reasoning with precise control, AR1 demonstrates a practical path towards Level 4 autonomous driving. Model weights are available at https://huggingface.co/nvidia/Alpamayo-R1-10B with inference code at https://github.com/NVlabs/alpamayo.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 7

citation-polarity summary

background 6 unclear 1

representative citing papers

VLADriveBench: Evaluating CoT-Action Relationship in VLA for Autonomous Driving

cs.CV · 2026-06-10 · unverdicted · novelty 7.0

VLADriveBench combines observational metrics and CoT intervention protocols to evaluate the relevance and causality of reasoning in vision-language-action models for autonomous driving, revealing divergent model behaviors.

M*: A Modular, Extensible, Serving System for Multimodal Models

cs.LG · 2026-06-10 · unverdicted · novelty 7.0

M* introduces the Walk Graph abstraction to serve arbitrary compositions of multimodal model components and reports latency and throughput gains over vLLM-Omni and other baselines on text-to-image, text-to-speech, and robotic planning workloads.

Foresight: Iterative Reasoning About Clues that Matter for Navigation

cs.RO · 2026-06-10 · unverdicted · novelty 7.0

Foresight uses iterative VLM plan proposal and critique with RL from human feedback to raise navigation success 37% and cut interventions 52% in real-world tests.

GeoDrive-Bench: Benchmarking Region-Specific Multimodal Reasoning in Autonomous Driving

cs.CV · 2026-06-01 · unverdicted · novelty 7.0

GeoDrive-Bench is a new multimodal benchmark and distillation method for testing and improving VLMs on region-specific traffic-rule reasoning in autonomous driving across six countries.

What-If World: A Causal Benchmark for General World Models in Embodied Scenarios

cs.CV · 2026-05-26 · unverdicted · novelty 7.0

What-If World is a new paired-prompt benchmark showing that nine state-of-the-art video generation models achieve at most 52% on causal intervention tests and cluster near 28% for open-source systems.

MAVEN: A Multi-stage Agentic Annotation Pipeline for Video Reasoning Tasks

cs.CV · 2026-05-21 · unverdicted · novelty 7.0

MAVEN pipeline generates multi-scale spatio-temporal event descriptions from videos using agentic adaptation and refinement, then produces training data that lets a fine-tuned 8B model outperform Gemini baselines on private CCTV and AccidentBench tasks.

Grounding Driving VLA via Inverse Kinematics

cs.CV · 2026-05-20 · conditional · novelty 7.0

By adding future visual state prediction and a dedicated inverse kinematics diffusion network that uses only visual boundary conditions, a 0.5B driving VLA recovers visual grounding and matches 7-8B models on NAVSIM-v2 and nuScenes.

Reasoning About Traversability: Language-Guided Off-Road 3D Trajectory Planning

cs.RO · 2026-04-23 · unverdicted · novelty 7.0

A language refinement framework with geometry-aware preference optimization lets VLMs generate more traversable 3D trajectories for off-road vehicles, yielding modest gains in error, traversability compliance, and elevation consistency on the ORAD-3D benchmark.

Latent Chain-of-Thought World Modeling for End-to-End Driving

cs.CV · 2025-12-11 · unverdicted · novelty 7.0

LCDrive unifies chain-of-thought reasoning and action selection for end-to-end driving by interleaving action-proposal tokens and latent world-model tokens that predict action outcomes, yielding faster inference and better trajectories than text-based or non-reasoning baselines.

ROSA: A Robotics Foundation Model Serving System for Robot Factories

cs.RO · 2026-07-01 · unverdicted · novelty 6.0

ROSA introduces shared GPU-pool serving, robotics-aware abstractions for multi-model pipelines, and factory-productivity scheduling that improves output by up to 12.06x over dedicated per-robot systems.

Scaling Self-Play for End-to-End Driving

cs.RO · 2026-06-17 · unverdicted · novelty 6.0

Self-play DAgger training in a batched pixel renderer produces end-to-end driving policies that reach competitive performance on HUGSIM and NAVSIM-v2 after real-world adaptation and improve with more self-play compute.

PersonaDrive: Human-Style Retrieval-Augmented VLA Agents for Closed-Loop Driving Simulation

cs.AI · 2026-06-10 · unverdicted · novelty 6.0

PersonaDrive retrieves style-specific human driving demonstrations to condition a single VLA backbone for diverse closed-loop driving agents, reporting 4.6% and 2.5% driving score gains over baselines on Bench2Drive with style consistency within 2%.

VLGA: Vision-Language-Geometry-Action Models for Autonomous Driving

cs.CV · 2026-06-10 · unverdicted · novelty 6.0

VLGA introduces geometry as a fourth modality in VLA models via pointmap regression loss, reporting SOTA open-loop and closed-loop driving metrics on nuScenes and Bench2Drive.

StressDream: Steering Video World Models for Robust Policy Evaluation and Improvement

cs.CV · 2026-05-29 · unverdicted · novelty 6.0

StressDream optimizes initial noise in diffusion video world models using VLM semantic and plausibility objectives to steer generations toward specified high-impact outcomes for improved policy evaluation.

nuReasoning: A Reasoning-Centric Dataset and Benchmark for Long-Tail Autonomous Driving

cs.CV · 2026-05-29 · unverdicted · novelty 6.0

nuReasoning is a new real-world dataset and benchmark extending nuScenes/nuPlan with 20k clips and multi-type reasoning annotations to evaluate and improve reasoning in long-tail autonomous driving.

Does Visual Information Play a Decisive Role in Vision-Language-Action Model Driving Behavior?

cs.CV · 2026-05-29 · unverdicted · novelty 6.0

A structured perturbation framework applied to VLA driving models reveals evaluation-dependent visual grounding patterns and uneven dependency across abstraction levels.

ReasonBreak: Probing Vulnerabilities in Reasoning-Enabled Vision-Language-Action Models for Autonomous Driving

cs.CR · 2026-05-27 · unverdicted · novelty 6.0

ReasonBreak demonstrates up to 89% attack success on reasoning and 72% on trajectories in NVIDIA Alpamayo VLA models via black-box textual perturbations, introducing a reasoning-aware evaluation framework and benchmark for autonomous driving.

DriveWAM: Video Generative Priors Enable Scalable World-Action Modeling for Autonomous Driving

cs.CV · 2026-05-27 · unverdicted · novelty 6.0

DriveWAM converts video generative priors into a unified video-action policy for driving, reporting strong benchmark performance and positive scaling from 4k to 100k clips.

LACO: Adaptive Latent Communication for Collaborative Driving

cs.AI · 2026-05-21 · unverdicted · novelty 6.0

LACO introduces Iterative Latent Deliberation, Cross-Horizon Saliency Attribution, and Structured Semantic Knowledge Distillation to enable low-latency latent communication in collaborative driving while preserving performance in CARLA simulations.

CosFly: Plan in the Matrix, Fly in the World

cs.RO · 2026-05-18 · unverdicted · novelty 6.0

CosFly introduces a box-structured planning and multimodal simulation pipeline for aerial target tracking in CARLA, paired with the public CosFly-Track dataset containing 250 trajectories and approximately 100,000 rendered multi-modal images.

CLAP: Contrastive Latent-space Prompt Optimization for End-to-end Autonomous Driving

cs.CV · 2026-05-17 · unverdicted · novelty 6.0

CLAP reduces planning error on challenging driving scenarios by 24% on NAVSIM using contrastive latent-space prompt optimization on frozen VLA models with no regression on normal frames.

Is VLA Reasoning Faithful? Probing Safety of Chain-of-Causation in Autonomous Driving Models

cs.AI · 2026-05-17 · unverdicted · novelty 6.0

VLA driving models show 42.5% reasoning fidelity and 48.3% reasoning-action consistency, with 97.7% trajectory fragility under perturbations.

MindVLA-U1: VLA Beats VA with Unified Streaming Architecture for Autonomous Driving

cs.RO · 2026-05-12 · unverdicted · novelty 6.0 · 2 refs

MindVLA-U1 is the first unified streaming VLA architecture that surpasses human drivers on WOD-E2E planning metrics while matching VA latency and preserving language interfaces.

Towards Safe Mobility: A Unified Transportation Foundation Model enabled by Open-Ended Vision-Language Dataset

cs.CV · 2026-04-24 · unverdicted · novelty 6.0

Creates LTD dataset for open-ended traffic VQA and trains UniVLT model to achieve SOTA on unified microscopic AD and macroscopic traffic reasoning tasks.

citing papers explorer

Showing 15 of 15 citing papers after filters.

Foresight: Iterative Reasoning About Clues that Matter for Navigation cs.RO · 2026-06-10 · unverdicted · none · ref 7 · internal anchor
Foresight uses iterative VLM plan proposal and critique with RL from human feedback to raise navigation success 37% and cut interventions 52% in real-world tests.
Reasoning About Traversability: Language-Guided Off-Road 3D Trajectory Planning cs.RO · 2026-04-23 · unverdicted · none · ref 31 · internal anchor
A language refinement framework with geometry-aware preference optimization lets VLMs generate more traversable 3D trajectories for off-road vehicles, yielding modest gains in error, traversability compliance, and elevation consistency on the ORAD-3D benchmark.
ROSA: A Robotics Foundation Model Serving System for Robot Factories cs.RO · 2026-07-01 · unverdicted · none · ref 44 · internal anchor
ROSA introduces shared GPU-pool serving, robotics-aware abstractions for multi-model pipelines, and factory-productivity scheduling that improves output by up to 12.06x over dedicated per-robot systems.
Scaling Self-Play for End-to-End Driving cs.RO · 2026-06-17 · unverdicted · none · ref 24 · internal anchor
Self-play DAgger training in a batched pixel renderer produces end-to-end driving policies that reach competitive performance on HUGSIM and NAVSIM-v2 after real-world adaptation and improve with more self-play compute.
CosFly: Plan in the Matrix, Fly in the World cs.RO · 2026-05-18 · unverdicted · none · ref 58 · internal anchor
CosFly introduces a box-structured planning and multimodal simulation pipeline for aerial target tracking in CARLA, paired with the public CosFly-Track dataset containing 250 trajectories and approximately 100,000 rendered multi-modal images.
MindVLA-U1: VLA Beats VA with Unified Streaming Architecture for Autonomous Driving cs.RO · 2026-05-12 · unverdicted · none · ref 29 · 2 links · internal anchor
MindVLA-U1 is the first unified streaming VLA architecture that surpasses human drivers on WOD-E2E planning metrics while matching VA latency and preserving language interfaces.
BridgeSim: Unveiling the OL-CL Gap in End-to-End Autonomous Driving cs.RO · 2026-04-12 · unverdicted · none · ref 60 · internal anchor
The primary OL-CL gap in end-to-end autonomous driving arises from objective mismatch creating structural inability to model reactive behaviors, which a test-time adaptation method can mitigate.
Open-Ended Instruction Realization with LLM-Enabled Multi-Planner Scheduling in Autonomous Vehicles cs.RO · 2026-04-09 · unverdicted · none · ref 42 · internal anchor
LLM-driven multi-planner scheduling framework turns open-ended passenger instructions into safe, traceable control signals for autonomous vehicles while cutting query costs and matching specialized safety levels.
Sim2Real-AD: A Modular Sim-to-Real Framework for Deploying VLM-Guided Reinforcement Learning in Real-World Autonomous Driving cs.RO · 2026-04-03 · unverdicted · none · ref 12 · internal anchor
Sim2Real-AD enables zero-shot transfer of CARLA-trained VLM-guided RL policies to full-scale vehicles, reporting 75-90% success rates in car-following, obstacle avoidance, and stop-sign scenarios without real-world RL training data.
Unleashing the Potential of Diffusion Models for End-to-End Autonomous Driving cs.RO · 2026-02-26 · unverdicted · none · ref 51 · internal anchor
The paper introduces Hyper Diffusion Planner (HDP), a diffusion-based E2E AD framework that identifies insights on loss space, trajectory representation and data scaling, adds RL post-training, and reports 10x performance gains over 200 km of real-world testing across 6 scenarios.
DriveSafer: End-to-End Autonomous Driving with Safety Guidance cs.RO · 2026-05-16 · unverdicted · none · ref 33 · internal anchor
DriveSafer reduces catastrophic failures (PDMS=0) by 48% and drivable-area compliance failures by over 65% versus DiffusionDrive on the NAVSIM benchmark by combining training-time safety constraints with inference-time guidance.
Causality-Aware End-to-End Autonomous Driving via Ego-Centric Joint Scene Modeling cs.RO · 2026-05-13 · unverdicted · none · ref 51 · 2 links · internal anchor
CaAD adds ego-centric joint-causal modeling and causality-aware policy alignment to end-to-end driving, reporting Driving Score 87.53 and PDMS 91.1 on Bench2Drive and NAVSIM.
Is the Future Compatible? Diagnosing Dynamic Consistency in World Action Models cs.RO · 2026-05-08 · unverdicted · none · ref 35 · internal anchor
Action-state consistency in World Action Models distinguishes successful from failed imagined futures and supports value-free selection of better rollouts via consensus among predictions.
Position: Good Embodied Reward Models Need Bad Behavior Data cs.RO · 2026-05-31 · unverdicted · none · ref 28 · internal anchor
Embodied reward models systematically over-reward unsafe, suboptimal, and shortcut robot behaviors due to training on successful data only, and modest inclusion of bad behavior data improves alignment with human preferences.
PEACE: A Planner-Executor Agent with Constraint Enforcement for UAVs cs.RO · 2026-05-26 · unverdicted · none · ref 23 · internal anchor
PEACE decouples single-pass LLM planning from PX4 execution via ROS 2 and a constraint layer, with modular 3D perception, and shows feasibility in Gazebo SITL with improved explainability and fewer LLM calls.

Alpamayo-R1: Bridging Reasoning and Action Prediction for Generalizable Autonomous Driving in the Long Tail

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer