hub Canonical reference

Alpamayo-R1: Bridging Reasoning and Action Prediction for Generalizable Autonomous Driving in the Long Tail

NVIDIA: Yan Wang, Wenjie Luo, Junjie Bai, Yulong Cao, Tong Che, Ke Chen · 2025 · cs.RO · arXiv 2511.00088

Canonical reference. 86% of citing Pith papers cite this work as background.

38 Pith papers citing it

Background 86% of classified citations

open full Pith review browse 38 citing papers arXiv PDF

abstract

End-to-end architectures trained via imitation learning have advanced autonomous driving by scaling model size and data, yet performance remains brittle in safety-critical long-tail scenarios where supervision is sparse and causal understanding is limited. We introduce Alpamayo-R1 (AR1), a vision-language-action model (VLA) that integrates Chain of Causation reasoning with trajectory planning for complex driving scenarios. Our approach features three key innovations: (1) the Chain of Causation (CoC) dataset, built through a hybrid auto-labeling and human-in-the-loop pipeline producing decision-grounded, causally linked reasoning traces aligned with driving behaviors; (2) a modular VLA architecture combining Cosmos-Reason, a vision-language model pre-trained for Physical AI, with a diffusion-based trajectory decoder that generates dynamically feasible trajectories in real time; (3) a multi-stage training strategy using supervised fine-tuning to elicit reasoning and reinforcement learning (RL) to enforce reasoning-action consistency and optimize reasoning quality. AR1 achieves up to a 12% improvement in planning accuracy on challenging cases compared to a trajectory-only baseline, with a 35% reduction in close encounter rate in closed-loop simulation. RL post-training improves reasoning quality by 45% and reasoning-action consistency by 37%. Model scaling from 0.5B to 7B parameters shows consistent improvements. On-vehicle road tests confirm real-time performance (99 ms latency) and successful urban deployment. By bridging interpretable reasoning with precise control, AR1 demonstrates a practical path towards Level 4 autonomous driving. Model weights are available at https://huggingface.co/nvidia/Alpamayo-R1-10B with inference code at https://github.com/NVlabs/alpamayo.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 7

citation-polarity summary

background 6 unclear 1

representative citing papers

GeoDrive-Bench: Benchmarking Region-Specific Multimodal Reasoning in Autonomous Driving

cs.CV · 2026-06-01 · unverdicted · novelty 7.0

GeoDrive-Bench is a new multimodal benchmark and distillation method for testing and improving VLMs on region-specific traffic-rule reasoning in autonomous driving across six countries.

What-If World: A Causal Benchmark for General World Models in Embodied Scenarios

cs.CV · 2026-05-26 · unverdicted · novelty 7.0

What-If World is a new paired-prompt benchmark showing that nine state-of-the-art video generation models achieve at most 52% on causal intervention tests and cluster near 28% for open-source systems.

MAVEN: A Multi-stage Agentic Annotation Pipeline for Video Reasoning Tasks

cs.CV · 2026-05-21 · unverdicted · novelty 7.0

MAVEN pipeline generates multi-scale spatio-temporal event descriptions from videos using agentic adaptation and refinement, then produces training data that lets a fine-tuned 8B model outperform Gemini baselines on private CCTV and AccidentBench tasks.

Grounding Driving VLA via Inverse Kinematics

cs.CV · 2026-05-20 · conditional · novelty 7.0

By adding future visual state prediction and a dedicated inverse kinematics diffusion network that uses only visual boundary conditions, a 0.5B driving VLA recovers visual grounding and matches 7-8B models on NAVSIM-v2 and nuScenes.

Reasoning About Traversability: Language-Guided Off-Road 3D Trajectory Planning

cs.RO · 2026-04-23 · unverdicted · novelty 7.0

A language refinement framework with geometry-aware preference optimization lets VLMs generate more traversable 3D trajectories for off-road vehicles, yielding modest gains in error, traversability compliance, and elevation consistency on the ORAD-3D benchmark.

Latent Chain-of-Thought World Modeling for End-to-End Driving

cs.CV · 2025-12-11 · unverdicted · novelty 7.0

LCDrive unifies chain-of-thought reasoning and action selection for end-to-end driving by interleaving action-proposal tokens and latent world-model tokens that predict action outcomes, yielding faster inference and better trajectories than text-based or non-reasoning baselines.

StressDream: Steering Video World Models for Robust Policy Evaluation and Improvement

cs.CV · 2026-05-29 · unverdicted · novelty 6.0

StressDream optimizes initial noise in diffusion video world models using VLM semantic and plausibility objectives to steer generations toward specified high-impact outcomes for improved policy evaluation.

nuReasoning: A Reasoning-Centric Dataset and Benchmark for Long-Tail Autonomous Driving

cs.CV · 2026-05-29 · unverdicted · novelty 6.0

nuReasoning is a new real-world dataset and benchmark extending nuScenes/nuPlan with 20k clips and multi-type reasoning annotations to evaluate and improve reasoning in long-tail autonomous driving.

Does Visual Information Play a Decisive Role in Vision-Language-Action Model Driving Behavior?

cs.CV · 2026-05-29 · unverdicted · novelty 6.0

A structured perturbation framework applied to VLA driving models reveals evaluation-dependent visual grounding patterns and uneven dependency across abstraction levels.

ReasonBreak: Probing Vulnerabilities in Reasoning-Enabled Vision-Language-Action Models for Autonomous Driving

cs.CR · 2026-05-27 · unverdicted · novelty 6.0

ReasonBreak demonstrates up to 89% attack success on reasoning and 72% on trajectories in NVIDIA Alpamayo VLA models via black-box textual perturbations, introducing a reasoning-aware evaluation framework and benchmark for autonomous driving.

DriveWAM: Video Generative Priors Enable Scalable World-Action Modeling for Autonomous Driving

cs.CV · 2026-05-27 · unverdicted · novelty 6.0

DriveWAM converts video generative priors into a unified video-action policy for driving, reporting strong benchmark performance and positive scaling from 4k to 100k clips.

LACO: Adaptive Latent Communication for Collaborative Driving

cs.AI · 2026-05-21 · unverdicted · novelty 6.0

LACO introduces Iterative Latent Deliberation, Cross-Horizon Saliency Attribution, and Structured Semantic Knowledge Distillation to enable low-latency latent communication in collaborative driving while preserving performance in CARLA simulations.

CosFly: Plan in the Matrix, Fly in the World

cs.RO · 2026-05-18 · unverdicted · novelty 6.0

CosFly introduces a box-structured planning and multimodal simulation pipeline for aerial target tracking in CARLA, paired with the public CosFly-Track dataset containing 250 trajectories and approximately 100,000 rendered multi-modal images.

CLAP: Contrastive Latent-space Prompt Optimization for End-to-end Autonomous Driving

cs.CV · 2026-05-17 · unverdicted · novelty 6.0

CLAP reduces planning error on challenging driving scenarios by 24% on NAVSIM using contrastive latent-space prompt optimization on frozen VLA models with no regression on normal frames.

Is VLA Reasoning Faithful? Probing Safety of Chain-of-Causation in Autonomous Driving Models

cs.AI · 2026-05-17 · unverdicted · novelty 6.0

VLA driving models show 42.5% reasoning fidelity and 48.3% reasoning-action consistency, with 97.7% trajectory fragility under perturbations.

MindVLA-U1: VLA Beats VA with Unified Streaming Architecture for Autonomous Driving

cs.RO · 2026-05-12 · unverdicted · novelty 6.0 · 2 refs

MindVLA-U1 is the first unified streaming VLA architecture that surpasses human drivers on WOD-E2E planning metrics while matching VA latency and preserving language interfaces.

Towards Safe Mobility: A Unified Transportation Foundation Model enabled by Open-Ended Vision-Language Dataset

cs.CV · 2026-04-24 · unverdicted · novelty 6.0

Creates LTD dataset for open-ended traffic VQA and trains UniVLT model to achieve SOTA on unified microscopic AD and macroscopic traffic reasoning tasks.

Xiaomi OneVL: One-Step Latent Reasoning and Planning with Vision-Language Explanation

cs.CV · 2026-04-20 · unverdicted · novelty 6.0 · 2 refs

OneVL achieves superior accuracy to explicit chain-of-thought reasoning at answer-only latency by supervising latent tokens with a visual world model decoder that predicts future frames.

BridgeSim: Unveiling the OL-CL Gap in End-to-End Autonomous Driving

cs.RO · 2026-04-12 · unverdicted · novelty 6.0

The primary OL-CL gap in end-to-end autonomous driving arises from objective mismatch creating structural inability to model reactive behaviors, which a test-time adaptation method can mitigate.

Orion-Lite: Distilling LLM Reasoning into Efficient Vision-Only Driving Models

cs.CV · 2026-04-09 · unverdicted · novelty 6.0

Orion-Lite uses latent feature distillation and trajectory supervision to create a vision-only model that surpasses its LLM-based teacher on closed-loop Bench2Drive evaluation, achieving a new SOTA driving score of 80.6.

Open-Ended Instruction Realization with LLM-Enabled Multi-Planner Scheduling in Autonomous Vehicles

cs.RO · 2026-04-09 · unverdicted · novelty 6.0

LLM-driven multi-planner scheduling framework turns open-ended passenger instructions into safe, traceable control signals for autonomous vehicles while cutting query costs and matching specialized safety levels.

Sim2Real-AD: A Modular Sim-to-Real Framework for Deploying VLM-Guided Reinforcement Learning in Real-World Autonomous Driving

cs.RO · 2026-04-03 · unverdicted · novelty 6.0

Sim2Real-AD enables zero-shot transfer of CARLA-trained VLM-guided RL policies to full-scale vehicles, reporting 75-90% success rates in car-following, obstacle avoidance, and stop-sign scenarios without real-world RL training data.

DRIV-EX: Counterfactual Explanations for Driving LLMs

cs.CL · 2026-02-28 · unverdicted · novelty 6.0

DRIV-EX generates fluent counterfactual scene descriptions by using gradient-optimized embeddings only as a guide for controlled text decoding, producing more reliable explanations than baselines on transcribed highD driving data.

Unleashing the Potential of Diffusion Models for End-to-End Autonomous Driving

cs.RO · 2026-02-26 · unverdicted · novelty 6.0

The paper introduces Hyper Diffusion Planner (HDP), a diffusion-based E2E AD framework that identifies insights on loss space, trajectory representation and data scaling, adds RL post-training, and reports 10x performance gains over 200 km of real-world testing across 6 scenarios.

citing papers explorer

Showing 34 of 34 citing papers after filters.

GeoDrive-Bench: Benchmarking Region-Specific Multimodal Reasoning in Autonomous Driving cs.CV · 2026-06-01 · unverdicted · none · ref 29 · internal anchor
GeoDrive-Bench is a new multimodal benchmark and distillation method for testing and improving VLMs on region-specific traffic-rule reasoning in autonomous driving across six countries.
What-If World: A Causal Benchmark for General World Models in Embodied Scenarios cs.CV · 2026-05-26 · unverdicted · none · ref 65 · internal anchor
What-If World is a new paired-prompt benchmark showing that nine state-of-the-art video generation models achieve at most 52% on causal intervention tests and cluster near 28% for open-source systems.
MAVEN: A Multi-stage Agentic Annotation Pipeline for Video Reasoning Tasks cs.CV · 2026-05-21 · unverdicted · none · ref 23 · internal anchor
MAVEN pipeline generates multi-scale spatio-temporal event descriptions from videos using agentic adaptation and refinement, then produces training data that lets a fine-tuned 8B model outperform Gemini baselines on private CCTV and AccidentBench tasks.
Reasoning About Traversability: Language-Guided Off-Road 3D Trajectory Planning cs.RO · 2026-04-23 · unverdicted · none · ref 31 · internal anchor
A language refinement framework with geometry-aware preference optimization lets VLMs generate more traversable 3D trajectories for off-road vehicles, yielding modest gains in error, traversability compliance, and elevation consistency on the ORAD-3D benchmark.
Latent Chain-of-Thought World Modeling for End-to-End Driving cs.CV · 2025-12-11 · unverdicted · none · ref 24 · internal anchor
LCDrive unifies chain-of-thought reasoning and action selection for end-to-end driving by interleaving action-proposal tokens and latent world-model tokens that predict action outcomes, yielding faster inference and better trajectories than text-based or non-reasoning baselines.
StressDream: Steering Video World Models for Robust Policy Evaluation and Improvement cs.CV · 2026-05-29 · unverdicted · none · ref 125 · internal anchor
StressDream optimizes initial noise in diffusion video world models using VLM semantic and plausibility objectives to steer generations toward specified high-impact outcomes for improved policy evaluation.
nuReasoning: A Reasoning-Centric Dataset and Benchmark for Long-Tail Autonomous Driving cs.CV · 2026-05-29 · unverdicted · none · ref 35 · internal anchor
nuReasoning is a new real-world dataset and benchmark extending nuScenes/nuPlan with 20k clips and multi-type reasoning annotations to evaluate and improve reasoning in long-tail autonomous driving.
Does Visual Information Play a Decisive Role in Vision-Language-Action Model Driving Behavior? cs.CV · 2026-05-29 · unverdicted · none · ref 14 · internal anchor
A structured perturbation framework applied to VLA driving models reveals evaluation-dependent visual grounding patterns and uneven dependency across abstraction levels.
ReasonBreak: Probing Vulnerabilities in Reasoning-Enabled Vision-Language-Action Models for Autonomous Driving cs.CR · 2026-05-27 · unverdicted · none · ref 44 · internal anchor
ReasonBreak demonstrates up to 89% attack success on reasoning and 72% on trajectories in NVIDIA Alpamayo VLA models via black-box textual perturbations, introducing a reasoning-aware evaluation framework and benchmark for autonomous driving.
DriveWAM: Video Generative Priors Enable Scalable World-Action Modeling for Autonomous Driving cs.CV · 2026-05-27 · unverdicted · none · ref 28 · internal anchor
DriveWAM converts video generative priors into a unified video-action policy for driving, reporting strong benchmark performance and positive scaling from 4k to 100k clips.
LACO: Adaptive Latent Communication for Collaborative Driving cs.AI · 2026-05-21 · unverdicted · none · ref 34 · internal anchor
LACO introduces Iterative Latent Deliberation, Cross-Horizon Saliency Attribution, and Structured Semantic Knowledge Distillation to enable low-latency latent communication in collaborative driving while preserving performance in CARLA simulations.
CosFly: Plan in the Matrix, Fly in the World cs.RO · 2026-05-18 · unverdicted · none · ref 58 · internal anchor
CosFly introduces a box-structured planning and multimodal simulation pipeline for aerial target tracking in CARLA, paired with the public CosFly-Track dataset containing 250 trajectories and approximately 100,000 rendered multi-modal images.
CLAP: Contrastive Latent-space Prompt Optimization for End-to-end Autonomous Driving cs.CV · 2026-05-17 · unverdicted · none · ref 39 · internal anchor
CLAP reduces planning error on challenging driving scenarios by 24% on NAVSIM using contrastive latent-space prompt optimization on frozen VLA models with no regression on normal frames.
Is VLA Reasoning Faithful? Probing Safety of Chain-of-Causation in Autonomous Driving Models cs.AI · 2026-05-17 · unverdicted · none · ref 2 · internal anchor
VLA driving models show 42.5% reasoning fidelity and 48.3% reasoning-action consistency, with 97.7% trajectory fragility under perturbations.
MindVLA-U1: VLA Beats VA with Unified Streaming Architecture for Autonomous Driving cs.RO · 2026-05-12 · unverdicted · none · ref 29 · 2 links · internal anchor
MindVLA-U1 is the first unified streaming VLA architecture that surpasses human drivers on WOD-E2E planning metrics while matching VA latency and preserving language interfaces.
Towards Safe Mobility: A Unified Transportation Foundation Model enabled by Open-Ended Vision-Language Dataset cs.CV · 2026-04-24 · unverdicted · none · ref 5 · internal anchor
Creates LTD dataset for open-ended traffic VQA and trains UniVLT model to achieve SOTA on unified microscopic AD and macroscopic traffic reasoning tasks.
Xiaomi OneVL: One-Step Latent Reasoning and Planning with Vision-Language Explanation cs.CV · 2026-04-20 · unverdicted · none · ref 104 · 2 links · internal anchor
OneVL achieves superior accuracy to explicit chain-of-thought reasoning at answer-only latency by supervising latent tokens with a visual world model decoder that predicts future frames.
BridgeSim: Unveiling the OL-CL Gap in End-to-End Autonomous Driving cs.RO · 2026-04-12 · unverdicted · none · ref 60 · internal anchor
The primary OL-CL gap in end-to-end autonomous driving arises from objective mismatch creating structural inability to model reactive behaviors, which a test-time adaptation method can mitigate.
Orion-Lite: Distilling LLM Reasoning into Efficient Vision-Only Driving Models cs.CV · 2026-04-09 · unverdicted · none · ref 35 · internal anchor
Orion-Lite uses latent feature distillation and trajectory supervision to create a vision-only model that surpasses its LLM-based teacher on closed-loop Bench2Drive evaluation, achieving a new SOTA driving score of 80.6.
Open-Ended Instruction Realization with LLM-Enabled Multi-Planner Scheduling in Autonomous Vehicles cs.RO · 2026-04-09 · unverdicted · none · ref 42 · internal anchor
LLM-driven multi-planner scheduling framework turns open-ended passenger instructions into safe, traceable control signals for autonomous vehicles while cutting query costs and matching specialized safety levels.
Sim2Real-AD: A Modular Sim-to-Real Framework for Deploying VLM-Guided Reinforcement Learning in Real-World Autonomous Driving cs.RO · 2026-04-03 · unverdicted · none · ref 12 · internal anchor
Sim2Real-AD enables zero-shot transfer of CARLA-trained VLM-guided RL policies to full-scale vehicles, reporting 75-90% success rates in car-following, obstacle avoidance, and stop-sign scenarios without real-world RL training data.
DRIV-EX: Counterfactual Explanations for Driving LLMs cs.CL · 2026-02-28 · unverdicted · none · ref 6 · internal anchor
DRIV-EX generates fluent counterfactual scene descriptions by using gradient-optimized embeddings only as a guide for controlled text decoding, producing more reliable explanations than baselines on transcribed highD driving data.
Unleashing the Potential of Diffusion Models for End-to-End Autonomous Driving cs.RO · 2026-02-26 · unverdicted · none · ref 51 · internal anchor
The paper introduces Hyper Diffusion Planner (HDP), a diffusion-based E2E AD framework that identifies insights on loss space, trajectory representation and data scaling, adds RL post-training, and reports 10x performance gains over 200 km of real-world testing across 6 scenarios.
PriorEye: Geospatial Visual Priors for End-to-End Autonomous Driving cs.CV · 2026-06-30 · unverdicted · none · ref 45 · internal anchor
PriorEye augments end-to-end driving models with a dual-memory architecture that stores and gates geospatial visual priors to improve performance and robustness to sensor corruption on NAVSIM-v2.
LVDrive: Latent Visual Representation Enhanced Vision-Language-Action Autonomous Driving Model cs.CV · 2026-05-21 · unverdicted · none · ref 40 · internal anchor
LVDrive improves closed-loop driving on Bench2Drive by adding latent future scene prediction to VLA models via unified embedding space processing and two-stage trajectory decoding.
Distill to Think, Foresee to Act: Cognitive-Physical Reinforcement Learning for Autonomous Driving cs.CV · 2026-05-20 · unverdicted · none · ref 51 · 2 links · internal anchor
CoPhy is a new RL framework that distills VLM cognition into BEV encoders, adds an auto-regressive BEV world model for action-conditioned future prediction, and optimizes policies via GRPO with dual physical-cognitive rewards, claiming SOTA on NAVSIM v1/v2.
DriveSafer: End-to-End Autonomous Driving with Safety Guidance cs.RO · 2026-05-16 · unverdicted · none · ref 33 · internal anchor
DriveSafer reduces catastrophic failures (PDMS=0) by 48% and drivable-area compliance failures by over 65% versus DiffusionDrive on the NAVSIM benchmark by combining training-time safety constraints with inference-time guidance.
Causality-Aware End-to-End Autonomous Driving via Ego-Centric Joint Scene Modeling cs.RO · 2026-05-13 · unverdicted · none · ref 51 · 2 links · internal anchor
CaAD adds ego-centric joint-causal modeling and causality-aware policy alignment to end-to-end driving, reporting Driving Score 87.53 and PDMS 91.1 on Bench2Drive and NAVSIM.
Latency Analysis and Optimization of Alpamayo 1 via Efficient Trajectory Generation cs.AI · 2026-05-09 · unverdicted · none · ref 8 · internal anchor
Redesigning Alpamayo 1 to single-reasoning and optimizing diffusion action generation cuts inference latency by 69.23% while preserving trajectory diversity and prediction quality.
Is the Future Compatible? Diagnosing Dynamic Consistency in World Action Models cs.RO · 2026-05-08 · unverdicted · none · ref 35 · internal anchor
Action-state consistency in World Action Models distinguishes successful from failed imagined futures and supports value-free selection of better rollouts via consensus among predictions.
SpanVLA: Efficient Action Bridging and Learning from Negative-Recovery Samples for Vision-Language-Action Model cs.CV · 2026-04-21 · unverdicted · none · ref 69 · internal anchor
SpanVLA reduces action generation latency via flow-matching conditioned on history and improves robustness by training on negative-recovery samples with GRPO and a dedicated reasoning dataset.
NVIDIA OmniDreams: Real-Time Generative World Model for Closed-Loop Autonomous Vehicle Simulation cs.CV · 2026-06-02 · unverdicted · none · ref 37 · internal anchor
OmniDreams is a real-time generative world model mid- and post-trained from the Cosmos diffusion model on 21k hours of driving data to autoregressively generate action-conditioned videos for closed-loop AV simulation.
Position: Good Embodied Reward Models Need Bad Behavior Data cs.RO · 2026-05-31 · unverdicted · none · ref 28 · internal anchor
Embodied reward models systematically over-reward unsafe, suboptimal, and shortcut robot behaviors due to training on successful data only, and modest inclusion of bad behavior data improves alignment with human preferences.
PEACE: A Planner-Executor Agent with Constraint Enforcement for UAVs cs.RO · 2026-05-26 · unverdicted · none · ref 23 · internal anchor
PEACE decouples single-pass LLM planning from PX4 execution via ROS 2 and a constraint layer, with modular 3D perception, and shows feasibility in Gazebo SITL with improved explainability and fewer LLM calls.

Alpamayo-R1: Bridging Reasoning and Action Prediction for Generalizable Autonomous Driving in the Long Tail

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer