arxiv: 2511.00088 · v2 · pith:DJR6JS25new · submitted 2025-10-30 · 💻 cs.RO · cs.AI· cs.LG

Alpamayo-R1: Bridging Reasoning and Action Prediction for Generalizable Autonomous Driving in the Long Tail

NVIDIA: Yan Wang , Wenjie Luo , Junjie Bai , Yulong Cao , Tong Che , Ke Chen , Yuxiao Chen , Jenna Diamond

show 34 more authors

Yifan Ding Wenhao Ding Liang Feng Greg Heinrich Jack Huang Peter Karkus Boyi Li Pinyi Li Tsung-Yi Lin Dongran Liu Ming-Yu Liu Langechuan Liu Zhijian Liu Jason Lu Yunxiang Mao Pavlo Molchanov Lindsey Pavao Zhenghao Peng Mike Ranzinger Ed Schmerling Shida Shen Yunfei Shi Sarah Tariq Ran Tian Tilman Wekel Xinshuo Weng Tianjun Xiao Eric Yang Xiaodong Yang Yurong You Xiaohui Zeng Wenyuan Zhang Boris Ivanovic Marco Pavone

This is my paper

Pith reviewed 2026-05-18 02:30 UTC · model grok-4.3

classification 💻 cs.RO cs.AIcs.LG

keywords autonomous drivingvision-language-action modelchain of causationtrajectory planningreinforcement learninglong-tail scenariosend-to-end learningdiffusion decoder

0 comments

The pith

Integrating Chain of Causation reasoning with trajectory planning improves autonomous driving accuracy in rare cases.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

End-to-end imitation learning models for autonomous driving perform poorly in safety-critical but uncommon situations because they lack explicit causal understanding and sufficient supervision. This paper introduces Alpamayo-R1, a vision-language-action model that adds structured reasoning about causes and effects in driving scenes to the generation of vehicle trajectories. The approach builds a dataset of causally linked reasoning traces via automated labeling plus human review, then trains the model first with supervised fine-tuning to produce reasoning and later with reinforcement learning to align that reasoning with safe actions. If the integration works, the result is higher planning accuracy and fewer near-collisions in both simulation and real-road tests for long-tail scenarios.

Core claim

Alpamayo-R1 integrates Chain of Causation reasoning with a diffusion-based trajectory decoder inside a modular vision-language-action architecture built on a pre-trained physical AI vision-language model. A hybrid auto-labeling and human-in-the-loop pipeline creates decision-grounded reasoning traces aligned with driving behaviors. Multi-stage training uses supervised fine-tuning to elicit reasoning followed by reinforcement learning to enforce reasoning-action consistency and optimize reasoning quality. This yields up to 12 percent higher planning accuracy on challenging cases, 35 percent lower close-encounter rates in closed-loop simulation, 45 percent better reasoning quality, and 37% 37%

What carries the argument

The Chain of Causation (CoC) dataset of causally linked reasoning traces together with a modular VLA architecture that pairs a pre-trained vision-language model with a real-time diffusion trajectory decoder.

If this is right

Planning accuracy rises by up to 12 percent on challenging cases relative to a trajectory-only baseline.
Close encounter rate falls by 35 percent in closed-loop simulation.
RL post-training raises reasoning quality by 45 percent and reasoning-action consistency by 37 percent.
Performance improves consistently as model size scales from 0.5B to 7B parameters.
The system runs at 99 ms latency and succeeds in real urban on-vehicle tests.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Explicit causal traces could make autonomous vehicle decisions more interpretable for safety audits or regulatory review.
The same reasoning-plus-action pattern might transfer to other control tasks such as robotic manipulation where causal understanding is sparse.
If the labeling pipeline can be further automated, the method could reduce reliance on ever-larger raw trajectory datasets.
Direct comparison against purely end-to-end models on a shared benchmark of novel edge cases would quantify how much the added reasoning step contributes.

Load-bearing premise

The hybrid auto-labeling and human-in-the-loop pipeline produces reasoning traces that accurately reflect real-world causal structure and generalize to long-tail driving scenarios not seen during dataset creation.

What would settle it

No gain or a loss in planning accuracy and no reduction in close encounters when the model is tested on an independent collection of long-tail driving scenarios recorded after the CoC dataset was finalized.

read the original abstract

End-to-end architectures trained via imitation learning have advanced autonomous driving by scaling model size and data, yet performance remains brittle in safety-critical long-tail scenarios where supervision is sparse and causal understanding is limited. We introduce Alpamayo-R1 (AR1), a vision-language-action model (VLA) that integrates Chain of Causation reasoning with trajectory planning for complex driving scenarios. Our approach features three key innovations: (1) the Chain of Causation (CoC) dataset, built through a hybrid auto-labeling and human-in-the-loop pipeline producing decision-grounded, causally linked reasoning traces aligned with driving behaviors; (2) a modular VLA architecture combining Cosmos-Reason, a vision-language model pre-trained for Physical AI, with a diffusion-based trajectory decoder that generates dynamically feasible trajectories in real time; (3) a multi-stage training strategy using supervised fine-tuning to elicit reasoning and reinforcement learning (RL) to enforce reasoning-action consistency and optimize reasoning quality. AR1 achieves up to a 12% improvement in planning accuracy on challenging cases compared to a trajectory-only baseline, with a 35% reduction in close encounter rate in closed-loop simulation. RL post-training improves reasoning quality by 45% and reasoning-action consistency by 37%. Model scaling from 0.5B to 7B parameters shows consistent improvements. On-vehicle road tests confirm real-time performance (99 ms latency) and successful urban deployment. By bridging interpretable reasoning with precise control, AR1 demonstrates a practical path towards Level 4 autonomous driving. Model weights are available at https://huggingface.co/nvidia/Alpamayo-R1-10B with inference code at https://github.com/NVlabs/alpamayo.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

AR1 shows concrete gains from adding reasoning traces to driving models but the causal claims hinge on unvalidated dataset labels.

read the letter

The main point is that Alpamayo-R1 reports measurable lifts in planning accuracy and fewer close encounters by inserting Chain of Causation reasoning before trajectory generation, yet those lifts rest on the assumption that the hybrid-labeled traces reflect real causal structure rather than the labeling pipeline itself. If that assumption slips, the gains look more like refined imitation than new reasoning capability. What is actually new is the CoC dataset construction via auto-labeling plus human review, the modular setup that pairs Cosmos-Reason with a diffusion trajectory head, and the RL stage that directly optimizes reasoning quality and action consistency after supervised fine-tuning. They also release weights and inference code, which is useful. The paper does a few things cleanly. It shows scaling benefits from 0.5B to 7B parameters, real-time latency under 100 ms, and on-vehicle urban runs that back the simulation numbers. The RL objective targets explicit consistency metrics instead of just replaying fitted outputs, so that part avoids obvious circularity. The soft spots sit mainly in the dataset and evaluation details. The abstract gives percentage improvements without baseline descriptions, split information, or significance tests, and there is no external check such as inter-annotator agreement with independent experts or causal intervention experiments on the traces. If the labels largely encode the auto-labeler’s heuristics, then the observed 12 % planning and 35 % encounter gains could be explained by better copying of that policy rather than genuine causal understanding. This is the least secure link. The work is aimed at groups building vision-language-action systems for robotics or autonomous driving who want a practical recipe for adding interpretable steps. Readers who care about long-tail robustness and open artifacts will find concrete material to test. It deserves a serious referee because the problem is central, the artifacts are public, and the empirical results are specific enough to evaluate even if the causal interpretation needs tighter evidence.

Referee Report

2 major / 2 minor

Summary. The paper claims to introduce Alpamayo-R1 (AR1), a vision-language-action model that bridges Chain of Causation reasoning with action prediction for improved generalization in long-tail autonomous driving scenarios. It features a new CoC dataset created via hybrid auto-labeling and human-in-the-loop, a modular architecture with Cosmos-Reason VLM and diffusion trajectory decoder, and multi-stage training with SFT and RL for consistency. Key results include up to 12% better planning accuracy on challenging cases versus trajectory-only baseline, 35% lower close encounter rate in closed-loop sim, 45% reasoning quality and 37% consistency improvements from RL, scaling benefits, and real-time on-vehicle deployment at 99ms latency.

Significance. Should the empirical results and the validity of the CoC reasoning traces be substantiated, this manuscript offers a promising direction for enhancing the robustness and interpretability of end-to-end autonomous driving systems. The integration of explicit causal reasoning with control actions, combined with the public release of weights and code, could facilitate broader adoption and further development in the field of physical AI and robotics.

major comments (2)

Abstract: The abstract reports concrete percentage gains on planning accuracy and close-encounter rate but provides no information on baseline details, statistical tests, data splits, or potential post-hoc selection. This is a load-bearing issue for the central empirical claims as it prevents proper assessment of whether the improvements are robust.
Dataset Construction: The Chain of Causation (CoC) dataset is constructed through a hybrid auto-labeling and human-in-the-loop pipeline, but there is no external validation such as inter-annotator agreement with independent experts or causal intervention tests. This raises a correctness-risk concern that the traces may embed pipeline artifacts rather than independent causal structure, which is critical for the claimed generalization to unseen long-tail scenarios.

minor comments (2)

Training Strategy: The multi-stage training using SFT and RL is described, but more specifics on the RL reward coefficients and how they balance reasoning quality and consistency would clarify the method.
Results: The on-vehicle road tests confirm real-time performance, but additional details on the test scenarios and success criteria would be beneficial for reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and will revise the paper to improve clarity and detail where appropriate.

read point-by-point responses

Referee: Abstract: The abstract reports concrete percentage gains on planning accuracy and close-encounter rate but provides no information on baseline details, statistical tests, data splits, or potential post-hoc selection. This is a load-bearing issue for the central empirical claims as it prevents proper assessment of whether the improvements are robust.

Authors: We agree that the abstract would benefit from additional context to support assessment of the reported gains. In the revised manuscript, we will expand the abstract to briefly describe the trajectory-only baseline, note that planning accuracy improvements are evaluated on a held-out test split focused on long-tail scenarios, clarify that close-encounter reductions come from closed-loop simulation averaged across episodes, and state that results follow the primary evaluation protocol with no post-hoc selection. We will also indicate that improvements reflect averages over multiple runs. revision: yes
Referee: Dataset Construction: The Chain of Causation (CoC) dataset is constructed through a hybrid auto-labeling and human-in-the-loop pipeline, but there is no external validation such as inter-annotator agreement with independent experts or causal intervention tests. This raises a correctness-risk concern that the traces may embed pipeline artifacts rather than independent causal structure, which is critical for the claimed generalization to unseen long-tail scenarios.

Authors: We acknowledge the importance of validating the CoC traces against potential pipeline artifacts. The hybrid construction uses automated generation followed by human-in-the-loop review by domain experts to ground traces in observed driving behaviors and causal links. While we did not compute formal inter-annotator agreement with external experts or conduct separate causal intervention tests, the multi-pass human curation was intended to filter artifacts. We will revise the dataset section to include more details on annotation guidelines, quality control criteria, and example traces. We believe the current process supports the generalization results shown in experiments, though additional external validation remains a valuable direction for future work. revision: partial

Circularity Check

0 steps flagged

No significant circularity; performance claims rest on external baselines and simulation

full rationale

The paper constructs the CoC dataset via a hybrid auto-labeling and human-in-the-loop pipeline, then applies SFT followed by RL that explicitly optimizes defined reasoning-action consistency metrics. Reported gains (12% planning accuracy, 35% close-encounter reduction, 45% reasoning-quality lift) are measured against a separate trajectory-only baseline, in closed-loop simulation, and on-vehicle tests rather than being tautological reproductions of fitted quantities. No self-definitional equations, fitted inputs renamed as predictions, or load-bearing self-citations that collapse the central claim appear in the derivation. The approach is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claims rest on the quality of the newly introduced Chain of Causation dataset and on the assumption that RL can enforce consistency without degrading trajectory quality or introducing hidden biases.

free parameters (1)

RL reward coefficients for reasoning quality and consistency
Hyperparameters that balance the two RL objectives are chosen during post-training and directly affect the reported 45% and 37% gains.

axioms (1)

domain assumption Hybrid auto-labeling plus human review can produce causally accurate and decision-grounded reasoning traces for driving scenes
Invoked in the description of the CoC dataset creation pipeline.

invented entities (1)

Chain of Causation (CoC) dataset no independent evidence
purpose: Supply decision-grounded, causally linked reasoning traces aligned with driving behaviors
New resource constructed for this work; no independent external validation cited in abstract.

pith-pipeline@v0.9.0 · 6024 in / 1402 out tokens · 59657 ms · 2026-05-18T02:30:53.726498+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Foundation/DAlembert/Inevitability bilinear_family_forced unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

AR1 achieves up to a 12% improvement in planning accuracy... RL post-training improves reasoning quality by 45% and reasoning-action consistency by 37%

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 18 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

MindVLA-U1: VLA Beats VA with Unified Streaming Architecture for Autonomous Driving
cs.RO 2026-05 unverdicted novelty 7.0

MindVLA-U1 introduces a unified streaming VLA with shared backbone, framewise memory, and language-guided action diffusion that surpasses human drivers on WOD-E2E planning metrics.
Reasoning About Traversability: Language-Guided Off-Road 3D Trajectory Planning
cs.RO 2026-04 unverdicted novelty 7.0

A language refinement framework with geometry-aware preference optimization lets VLMs generate more traversable 3D trajectories for off-road vehicles, yielding modest gains in error, traversability compliance, and ele...
Latent Chain-of-Thought World Modeling for End-to-End Driving
cs.CV 2025-12 unverdicted novelty 7.0

LCDrive unifies chain-of-thought reasoning and action selection for end-to-end driving by interleaving action-proposal tokens and latent world-model tokens that predict action outcomes, yielding faster inference and b...
MindVLA-U1: VLA Beats VA with Unified Streaming Architecture for Autonomous Driving
cs.RO 2026-05 unverdicted novelty 6.0

MindVLA-U1 is the first unified streaming VLA architecture that surpasses human drivers on WOD-E2E planning metrics while matching VA latency and preserving language interfaces.
Towards Safe Mobility: A Unified Transportation Foundation Model enabled by Open-Ended Vision-Language Dataset
cs.CV 2026-04 unverdicted novelty 6.0

Creates LTD dataset for open-ended traffic VQA and trains UniVLT model to achieve SOTA on unified microscopic AD and macroscopic traffic reasoning tasks.
Xiaomi OneVL: One-Step Latent Reasoning and Planning with Vision-Language Explanation
cs.CV 2026-04 unverdicted novelty 6.0

OneVL achieves superior accuracy to explicit chain-of-thought reasoning at answer-only latency by supervising latent tokens with a visual world model decoder that predicts future frames.
Xiaomi OneVL: One-Step Latent Reasoning and Planning with Vision-Language Explanation
cs.CV 2026-04 unverdicted novelty 6.0

OneVL is the first latent CoT method to exceed explicit CoT accuracy on four driving benchmarks while running at answer-only speed, by supervising latent tokens with a visual world model decoder.
Intent-aligned Autonomous Spacecraft Guidance via Reasoning Models
eess.SY 2026-04 unverdicted novelty 6.0

A pipeline links foundation-model intent reasoning to safe trajectory optimization via behavior sequences and waypoint constraints, achieving over 90% convergence and 1.5x better intent satisfaction in close-proximity tests.
BridgeSim: Unveiling the OL-CL Gap in End-to-End Autonomous Driving
cs.RO 2026-04 unverdicted novelty 6.0

The primary OL-CL gap in end-to-end autonomous driving arises from objective mismatch creating structural inability to model reactive behaviors, which a test-time adaptation method can mitigate.
Orion-Lite: Distilling LLM Reasoning into Efficient Vision-Only Driving Models
cs.CV 2026-04 unverdicted novelty 6.0

Orion-Lite uses latent feature distillation and trajectory supervision to create a vision-only model that surpasses its LLM-based teacher on closed-loop Bench2Drive evaluation, achieving a new SOTA driving score of 80.6.
Open-Ended Instruction Realization with LLM-Enabled Multi-Planner Scheduling in Autonomous Vehicles
cs.RO 2026-04 unverdicted novelty 6.0

LLM-driven multi-planner scheduling framework turns open-ended passenger instructions into safe, traceable control signals for autonomous vehicles while cutting query costs and matching specialized safety levels.
Sim2Real-AD: A Modular Sim-to-Real Framework for Deploying VLM-Guided Reinforcement Learning in Real-World Autonomous Driving
cs.RO 2026-04 unverdicted novelty 6.0

Sim2Real-AD enables zero-shot transfer of CARLA-trained VLM-guided RL policies to full-scale vehicles, reporting 75-90% success rates in car-following, obstacle avoidance, and stop-sign scenarios without real-world RL...
ExploreVLA: Dense World Modeling and Exploration for End-to-End Autonomous Driving
cs.CV 2026-04 unverdicted novelty 6.0

ExploreVLA augments VLA driving models with future RGB and depth prediction for dense supervision and uses prediction uncertainty as a safety-gated intrinsic reward for RL-based exploration, reaching SOTA PDMS 93.7 on NAVSIM.
DRIV-EX: Counterfactual Explanations for Driving LLMs
cs.CL 2026-02 unverdicted novelty 6.0

DRIV-EX generates fluent counterfactual scene descriptions by using gradient-optimized embeddings only as a guide for controlled text decoding, producing more reliable explanations than baselines on transcribed highD ...
Causality-Aware End-to-End Autonomous Driving via Ego-Centric Joint Scene Modeling
cs.RO 2026-05 unverdicted novelty 5.0

CaAD adds ego-centric joint-causal modeling and causality-aware policy alignment to end-to-end driving, reporting Driving Score 87.53 and Success Rate 71.81 on Bench2Drive plus PDMS 91.1 on NAVSIM.
Latency Analysis and Optimization of Alpamayo 1 via Efficient Trajectory Generation
cs.AI 2026-05 unverdicted novelty 5.0

Redesigning Alpamayo 1 to single-reasoning and optimizing diffusion action generation cuts inference latency by 69.23% while preserving trajectory diversity and prediction quality.
Is the Future Compatible? Diagnosing Dynamic Consistency in World Action Models
cs.RO 2026-05 unverdicted novelty 5.0

Action-state consistency in World Action Models distinguishes successful from failed imagined futures and supports value-free selection of better rollouts via consensus among predictions.
SpanVLA: Efficient Action Bridging and Learning from Negative-Recovery Samples for Vision-Language-Action Model
cs.CV 2026-04 unverdicted novelty 5.0

SpanVLA reduces action generation latency via flow-matching conditioned on history and improves robustness by training on negative-recovery samples with GRPO and a dedicated reasoning dataset.

Reference graph

Works this paper leans on

113 extracted references · 113 canonical work pages · cited by 16 Pith papers · 20 internal anchors

[1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. GPT-4 technical report.arXiv preprint arXiv:2303.08774, 2023. 1, 5

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

CoVLA: Comprehensive vision-language-action dataset for autonomous driving

Hidehisa Arai, Keita Miwa, Kento Sasaki, Kohei Watanabe, Yu Yamaguchi, Shunsuke Aoki, and Issei Yamamoto. CoVLA: Comprehensive vision-language-action dataset for autonomous driving. In2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 1933–1943. IEEE, 2025. 5, 9

work page 1933
[3]

Qwen2.5-VL Technical Report

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2.5-VL technical report.arXiv preprint arXiv:2502.13923, 2025. 6, 29

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

Constitutional AI: Harmlessness from AI Feedback

Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, et al. Constitutional AI: Harmlessness from AI feedback.arXiv preprint arXiv:2212.08073, 2022. 20

work page internal anchor Pith review Pith/arXiv arXiv 2022
[5]

METEOR: An automatic metric for mt evaluation with improved correlation with human judgments

Satanjeev Banerjee and Alon Lavie. METEOR: An automatic metric for mt evaluation with improved correlation with human judgments. InACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pages 65–72, 2005. 16

work page 2005
[6]

End to End Learning for Self-Driving Cars

Mariusz Bojarski, Davide Del Testa, Daniel Dworakowski, Bernhard Firner, Beat Flepp, Prasoon Goyal, Lawrence D. Jackel, Mathew Monfort, Urs Muller, Jiakai Zhang, et al. End-to-End Learning for Self- Driving Cars.arXiv preprint arXiv:1604.07316, 2016. 1

work page internal anchor Pith review Pith/arXiv arXiv 2016
[7]

nuScenes: A multimodal dataset for autonomous driving

Holger Caesar, Varun Bankiti, Alex H Lang, Sourabh Vora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Giancarlo Baldan, and Oscar Beijbom. nuScenes: A multimodal dataset for autonomous driving. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11621–11631, 2020. 4

work page 2020
[8]

Impromptu VLA: Open weights and open data for driving vision-language- action models.arXiv preprint arXiv:2505.23757, 2025

Haohan Chi, Huan-ang Gao, Ziming Liu, Jianing Liu, Chenyu Liu, Jinwei Li, Kaisen Yang, Yangcheng Yu, Zeda Wang, Wenyi Li, et al. Impromptu VLA: Open weights and open data for driving vision-language- action models.arXiv preprint arXiv:2505.23757, 2025. 5, 9

work page arXiv 2025
[9]

Language-image models with 3D understanding

Jang Hyun Cho, Boris Ivanovic, Yulong Cao, Edward Schmerling, Yue Wang, Xinshuo Weng, Boyi Li, Yurong You, Philipp Krähenbühl, Yan Wang, et al. Language-image models with 3D understanding. In International Conference on Learning Representations, 2025. 3

work page 2025
[10]

Deep rein- forcement learning from human preferences.Advances in Neural Information Processing Systems, 2017

Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep rein- forcement learning from human preferences.Advances in Neural Information Processing Systems, 2017. 3, 4, 20

work page 2017
[11]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025. 1, 5

work page internal anchor Pith review Pith/arXiv arXiv 2025
[12]

Retrieval-based interleaved visual chain-of-thought in real-world driving scenarios.arXiv preprint arXiv:2501.04671, 2025

Charles Corbière, Simon Roburin, Syrielle Montariol, Antoine Bosselut, and Alexandre Alahi. Retrieval-based interleaved visual chain-of-thought in real-world driving scenarios.arXiv preprint arXiv:2501.04671, 2025. 4

work page arXiv 2025
[13]

Parting with misconceptions about learning-based vehicle motion planning

Daniel Dauner, Marcel Hallgarten, Andreas Geiger, and Kashyap Chitta. Parting with misconceptions about learning-based vehicle motion planning. InConference on Robot Learning, pages 1268–1281. PMLR, 2023. 24 35 Alpamayo-R1: Bridging Reasoning and Action Prediction for Generalizable Autonomous Driving in the Long Tail

work page 2023
[14]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

DeepSeek-AI. DeepSeek-R1: Incentivizing reasoning capability in LLMs via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025. 2, 3, 4, 6, 20

work page internal anchor Pith review Pith/arXiv arXiv 2025
[15]

Holistic autonomous driving understanding by bird’s-eye-view injected multi-modal large models

Xinpeng Ding, Jianhua Han, Hang Xu, Xiaodan Liang, Wei Zhang, and Xiaomeng Li. Holistic autonomous driving understanding by bird’s-eye-view injected multi-modal large models. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13668–13677, 2024. 5

work page 2024
[16]

CARLA: An open urban driving simulator

Alexey Dosovitskiy, German Ros, Felipe Codevilla, Antonio Lopez, and Vladlen Koltun. CARLA: An open urban driving simulator. InConference on Robot Learning. PMLR, 2017. 4

work page 2017
[17]

An image is worth 16x16 words: Transformers for image recognition at scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2020. 7

work page 2020
[18]

Knowledge insulating vision-language-action models: Train fast, run fast, generalize better.arXiv preprint arXiv:2505.23705, 2025

Danny Driess, Jost Tobias Springenberg, Brian Ichter, Lili Yu, Adrian Li-Bell, Karl Pertsch, Allen Z Ren, Homer Walke, Quan Vuong, Lucy Xiaoyang Shi, et al. Knowledge insulating vision-language-action models: Train fast, run fast, generalize better.arXiv preprint arXiv:2505.23705, 2025. 2, 6, 9, 18

work page arXiv 2025
[19]

Taming transformers for high-resolution image synthesis

Patrick Esser, Robin Rombach, and Bjorn Ommer. Taming transformers for high-resolution image synthesis. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12873–12883,

work page
[20]

Large scale interactive motion forecasting for autonomous driving: The waymo open motion dataset

Scott Ettinger, Shuyang Cheng, Benjamin Caine, Chenxi Liu, Hang Zhao, Sabeek Pradhan, Yuning Chai, Ben Sapp, Charles Qi, Yin Zhou, Zoey Yang, Aurélien Chouard, Pei Sun, Jiquan Ngiam, Vijay Vasudevan, Alexander McCauley, Jonathon Shlens, and Dragomir Anguelov. Large scale interactive motion forecasting for autonomous driving: The waymo open motion dataset....

work page 2021
[21]

Baidu Apollo EM Motion Planner

Haoyang Fan, Fan Zhu, Changchun Liu, Liangliang Zhang, Li Zhuang, Dong Li, Weicheng Zhu, Jiangtao Hu, Hongye Li, and Qi Kong. Baidu Apollo EM motion planner.arXiv preprint arXiv:1807.08048, 2018. 1

work page internal anchor Pith review Pith/arXiv arXiv 2018
[22]

CoReVLA: A dual-stage end-to-end autonomous driving framework for long-tail scenarios via collect-and-refine.arXiv preprint arXiv:2509.15968, 2025

Shiyu Fang, Yiming Cui, Haoyang Liang, Chen Lv, Peng Hang, and Jian Sun. CoReVLA: A dual-stage end-to-end autonomous driving framework for long-tail scenarios via collect-and-refine.arXiv preprint arXiv:2509.15968, 2025. 3

work page arXiv 2025
[23]

DriveAction: A benchmark for exploring human-like driving decisions in vla models.arXiv preprint arXiv:2506.05667, 2025

Yuhan Hao, Zhengning Li, Lei Sun, Weilong Wang, Naixin Yi, Sheng Song, Caihong Qin, Mofan Zhou, Yifei Zhan, Peng Jia, et al. DriveAction: A benchmark for exploring human-like driving decisions in vla models.arXiv preprint arXiv:2506.05667, 2025. 4

work page arXiv 2025
[24]

Distilling multi-modal large language models for autonomous driving

Deepti Hegde, Rajeev Yasarla, Hong Cai, Shizhong Han, Apratim Bhattacharyya, Shweta Mahajan, Litian Liu, Risheek Garrepalli, Vishal M Patel, and Fatih Porikli. Distilling multi-modal large language models for autonomous driving. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 27575–27585, 2025. 3

work page 2025
[25]

DriveAgent: Multi-agent structured reasoning with LLM and multimodal sensor fusion for autonomous driving.arXiv preprint arXiv:2505.02123, 2025

Xinmeng Hou, Wuqi Wang, Long Yang, Hao Lin, Jinglun Feng, Haigen Min, and Xiangmo Zhao. DriveAgent: Multi-agent structured reasoning with LLM and multimodal sensor fusion for autonomous driving.arXiv preprint arXiv:2505.02123, 2025. 4

work page arXiv 2025
[26]

Planning- oriented autonomous driving

Yihan Hu, Jiazhi Yang, Li Chen, Keyu Li, Chonghao Sima, Xizhou Zhu, Siqi Chai, Senyao Du, Tianwei Lin, Wenhai Wang, Lewei Lu, Xiaosong Jia, Qiang Liu, Jifeng Dai, Yu Qiao, and Hongyang Li. Planning- oriented autonomous driving. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17853–17862, 2023. 1 36 Alpamayo-R1: B...

work page 2023
[27]

EMMA: End-to-End Multimodal Model for Autonomous Driving

Jyh-Jing Hwang, Runsheng Xu, Hubert Lin, Wei-Chih Hung, Jingwei Ji, Kristy Choi, Di Huang, Tong He, Paul Covington, Benjamin Sapp, et al. EMMA: End-to-end multimodal model for autonomous driving. arXiv preprint arXiv:2410.23262, 2024. 2, 3

work page internal anchor Pith review Pith/arXiv arXiv 2024
[28]

Ishaq, J

Ayesha Ishaq, Jean Lahoud, Ketan More, Omkar Thawakar, Ritesh Thawkar, Dinura Dissanayake, Noor Ahsan, Yuhao Li, Fahad Shahbaz Khan, Hisham Cholakkal, et al. DriveLMM-o1: A step-by-step reasoning datasetandlargemultimodalmodelfordrivingscenariounderstanding.arXivpreprintarXiv:2503.10621,

work page arXiv
[29]

Efficient multi-camera tokenization with triplanes for end-to-end driving.IEEE Robotics and Automation Letters, 10(11):11713–11720, 2025

Boris Ivanovic, Cristiano Saltori, Yurong You, Yan Wang, Wenjie Luo, and Marco Pavone. Efficient multi-camera tokenization with triplanes for end-to-end driving.IEEE Robotics and Automation Letters, 10(11):11713–11720, 2025. 7, 31, 32

work page 2025
[30]

Planning with diffusion for flexible behavior synthesis

Michael Janner, Yilun Du, Joshua Tenenbaum, and Sergey Levine. Planning with diffusion for flexible behavior synthesis. InInternational Conference on Machine Learning, pages 9902–9915, 2022. 18

work page 2022
[31]

Bench2Drive: Towards multi- ability benchmarking of closed-loop end-to-end autonomous driving.Advances in Neural Information Processing Systems, 37:819–844, 2024

Xiaosong Jia, Zhenjie Yang, Qifeng Li, Zhiyuan Zhang, and Junchi Yan. Bench2Drive: Towards multi- ability benchmarking of closed-loop end-to-end autonomous driving.Advances in Neural Information Processing Systems, 37:819–844, 2024. 3

work page 2024
[32]

IRL-VLA: Training an vision-language-action policy via reward world model.arXiv preprint arXiv:2508.06571, 2025

Anqing Jiang, Yu Gao, Yiru Wang, Zhigang Sun, Shuo Wang, Yuwen Heng, Hao Sun, Shichen Tang, Lijuan Zhu, Jinhao Chai, et al. IRL-VLA: Training an vision-language-action policy via reward world model.arXiv preprint arXiv:2508.06571, 2025. 2, 3

work page arXiv 2025
[33]

VAD: Vectorized scene representation for efficient autonomous driving

Bo Jiang, Shaoyu Chen, Qing Xu, Bencheng Liao, Jiajie Chen, Helong Zhou, Qian Zhang, Wenyu Liu, Chang Huang, and Xinggang Wang. VAD: Vectorized scene representation for efficient autonomous driving. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 8340–8350,

work page
[34]

Senna: Bridging Large Vision-Language Models and End-to-End Autonomous Driving

Bo Jiang, Shaoyu Chen, Bencheng Liao, Xingyu Zhang, Wei Yin, Qian Zhang, Chang Huang, Wenyu Liu, and Xinggang Wang. Senna: Bridging large vision-language models and end-to-end autonomous driving.arXiv preprint arXiv:2410.22313, 2024. 5

work page internal anchor Pith review Pith/arXiv arXiv 2024
[35]

MotionDiffuser: Controllable multi-agent motion prediction using diffusion

Chiyu Jiang, Andre Cornman, Cheolho Park, Benjamin Sapp, Yin Zhou, Dragomir Anguelov, et al. MotionDiffuser: Controllable multi-agent motion prediction using diffusion. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9644–9653, 2023. 18

work page 2023
[36]

Scaling Laws for Neural Language Models

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361, 2020. 2

work page internal anchor Pith review Pith/arXiv arXiv 2001
[37]

Plataniotis, Yao Lu, Song Han, and Zhijian Liu

Samir Khaki, Junxian Guo, Jiaming Tang, Shang Yang, Yukang Chen, Konstantinos N. Plataniotis, Yao Lu, Song Han, and Zhijian Liu. SparseVILA: Decoupling Visual Sparsity for Efficient VLM Inference. In ICCV, 2025. 8

work page 2025
[38]

Textual explanations for self-driving vehicles

Jinkyu Kim, Anna Rohrbach, Trevor Darrell, John Canny, and Zeynep Akata. Textual explanations for self-driving vehicles. InEuropean Conference on Computer Vision, pages 563–578, 2018. 5

work page 2018
[39]

Gonzalez, Hao Zhang, and Ion Stoica

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. InACM SIGOPS 29th Symposium on Operating Systems Principles, 2023. 22

work page 2023
[40]

RLAIF vs

Harrison Lee, Samrat Phatale, Hassan Mansoor, Thomas Mesnard, Johan Ferret, Kellie Ren Lu, Colton Bishop, Ethan Hall, Victor Carbune, Abhinav Rastogi, et al. RLAIF vs. RLHF: Scaling Reinforcement Learning from Human Feedback with AI Feedback. InInternational Conference on Machine Learning,

work page
[41]

20 37 Alpamayo-R1: Bridging Reasoning and Action Prediction for Generalizable Autonomous Driving in the Long Tail

work page
[42]

Aligning Text-to-Image Models using Human Feedback

Kimin Lee, Hao Liu, Moonkyung Ryu, Olivia Watkins, Yuqing Du, Craig Boutilier, Pieter Abbeel, Mo- hammad Ghavamzadeh, and Shixiang Shane Gu. Aligning text-to-image models using human feedback. arXiv preprint arXiv:2302.12192, 2023. 4

work page internal anchor Pith review Pith/arXiv arXiv 2023
[43]

A survey on motion prediction and risk assessment for intelligent vehicles.ROBOMECH Journal, 1(1):1–14, 2014

Sébastien Lefèvre, David Vasquez, and Christian Laugier. A survey on motion prediction and risk assessment for intelligent vehicles.ROBOMECH Journal, 1(1):1–14, 2014. 1

work page 2014
[44]

Wolf: Dense video captioning with a world summarization framework.Transactions on Machine Learning Research, 2025

Boyi Li, Ligeng Zhu, Ran Tian, Shuhan Tan, Yuxiao Chen, Yao Lu, Yin Cui, Sushant Veer, Max Ehrlich, Jonah Philion, et al. Wolf: Dense video captioning with a world summarization framework.Transactions on Machine Learning Research, 2025. 3

work page 2025
[45]

WOMD-Reasoning: A large-scale dataset for interaction reasoning in driving.arXiv preprint arXiv:2407.04281, 2024

Yiheng Li, Cunxin Fan, Chongjian Ge, Zhihao Zhao, Chenran Li, Chenfeng Xu, Huaxiu Yao, Masayoshi Tomizuka, Bolei Zhou, Chen Tang, et al. WOMD-Reasoning: A large-scale dataset for interaction reasoning in driving.arXiv preprint arXiv:2407.04281, 2024. 4

work page arXiv 2024
[46]

ReCogDrive: A Reinforced Cognitive Framework for End-to-End Autonomous Driving

YongkangLi, KaixinXiong, XiangyuGuo, FangLi, SixuYan, GangweiXu, LijunZhou, LongChen, Haiyang Sun, Bing Wang, et al. ReCogDrive: A reinforced cognitive framework for end-to-end autonomous driving.arXiv preprint arXiv:2506.08052, 2025. 4

work page internal anchor Pith review Pith/arXiv arXiv 2025
[47]

Drive-R1: Bridging reasoning and planning in VLMs for autonomous driving with reinforcement learning.arXiv preprint arXiv:2506.18234, 2025

Yue Li, Meng Tian, Dechang Zhu, Jiangtong Zhu, Zhenyu Lin, Zhiwei Xiong, and Xinhai Zhao. Drive-R1: Bridging reasoning and planning in VLMs for autonomous driving with reinforcement learning.arXiv preprint arXiv:2506.18234, 2025. 4

work page arXiv 2025
[48]

CoT-Drive: Efficient motion forecasting for autonomous driving with LLMs and chain-of-thought prompting.IEEE Transactions on Artificial Intelligence, 2025

Haicheng Liao, Hanlin Kong, Bonan Wang, Chengyue Wang, Wang Ye, Zhengbing He, Chengzhong Xu, and Zhenning Li. CoT-Drive: Efficient motion forecasting for autonomous driving with LLMs and chain-of-thought prompting.IEEE Transactions on Artificial Intelligence, 2025. 4

work page 2025
[49]

Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matthew Le. Flow matching for generative modeling. InInternational Conference on Learning Representations, 2023. 2, 9, 18

work page 2023
[50]

DSDrive: Distilling large language model for lightweight end-to-end autonomous driving with unified reasoning and planning.arXiv preprint arXiv:2505.05360, 2025

Wenru Liu, Pei Liu, and Jun Ma. DSDrive: Distilling large language model for lightweight end-to-end autonomous driving with unified reasoning and planning.arXiv preprint arXiv:2505.05360, 2025. 4

work page arXiv 2025
[51]

ReasonPlan: Unified scene prediction and decision reasoning for closed-loop autonomous driving.arXiv preprint arXiv:2505.20024, 2025

Xueyi Liu, Zuodong Zhong, Yuxin Guo, Yun-Fu Liu, Zhiguo Su, Qichao Zhang, Junli Wang, Yinfeng Gao, Yupeng Zheng, Qiao Lin, et al. ReasonPlan: Unified scene prediction and decision reasoning for closed-loop autonomous driving.arXiv preprint arXiv:2505.20024, 2025. 4

work page arXiv 2025
[52]

Imitation is not enough: Robustifying imitation with reinforcement learning for challenging driving scenarios

Yiren Lu, Justin Fu, George Tucker, Xinlei Pan, Eli Bronstein, Rebecca Roelofs, Benjamin Sapp, Brandyn White, Aleksandra Faust, Shimon Whiteson, et al. Imitation is not enough: Robustifying imitation with reinforcement learning for challenging driving scenarios. InIEEE/RSJ International Conference on Intelligent Robots and Systems, pages 7553–7560, 2023. 4

work page 2023
[53]

Real-ad: Towards human-like reasoning in end-to- end autonomous driving

Yuhang Lu, Jiadong Tu, Yuexin Ma, and Xinge Zhu. Real-ad: Towards human-like reasoning in end-to- end autonomous driving. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 27783–27793, 2025. 3

work page 2025
[54]

AdaThinkDrive: Adaptive thinking via reinforcement learning for autonomous driving.arXiv preprint arXiv:2509.13769, 2025

Yuechen Luo, Fang Li, Shaoqing Xu, Zhiyi Lai, Lei Yang, Qimao Chen, Ziang Luo, Zixun Xie, Shengyin Jiang, Jiaxin Liu, et al. AdaThinkDrive: Adaptive thinking via reinforcement learning for autonomous driving.arXiv preprint arXiv:2509.13769, 2025. 2, 3

work page arXiv 2025
[55]

MTRDrive: Memory-tool synergistic reasoning for robust autonomous driving in corner cases.arXiv preprint arXiv:2509.20843, 2025

Ziang Luo, Kangan Qian, Jiahua Wang, Yuechen Luo, Jinyu Miao, Zheng Fu, Yunlong Wang, Sicong Jiang, Zilin Huang, Yifei Hu, et al. MTRDrive: Memory-tool synergistic reasoning for robust autonomous driving in corner cases.arXiv preprint arXiv:2509.20843, 2025. 4

work page arXiv 2025
[56]

Cambridge University Press, 2017

Kevin M Lynch and Frank C Park.Modern Robotics. Cambridge University Press, 2017. 9 38 Alpamayo-R1: Bridging Reasoning and Action Prediction for Generalizable Autonomous Driving in the Long Tail

work page 2017
[57]

DRAMA: Joint risk localization and captioning in driving.Winter Conference on Applications of Computer Vision, 2023

Srikanth Malla, Chiho Choi, Isht Dwivedi, Joon Hee Choi, and Jiachen Li. DRAMA: Joint risk localization and captioning in driving.Winter Conference on Applications of Computer Vision, 2023. 4, 9

work page 2023
[58]

GPT-Driver: Learning to Drive with GPT

Jiageng Mao, Yuxi Qian, Junjie Ye, Hang Zhao, and Yue Wang. GPT-Driver: Learning to drive with GPT. arXiv preprint arXiv:2310.01415, 2023. 2, 3

work page internal anchor Pith review Pith/arXiv arXiv 2023
[59]

A language agent for autonomous driving

Jiageng Mao, Junjie Ye, Yuxi Qian, Marco Pavone, and Yue Wang. A language agent for autonomous driving. InConference on Language Modeling, 2024. 2, 3

work page 2024
[60]

LingoQA: Visual question answering for autonomous driving

Ana-Maria Marcu, Long Chen, Jan Hünermann, Alice Karnsund, Benoit Hanotte, Prajwal Chidananda, Saurabh Nair, Vijay Badrinarayanan, Alex Kendall, Jamie Shotton, et al. LingoQA: Visual question answering for autonomous driving. InEuropean Conference on Computer Vision, pages 252–269, 2024. 5, 30, 31

work page 2024
[61]

Rule based rewards for language model safety.Advances in Neural Information Processing Systems, 2024

Tong Mu, Alec Helyar, Johannes Heidecke, Joshua Achiam, Andrea Vallone, Ian Kivlichan, Molly Lin, Alex Beutel, John Schulman, and Lilian Weng. Rule based rewards for language model safety.Advances in Neural Information Processing Systems, 2024. 4

work page 2024
[62]

Rea- son2Drive: Towards interpretable and chain-based reasoning for autonomous driving

Ming Nie, Renyuan Peng, Chunwei Wang, Xinyue Cai, Jianhua Han, Hang Xu, and Li Zhang. Rea- son2Drive: Towards interpretable and chain-based reasoning for autonomous driving. InEuropean Conference on Computer Vision, pages 292–308, 2024. 3, 5

work page 2024
[63]

Cosmos-RL: A flexible and scalable reinforcement learning framework

NVIDIA. Cosmos-RL: A flexible and scalable reinforcement learning framework. https:// nvidia-cosmos.github.io/cosmos-rl/, 2025. 22

work page 2025
[64]

Physical AI autonomous vehicles dataset

NVIDIA. Physical AI autonomous vehicles dataset. https://huggingface.co/datasets/nvidia/ PhysicalAI-Autonomous-Vehicles, October 2025. URL https://huggingface.co/datasets/ nvidia/PhysicalAI-Autonomous-Vehicles. 28, 29, 33

work page 2025
[65]

Physical AI autonomous vehicles NuRec dataset

NVIDIA. Physical AI autonomous vehicles NuRec dataset. https://huggingface.co/datasets/ nvidia/PhysicalAI-Autonomous-Vehicles-NuRec, October 2025. URLhttps://huggingface.co/ datasets/nvidia/PhysicalAI-Autonomous-Vehicles-NuRec. 29

work page 2025
[66]

NVIDIA Announces Alpamayo Family of Open-Source AI Models and Tools to Accelerate Safe, Reasoning-Based Autonomous Vehicle Development

NVIDIA. NVIDIA Announces Alpamayo Family of Open-Source AI Models and Tools to Accelerate Safe, Reasoning-Based Autonomous Vehicle Development. 2026. URLhttps://nvidianews.nvidia.com/ news/alpamayo-autonomous-vehicle-development. 1

work page 2026
[67]

Cosmos-Reason1: From Physical Common Sense To Embodied Reasoning

NVIDIA, Alisson Azzolini, Junjie Bai, Hannah Brandon, Jiaxin Cao, Prithvijit Chattopadhyay, Huayu Chen, Jinju Chu, Yin Cui, Jenna Diamond, Yifan Ding, Liang Feng, Francesco Ferroni, Rama Govindaraju, JinweiGu, SiddharthGururani, ImadElHanafi, ZekunHao, JacobHuffman, JingyiJin, BrendanJohnson, Rizwan Khan, George Kurian, Elena Lantz, Nayeon Lee, Zhaoshuo L...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[68]

AlpaSim: A modular, lightweight, and data-driven research simulator for autonomous driving, October 2025

NVIDIA, Yulong Cao, Riccardo de Lutio, Sanja Fidler, Guillermo Garcia Cobo, Zan Gojcic, Maximilian Igl, Boris Ivanovic, Peter Karkus, Janick Martinez Esturo, Marco Pavone, Aaron Smith, Ellie Tanimura, Michal Tyszkiewicz, Michael Watson, Qi Wu, and Le Zhang. AlpaSim: A modular, lightweight, and data-driven research simulator for autonomous driving, October...

work page 2025
[69]

Learning to reason with LLMs, 2024

OpenAI. Learning to reason with LLMs, 2024. URL https://openai.com/index/ learning-to-reason-with-llms/. 2, 4, 32 39 Alpamayo-R1: Bridging Reasoning and Action Prediction for Generalizable Autonomous Driving in the Long Tail

work page 2024
[70]

GPT-5 system card.https://openai.com/index/gpt-5-system-card/, 2025

OpenAI. GPT-5 system card.https://openai.com/index/gpt-5-system-card/, 2025. 15, 16

work page 2025
[71]

DINOv2: Learning robust visual features without supervision.Transactions on Machine Learning Research, 2023

Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. DINOv2: Learning robust visual features without supervision.Transactions on Machine Learning Research, 2023. 29

work page 2023
[72]

A survey of motion planning and control techniques for self-driving urban vehicles.IEEE Transactions on Intelligent Vehicles, 1(1):33–55, 2016

Brian Paden, Michal Čáp, Sze Zheng Yong, Dmitry Yershov, and Emilio Frazzoli. A survey of motion planning and control techniques for self-driving urban vehicles.IEEE Transactions on Intelligent Vehicles, 1(1):33–55, 2016. 1

work page 2016
[73]

BLEU: a method for automatic evaluation of machine translation

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. BLEU: a method for automatic evaluation of machine translation. InAssociation for Computational Linguistics, pages 311–318, 2002. 16

work page 2002
[74]

Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, et al.𝜋0.5: a vision-language-action model with open-world generalization.arXiv preprint arXiv:2504.16054, 2025. 18

work page internal anchor Pith review Pith/arXiv arXiv 2025
[75]

AgentThink: A unified framework for tool-augmented chain-of- thought reasoning in vision-language models for autonomous driving.arXiv preprint arXiv:2505.15298,

Kangan Qian, Sicong Jiang, Yang Zhong, Ziang Luo, Zilin Huang, Tianze Zhu, Kun Jiang, Mengmeng Yang, Zheng Fu, Jinyu Miao, et al. AgentThink: A unified framework for tool-augmented chain-of- thought reasoning in vision-language models for autonomous driving.arXiv preprint arXiv:2505.15298,

work page arXiv
[76]

NuScenes-QA: A multi-modal visual question answering benchmark for autonomous driving scenario

Tianwen Qian, Jingjing Chen, Linhai Zhuo, Yang Jiao, and Yu-Gang Jiang. NuScenes-QA: A multi-modal visual question answering benchmark for autonomous driving scenario. InAAAI Conference on Artificial Intelligence, pages 4542–4550, 2024. 4

work page 2024
[77]

Qwen2.5: A party of foundation models, September 2024

Qwen Team. Qwen2.5: A party of foundation models, September 2024. URLhttps://qwenlm.github. io/blog/qwen2.5/. 29

work page 2024
[78]

Qwen3-VL: Sharper vision, deeper thought, broader action.https://qwen.ai/blog? id=99f0335c4ad9ff6153e517418d48535ab6d8afef&from=research.latest-advancements-list,

Qwen Team. Qwen3-VL: Sharper vision, deeper thought, broader action.https://qwen.ai/blog? id=99f0335c4ad9ff6153e517418d48535ab6d8afef&from=research.latest-advancements-list,

work page
[79]

SimLingo: Vision-only closed-loop autonomous driving with language-action alignment

Katrin Renz, Long Chen, Elahe Arani, and Oleg Sinavski. SimLingo: Vision-only closed-loop autonomous driving with language-action alignment. InIEEE/CVF Computer Vision and Pattern Recognition Conference, pages 11993–12003, 2025. 2, 3

work page 2025
[80]

Poutine: Vision- language-trajectory pre-training and reinforcement learning post-training enable robust end-to-end autonomous driving.arXiv preprint arXiv:2506.11234, 2025

Luke Rowe, Rodrigue de Schaetzen, Roger Girgis, Christopher Pal, and Liam Paull. Poutine: Vision- language-trajectory pre-training and reinforcement learning post-training enable robust end-to-end autonomous driving.arXiv preprint arXiv:2506.11234, 2025. 2, 3, 4

work page arXiv 2025

Showing first 80 references.