pith. machine review for the scientific record. sign in

arxiv: 2511.00088 · v2 · pith:DJR6JS25new · submitted 2025-10-30 · 💻 cs.RO · cs.AI· cs.LG

Alpamayo-R1: Bridging Reasoning and Action Prediction for Generalizable Autonomous Driving in the Long Tail

Pith reviewed 2026-05-18 02:30 UTC · model grok-4.3

classification 💻 cs.RO cs.AIcs.LG
keywords autonomous drivingvision-language-action modelchain of causationtrajectory planningreinforcement learninglong-tail scenariosend-to-end learningdiffusion decoder
0
0 comments X

The pith

Integrating Chain of Causation reasoning with trajectory planning improves autonomous driving accuracy in rare cases.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

End-to-end imitation learning models for autonomous driving perform poorly in safety-critical but uncommon situations because they lack explicit causal understanding and sufficient supervision. This paper introduces Alpamayo-R1, a vision-language-action model that adds structured reasoning about causes and effects in driving scenes to the generation of vehicle trajectories. The approach builds a dataset of causally linked reasoning traces via automated labeling plus human review, then trains the model first with supervised fine-tuning to produce reasoning and later with reinforcement learning to align that reasoning with safe actions. If the integration works, the result is higher planning accuracy and fewer near-collisions in both simulation and real-road tests for long-tail scenarios.

Core claim

Alpamayo-R1 integrates Chain of Causation reasoning with a diffusion-based trajectory decoder inside a modular vision-language-action architecture built on a pre-trained physical AI vision-language model. A hybrid auto-labeling and human-in-the-loop pipeline creates decision-grounded reasoning traces aligned with driving behaviors. Multi-stage training uses supervised fine-tuning to elicit reasoning followed by reinforcement learning to enforce reasoning-action consistency and optimize reasoning quality. This yields up to 12 percent higher planning accuracy on challenging cases, 35 percent lower close-encounter rates in closed-loop simulation, 45 percent better reasoning quality, and 37% 37%

What carries the argument

The Chain of Causation (CoC) dataset of causally linked reasoning traces together with a modular VLA architecture that pairs a pre-trained vision-language model with a real-time diffusion trajectory decoder.

If this is right

  • Planning accuracy rises by up to 12 percent on challenging cases relative to a trajectory-only baseline.
  • Close encounter rate falls by 35 percent in closed-loop simulation.
  • RL post-training raises reasoning quality by 45 percent and reasoning-action consistency by 37 percent.
  • Performance improves consistently as model size scales from 0.5B to 7B parameters.
  • The system runs at 99 ms latency and succeeds in real urban on-vehicle tests.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Explicit causal traces could make autonomous vehicle decisions more interpretable for safety audits or regulatory review.
  • The same reasoning-plus-action pattern might transfer to other control tasks such as robotic manipulation where causal understanding is sparse.
  • If the labeling pipeline can be further automated, the method could reduce reliance on ever-larger raw trajectory datasets.
  • Direct comparison against purely end-to-end models on a shared benchmark of novel edge cases would quantify how much the added reasoning step contributes.

Load-bearing premise

The hybrid auto-labeling and human-in-the-loop pipeline produces reasoning traces that accurately reflect real-world causal structure and generalize to long-tail driving scenarios not seen during dataset creation.

What would settle it

No gain or a loss in planning accuracy and no reduction in close encounters when the model is tested on an independent collection of long-tail driving scenarios recorded after the CoC dataset was finalized.

read the original abstract

End-to-end architectures trained via imitation learning have advanced autonomous driving by scaling model size and data, yet performance remains brittle in safety-critical long-tail scenarios where supervision is sparse and causal understanding is limited. We introduce Alpamayo-R1 (AR1), a vision-language-action model (VLA) that integrates Chain of Causation reasoning with trajectory planning for complex driving scenarios. Our approach features three key innovations: (1) the Chain of Causation (CoC) dataset, built through a hybrid auto-labeling and human-in-the-loop pipeline producing decision-grounded, causally linked reasoning traces aligned with driving behaviors; (2) a modular VLA architecture combining Cosmos-Reason, a vision-language model pre-trained for Physical AI, with a diffusion-based trajectory decoder that generates dynamically feasible trajectories in real time; (3) a multi-stage training strategy using supervised fine-tuning to elicit reasoning and reinforcement learning (RL) to enforce reasoning-action consistency and optimize reasoning quality. AR1 achieves up to a 12% improvement in planning accuracy on challenging cases compared to a trajectory-only baseline, with a 35% reduction in close encounter rate in closed-loop simulation. RL post-training improves reasoning quality by 45% and reasoning-action consistency by 37%. Model scaling from 0.5B to 7B parameters shows consistent improvements. On-vehicle road tests confirm real-time performance (99 ms latency) and successful urban deployment. By bridging interpretable reasoning with precise control, AR1 demonstrates a practical path towards Level 4 autonomous driving. Model weights are available at https://huggingface.co/nvidia/Alpamayo-R1-10B with inference code at https://github.com/NVlabs/alpamayo.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims to introduce Alpamayo-R1 (AR1), a vision-language-action model that bridges Chain of Causation reasoning with action prediction for improved generalization in long-tail autonomous driving scenarios. It features a new CoC dataset created via hybrid auto-labeling and human-in-the-loop, a modular architecture with Cosmos-Reason VLM and diffusion trajectory decoder, and multi-stage training with SFT and RL for consistency. Key results include up to 12% better planning accuracy on challenging cases versus trajectory-only baseline, 35% lower close encounter rate in closed-loop sim, 45% reasoning quality and 37% consistency improvements from RL, scaling benefits, and real-time on-vehicle deployment at 99ms latency.

Significance. Should the empirical results and the validity of the CoC reasoning traces be substantiated, this manuscript offers a promising direction for enhancing the robustness and interpretability of end-to-end autonomous driving systems. The integration of explicit causal reasoning with control actions, combined with the public release of weights and code, could facilitate broader adoption and further development in the field of physical AI and robotics.

major comments (2)
  1. Abstract: The abstract reports concrete percentage gains on planning accuracy and close-encounter rate but provides no information on baseline details, statistical tests, data splits, or potential post-hoc selection. This is a load-bearing issue for the central empirical claims as it prevents proper assessment of whether the improvements are robust.
  2. Dataset Construction: The Chain of Causation (CoC) dataset is constructed through a hybrid auto-labeling and human-in-the-loop pipeline, but there is no external validation such as inter-annotator agreement with independent experts or causal intervention tests. This raises a correctness-risk concern that the traces may embed pipeline artifacts rather than independent causal structure, which is critical for the claimed generalization to unseen long-tail scenarios.
minor comments (2)
  1. Training Strategy: The multi-stage training using SFT and RL is described, but more specifics on the RL reward coefficients and how they balance reasoning quality and consistency would clarify the method.
  2. Results: The on-vehicle road tests confirm real-time performance, but additional details on the test scenarios and success criteria would be beneficial for reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and will revise the paper to improve clarity and detail where appropriate.

read point-by-point responses
  1. Referee: Abstract: The abstract reports concrete percentage gains on planning accuracy and close-encounter rate but provides no information on baseline details, statistical tests, data splits, or potential post-hoc selection. This is a load-bearing issue for the central empirical claims as it prevents proper assessment of whether the improvements are robust.

    Authors: We agree that the abstract would benefit from additional context to support assessment of the reported gains. In the revised manuscript, we will expand the abstract to briefly describe the trajectory-only baseline, note that planning accuracy improvements are evaluated on a held-out test split focused on long-tail scenarios, clarify that close-encounter reductions come from closed-loop simulation averaged across episodes, and state that results follow the primary evaluation protocol with no post-hoc selection. We will also indicate that improvements reflect averages over multiple runs. revision: yes

  2. Referee: Dataset Construction: The Chain of Causation (CoC) dataset is constructed through a hybrid auto-labeling and human-in-the-loop pipeline, but there is no external validation such as inter-annotator agreement with independent experts or causal intervention tests. This raises a correctness-risk concern that the traces may embed pipeline artifacts rather than independent causal structure, which is critical for the claimed generalization to unseen long-tail scenarios.

    Authors: We acknowledge the importance of validating the CoC traces against potential pipeline artifacts. The hybrid construction uses automated generation followed by human-in-the-loop review by domain experts to ground traces in observed driving behaviors and causal links. While we did not compute formal inter-annotator agreement with external experts or conduct separate causal intervention tests, the multi-pass human curation was intended to filter artifacts. We will revise the dataset section to include more details on annotation guidelines, quality control criteria, and example traces. We believe the current process supports the generalization results shown in experiments, though additional external validation remains a valuable direction for future work. revision: partial

Circularity Check

0 steps flagged

No significant circularity; performance claims rest on external baselines and simulation

full rationale

The paper constructs the CoC dataset via a hybrid auto-labeling and human-in-the-loop pipeline, then applies SFT followed by RL that explicitly optimizes defined reasoning-action consistency metrics. Reported gains (12% planning accuracy, 35% close-encounter reduction, 45% reasoning-quality lift) are measured against a separate trajectory-only baseline, in closed-loop simulation, and on-vehicle tests rather than being tautological reproductions of fitted quantities. No self-definitional equations, fitted inputs renamed as predictions, or load-bearing self-citations that collapse the central claim appear in the derivation. The approach is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claims rest on the quality of the newly introduced Chain of Causation dataset and on the assumption that RL can enforce consistency without degrading trajectory quality or introducing hidden biases.

free parameters (1)
  • RL reward coefficients for reasoning quality and consistency
    Hyperparameters that balance the two RL objectives are chosen during post-training and directly affect the reported 45% and 37% gains.
axioms (1)
  • domain assumption Hybrid auto-labeling plus human review can produce causally accurate and decision-grounded reasoning traces for driving scenes
    Invoked in the description of the CoC dataset creation pipeline.
invented entities (1)
  • Chain of Causation (CoC) dataset no independent evidence
    purpose: Supply decision-grounded, causally linked reasoning traces aligned with driving behaviors
    New resource constructed for this work; no independent external validation cited in abstract.

pith-pipeline@v0.9.0 · 6024 in / 1402 out tokens · 59657 ms · 2026-05-18T02:30:53.726498+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • Foundation/DAlembert/Inevitability bilinear_family_forced unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    AR1 achieves up to a 12% improvement in planning accuracy... RL post-training improves reasoning quality by 45% and reasoning-action consistency by 37%

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 18 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. MindVLA-U1: VLA Beats VA with Unified Streaming Architecture for Autonomous Driving

    cs.RO 2026-05 unverdicted novelty 7.0

    MindVLA-U1 introduces a unified streaming VLA with shared backbone, framewise memory, and language-guided action diffusion that surpasses human drivers on WOD-E2E planning metrics.

  2. Reasoning About Traversability: Language-Guided Off-Road 3D Trajectory Planning

    cs.RO 2026-04 unverdicted novelty 7.0

    A language refinement framework with geometry-aware preference optimization lets VLMs generate more traversable 3D trajectories for off-road vehicles, yielding modest gains in error, traversability compliance, and ele...

  3. Latent Chain-of-Thought World Modeling for End-to-End Driving

    cs.CV 2025-12 unverdicted novelty 7.0

    LCDrive unifies chain-of-thought reasoning and action selection for end-to-end driving by interleaving action-proposal tokens and latent world-model tokens that predict action outcomes, yielding faster inference and b...

  4. MindVLA-U1: VLA Beats VA with Unified Streaming Architecture for Autonomous Driving

    cs.RO 2026-05 unverdicted novelty 6.0

    MindVLA-U1 is the first unified streaming VLA architecture that surpasses human drivers on WOD-E2E planning metrics while matching VA latency and preserving language interfaces.

  5. Towards Safe Mobility: A Unified Transportation Foundation Model enabled by Open-Ended Vision-Language Dataset

    cs.CV 2026-04 unverdicted novelty 6.0

    Creates LTD dataset for open-ended traffic VQA and trains UniVLT model to achieve SOTA on unified microscopic AD and macroscopic traffic reasoning tasks.

  6. Xiaomi OneVL: One-Step Latent Reasoning and Planning with Vision-Language Explanation

    cs.CV 2026-04 unverdicted novelty 6.0

    OneVL achieves superior accuracy to explicit chain-of-thought reasoning at answer-only latency by supervising latent tokens with a visual world model decoder that predicts future frames.

  7. Xiaomi OneVL: One-Step Latent Reasoning and Planning with Vision-Language Explanation

    cs.CV 2026-04 unverdicted novelty 6.0

    OneVL is the first latent CoT method to exceed explicit CoT accuracy on four driving benchmarks while running at answer-only speed, by supervising latent tokens with a visual world model decoder.

  8. Intent-aligned Autonomous Spacecraft Guidance via Reasoning Models

    eess.SY 2026-04 unverdicted novelty 6.0

    A pipeline links foundation-model intent reasoning to safe trajectory optimization via behavior sequences and waypoint constraints, achieving over 90% convergence and 1.5x better intent satisfaction in close-proximity tests.

  9. BridgeSim: Unveiling the OL-CL Gap in End-to-End Autonomous Driving

    cs.RO 2026-04 unverdicted novelty 6.0

    The primary OL-CL gap in end-to-end autonomous driving arises from objective mismatch creating structural inability to model reactive behaviors, which a test-time adaptation method can mitigate.

  10. Orion-Lite: Distilling LLM Reasoning into Efficient Vision-Only Driving Models

    cs.CV 2026-04 unverdicted novelty 6.0

    Orion-Lite uses latent feature distillation and trajectory supervision to create a vision-only model that surpasses its LLM-based teacher on closed-loop Bench2Drive evaluation, achieving a new SOTA driving score of 80.6.

  11. Open-Ended Instruction Realization with LLM-Enabled Multi-Planner Scheduling in Autonomous Vehicles

    cs.RO 2026-04 unverdicted novelty 6.0

    LLM-driven multi-planner scheduling framework turns open-ended passenger instructions into safe, traceable control signals for autonomous vehicles while cutting query costs and matching specialized safety levels.

  12. Sim2Real-AD: A Modular Sim-to-Real Framework for Deploying VLM-Guided Reinforcement Learning in Real-World Autonomous Driving

    cs.RO 2026-04 unverdicted novelty 6.0

    Sim2Real-AD enables zero-shot transfer of CARLA-trained VLM-guided RL policies to full-scale vehicles, reporting 75-90% success rates in car-following, obstacle avoidance, and stop-sign scenarios without real-world RL...

  13. ExploreVLA: Dense World Modeling and Exploration for End-to-End Autonomous Driving

    cs.CV 2026-04 unverdicted novelty 6.0

    ExploreVLA augments VLA driving models with future RGB and depth prediction for dense supervision and uses prediction uncertainty as a safety-gated intrinsic reward for RL-based exploration, reaching SOTA PDMS 93.7 on NAVSIM.

  14. DRIV-EX: Counterfactual Explanations for Driving LLMs

    cs.CL 2026-02 unverdicted novelty 6.0

    DRIV-EX generates fluent counterfactual scene descriptions by using gradient-optimized embeddings only as a guide for controlled text decoding, producing more reliable explanations than baselines on transcribed highD ...

  15. Causality-Aware End-to-End Autonomous Driving via Ego-Centric Joint Scene Modeling

    cs.RO 2026-05 unverdicted novelty 5.0

    CaAD adds ego-centric joint-causal modeling and causality-aware policy alignment to end-to-end driving, reporting Driving Score 87.53 and Success Rate 71.81 on Bench2Drive plus PDMS 91.1 on NAVSIM.

  16. Latency Analysis and Optimization of Alpamayo 1 via Efficient Trajectory Generation

    cs.AI 2026-05 unverdicted novelty 5.0

    Redesigning Alpamayo 1 to single-reasoning and optimizing diffusion action generation cuts inference latency by 69.23% while preserving trajectory diversity and prediction quality.

  17. Is the Future Compatible? Diagnosing Dynamic Consistency in World Action Models

    cs.RO 2026-05 unverdicted novelty 5.0

    Action-state consistency in World Action Models distinguishes successful from failed imagined futures and supports value-free selection of better rollouts via consensus among predictions.

  18. SpanVLA: Efficient Action Bridging and Learning from Negative-Recovery Samples for Vision-Language-Action Model

    cs.CV 2026-04 unverdicted novelty 5.0

    SpanVLA reduces action generation latency via flow-matching conditioned on history and improves robustness by training on negative-recovery samples with GRPO and a dedicated reasoning dataset.

Reference graph

Works this paper leans on

113 extracted references · 113 canonical work pages · cited by 16 Pith papers · 20 internal anchors

  1. [1]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. GPT-4 technical report.arXiv preprint arXiv:2303.08774, 2023. 1, 5

  2. [2]

    CoVLA: Comprehensive vision-language-action dataset for autonomous driving

    Hidehisa Arai, Keita Miwa, Kento Sasaki, Kohei Watanabe, Yu Yamaguchi, Shunsuke Aoki, and Issei Yamamoto. CoVLA: Comprehensive vision-language-action dataset for autonomous driving. In2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 1933–1943. IEEE, 2025. 5, 9

  3. [3]

    Qwen2.5-VL Technical Report

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2.5-VL technical report.arXiv preprint arXiv:2502.13923, 2025. 6, 29

  4. [4]

    Constitutional AI: Harmlessness from AI Feedback

    Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, et al. Constitutional AI: Harmlessness from AI feedback.arXiv preprint arXiv:2212.08073, 2022. 20

  5. [5]

    METEOR: An automatic metric for mt evaluation with improved correlation with human judgments

    Satanjeev Banerjee and Alon Lavie. METEOR: An automatic metric for mt evaluation with improved correlation with human judgments. InACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pages 65–72, 2005. 16

  6. [6]

    End to End Learning for Self-Driving Cars

    Mariusz Bojarski, Davide Del Testa, Daniel Dworakowski, Bernhard Firner, Beat Flepp, Prasoon Goyal, Lawrence D. Jackel, Mathew Monfort, Urs Muller, Jiakai Zhang, et al. End-to-End Learning for Self- Driving Cars.arXiv preprint arXiv:1604.07316, 2016. 1

  7. [7]

    nuScenes: A multimodal dataset for autonomous driving

    Holger Caesar, Varun Bankiti, Alex H Lang, Sourabh Vora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Giancarlo Baldan, and Oscar Beijbom. nuScenes: A multimodal dataset for autonomous driving. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11621–11631, 2020. 4

  8. [8]

    Impromptu VLA: Open weights and open data for driving vision-language- action models.arXiv preprint arXiv:2505.23757, 2025

    Haohan Chi, Huan-ang Gao, Ziming Liu, Jianing Liu, Chenyu Liu, Jinwei Li, Kaisen Yang, Yangcheng Yu, Zeda Wang, Wenyi Li, et al. Impromptu VLA: Open weights and open data for driving vision-language- action models.arXiv preprint arXiv:2505.23757, 2025. 5, 9

  9. [9]

    Language-image models with 3D understanding

    Jang Hyun Cho, Boris Ivanovic, Yulong Cao, Edward Schmerling, Yue Wang, Xinshuo Weng, Boyi Li, Yurong You, Philipp Krähenbühl, Yan Wang, et al. Language-image models with 3D understanding. In International Conference on Learning Representations, 2025. 3

  10. [10]

    Deep rein- forcement learning from human preferences.Advances in Neural Information Processing Systems, 2017

    Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep rein- forcement learning from human preferences.Advances in Neural Information Processing Systems, 2017. 3, 4, 20

  11. [11]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025. 1, 5

  12. [12]

    Retrieval-based interleaved visual chain-of-thought in real-world driving scenarios.arXiv preprint arXiv:2501.04671, 2025

    Charles Corbière, Simon Roburin, Syrielle Montariol, Antoine Bosselut, and Alexandre Alahi. Retrieval-based interleaved visual chain-of-thought in real-world driving scenarios.arXiv preprint arXiv:2501.04671, 2025. 4

  13. [13]

    Parting with misconceptions about learning-based vehicle motion planning

    Daniel Dauner, Marcel Hallgarten, Andreas Geiger, and Kashyap Chitta. Parting with misconceptions about learning-based vehicle motion planning. InConference on Robot Learning, pages 1268–1281. PMLR, 2023. 24 35 Alpamayo-R1: Bridging Reasoning and Action Prediction for Generalizable Autonomous Driving in the Long Tail

  14. [14]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    DeepSeek-AI. DeepSeek-R1: Incentivizing reasoning capability in LLMs via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025. 2, 3, 4, 6, 20

  15. [15]

    Holistic autonomous driving understanding by bird’s-eye-view injected multi-modal large models

    Xinpeng Ding, Jianhua Han, Hang Xu, Xiaodan Liang, Wei Zhang, and Xiaomeng Li. Holistic autonomous driving understanding by bird’s-eye-view injected multi-modal large models. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13668–13677, 2024. 5

  16. [16]

    CARLA: An open urban driving simulator

    Alexey Dosovitskiy, German Ros, Felipe Codevilla, Antonio Lopez, and Vladlen Koltun. CARLA: An open urban driving simulator. InConference on Robot Learning. PMLR, 2017. 4

  17. [17]

    An image is worth 16x16 words: Transformers for image recognition at scale

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2020. 7

  18. [18]

    Knowledge insulating vision-language-action models: Train fast, run fast, generalize better.arXiv preprint arXiv:2505.23705, 2025

    Danny Driess, Jost Tobias Springenberg, Brian Ichter, Lili Yu, Adrian Li-Bell, Karl Pertsch, Allen Z Ren, Homer Walke, Quan Vuong, Lucy Xiaoyang Shi, et al. Knowledge insulating vision-language-action models: Train fast, run fast, generalize better.arXiv preprint arXiv:2505.23705, 2025. 2, 6, 9, 18

  19. [19]

    Taming transformers for high-resolution image synthesis

    Patrick Esser, Robin Rombach, and Bjorn Ommer. Taming transformers for high-resolution image synthesis. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12873–12883,

  20. [20]

    Large scale interactive motion forecasting for autonomous driving: The waymo open motion dataset

    Scott Ettinger, Shuyang Cheng, Benjamin Caine, Chenxi Liu, Hang Zhao, Sabeek Pradhan, Yuning Chai, Ben Sapp, Charles Qi, Yin Zhou, Zoey Yang, Aurélien Chouard, Pei Sun, Jiquan Ngiam, Vijay Vasudevan, Alexander McCauley, Jonathon Shlens, and Dragomir Anguelov. Large scale interactive motion forecasting for autonomous driving: The waymo open motion dataset....

  21. [21]

    Baidu Apollo EM Motion Planner

    Haoyang Fan, Fan Zhu, Changchun Liu, Liangliang Zhang, Li Zhuang, Dong Li, Weicheng Zhu, Jiangtao Hu, Hongye Li, and Qi Kong. Baidu Apollo EM motion planner.arXiv preprint arXiv:1807.08048, 2018. 1

  22. [22]

    CoReVLA: A dual-stage end-to-end autonomous driving framework for long-tail scenarios via collect-and-refine.arXiv preprint arXiv:2509.15968, 2025

    Shiyu Fang, Yiming Cui, Haoyang Liang, Chen Lv, Peng Hang, and Jian Sun. CoReVLA: A dual-stage end-to-end autonomous driving framework for long-tail scenarios via collect-and-refine.arXiv preprint arXiv:2509.15968, 2025. 3

  23. [23]

    DriveAction: A benchmark for exploring human-like driving decisions in vla models.arXiv preprint arXiv:2506.05667, 2025

    Yuhan Hao, Zhengning Li, Lei Sun, Weilong Wang, Naixin Yi, Sheng Song, Caihong Qin, Mofan Zhou, Yifei Zhan, Peng Jia, et al. DriveAction: A benchmark for exploring human-like driving decisions in vla models.arXiv preprint arXiv:2506.05667, 2025. 4

  24. [24]

    Distilling multi-modal large language models for autonomous driving

    Deepti Hegde, Rajeev Yasarla, Hong Cai, Shizhong Han, Apratim Bhattacharyya, Shweta Mahajan, Litian Liu, Risheek Garrepalli, Vishal M Patel, and Fatih Porikli. Distilling multi-modal large language models for autonomous driving. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 27575–27585, 2025. 3

  25. [25]

    DriveAgent: Multi-agent structured reasoning with LLM and multimodal sensor fusion for autonomous driving.arXiv preprint arXiv:2505.02123, 2025

    Xinmeng Hou, Wuqi Wang, Long Yang, Hao Lin, Jinglun Feng, Haigen Min, and Xiangmo Zhao. DriveAgent: Multi-agent structured reasoning with LLM and multimodal sensor fusion for autonomous driving.arXiv preprint arXiv:2505.02123, 2025. 4

  26. [26]

    Planning- oriented autonomous driving

    Yihan Hu, Jiazhi Yang, Li Chen, Keyu Li, Chonghao Sima, Xizhou Zhu, Siqi Chai, Senyao Du, Tianwei Lin, Wenhai Wang, Lewei Lu, Xiaosong Jia, Qiang Liu, Jifeng Dai, Yu Qiao, and Hongyang Li. Planning- oriented autonomous driving. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17853–17862, 2023. 1 36 Alpamayo-R1: B...

  27. [27]

    EMMA: End-to-End Multimodal Model for Autonomous Driving

    Jyh-Jing Hwang, Runsheng Xu, Hubert Lin, Wei-Chih Hung, Jingwei Ji, Kristy Choi, Di Huang, Tong He, Paul Covington, Benjamin Sapp, et al. EMMA: End-to-end multimodal model for autonomous driving. arXiv preprint arXiv:2410.23262, 2024. 2, 3

  28. [28]

    Ishaq, J

    Ayesha Ishaq, Jean Lahoud, Ketan More, Omkar Thawakar, Ritesh Thawkar, Dinura Dissanayake, Noor Ahsan, Yuhao Li, Fahad Shahbaz Khan, Hisham Cholakkal, et al. DriveLMM-o1: A step-by-step reasoning datasetandlargemultimodalmodelfordrivingscenariounderstanding.arXivpreprintarXiv:2503.10621,

  29. [29]

    Efficient multi-camera tokenization with triplanes for end-to-end driving.IEEE Robotics and Automation Letters, 10(11):11713–11720, 2025

    Boris Ivanovic, Cristiano Saltori, Yurong You, Yan Wang, Wenjie Luo, and Marco Pavone. Efficient multi-camera tokenization with triplanes for end-to-end driving.IEEE Robotics and Automation Letters, 10(11):11713–11720, 2025. 7, 31, 32

  30. [30]

    Planning with diffusion for flexible behavior synthesis

    Michael Janner, Yilun Du, Joshua Tenenbaum, and Sergey Levine. Planning with diffusion for flexible behavior synthesis. InInternational Conference on Machine Learning, pages 9902–9915, 2022. 18

  31. [31]

    Bench2Drive: Towards multi- ability benchmarking of closed-loop end-to-end autonomous driving.Advances in Neural Information Processing Systems, 37:819–844, 2024

    Xiaosong Jia, Zhenjie Yang, Qifeng Li, Zhiyuan Zhang, and Junchi Yan. Bench2Drive: Towards multi- ability benchmarking of closed-loop end-to-end autonomous driving.Advances in Neural Information Processing Systems, 37:819–844, 2024. 3

  32. [32]

    IRL-VLA: Training an vision-language-action policy via reward world model.arXiv preprint arXiv:2508.06571, 2025

    Anqing Jiang, Yu Gao, Yiru Wang, Zhigang Sun, Shuo Wang, Yuwen Heng, Hao Sun, Shichen Tang, Lijuan Zhu, Jinhao Chai, et al. IRL-VLA: Training an vision-language-action policy via reward world model.arXiv preprint arXiv:2508.06571, 2025. 2, 3

  33. [33]

    VAD: Vectorized scene representation for efficient autonomous driving

    Bo Jiang, Shaoyu Chen, Qing Xu, Bencheng Liao, Jiajie Chen, Helong Zhou, Qian Zhang, Wenyu Liu, Chang Huang, and Xinggang Wang. VAD: Vectorized scene representation for efficient autonomous driving. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 8340–8350,

  34. [34]

    Senna: Bridging Large Vision-Language Models and End-to-End Autonomous Driving

    Bo Jiang, Shaoyu Chen, Bencheng Liao, Xingyu Zhang, Wei Yin, Qian Zhang, Chang Huang, Wenyu Liu, and Xinggang Wang. Senna: Bridging large vision-language models and end-to-end autonomous driving.arXiv preprint arXiv:2410.22313, 2024. 5

  35. [35]

    MotionDiffuser: Controllable multi-agent motion prediction using diffusion

    Chiyu Jiang, Andre Cornman, Cheolho Park, Benjamin Sapp, Yin Zhou, Dragomir Anguelov, et al. MotionDiffuser: Controllable multi-agent motion prediction using diffusion. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9644–9653, 2023. 18

  36. [36]

    Scaling Laws for Neural Language Models

    Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361, 2020. 2

  37. [37]

    Plataniotis, Yao Lu, Song Han, and Zhijian Liu

    Samir Khaki, Junxian Guo, Jiaming Tang, Shang Yang, Yukang Chen, Konstantinos N. Plataniotis, Yao Lu, Song Han, and Zhijian Liu. SparseVILA: Decoupling Visual Sparsity for Efficient VLM Inference. In ICCV, 2025. 8

  38. [38]

    Textual explanations for self-driving vehicles

    Jinkyu Kim, Anna Rohrbach, Trevor Darrell, John Canny, and Zeynep Akata. Textual explanations for self-driving vehicles. InEuropean Conference on Computer Vision, pages 563–578, 2018. 5

  39. [39]

    Gonzalez, Hao Zhang, and Ion Stoica

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. InACM SIGOPS 29th Symposium on Operating Systems Principles, 2023. 22

  40. [40]

    RLAIF vs

    Harrison Lee, Samrat Phatale, Hassan Mansoor, Thomas Mesnard, Johan Ferret, Kellie Ren Lu, Colton Bishop, Ethan Hall, Victor Carbune, Abhinav Rastogi, et al. RLAIF vs. RLHF: Scaling Reinforcement Learning from Human Feedback with AI Feedback. InInternational Conference on Machine Learning,

  41. [41]

    20 37 Alpamayo-R1: Bridging Reasoning and Action Prediction for Generalizable Autonomous Driving in the Long Tail

  42. [42]

    Aligning Text-to-Image Models using Human Feedback

    Kimin Lee, Hao Liu, Moonkyung Ryu, Olivia Watkins, Yuqing Du, Craig Boutilier, Pieter Abbeel, Mo- hammad Ghavamzadeh, and Shixiang Shane Gu. Aligning text-to-image models using human feedback. arXiv preprint arXiv:2302.12192, 2023. 4

  43. [43]

    A survey on motion prediction and risk assessment for intelligent vehicles.ROBOMECH Journal, 1(1):1–14, 2014

    Sébastien Lefèvre, David Vasquez, and Christian Laugier. A survey on motion prediction and risk assessment for intelligent vehicles.ROBOMECH Journal, 1(1):1–14, 2014. 1

  44. [44]

    Wolf: Dense video captioning with a world summarization framework.Transactions on Machine Learning Research, 2025

    Boyi Li, Ligeng Zhu, Ran Tian, Shuhan Tan, Yuxiao Chen, Yao Lu, Yin Cui, Sushant Veer, Max Ehrlich, Jonah Philion, et al. Wolf: Dense video captioning with a world summarization framework.Transactions on Machine Learning Research, 2025. 3

  45. [45]

    WOMD-Reasoning: A large-scale dataset for interaction reasoning in driving.arXiv preprint arXiv:2407.04281, 2024

    Yiheng Li, Cunxin Fan, Chongjian Ge, Zhihao Zhao, Chenran Li, Chenfeng Xu, Huaxiu Yao, Masayoshi Tomizuka, Bolei Zhou, Chen Tang, et al. WOMD-Reasoning: A large-scale dataset for interaction reasoning in driving.arXiv preprint arXiv:2407.04281, 2024. 4

  46. [46]

    ReCogDrive: A Reinforced Cognitive Framework for End-to-End Autonomous Driving

    YongkangLi, KaixinXiong, XiangyuGuo, FangLi, SixuYan, GangweiXu, LijunZhou, LongChen, Haiyang Sun, Bing Wang, et al. ReCogDrive: A reinforced cognitive framework for end-to-end autonomous driving.arXiv preprint arXiv:2506.08052, 2025. 4

  47. [47]

    Drive-R1: Bridging reasoning and planning in VLMs for autonomous driving with reinforcement learning.arXiv preprint arXiv:2506.18234, 2025

    Yue Li, Meng Tian, Dechang Zhu, Jiangtong Zhu, Zhenyu Lin, Zhiwei Xiong, and Xinhai Zhao. Drive-R1: Bridging reasoning and planning in VLMs for autonomous driving with reinforcement learning.arXiv preprint arXiv:2506.18234, 2025. 4

  48. [48]

    CoT-Drive: Efficient motion forecasting for autonomous driving with LLMs and chain-of-thought prompting.IEEE Transactions on Artificial Intelligence, 2025

    Haicheng Liao, Hanlin Kong, Bonan Wang, Chengyue Wang, Wang Ye, Zhengbing He, Chengzhong Xu, and Zhenning Li. CoT-Drive: Efficient motion forecasting for autonomous driving with LLMs and chain-of-thought prompting.IEEE Transactions on Artificial Intelligence, 2025. 4

  49. [49]

    Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matthew Le. Flow matching for generative modeling. InInternational Conference on Learning Representations, 2023. 2, 9, 18

  50. [50]

    DSDrive: Distilling large language model for lightweight end-to-end autonomous driving with unified reasoning and planning.arXiv preprint arXiv:2505.05360, 2025

    Wenru Liu, Pei Liu, and Jun Ma. DSDrive: Distilling large language model for lightweight end-to-end autonomous driving with unified reasoning and planning.arXiv preprint arXiv:2505.05360, 2025. 4

  51. [51]

    ReasonPlan: Unified scene prediction and decision reasoning for closed-loop autonomous driving.arXiv preprint arXiv:2505.20024, 2025

    Xueyi Liu, Zuodong Zhong, Yuxin Guo, Yun-Fu Liu, Zhiguo Su, Qichao Zhang, Junli Wang, Yinfeng Gao, Yupeng Zheng, Qiao Lin, et al. ReasonPlan: Unified scene prediction and decision reasoning for closed-loop autonomous driving.arXiv preprint arXiv:2505.20024, 2025. 4

  52. [52]

    Imitation is not enough: Robustifying imitation with reinforcement learning for challenging driving scenarios

    Yiren Lu, Justin Fu, George Tucker, Xinlei Pan, Eli Bronstein, Rebecca Roelofs, Benjamin Sapp, Brandyn White, Aleksandra Faust, Shimon Whiteson, et al. Imitation is not enough: Robustifying imitation with reinforcement learning for challenging driving scenarios. InIEEE/RSJ International Conference on Intelligent Robots and Systems, pages 7553–7560, 2023. 4

  53. [53]

    Real-ad: Towards human-like reasoning in end-to- end autonomous driving

    Yuhang Lu, Jiadong Tu, Yuexin Ma, and Xinge Zhu. Real-ad: Towards human-like reasoning in end-to- end autonomous driving. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 27783–27793, 2025. 3

  54. [54]

    AdaThinkDrive: Adaptive thinking via reinforcement learning for autonomous driving.arXiv preprint arXiv:2509.13769, 2025

    Yuechen Luo, Fang Li, Shaoqing Xu, Zhiyi Lai, Lei Yang, Qimao Chen, Ziang Luo, Zixun Xie, Shengyin Jiang, Jiaxin Liu, et al. AdaThinkDrive: Adaptive thinking via reinforcement learning for autonomous driving.arXiv preprint arXiv:2509.13769, 2025. 2, 3

  55. [55]

    MTRDrive: Memory-tool synergistic reasoning for robust autonomous driving in corner cases.arXiv preprint arXiv:2509.20843, 2025

    Ziang Luo, Kangan Qian, Jiahua Wang, Yuechen Luo, Jinyu Miao, Zheng Fu, Yunlong Wang, Sicong Jiang, Zilin Huang, Yifei Hu, et al. MTRDrive: Memory-tool synergistic reasoning for robust autonomous driving in corner cases.arXiv preprint arXiv:2509.20843, 2025. 4

  56. [56]

    Cambridge University Press, 2017

    Kevin M Lynch and Frank C Park.Modern Robotics. Cambridge University Press, 2017. 9 38 Alpamayo-R1: Bridging Reasoning and Action Prediction for Generalizable Autonomous Driving in the Long Tail

  57. [57]

    DRAMA: Joint risk localization and captioning in driving.Winter Conference on Applications of Computer Vision, 2023

    Srikanth Malla, Chiho Choi, Isht Dwivedi, Joon Hee Choi, and Jiachen Li. DRAMA: Joint risk localization and captioning in driving.Winter Conference on Applications of Computer Vision, 2023. 4, 9

  58. [58]

    GPT-Driver: Learning to Drive with GPT

    Jiageng Mao, Yuxi Qian, Junjie Ye, Hang Zhao, and Yue Wang. GPT-Driver: Learning to drive with GPT. arXiv preprint arXiv:2310.01415, 2023. 2, 3

  59. [59]

    A language agent for autonomous driving

    Jiageng Mao, Junjie Ye, Yuxi Qian, Marco Pavone, and Yue Wang. A language agent for autonomous driving. InConference on Language Modeling, 2024. 2, 3

  60. [60]

    LingoQA: Visual question answering for autonomous driving

    Ana-Maria Marcu, Long Chen, Jan Hünermann, Alice Karnsund, Benoit Hanotte, Prajwal Chidananda, Saurabh Nair, Vijay Badrinarayanan, Alex Kendall, Jamie Shotton, et al. LingoQA: Visual question answering for autonomous driving. InEuropean Conference on Computer Vision, pages 252–269, 2024. 5, 30, 31

  61. [61]

    Rule based rewards for language model safety.Advances in Neural Information Processing Systems, 2024

    Tong Mu, Alec Helyar, Johannes Heidecke, Joshua Achiam, Andrea Vallone, Ian Kivlichan, Molly Lin, Alex Beutel, John Schulman, and Lilian Weng. Rule based rewards for language model safety.Advances in Neural Information Processing Systems, 2024. 4

  62. [62]

    Rea- son2Drive: Towards interpretable and chain-based reasoning for autonomous driving

    Ming Nie, Renyuan Peng, Chunwei Wang, Xinyue Cai, Jianhua Han, Hang Xu, and Li Zhang. Rea- son2Drive: Towards interpretable and chain-based reasoning for autonomous driving. InEuropean Conference on Computer Vision, pages 292–308, 2024. 3, 5

  63. [63]

    Cosmos-RL: A flexible and scalable reinforcement learning framework

    NVIDIA. Cosmos-RL: A flexible and scalable reinforcement learning framework. https:// nvidia-cosmos.github.io/cosmos-rl/, 2025. 22

  64. [64]

    Physical AI autonomous vehicles dataset

    NVIDIA. Physical AI autonomous vehicles dataset. https://huggingface.co/datasets/nvidia/ PhysicalAI-Autonomous-Vehicles, October 2025. URL https://huggingface.co/datasets/ nvidia/PhysicalAI-Autonomous-Vehicles. 28, 29, 33

  65. [65]

    Physical AI autonomous vehicles NuRec dataset

    NVIDIA. Physical AI autonomous vehicles NuRec dataset. https://huggingface.co/datasets/ nvidia/PhysicalAI-Autonomous-Vehicles-NuRec, October 2025. URLhttps://huggingface.co/ datasets/nvidia/PhysicalAI-Autonomous-Vehicles-NuRec. 29

  66. [66]

    NVIDIA Announces Alpamayo Family of Open-Source AI Models and Tools to Accelerate Safe, Reasoning-Based Autonomous Vehicle Development

    NVIDIA. NVIDIA Announces Alpamayo Family of Open-Source AI Models and Tools to Accelerate Safe, Reasoning-Based Autonomous Vehicle Development. 2026. URLhttps://nvidianews.nvidia.com/ news/alpamayo-autonomous-vehicle-development. 1

  67. [67]

    Cosmos-Reason1: From Physical Common Sense To Embodied Reasoning

    NVIDIA, Alisson Azzolini, Junjie Bai, Hannah Brandon, Jiaxin Cao, Prithvijit Chattopadhyay, Huayu Chen, Jinju Chu, Yin Cui, Jenna Diamond, Yifan Ding, Liang Feng, Francesco Ferroni, Rama Govindaraju, JinweiGu, SiddharthGururani, ImadElHanafi, ZekunHao, JacobHuffman, JingyiJin, BrendanJohnson, Rizwan Khan, George Kurian, Elena Lantz, Nayeon Lee, Zhaoshuo L...

  68. [68]

    AlpaSim: A modular, lightweight, and data-driven research simulator for autonomous driving, October 2025

    NVIDIA, Yulong Cao, Riccardo de Lutio, Sanja Fidler, Guillermo Garcia Cobo, Zan Gojcic, Maximilian Igl, Boris Ivanovic, Peter Karkus, Janick Martinez Esturo, Marco Pavone, Aaron Smith, Ellie Tanimura, Michal Tyszkiewicz, Michael Watson, Qi Wu, and Le Zhang. AlpaSim: A modular, lightweight, and data-driven research simulator for autonomous driving, October...

  69. [69]

    Learning to reason with LLMs, 2024

    OpenAI. Learning to reason with LLMs, 2024. URL https://openai.com/index/ learning-to-reason-with-llms/. 2, 4, 32 39 Alpamayo-R1: Bridging Reasoning and Action Prediction for Generalizable Autonomous Driving in the Long Tail

  70. [70]

    GPT-5 system card.https://openai.com/index/gpt-5-system-card/, 2025

    OpenAI. GPT-5 system card.https://openai.com/index/gpt-5-system-card/, 2025. 15, 16

  71. [71]

    DINOv2: Learning robust visual features without supervision.Transactions on Machine Learning Research, 2023

    Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. DINOv2: Learning robust visual features without supervision.Transactions on Machine Learning Research, 2023. 29

  72. [72]

    A survey of motion planning and control techniques for self-driving urban vehicles.IEEE Transactions on Intelligent Vehicles, 1(1):33–55, 2016

    Brian Paden, Michal Čáp, Sze Zheng Yong, Dmitry Yershov, and Emilio Frazzoli. A survey of motion planning and control techniques for self-driving urban vehicles.IEEE Transactions on Intelligent Vehicles, 1(1):33–55, 2016. 1

  73. [73]

    BLEU: a method for automatic evaluation of machine translation

    Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. BLEU: a method for automatic evaluation of machine translation. InAssociation for Computational Linguistics, pages 311–318, 2002. 16

  74. [74]

    Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, et al.𝜋0.5: a vision-language-action model with open-world generalization.arXiv preprint arXiv:2504.16054, 2025. 18

  75. [75]

    AgentThink: A unified framework for tool-augmented chain-of- thought reasoning in vision-language models for autonomous driving.arXiv preprint arXiv:2505.15298,

    Kangan Qian, Sicong Jiang, Yang Zhong, Ziang Luo, Zilin Huang, Tianze Zhu, Kun Jiang, Mengmeng Yang, Zheng Fu, Jinyu Miao, et al. AgentThink: A unified framework for tool-augmented chain-of- thought reasoning in vision-language models for autonomous driving.arXiv preprint arXiv:2505.15298,

  76. [76]

    NuScenes-QA: A multi-modal visual question answering benchmark for autonomous driving scenario

    Tianwen Qian, Jingjing Chen, Linhai Zhuo, Yang Jiao, and Yu-Gang Jiang. NuScenes-QA: A multi-modal visual question answering benchmark for autonomous driving scenario. InAAAI Conference on Artificial Intelligence, pages 4542–4550, 2024. 4

  77. [77]

    Qwen2.5: A party of foundation models, September 2024

    Qwen Team. Qwen2.5: A party of foundation models, September 2024. URLhttps://qwenlm.github. io/blog/qwen2.5/. 29

  78. [78]

    Qwen3-VL: Sharper vision, deeper thought, broader action.https://qwen.ai/blog? id=99f0335c4ad9ff6153e517418d48535ab6d8afef&from=research.latest-advancements-list,

    Qwen Team. Qwen3-VL: Sharper vision, deeper thought, broader action.https://qwen.ai/blog? id=99f0335c4ad9ff6153e517418d48535ab6d8afef&from=research.latest-advancements-list,

  79. [79]

    SimLingo: Vision-only closed-loop autonomous driving with language-action alignment

    Katrin Renz, Long Chen, Elahe Arani, and Oleg Sinavski. SimLingo: Vision-only closed-loop autonomous driving with language-action alignment. InIEEE/CVF Computer Vision and Pattern Recognition Conference, pages 11993–12003, 2025. 2, 3

  80. [80]

    Poutine: Vision- language-trajectory pre-training and reinforcement learning post-training enable robust end-to-end autonomous driving.arXiv preprint arXiv:2506.11234, 2025

    Luke Rowe, Rodrigue de Schaetzen, Roger Girgis, Christopher Pal, and Liam Paull. Poutine: Vision- language-trajectory pre-training and reinforcement learning post-training enable robust end-to-end autonomous driving.arXiv preprint arXiv:2506.11234, 2025. 2, 3, 4

Showing first 80 references.