pith. sign in

arxiv: 2605.21139 · v1 · pith:2F7C5IPDnew · submitted 2026-05-20 · 💻 cs.CV · cs.LG

Distill to Think, Foresee to Act: Cognitive-Physical Reinforcement Learning for Autonomous Driving

Pith reviewed 2026-05-21 04:41 UTC · model grok-4.3

classification 💻 cs.CV cs.LG
keywords autonomous drivingreinforcement learningBEV world modelvision-language distillationdual rewardGRPOintent controlNAVSIM benchmark
0
0 comments X

The pith

Cognitive-physical RL for driving distills VLM knowledge into BEV encoder and adds action-conditioned future prediction for safer policies

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

End-to-end autonomous driving models are limited by imitation learning ceilings, and the paper seeks to overcome this by supplying reinforcement learning with missing cognitive understanding of traffic and intent plus physical foresight into action consequences. It distills vision-language model knowledge into a bird's-eye-view encoder that retains semantic capability at zero runtime cost after the model is discarded, then constructs an auto-regressive BEV world model that generates future semantic maps from candidate actions to serve as a sandbox for safety evaluation. The resulting policy is trained with GRPO using a dual-reward setup in which physical scores from world-model rollouts enforce hard constraints and cognitive scores from a language-aligned scorer enforce intent compliance, producing better results on standard benchmarks along with the ability to accept user language commands.

Core claim

The central claim is that a cognitive-physical reinforcement learning framework called CoPhy advances autonomous driving by first distilling VLM knowledge into the BEV encoder and discarding the VLM to keep cognitive ability at zero inference cost while exposing a language interface, second building an auto-regressive BEV world model that explicitly predicts future semantic maps conditioned on candidate actions to derive interpretable safety metrics, and third optimizing the driving policy via GRPO with a dual-reward mechanism in which physical rewards from BEV rollouts enforce hard safety constraints and cognitive rewards ensure intent compliance, yielding state-of-the-art performance on NA

What carries the argument

The dual infrastructure of a distilled cognitive BEV encoder that retains VLM semantics at zero cost and an auto-regressive BEV world model that predicts future semantic maps from candidate actions to supply physical safety metrics for dual-reward GRPO

If this is right

  • The method achieves state-of-the-art results on NAVSIM v1 and v2 benchmarks.
  • Safer driving results from cognitively informed scene compliance enforced by the physical reward.
  • Flexible intent control becomes possible through user-defined language instructions via the cognitive channel.
  • The physical reward derived from BEV rollouts directly enforces hard safety constraints during optimization.
  • The cognitive reward from the language-aligned scorer maintains compliance with driving intent.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the world-model predictions remain reliable across diverse weather and traffic densities, the approach could lower the volume of real-world miles needed for validation.
  • The pluggable language interface could support regional or personal driving style preferences without retraining the core policy.
  • Extending the same distillation step to other perception modules might cut inference costs in broader robotics applications.
  • Pairing the dual-reward structure with multi-agent world models could address cooperative behaviors in dense traffic.

Load-bearing premise

The auto-regressive BEV world model produces future semantic maps accurate enough that safety metrics computed from its rollouts can be treated as reliable hard constraints.

What would settle it

Direct tests showing that collision or violation rates predicted by the BEV world model rollouts do not match observed outcomes in the NAVSIM simulator or real-world driving data would falsify the reliability of the physical safety constraints.

Figures

Figures reproduced from arXiv: 2605.21139 by Jian Yang, Jin Xie, Qiang Meng, Yang Wu, Youquan Liu, Zhaojiang Liu.

Figure 1
Figure 1. Figure 1: (a) Previous methods isolate cognitive and physical reasoning, leading to semantic failures like ignoring a STOP sign or spatial violations like halting on a crosswalk. (b) In contrast, CoPhy respects traffic semantics and maintains lane discipline, ensuring safe and cognitive-aligned driving. collisions and lane violations. Similar to human drivers who mentally simulate outcomes before acting, an autonomo… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of CoPhy. Multi-modal data are encoded into BEV state [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Hierarchical trajectory selection. Candidates passing the cognitive threshold [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative comparisons of trajectories before and after optimization. The dual-reward [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative comparisons with DiffusionDrive [28] and WoTE [23]. [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: t-SNE [49] of distillation. Human command: I‘m in a very urgent situation, accelerate and run the red light! Original trajectory Human-intent trajectory Human command: Stop following slowly, do not tailgate. Change lanes to an open lane [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
read the original abstract

Current end-to-end autonomous driving models are fundamentally constrained by the behavioral cloning ceiling of imitation learning. While reinforcement learning offers a path to smarter autonomy, it demands two missing pieces of infrastructure: (1) a cognitive foundation that understands traffic semantics and driving intent, and (2) a foresighted physical environment that can anticipate the consequences of candidate actions. To this end, we propose CoPhy, a CognitivePhysical reinforcement learning framework for autonomous driving. To distill to think, we distill VLM knowledge into the BEV encoder and then discard the VLM entirely, retaining cognitive ability at zero inference cost while releasing the cognitive channel as a pluggable interface for optional human language commands. To foresee to act, we build an auto-regressive BEV world model that explicitly predicts future semantic maps conditioned on candidate actions, serving as an interpretable physical sandbox from which safety metrics are directly derived. Built upon this dual infrastructure, we optimize the driving policy via GRPO with a novel dual-reward mechanism: a physical reward derived from BEV rollouts enforces hard safety constraints, while a cognitive reward from a language-aligned scorer ensures intent compliance. Extensive experiments demonstrate that CoPhy not only achieves state-of-the-art results on NAVSIM v1 and v2 benchmarks, but also enables safer driving via cognitively informed scene compliance and flexible intent control through user-defined language instructions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper proposes CoPhy, a Cognitive-Physical RL framework for autonomous driving. It distills VLM knowledge into a BEV encoder (then discards the VLM) to retain cognitive understanding of traffic semantics and intent at zero inference cost, while constructing an auto-regressive BEV world model that predicts future semantic maps conditioned on candidate actions. The driving policy is optimized via GRPO using a dual-reward mechanism: physical rewards derived from BEV rollouts to enforce hard safety constraints, and cognitive rewards from a language-aligned scorer for intent compliance. The work claims state-of-the-art results on NAVSIM v1 and v2 benchmarks together with safer driving and flexible user-defined language control.

Significance. If the empirical claims hold and the world-model assumption is substantiated, the framework would offer a practical way to combine high-level cognitive reasoning with low-level physical foresight inside an RL loop, addressing the behavioral-cloning ceiling of imitation learning. The distillation step that preserves VLM-derived cognition without runtime cost and the pluggable language interface are concrete engineering strengths that could improve controllability and interpretability in safety-critical driving systems.

major comments (1)
  1. [§3.2] §3.2 (auto-regressive BEV world model): The central safety claim—that physical rewards derived from BEV rollouts enforce hard safety constraints—depends on the world model producing future semantic maps whose derived metrics (collision, off-road, etc.) remain faithful over the multi-step horizons used in GRPO. No per-step IoU, rollout-consistency, or ground-truth simulator alignment numbers are reported, so it is unclear whether compounding prediction errors allow the policy to exploit model artifacts rather than true dynamics.
minor comments (2)
  1. [Abstract] Abstract: The acronym GRPO is used without expansion; define it as Group Relative Policy Optimization on first use.
  2. [§3] Notation: The distinction between the distilled BEV encoder and the separate auto-regressive world model should be clarified with a single diagram or explicit equation reference to avoid reader confusion about which component supplies the cognitive versus physical channel.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the thoughtful review and constructive feedback on our work. We address the major comment point-by-point below and have incorporated revisions to strengthen the manuscript's claims regarding the BEV world model.

read point-by-point responses
  1. Referee: [§3.2] §3.2 (auto-regressive BEV world model): The central safety claim—that physical rewards derived from BEV rollouts enforce hard safety constraints—depends on the world model producing future semantic maps whose derived metrics (collision, off-road, etc.) remain faithful over the multi-step horizons used in GRPO. No per-step IoU, rollout-consistency, or ground-truth simulator alignment numbers are reported, so it is unclear whether compounding prediction errors allow the policy to exploit model artifacts rather than true dynamics.

    Authors: We agree that explicit validation of the world model's predictive fidelity is important for substantiating the safety claims. While the original manuscript emphasized end-to-end NAVSIM results (which provide indirect evidence through policy performance), we acknowledge the value of direct metrics. In the revised manuscript, we have expanded §3.2 with a new evaluation subsection reporting per-step IoU for semantic map predictions, rollout consistency over the 5- and 10-step horizons used in GRPO, and alignment statistics against ground-truth simulator trajectories on a held-out validation set. These results indicate limited compounding error (IoU degradation <6% at 10 steps) and support that the policy optimizes against faithful dynamics. We have also added qualitative rollout visualizations and a brief discussion of remaining limitations. revision: yes

Circularity Check

0 steps flagged

No significant circularity; framework assembles independent components

full rationale

The paper constructs CoPhy by distilling VLM knowledge into a BEV encoder (then discarding the VLM), training an auto-regressive BEV world model to predict future semantic maps conditioned on actions, and optimizing a policy via GRPO using dual rewards defined directly from those models' outputs. The physical reward derives safety metrics from world-model rollouts and the cognitive reward uses a language-aligned scorer; neither is fitted post-hoc to the NAVSIM benchmark metrics nor reduces any claimed prediction to its inputs by construction. No load-bearing self-citations, uniqueness theorems, or ansatzes imported from prior author work appear in the derivation chain. The SOTA claims and safer-driving results are presented as empirical outcomes of this assembly, making the overall derivation self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies insufficient detail to enumerate concrete free parameters or axioms; the central claims rest on unstated assumptions about distillation fidelity and world-model accuracy.

pith-pipeline@v0.9.0 · 5787 in / 1307 out tokens · 48531 ms · 2026-05-21T04:41:04.800943+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

68 extracted references · 68 canonical work pages · 12 internal anchors

  1. [1]

    NuPlan: A closed-loop ML-based planning benchmark for autonomous vehicles

    Holger Caesar, Juraj Kabzan, Kok Seang Tan, Whye Kit Fong, Eric Wolff, Alex Lang, Luke Fletcher, Oscar Beijbom, and Sammy Omari. nuplan: A closed-loop ml-based planning benchmark for autonomous vehicles.arXiv preprint arXiv:2106.11810, 2021

  2. [2]

    Pseudo-simulation for autonomous driving

    Wei Cao, Marcel Hallgarten, Tianyu Li, Daniel Dauner, Xunjiang Gu, Caojun Wang, Yakov Miron, Marco Aiello, Hongyang Li, Igor Gilitschenski, et al. Pseudo-simulation for autonomous driving

  3. [3]

    End-to-end autonomous driving: Challenges and frontiers.TPAMI, 46(12):10164–10183, 2024

    Li Chen, Penghao Wu, Kashyap Chitta, Bernhard Jaeger, Andreas Geiger, and Hongyang Li. End-to-end autonomous driving: Challenges and frontiers.TPAMI, 46(12):10164–10183, 2024

  4. [4]

    Vadv2: End-to-end vectorized autonomous driving via probabilistic planning.ICLR, 2026

    Shaoyu Chen, Bo Jiang, Hao Gao, Bencheng Liao, Qing Xu, Qian Zhang, Chang Huang, Wenyu Liu, and Xinggang Wang. Vadv2: End-to-end vectorized autonomous driving via probabilistic planning.ICLR, 2026

  5. [5]

    Transfuser: Imitation with transformer-based sensor fusion for autonomous driving.TPAMI, 45(11):12878–12895, 2022

    Kashyap Chitta, Aditya Prakash, Bernhard Jaeger, Zehao Yu, Katrin Renz, and Andreas Geiger. Transfuser: Imitation with transformer-based sensor fusion for autonomous driving.TPAMI, 45(11):12878–12895, 2022

  6. [6]

    Openscene: The largest up-to-date 3d occupancy prediction bench- mark in autonomous driving.https://github.com/OpenDriveLab/OpenScene, 2023

    OpenScene Contributors. Openscene: The largest up-to-date 3d occupancy prediction bench- mark in autonomous driving.https://github.com/OpenDriveLab/OpenScene, 2023

  7. [7]

    Navsim: Data-driven non-reactive autonomous vehicle simulation and benchmarking.NeurIPS, 37:28706–28719, 2024

    Daniel Dauner, Marcel Hallgarten, Tianyu Li, Xinshuo Weng, Zhiyu Huang, Zetong Yang, Hongyang Li, Igor Gilitschenski, Boris Ivanovic, Marco Pavone, et al. Navsim: Data-driven non-reactive autonomous vehicle simulation and benchmarking.NeurIPS, 37:28706–28719, 2024

  8. [8]

    Artemis: Autoregressive end-to-end trajectory planning with mixture of experts for autonomous driving.RAL, 11(1):226–233, 2025

    Renju Feng, Ning Xi, Duanfeng Chu, Rukang Wang, Zejian Deng, Anzheng Wang, Liping Lu, Jinxiang Wang, and Yanjun Huang. Artemis: Autoregressive end-to-end trajectory planning with mixture of experts for autonomous driving.RAL, 11(1):226–233, 2025

  9. [9]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

  10. [10]

    Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Liang Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022

  11. [11]

    Planning-oriented autonomous driving

    Yihan Hu, Jiazhi Yang, Li Chen, Keyu Li, Chonghao Sima, Xizhou Zhu, Siqi Chai, Senyao Du, Tianwei Lin, Wenhai Wang, et al. Planning-oriented autonomous driving. InCVPR, pages 17853–17862, 2023

  12. [12]

    EMMA: End-to-End Multimodal Model for Autonomous Driving

    Jyh-Jing Hwang, Runsheng Xu, Hubert Lin, Wei-Chih Hung, Jingwei Ji, Kristy Choi, Di Huang, Tong He, Paul Covington, Benjamin Sapp, et al. Emma: End-to-end multimodal model for autonomous driving.arXiv preprint arXiv:2410.23262, 2024

  13. [13]

    Irl-vla: Training an vision-language-action policy via reward world model.arXiv preprint arXiv:2508.06571, 2025

    Anqing Jiang, Yu Gao, Yiru Wang, Zhigang Sun, Shuo Wang, Yuwen Heng, Hao Sun, Shichen Tang, Lijuan Zhu, Jinhao Chai, et al. Irl-vla: Training an vision-language-action policy via reward world model.arXiv preprint arXiv:2508.06571, 2025

  14. [14]

    Senna: Bridging Large Vision-Language Models and End-to-End Autonomous Driving

    Bo Jiang, Shaoyu Chen, Bencheng Liao, Xingyu Zhang, Wei Yin, Qian Zhang, Chang Huang, Wenyu Liu, and Xinggang Wang. Senna: Bridging large vision-language models and end-to-end autonomous driving.arXiv preprint arXiv:2410.22313, 2024

  15. [15]

    Vad: Vectorized scene representation for efficient autonomous driving

    Bo Jiang, Shaoyu Chen, Qing Xu, Bencheng Liao, Jiajie Chen, Helong Zhou, Qian Zhang, Wenyu Liu, Chang Huang, and Xinggang Wang. Vad: Vectorized scene representation for efficient autonomous driving. InICCV, pages 8340–8350, 2023

  16. [16]

    AlphaDrive: Unleashing the Power of VLMs in Autonomous Driving via Reinforcement Learning and Reasoning

    Bo Jiang, Shaoyu Chen, Qian Zhang, Wenyu Liu, and Xinggang Wang. Alphadrive: Unleashing the power of vlms in autonomous driving via reinforcement learning and reasoning.arXiv preprint arXiv:2503.07608, 2025. 10

  17. [17]

    Adapt: Action-aware driving caption transformer

    Bu Jin, Xinyu Liu, Yupeng Zheng, Pengfei Li, Hao Zhao, Tong Zhang, Yuhang Zheng, Guyue Zhou, and Jingjing Liu. Adapt: Action-aware driving caption transformer. InICRA, pages 7554–7561. IEEE, 2023

  18. [18]

    Finetuning generative trajectory model with reinforcement learning from human feedback.arXiv e-prints, pages arXiv–2503, 2025

    Derun Li, Jianwei Ren, Yue Wang, Xin Wen, Pengxiang Li, Leimeng Xu, Kun Zhan, Zhongpu Xia, Peng Jia, Xianpeng Lang, et al. Finetuning generative trajectory model with reinforcement learning from human feedback.arXiv e-prints, pages arXiv–2503, 2025

  19. [19]

    Hydra-mdp++: Advancing end-to-end driving via expert-guided hydra-distillation.arXiv preprint arXiv:2503.12820, 2025

    Kailin Li, Zhenxin Li, Shiyi Lan, Yuan Xie, Zhizhong Zhang, Jiayi Liu, Zuxuan Wu, Zhiding Yu, and Jose M Alvarez. Hydra-mdp++: Advancing end-to-end driving via expert-guided hydra-distillation.arXiv preprint arXiv:2503.12820, 2025

  20. [20]

    Spacedrive: Infusing spatial awareness into vlm-based autonomous driving.arXiv preprint arXiv:2512.10719, 2, 2025

    Peizheng Li, Zhenghao Zhang, David Holtz, Hang Yu, Yutong Yang, Yuzhi Lai, Rui Song, Andreas Geiger, and Andreas Zell. Spacedrive: Infusing spatial awareness into vlm-based autonomous driving.arXiv preprint arXiv:2512.10719, 2, 2025

  21. [21]

    Enhancing end-to-end autonomous driving with latent world model.ICLR, 2025

    Yingyan Li, Lue Fan, Jiawei He, Yuqi Wang, Yuntao Chen, Zhaoxiang Zhang, and Tieniu Tan. Enhancing end-to-end autonomous driving with latent world model.ICLR, 2025

  22. [22]

    Drivevla-w0: World models amplify data scaling law in autonomous driving.ICLR, 2026

    Yingyan Li, Shuyao Shang, Weisong Liu, Bing Zhan, Haochen Wang, Yuqi Wang, Yuntao Chen, Xiaoman Wang, Yasong An, Chufeng Tang, et al. Drivevla-w0: World models amplify data scaling law in autonomous driving.ICLR, 2026

  23. [23]

    End-to-end driving with online trajectory evaluation via bev world model

    Yingyan Li, Yuqi Wang, Yang Liu, Jiawei He, Lue Fan, and Zhaoxiang Zhang. End-to-end driving with online trajectory evaluation via bev world model. InICCV, pages 27137–27146, 2025

  24. [24]

    Recogdrive: A reinforced cognitive framework for end-to-end autonomous driving.ICLR, 2026

    Yongkang Li, Kaixin Xiong, Xiangyu Guo, Fang Li, Sixu Yan, Gangwei Xu, Lijun Zhou, Long Chen, Haiyang Sun, Bing Wang, et al. Recogdrive: A reinforced cognitive framework for end-to-end autonomous driving.ICLR, 2026

  25. [25]

    Unidrivevla: Unifying understanding, perception, and action planning for autonomous driving.arXiv preprint arXiv:2604.02190, 2026

    Yongkang Li, Lijun Zhou, Sixu Yan, Bencheng Liao, Tianyi Yan, Kaixin Xiong, Long Chen, Hongwei Xie, Bing Wang, Guang Chen, et al. Unidrivevla: Unifying understanding, perception, and action planning for autonomous driving.arXiv preprint arXiv:2604.02190, 2026

  26. [26]

    Drive-r1: Bridging reasoning and planning in vlms for autonomous driving with reinforcement learning

    Yue Li, Meng Tian, Dechang Zhu, Jiangtong Zhu, Zhenyu Lin, Zhiwei Xiong, and Xinhai Zhao. Drive-r1: Bridging reasoning and planning in vlms for autonomous driving with reinforcement learning. InAAAI, volume 40, pages 6708–6716, 2026

  27. [27]

    Hydra-MDP: End-to-end Multimodal Planning with Multi-target Hydra-Distillation

    Zhenxin Li, Kailin Li, Shihao Wang, Shiyi Lan, Zhiding Yu, Yishen Ji, Zhiqi Li, Ziyue Zhu, Jan Kautz, Zuxuan Wu, et al. Hydra-mdp: End-to-end multimodal planning with multi-target hydra-distillation.arXiv preprint arXiv:2406.06978, 2024

  28. [28]

    Diffusiondrive: Truncated diffusion model for end-to-end autonomous driving

    Bencheng Liao, Shaoyu Chen, Haoran Yin, Bo Jiang, Cheng Wang, Sixu Yan, Xinbang Zhang, Xiangyu Li, Ying Zhang, Qian Zhang, et al. Diffusiondrive: Truncated diffusion model for end-to-end autonomous driving. InCVPR, pages 12037–12047, 2025

  29. [29]

    Focal loss for dense object detection

    Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. Focal loss for dense object detection. InICCV, pages 2980–2988, 2017

  30. [30]

    X-driver: Explainable autonomous driving with vision-language models.arXiv preprint arXiv:2505.05098, 2025

    Wei Liu, Jiyuan Zhang, Binxiong Zheng, Yufeng Hu, Yingzhan Lin, and Zengfeng Zeng. X-driver: Explainable autonomous driving with vision-language models.arXiv preprint arXiv:2505.05098, 2025

  31. [31]

    Sgdr: Stochastic gradient descent with warm restarts

    Ilya Loshchilov and Frank Hutter. Sgdr: Stochastic gradient descent with warm restarts. In ICLR, 2017

  32. [32]

    Decoupled weight decay regularization.ICLR, 2018

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.ICLR, 2018

  33. [33]

    Adathinkdrive: Adaptive thinking via reinforcement learning for autonomous driving.arXiv preprint arXiv:2509.13769, 2025

    Yuechen Luo, Fang Li, Shaoqing Xu, Zhiyi Lai, Lei Yang, Qimao Chen, Ziang Luo, Zixun Xie, Shengyin Jiang, Jiaxin Liu, et al. Adathinkdrive: Adaptive thinking via reinforcement learning for autonomous driving.arXiv preprint arXiv:2509.13769, 2025

  34. [34]

    Drama: Joint risk localization and captioning in driving

    Srikanth Malla, Chiho Choi, Isht Dwivedi, Joon Hee Choi, and Jiachen Li. Drama: Joint risk localization and captioning in driving. InWACV, pages 1043–1052, 2023. 11

  35. [35]

    Lingoqa: Visual question answering for autonomous driving

    Ana-Maria Marcu, Long Chen, Jan Hünermann, Alice Karnsund, Benoit Hanotte, Prajwal Chidananda, Saurabh Nair, Vijay Badrinarayanan, Alex Kendall, Jamie Shotton, et al. Lingoqa: Visual question answering for autonomous driving. InECCV, pages 252–269. Springer, 2024

  36. [36]

    Representation Learning with Contrastive Predictive Coding

    Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding.arXiv preprint arXiv:1807.03748, 2018

  37. [37]

    Vlaad: Vision and language assistant for autonomous driving

    SungYeon Park, MinJae Lee, JiHyuk Kang, Hahyeon Choi, Yoonah Park, Juhwan Cho, Adam Lee, and DongKyu Kim. Vlaad: Vision and language assistant for autonomous driving. In CVPR, pages 980–987, 2024

  38. [38]

    Multi-modal fusion transformer for end-to-end autonomous driving

    Aditya Prakash, Kashyap Chitta, and Andreas Geiger. Multi-modal fusion transformer for end-to-end autonomous driving. InCVPR, pages 7077–7087, 2021

  39. [39]

    Nuscenes-qa: A multi-modal visual question answering benchmark for autonomous driving scenario

    Tianwen Qian, Jingjing Chen, Linhai Zhuo, Yang Jiao, and Yu-Gang Jiang. Nuscenes-qa: A multi-modal visual question answering benchmark for autonomous driving scenario. InAAAI, volume 38, pages 4542–4550, 2024

  40. [40]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InICLR, pages 8748–8763. PMLR, 2021

  41. [41]

    Simlingo: Vision-only closed-loop autonomous driving with language-action alignment

    Katrin Renz, Long Chen, Elahe Arani, and Oleg Sinavski. Simlingo: Vision-only closed-loop autonomous driving with language-action alignment. InCVPR, pages 11993–12003, 2025

  42. [42]

    Poutine: Vision-language-trajectory pre-training and reinforcement learning post-training enable robust end-to-end autonomous driving.arXiv preprint arXiv:2506.11234, 2025

    Luke Rowe, Rodrigue de Schaetzen, Roger Girgis, Christopher Pal, and Liam Paull. Poutine: Vision-language-trajectory pre-training and reinforcement learning post-training enable robust end-to-end autonomous driving.arXiv preprint arXiv:2506.11234, 2025

  43. [43]

    Drivedpo: Policy learning via safety dpo for end-to-end autonomous driving.NeurIPS, 2025

    Shuyao Shang, Yuntao Chen, Yuqi Wang, Yingyan Li, and Zhaoxiang Zhang. Drivedpo: Policy learning via safety dpo for end-to-end autonomous driving.NeurIPS, 2025

  44. [44]

    Lmdrive: Closed-loop end-to-end driving with large language models

    Hao Shao, Yuxuan Hu, Letian Wang, Guanglu Song, Steven L Waslander, Yu Liu, and Hong- sheng Li. Lmdrive: Closed-loop end-to-end driving with large language models. InCVPR, pages 15120–15130, 2024

  45. [45]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

  46. [46]

    ExploreVLA: Dense World Modeling and Exploration for End-to-End Autonomous Driving

    Zihao Sheng, Xin Ye, Jingru Luo, Sikai Chen, and Liu Ren. Explorevla: Dense world modeling and exploration for end-to-end autonomous driving.arXiv preprint arXiv:2604.02714, 2026

  47. [47]

    Drivelm: Driving with graph visual question answering

    Chonghao Sima, Katrin Renz, Kashyap Chitta, Li Chen, Hanxue Zhang, Chengen Xie, Jens Beißwenger, Ping Luo, Andreas Geiger, and Hongyang Li. Drivelm: Driving with graph visual question answering. InECCV, pages 256–274. Springer, 2024

  48. [48]

    Deep reinforcement learning for robotics: A survey of real-world successes.Annual Review of Control, Robotics, and Autonomous Systems, 8(1):153–188, 2025

    Chen Tang, Ben Abbatematteo, Jiaheng Hu, Rohan Chandra, Roberto Martín-Martín, and Peter Stone. Deep reinforcement learning for robotics: A survey of real-world successes.Annual Review of Control, Robotics, and Autonomous Systems, 8(1):153–188, 2025

  49. [49]

    Visualizing data using t-sne.JMLR, 9(11), 2008

    Laurens Van der Maaten and Geoffrey Hinton. Visualizing data using t-sne.JMLR, 9(11), 2008

  50. [50]

    Attention is all you need.NeurIPS, 30, 2017

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.NeurIPS, 30, 2017

  51. [51]

    Alpamayo-R1: Bridging Reasoning and Action Prediction for Generalizable Autonomous Driving in the Long Tail

    Yan Wang, Wenjie Luo, Junjie Bai, Yulong Cao, Tong Che, Ke Chen, Yuxiao Chen, Jenna Dia- mond, Yifan Ding, Wenhao Ding, et al. Alpamayo-r1: Bridging reasoning and action prediction for generalizable autonomous driving in the long tail.arXiv preprint arXiv:2511.00088, 2025

  52. [52]

    Para-drive: Paral- lelized architecture for real-time autonomous driving

    Xinshuo Weng, Boris Ivanovic, Yan Wang, Yue Wang, and Marco Pavone. Para-drive: Paral- lelized architecture for real-time autonomous driving. InCVPR, pages 15449–15458, 2024. 12

  53. [53]

    Drivelaw: Unifying planning and video generation in a latent driving world.CVPR, 2026

    Tianze Xia, Yongkang Li, Lijun Zhou, Jingfeng Yao, Kaixin Xiong, Haiyang Sun, Bing Wang, Kun Ma, Guang Chen, Hangjun Ye, et al. Drivelaw: Unifying planning and video generation in a latent driving world.CVPR, 2026

  54. [54]

    Goalflow: Goal-driven flow matching for multimodal trajectories generation in end-to-end autonomous driving

    Zebin Xing, Xingyu Zhang, Yang Hu, Bo Jiang, Tong He, Qian Zhang, Xiaoxiao Long, and Wei Yin. Goalflow: Goal-driven flow matching for multimodal trajectories generation in end-to-end autonomous driving. InCVPR, pages 1602–1611, 2025

  55. [55]

    Wam-diff: A masked diffusion vla framework with moe and online reinforcement learning for autonomous driving.arXiv preprint arXiv:2512.11872, 2025

    Mingwang Xu, Jiahao Cui, Feipeng Cai, Hanlin Shang, Zhihao Zhu, Shan Luan, Yifang Xu, Neng Zhang, Yaoyi Li, Jia Cai, et al. Wam-diff: A masked diffusion vla framework with moe and online reinforcement learning for autonomous driving.arXiv preprint arXiv:2512.11872, 2025

  56. [56]

    Drivegpt4: Interpretable end-to-end autonomous driving via large language model.RAL, 9(10):8186–8193, 2024

    Zhenhua Xu, Yujia Zhang, Enze Xie, Zhen Zhao, Yong Guo, Kwan-Yee K Wong, Zhenguo Li, and Hengshuang Zhao. Drivegpt4: Interpretable end-to-end autonomous driving via large language model.RAL, 9(10):8186–8193, 2024

  57. [57]

    Ad-r1: Closed-loop reinforcement learning for end-to-end autonomous driving with impartial world models.arXiv preprint arXiv:2511.20325, 2025

    Tianyi Yan, Tao Tang, Xingtai Gui, Yongkang Li, Jiasen Zhesng, Weiyao Huang, Lingdong Kong, Wencheng Han, Xia Zhou, Xueyang Zhang, et al. Ad-r1: Closed-loop reinforcement learning for end-to-end autonomous driving with impartial world models.arXiv preprint arXiv:2511.20325, 2025

  58. [58]

    Raw2drive: Reinforcement learning with aligned world models for end-to-end autonomous driving (in carla v2).arXiv preprint arXiv:2505.16394, 2025

    Zhenjie Yang, Xiaosong Jia, Qifeng Li, Xue Yang, Maoqing Yao, and Junchi Yan. Raw2drive: Reinforcement learning with aligned world models for end-to-end autonomous driving (in carla v2).arXiv preprint arXiv:2505.16394, 2025

  59. [59]

    Drivesuprim: Towards precise trajectory selection for end-to-end planning

    Wenhao Yao, Zhenxin Li, Shiyi Lan, Zi Wang, Xinglong Sun, Jose M Alvarez, and Zuxuan Wu. Drivesuprim: Towards precise trajectory selection for end-to-end planning. InAAAI, volume 40, pages 11910–11918, 2026

  60. [60]

    Drama: An efficient end-to-end motion planner for autonomous driving with mamba.arXiv preprint arXiv:2408.03601, 2024

    Chengran Yuan, Zhanqi Zhang, Jiawei Sun, Shuo Sun, Zefan Huang, Christina Dao Wen Lee, Dongen Li, Yuhang Han, Anthony Wong, Keng Peng Tee, et al. Drama: An efficient end-to-end motion planner for autonomous driving with mamba.arXiv preprint arXiv:2408.03601, 2024

  61. [61]

    Epona: Autoregressive diffusion world model for autonomous driving

    Kaiwen Zhang, Zhenyu Tang, Xiaotao Hu, Xingang Pan, Xiaoyang Guo, Yuan Liu, Jingwei Huang, Li Yuan, Qian Zhang, Xiao-Xiao Long, et al. Epona: Autoregressive diffusion world model for autonomous driving. InICCV, pages 27220–27230, 2025

  62. [62]

    A survey of autonomous driving from a deep learning perspective.ACM Computing Surveys, 57(10):1–60, 2025

    Jingyuan Zhao, Yuyan Wu, Rui Deng, Susu Xu, Jinpeng Gao, and Andrew Burke. A survey of autonomous driving from a deep learning perspective.ACM Computing Surveys, 57(10):1–60, 2025

  63. [63]

    Genad: Genera- tive end-to-end autonomous driving

    Wenzhao Zheng, Ruiqi Song, Xianda Guo, Chenming Zhang, and Long Chen. Genad: Genera- tive end-to-end autonomous driving. InECCV, pages 87–104. Springer, 2024

  64. [64]

    World4drive: End-to-end autonomous driving via intention-aware physical latent world model

    Yupeng Zheng, Pengxuan Yang, Zebin Xing, Qichao Zhang, Yuhang Zheng, Yinfeng Gao, Pengfei Li, Teng Zhang, Zhongpu Xia, Peng Jia, et al. World4drive: End-to-end autonomous driving via intention-aware physical latent world model. InICCV, pages 28632–28642, 2025

  65. [65]

    Resad: Normalized residual trajectory modeling for end-to-end autonomous driving.CVPR, 2026

    Zhiyu Zheng, Shaoyu Chen, Haoran Yin, Xinbang Zhang, Jialv Zou, Xinggang Wang, Qian Zhang, and Lefei Zhang. Resad: Normalized residual trajectory modeling for end-to-end autonomous driving.CVPR, 2026

  66. [66]

    Open- drivevla: Towards end-to-end autonomous driving with large vision language action model

    Xingcheng Zhou, Xuyuan Han, Feng Yang, Yunpu Ma, V olker Tresp, and Alois Knoll. Open- drivevla: Towards end-to-end autonomous driving with large vision language action model. In AAAI, volume 40, pages 13782–13790, 2026

  67. [67]

    Autovla: A vision-language-action model for end-to-end autonomous driving with adaptive reasoning and reinforcement fine-tuning.NeurIPS, 2025

    Zewei Zhou, Tianhui Cai, Seth Z Zhao, Yun Zhang, Zhiyu Huang, Bolei Zhou, and Jiaqi Ma. Autovla: A vision-language-action model for end-to-end autonomous driving with adaptive reasoning and reinforcement fine-tuning.NeurIPS, 2025

  68. [68]

    InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

    Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models.arXiv preprint arXiv:2504.10479, 2025. 13