pith. sign in

arxiv: 2605.21139 · v2 · pith:2F7C5IPDnew · submitted 2026-05-20 · 💻 cs.CV · cs.LG

Distill to Think, Foresee to Act: Cognitive-Physical Reinforcement Learning for Autonomous Driving

Pith reviewed 2026-05-25 05:50 UTC · model grok-4.3

classification 💻 cs.CV cs.LG
keywords autonomous drivingreinforcement learningBEV world modelvision-language modelscognitive-physical frameworkNAVSIM benchmarkintent controlsafety constraints
0
0 comments X

The pith

CoPhy distills VLM knowledge into a BEV encoder and pairs it with an auto-regressive BEV world model to optimize driving policies via dual-reward reinforcement learning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes a Cognitive-Physical reinforcement learning framework called CoPhy that first distills visual-language model knowledge into a bird's-eye-view encoder to retain cognitive understanding of traffic semantics and driving intent at zero inference cost. It then builds an auto-regressive BEV world model that predicts future semantic maps conditioned on candidate actions, creating an interpretable physical sandbox for deriving safety metrics. These two components support GRPO optimization with a physical reward that enforces hard safety constraints from the rollouts and a cognitive reward from a language-aligned scorer that ensures intent compliance. A sympathetic reader would care because the approach claims to surpass the behavioral cloning ceiling of imitation learning while enabling safer driving and flexible control through optional user language instructions.

Core claim

CoPhy achieves state-of-the-art results on NAVSIM v1 and v2 benchmarks by distilling VLM knowledge into the BEV encoder for cognitive ability, building an auto-regressive BEV world model to foresee action consequences as an interpretable physical sandbox, and optimizing the policy with GRPO under a dual-reward mechanism where physical rewards from BEV rollouts enforce safety and cognitive rewards from language alignment ensure intent compliance, all while releasing the cognitive channel for optional human language commands.

What carries the argument

The auto-regressive BEV world model that explicitly predicts future semantic maps conditioned on candidate actions, serving as the interpretable physical sandbox from which safety metrics are directly derived, together with the distilled cognitive channel in the BEV encoder.

If this is right

  • Driving policies reach state-of-the-art performance on NAVSIM v1 and v2 benchmarks.
  • Physical rewards derived from BEV rollouts enforce hard safety constraints during optimization.
  • Cognitive rewards from the language-aligned scorer ensure compliance with driving intent.
  • The distilled cognitive channel supports flexible intent control through user-defined language instructions at no extra inference cost.
  • Cognitively informed scene compliance produces safer driving behavior overall.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The separation of distilled cognition and predictive physics could allow the same infrastructure to support other sequential decision tasks beyond driving.
  • Making the world model outputs directly inspectable may simplify safety audits and regulatory review of learned policies.
  • User language instructions could extend to multi-agent coordination scenarios if the cognitive channel is shared across vehicles.
  • The dual-reward structure might be tested for transfer to simulation environments with different sensor modalities.

Load-bearing premise

The auto-regressive BEV world model can accurately predict future semantic maps conditioned on candidate actions.

What would settle it

A direct comparison showing that the predicted future semantic maps from the BEV world model diverge significantly from actual observed maps in held-out driving sequences would falsify the reliability of the physical sandbox for safety metric derivation.

Figures

Figures reproduced from arXiv: 2605.21139 by Jian Yang, Jin Xie, Qiang Meng, Yang Wu, Youquan Liu, Zhaojiang Liu.

Figure 1
Figure 1. Figure 1: (a) Previous methods isolate cognitive and physical reasoning, leading to semantic failures like ignoring a STOP sign or spatial violations like halting on a crosswalk. (b) In contrast, CoPhy respects traffic semantics and maintains lane discipline, ensuring safe and cognitive-aligned driving. collisions and lane violations. Similar to human drivers who mentally simulate outcomes before acting, an autonomo… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of CoPhy. Multi-modal data are encoded into BEV state [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Hierarchical trajectory selection. Candidates passing the cognitive threshold [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative comparisons of trajectories before and after optimization. The dual-reward [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative comparisons with DiffusionDrive [28] and WoTE [23]. [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: t-SNE [49] of distillation. Human command: I‘m in a very urgent situation, accelerate and run the red light! Original trajectory Human-intent trajectory Human command: Stop following slowly, do not tailgate. Change lanes to an open lane [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
read the original abstract

Current end-to-end autonomous driving models are fundamentally constrained by the behavioral cloning ceiling of imitation learning. While reinforcement learning offers a path to smarter autonomy, it demands two missing pieces of infrastructure: (1) a cognitive foundation that understands traffic semantics and driving intent, and (2) a foresighted physical environment that can anticipate the consequences of candidate actions. To this end, we propose CoPhy, a CognitivePhysical reinforcement learning framework for autonomous driving. To distill to think, we distill VLM knowledge into the BEV encoder and then discard the VLM entirely, retaining cognitive ability at zero inference cost while releasing the cognitive channel as a pluggable interface for optional human language commands. To foresee to act, we build an auto-regressive BEV world model that explicitly predicts future semantic maps conditioned on candidate actions, serving as an interpretable physical sandbox from which safety metrics are directly derived. Built upon this dual infrastructure, we optimize the driving policy via GRPO with a novel dual-reward mechanism: a physical reward derived from BEV rollouts enforces hard safety constraints, while a cognitive reward from a language-aligned scorer ensures intent compliance. Extensive experiments demonstrate that CoPhy not only achieves state-of-the-art results on NAVSIM v1 and v2 benchmarks, but also enables safer driving via cognitively informed scene compliance and flexible intent control through user-defined language instructions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes CoPhy, a Cognitive-Physical reinforcement learning framework for autonomous driving. It distills VLM knowledge into a BEV encoder (discarding the VLM at inference) to retain cognitive ability and expose a pluggable language interface, builds an auto-regressive BEV world model that predicts future semantic maps conditioned on candidate actions to serve as an interpretable physical sandbox, and optimizes the policy via GRPO using a dual-reward mechanism (physical reward from BEV rollouts for hard safety constraints plus cognitive reward from a language-aligned scorer for intent compliance). The central claims are SOTA performance on NAVSIM v1/v2 plus safer, language-controllable driving.

Significance. If the world-model accuracy and dual-reward claims hold with supporting evidence, the work could meaningfully advance beyond behavioral cloning in end-to-end driving by supplying both cognitive semantics and foresighted physical evaluation inside the RL loop, with the pluggable cognitive channel offering a practical route to user-specified intent. The explicit derivation of safety metrics from interpretable rollouts is a potentially valuable direction if quantitatively validated.

major comments (3)
  1. [Abstract] Abstract: the claim that CoPhy 'achieves state-of-the-art results on NAVSIM v1 and v2 benchmarks' is unsupported by any numerical metrics, baseline comparisons, ablation tables, or error analysis in the provided text, rendering the primary performance assertion unverifiable.
  2. [Abstract] Abstract: the auto-regressive BEV world model is presented as producing sufficiently accurate future semantic maps to derive reliable safety metrics and enforce 'hard safety constraints,' yet the manuscript supplies no multi-step prediction metrics (mIoU, instance-level error, or closed-loop safety correlation) over the 4–8 s horizons relevant to safety evaluation; this assumption is load-bearing for the physical-reward component.
  3. [Abstract] Abstract: the dual-reward mechanism (physical reward from BEV rollouts + cognitive reward from language-aligned scorer) and its integration with GRPO are described only at the level of high-level prose with no equations, reward formulations, or training details, preventing assessment of whether the rewards are genuinely new or risk circularity.
minor comments (2)
  1. [Abstract] Acronyms (VLM, BEV, GRPO, NAVSIM) are not defined on first use.
  2. [Abstract] The mapping between the rhetorical phrases 'distill to think' and 'foresee to act' and the concrete technical modules could be stated more explicitly.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on the abstract. We agree that several central claims require explicit supporting evidence within the abstract itself to be verifiable from the provided text. We will revise the abstract to incorporate key numerical results, prediction metrics, and reward formulations drawn from the full manuscript. Point-by-point responses follow.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim that CoPhy 'achieves state-of-the-art results on NAVSIM v1 and v2 benchmarks' is unsupported by any numerical metrics, baseline comparisons, ablation tables, or error analysis in the provided text, rendering the primary performance assertion unverifiable.

    Authors: We acknowledge that the abstract, as currently written, does not contain the supporting numerical evidence. The full manuscript reports these results in the experiments section, including direct comparisons and ablations on NAVSIM v1/v2. To address the concern directly, we will revise the abstract to include the primary SOTA metrics and improvement margins over baselines. revision: yes

  2. Referee: [Abstract] Abstract: the auto-regressive BEV world model is presented as producing sufficiently accurate future semantic maps to derive reliable safety metrics and enforce 'hard safety constraints,' yet the manuscript supplies no multi-step prediction metrics (mIoU, instance-level error, or closed-loop safety correlation) over the 4–8 s horizons relevant to safety evaluation; this assumption is load-bearing for the physical-reward component.

    Authors: This observation is correct for the abstract text. The manuscript contains the requested multi-step metrics and safety correlations in the world-model evaluation subsection. We will add a concise statement of these metrics (mIoU and horizon-specific accuracy) to the revised abstract to substantiate the physical-reward claims. revision: yes

  3. Referee: [Abstract] Abstract: the dual-reward mechanism (physical reward from BEV rollouts + cognitive reward from language-aligned scorer) and its integration with GRPO are described only at the level of high-level prose with no equations, reward formulations, or training details, preventing assessment of whether the rewards are genuinely new or risk circularity.

    Authors: We agree that the abstract provides only a prose description. The methods section supplies the full reward equations, GRPO integration, and training details. We will incorporate the core reward formulations into the revised abstract to allow assessment of novelty and avoid any appearance of circularity. revision: yes

Circularity Check

0 steps flagged

No circularity: claims rest on external benchmark results without self-referential reductions

full rationale

The provided abstract and text describe a framework with a distilled BEV encoder, auto-regressive world model for semantic map prediction, and dual-reward GRPO optimization. No equations, parameter-fitting procedures, or derivation steps are shown that reduce a claimed prediction or result to its own inputs by construction. Performance is asserted via SOTA on NAVSIM v1/v2 benchmarks, which are external. The world model and rewards are presented as infrastructure components without evidence of self-definition or fitted-input renaming. This is the common case of a self-contained empirical claim.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The framework rests on the unverified effectiveness of VLM distillation for retaining cognitive ability and on the predictive fidelity of the new BEV world model; both lack independent evidence in the abstract.

axioms (1)
  • domain assumption Vision-language models contain transferable cognitive knowledge about traffic semantics and driving intent that can be distilled into a BEV encoder without loss of utility.
    Invoked to justify discarding the VLM after distillation while retaining cognitive ability.
invented entities (2)
  • Auto-regressive BEV world model no independent evidence
    purpose: Predict future semantic maps conditioned on candidate actions to derive safety metrics
    New component introduced to serve as physical sandbox; no external validation mentioned.
  • Cognitive channel as pluggable interface no independent evidence
    purpose: Enable optional human language commands after distillation
    Introduced as byproduct of the distillation process.

pith-pipeline@v0.9.0 · 5787 in / 1463 out tokens · 52880 ms · 2026-05-25T05:50:17.337906+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

68 extracted references · 68 canonical work pages · 12 internal anchors

  1. [1]

    NuPlan: A closed-loop ML-based planning benchmark for autonomous vehicles

    Holger Caesar, Juraj Kabzan, Kok Seang Tan, Whye Kit Fong, Eric Wolff, Alex Lang, Luke Fletcher, Oscar Beijbom, and Sammy Omari. nuplan: A closed-loop ml-based planning benchmark for autonomous vehicles.arXiv preprint arXiv:2106.11810, 2021

  2. [2]

    Pseudo-simulation for autonomous driving

    Wei Cao, Marcel Hallgarten, Tianyu Li, Daniel Dauner, Xunjiang Gu, Caojun Wang, Yakov Miron, Marco Aiello, Hongyang Li, Igor Gilitschenski, et al. Pseudo-simulation for autonomous driving

  3. [3]

    End-to-end autonomous driving: Challenges and frontiers.TPAMI, 46(12):10164–10183, 2024

    Li Chen, Penghao Wu, Kashyap Chitta, Bernhard Jaeger, Andreas Geiger, and Hongyang Li. End-to-end autonomous driving: Challenges and frontiers.TPAMI, 46(12):10164–10183, 2024

  4. [4]

    Vadv2: End-to-end vectorized autonomous driving via probabilistic planning.ICLR, 2026

    Shaoyu Chen, Bo Jiang, Hao Gao, Bencheng Liao, Qing Xu, Qian Zhang, Chang Huang, Wenyu Liu, and Xinggang Wang. Vadv2: End-to-end vectorized autonomous driving via probabilistic planning.ICLR, 2026

  5. [5]

    Transfuser: Imitation with transformer-based sensor fusion for autonomous driving.TPAMI, 45(11):12878–12895, 2022

    Kashyap Chitta, Aditya Prakash, Bernhard Jaeger, Zehao Yu, Katrin Renz, and Andreas Geiger. Transfuser: Imitation with transformer-based sensor fusion for autonomous driving.TPAMI, 45(11):12878–12895, 2022

  6. [6]

    Openscene: The largest up-to-date 3d occupancy prediction bench- mark in autonomous driving.https://github.com/OpenDriveLab/OpenScene, 2023

    OpenScene Contributors. Openscene: The largest up-to-date 3d occupancy prediction bench- mark in autonomous driving.https://github.com/OpenDriveLab/OpenScene, 2023

  7. [7]

    Navsim: Data-driven non-reactive autonomous vehicle simulation and benchmarking.NeurIPS, 37:28706–28719, 2024

    Daniel Dauner, Marcel Hallgarten, Tianyu Li, Xinshuo Weng, Zhiyu Huang, Zetong Yang, Hongyang Li, Igor Gilitschenski, Boris Ivanovic, Marco Pavone, et al. Navsim: Data-driven non-reactive autonomous vehicle simulation and benchmarking.NeurIPS, 37:28706–28719, 2024

  8. [8]

    Artemis: Autoregressive end-to-end trajectory planning with mixture of experts for autonomous driving.RAL, 11(1):226–233, 2025

    Renju Feng, Ning Xi, Duanfeng Chu, Rukang Wang, Zejian Deng, Anzheng Wang, Liping Lu, Jinxiang Wang, and Yanjun Huang. Artemis: Autoregressive end-to-end trajectory planning with mixture of experts for autonomous driving.RAL, 11(1):226–233, 2025

  9. [9]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

  10. [10]

    Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Liang Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022

  11. [11]

    Planning-oriented autonomous driving

    Yihan Hu, Jiazhi Yang, Li Chen, Keyu Li, Chonghao Sima, Xizhou Zhu, Siqi Chai, Senyao Du, Tianwei Lin, Wenhai Wang, et al. Planning-oriented autonomous driving. InCVPR, pages 17853–17862, 2023

  12. [12]

    EMMA: End-to-End Multimodal Model for Autonomous Driving

    Jyh-Jing Hwang, Runsheng Xu, Hubert Lin, Wei-Chih Hung, Jingwei Ji, Kristy Choi, Di Huang, Tong He, Paul Covington, Benjamin Sapp, et al. Emma: End-to-end multimodal model for autonomous driving.arXiv preprint arXiv:2410.23262, 2024

  13. [13]

    Irl-vla: Training an vision-language-action policy via reward world model.arXiv preprint arXiv:2508.06571, 2025

    Anqing Jiang, Yu Gao, Yiru Wang, Zhigang Sun, Shuo Wang, Yuwen Heng, Hao Sun, Shichen Tang, Lijuan Zhu, Jinhao Chai, et al. Irl-vla: Training an vision-language-action policy via reward world model.arXiv preprint arXiv:2508.06571, 2025

  14. [14]

    Senna: Bridging Large Vision-Language Models and End-to-End Autonomous Driving

    Bo Jiang, Shaoyu Chen, Bencheng Liao, Xingyu Zhang, Wei Yin, Qian Zhang, Chang Huang, Wenyu Liu, and Xinggang Wang. Senna: Bridging large vision-language models and end-to-end autonomous driving.arXiv preprint arXiv:2410.22313, 2024

  15. [15]

    Vad: Vectorized scene representation for efficient autonomous driving

    Bo Jiang, Shaoyu Chen, Qing Xu, Bencheng Liao, Jiajie Chen, Helong Zhou, Qian Zhang, Wenyu Liu, Chang Huang, and Xinggang Wang. Vad: Vectorized scene representation for efficient autonomous driving. InICCV, pages 8340–8350, 2023

  16. [16]

    AlphaDrive: Unleashing the Power of VLMs in Autonomous Driving via Reinforcement Learning and Reasoning

    Bo Jiang, Shaoyu Chen, Qian Zhang, Wenyu Liu, and Xinggang Wang. Alphadrive: Unleashing the power of vlms in autonomous driving via reinforcement learning and reasoning.arXiv preprint arXiv:2503.07608, 2025. 10

  17. [17]

    Adapt: Action-aware driving caption transformer

    Bu Jin, Xinyu Liu, Yupeng Zheng, Pengfei Li, Hao Zhao, Tong Zhang, Yuhang Zheng, Guyue Zhou, and Jingjing Liu. Adapt: Action-aware driving caption transformer. InICRA, pages 7554–7561. IEEE, 2023

  18. [18]

    Finetuning generative trajectory model with reinforcement learning from human feedback.arXiv e-prints, pages arXiv–2503, 2025

    Derun Li, Jianwei Ren, Yue Wang, Xin Wen, Pengxiang Li, Leimeng Xu, Kun Zhan, Zhongpu Xia, Peng Jia, Xianpeng Lang, et al. Finetuning generative trajectory model with reinforcement learning from human feedback.arXiv e-prints, pages arXiv–2503, 2025

  19. [19]

    Hydra-mdp++: Advancing end-to-end driving via expert-guided hydra-distillation.arXiv preprint arXiv:2503.12820, 2025

    Kailin Li, Zhenxin Li, Shiyi Lan, Yuan Xie, Zhizhong Zhang, Jiayi Liu, Zuxuan Wu, Zhiding Yu, and Jose M Alvarez. Hydra-mdp++: Advancing end-to-end driving via expert-guided hydra-distillation.arXiv preprint arXiv:2503.12820, 2025

  20. [20]

    SpaceDrive: Infusing Spatial Awareness into VLM-based Autonomous Driving

    Peizheng Li, Zhenghao Zhang, David Holtz, Hang Yu, Yutong Yang, Yuzhi Lai, Rui Song, Andreas Geiger, and Andreas Zell. Spacedrive: Infusing spatial awareness into vlm-based autonomous driving.arXiv preprint arXiv:2512.10719, 2, 2025

  21. [21]

    Enhancing end-to-end autonomous driving with latent world model.ICLR, 2025

    Yingyan Li, Lue Fan, Jiawei He, Yuqi Wang, Yuntao Chen, Zhaoxiang Zhang, and Tieniu Tan. Enhancing end-to-end autonomous driving with latent world model.ICLR, 2025

  22. [22]

    Drivevla-w0: World models amplify data scaling law in autonomous driving.ICLR, 2026

    Yingyan Li, Shuyao Shang, Weisong Liu, Bing Zhan, Haochen Wang, Yuqi Wang, Yuntao Chen, Xiaoman Wang, Yasong An, Chufeng Tang, et al. Drivevla-w0: World models amplify data scaling law in autonomous driving.ICLR, 2026

  23. [23]

    End-to-end driving with online trajectory evaluation via bev world model

    Yingyan Li, Yuqi Wang, Yang Liu, Jiawei He, Lue Fan, and Zhaoxiang Zhang. End-to-end driving with online trajectory evaluation via bev world model. InICCV, pages 27137–27146, 2025

  24. [24]

    Recogdrive: A reinforced cognitive framework for end-to-end autonomous driving.ICLR, 2026

    Yongkang Li, Kaixin Xiong, Xiangyu Guo, Fang Li, Sixu Yan, Gangwei Xu, Lijun Zhou, Long Chen, Haiyang Sun, Bing Wang, et al. Recogdrive: A reinforced cognitive framework for end-to-end autonomous driving.ICLR, 2026

  25. [25]

    Unidrivevla: Unifying understanding, perception, and action planning for autonomous driving.arXiv preprint arXiv:2604.02190, 2026

    Yongkang Li, Lijun Zhou, Sixu Yan, Bencheng Liao, Tianyi Yan, Kaixin Xiong, Long Chen, Hongwei Xie, Bing Wang, Guang Chen, et al. Unidrivevla: Unifying understanding, perception, and action planning for autonomous driving.arXiv preprint arXiv:2604.02190, 2026

  26. [26]

    Drive-r1: Bridging reasoning and planning in vlms for autonomous driving with reinforcement learning

    Yue Li, Meng Tian, Dechang Zhu, Jiangtong Zhu, Zhenyu Lin, Zhiwei Xiong, and Xinhai Zhao. Drive-r1: Bridging reasoning and planning in vlms for autonomous driving with reinforcement learning. InAAAI, volume 40, pages 6708–6716, 2026

  27. [27]

    Hydra-MDP: End-to-end Multimodal Planning with Multi-target Hydra-Distillation

    Zhenxin Li, Kailin Li, Shihao Wang, Shiyi Lan, Zhiding Yu, Yishen Ji, Zhiqi Li, Ziyue Zhu, Jan Kautz, Zuxuan Wu, et al. Hydra-mdp: End-to-end multimodal planning with multi-target hydra-distillation.arXiv preprint arXiv:2406.06978, 2024

  28. [28]

    Diffusiondrive: Truncated diffusion model for end-to-end autonomous driving

    Bencheng Liao, Shaoyu Chen, Haoran Yin, Bo Jiang, Cheng Wang, Sixu Yan, Xinbang Zhang, Xiangyu Li, Ying Zhang, Qian Zhang, et al. Diffusiondrive: Truncated diffusion model for end-to-end autonomous driving. InCVPR, pages 12037–12047, 2025

  29. [29]

    Focal loss for dense object detection

    Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. Focal loss for dense object detection. InICCV, pages 2980–2988, 2017

  30. [30]

    X-driver: Explainable autonomous driving with vision-language models.arXiv preprint arXiv:2505.05098, 2025

    Wei Liu, Jiyuan Zhang, Binxiong Zheng, Yufeng Hu, Yingzhan Lin, and Zengfeng Zeng. X-driver: Explainable autonomous driving with vision-language models.arXiv preprint arXiv:2505.05098, 2025

  31. [31]

    Sgdr: Stochastic gradient descent with warm restarts

    Ilya Loshchilov and Frank Hutter. Sgdr: Stochastic gradient descent with warm restarts. In ICLR, 2017

  32. [32]

    Decoupled weight decay regularization.ICLR, 2018

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.ICLR, 2018

  33. [33]

    Adathinkdrive: Adaptive thinking via reinforcement learning for autonomous driving.arXiv preprint arXiv:2509.13769, 2025

    Yuechen Luo, Fang Li, Shaoqing Xu, Zhiyi Lai, Lei Yang, Qimao Chen, Ziang Luo, Zixun Xie, Shengyin Jiang, Jiaxin Liu, et al. Adathinkdrive: Adaptive thinking via reinforcement learning for autonomous driving.arXiv preprint arXiv:2509.13769, 2025

  34. [34]

    Drama: Joint risk localization and captioning in driving

    Srikanth Malla, Chiho Choi, Isht Dwivedi, Joon Hee Choi, and Jiachen Li. Drama: Joint risk localization and captioning in driving. InWACV, pages 1043–1052, 2023. 11

  35. [35]

    Lingoqa: Visual question answering for autonomous driving

    Ana-Maria Marcu, Long Chen, Jan Hünermann, Alice Karnsund, Benoit Hanotte, Prajwal Chidananda, Saurabh Nair, Vijay Badrinarayanan, Alex Kendall, Jamie Shotton, et al. Lingoqa: Visual question answering for autonomous driving. InECCV, pages 252–269. Springer, 2024

  36. [36]

    Representation Learning with Contrastive Predictive Coding

    Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding.arXiv preprint arXiv:1807.03748, 2018

  37. [37]

    Vlaad: Vision and language assistant for autonomous driving

    SungYeon Park, MinJae Lee, JiHyuk Kang, Hahyeon Choi, Yoonah Park, Juhwan Cho, Adam Lee, and DongKyu Kim. Vlaad: Vision and language assistant for autonomous driving. In CVPR, pages 980–987, 2024

  38. [38]

    Multi-modal fusion transformer for end-to-end autonomous driving

    Aditya Prakash, Kashyap Chitta, and Andreas Geiger. Multi-modal fusion transformer for end-to-end autonomous driving. InCVPR, pages 7077–7087, 2021

  39. [39]

    Nuscenes-qa: A multi-modal visual question answering benchmark for autonomous driving scenario

    Tianwen Qian, Jingjing Chen, Linhai Zhuo, Yang Jiao, and Yu-Gang Jiang. Nuscenes-qa: A multi-modal visual question answering benchmark for autonomous driving scenario. InAAAI, volume 38, pages 4542–4550, 2024

  40. [40]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InICLR, pages 8748–8763. PMLR, 2021

  41. [41]

    Simlingo: Vision-only closed-loop autonomous driving with language-action alignment

    Katrin Renz, Long Chen, Elahe Arani, and Oleg Sinavski. Simlingo: Vision-only closed-loop autonomous driving with language-action alignment. InCVPR, pages 11993–12003, 2025

  42. [42]

    Poutine: Vision-language-trajectory pre-training and reinforcement learning post-training enable robust end-to-end autonomous driving.arXiv preprint arXiv:2506.11234, 2025

    Luke Rowe, Rodrigue de Schaetzen, Roger Girgis, Christopher Pal, and Liam Paull. Poutine: Vision-language-trajectory pre-training and reinforcement learning post-training enable robust end-to-end autonomous driving.arXiv preprint arXiv:2506.11234, 2025

  43. [43]

    Drivedpo: Policy learning via safety dpo for end-to-end autonomous driving.NeurIPS, 2025

    Shuyao Shang, Yuntao Chen, Yuqi Wang, Yingyan Li, and Zhaoxiang Zhang. Drivedpo: Policy learning via safety dpo for end-to-end autonomous driving.NeurIPS, 2025

  44. [44]

    Lmdrive: Closed-loop end-to-end driving with large language models

    Hao Shao, Yuxuan Hu, Letian Wang, Guanglu Song, Steven L Waslander, Yu Liu, and Hong- sheng Li. Lmdrive: Closed-loop end-to-end driving with large language models. InCVPR, pages 15120–15130, 2024

  45. [45]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

  46. [46]

    ExploreVLA: Dense World Modeling and Exploration for End-to-End Autonomous Driving

    Zihao Sheng, Xin Ye, Jingru Luo, Sikai Chen, and Liu Ren. Explorevla: Dense world modeling and exploration for end-to-end autonomous driving.arXiv preprint arXiv:2604.02714, 2026

  47. [47]

    Drivelm: Driving with graph visual question answering

    Chonghao Sima, Katrin Renz, Kashyap Chitta, Li Chen, Hanxue Zhang, Chengen Xie, Jens Beißwenger, Ping Luo, Andreas Geiger, and Hongyang Li. Drivelm: Driving with graph visual question answering. InECCV, pages 256–274. Springer, 2024

  48. [48]

    Deep reinforcement learning for robotics: A survey of real-world successes.Annual Review of Control, Robotics, and Autonomous Systems, 8(1):153–188, 2025

    Chen Tang, Ben Abbatematteo, Jiaheng Hu, Rohan Chandra, Roberto Martín-Martín, and Peter Stone. Deep reinforcement learning for robotics: A survey of real-world successes.Annual Review of Control, Robotics, and Autonomous Systems, 8(1):153–188, 2025

  49. [49]

    Visualizing data using t-sne.JMLR, 9(11), 2008

    Laurens Van der Maaten and Geoffrey Hinton. Visualizing data using t-sne.JMLR, 9(11), 2008

  50. [50]

    Attention is all you need.NeurIPS, 30, 2017

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.NeurIPS, 30, 2017

  51. [51]

    Alpamayo-R1: Bridging Reasoning and Action Prediction for Generalizable Autonomous Driving in the Long Tail

    Yan Wang, Wenjie Luo, Junjie Bai, Yulong Cao, Tong Che, Ke Chen, Yuxiao Chen, Jenna Dia- mond, Yifan Ding, Wenhao Ding, et al. Alpamayo-r1: Bridging reasoning and action prediction for generalizable autonomous driving in the long tail.arXiv preprint arXiv:2511.00088, 2025

  52. [52]

    Para-drive: Paral- lelized architecture for real-time autonomous driving

    Xinshuo Weng, Boris Ivanovic, Yan Wang, Yue Wang, and Marco Pavone. Para-drive: Paral- lelized architecture for real-time autonomous driving. InCVPR, pages 15449–15458, 2024. 12

  53. [53]

    Drivelaw: Unifying planning and video generation in a latent driving world.CVPR, 2026

    Tianze Xia, Yongkang Li, Lijun Zhou, Jingfeng Yao, Kaixin Xiong, Haiyang Sun, Bing Wang, Kun Ma, Guang Chen, Hangjun Ye, et al. Drivelaw: Unifying planning and video generation in a latent driving world.CVPR, 2026

  54. [54]

    Goalflow: Goal-driven flow matching for multimodal trajectories generation in end-to-end autonomous driving

    Zebin Xing, Xingyu Zhang, Yang Hu, Bo Jiang, Tong He, Qian Zhang, Xiaoxiao Long, and Wei Yin. Goalflow: Goal-driven flow matching for multimodal trajectories generation in end-to-end autonomous driving. InCVPR, pages 1602–1611, 2025

  55. [55]

    Wam-diff: A masked diffusion vla framework with moe and online reinforcement learning for autonomous driving.arXiv preprint arXiv:2512.11872, 2025

    Mingwang Xu, Jiahao Cui, Feipeng Cai, Hanlin Shang, Zhihao Zhu, Shan Luan, Yifang Xu, Neng Zhang, Yaoyi Li, Jia Cai, et al. Wam-diff: A masked diffusion vla framework with moe and online reinforcement learning for autonomous driving.arXiv preprint arXiv:2512.11872, 2025

  56. [56]

    Drivegpt4: Interpretable end-to-end autonomous driving via large language model.RAL, 9(10):8186–8193, 2024

    Zhenhua Xu, Yujia Zhang, Enze Xie, Zhen Zhao, Yong Guo, Kwan-Yee K Wong, Zhenguo Li, and Hengshuang Zhao. Drivegpt4: Interpretable end-to-end autonomous driving via large language model.RAL, 9(10):8186–8193, 2024

  57. [57]

    Ad-r1: Closed-loop reinforcement learning for end-to-end autonomous driving with impartial world models.arXiv preprint arXiv:2511.20325, 2025

    Tianyi Yan, Tao Tang, Xingtai Gui, Yongkang Li, Jiasen Zhesng, Weiyao Huang, Lingdong Kong, Wencheng Han, Xia Zhou, Xueyang Zhang, et al. Ad-r1: Closed-loop reinforcement learning for end-to-end autonomous driving with impartial world models.arXiv preprint arXiv:2511.20325, 2025

  58. [58]

    Raw2drive: Reinforcement learning with aligned world models for end-to-end autonomous driving (in carla v2).arXiv preprint arXiv:2505.16394, 2025

    Zhenjie Yang, Xiaosong Jia, Qifeng Li, Xue Yang, Maoqing Yao, and Junchi Yan. Raw2drive: Reinforcement learning with aligned world models for end-to-end autonomous driving (in carla v2).arXiv preprint arXiv:2505.16394, 2025

  59. [59]

    Drivesuprim: Towards precise trajectory selection for end-to-end planning

    Wenhao Yao, Zhenxin Li, Shiyi Lan, Zi Wang, Xinglong Sun, Jose M Alvarez, and Zuxuan Wu. Drivesuprim: Towards precise trajectory selection for end-to-end planning. InAAAI, volume 40, pages 11910–11918, 2026

  60. [60]

    Drama: An efficient end-to-end motion planner for autonomous driving with mamba.arXiv preprint arXiv:2408.03601, 2024

    Chengran Yuan, Zhanqi Zhang, Jiawei Sun, Shuo Sun, Zefan Huang, Christina Dao Wen Lee, Dongen Li, Yuhang Han, Anthony Wong, Keng Peng Tee, et al. Drama: An efficient end-to-end motion planner for autonomous driving with mamba.arXiv preprint arXiv:2408.03601, 2024

  61. [61]

    Epona: Autoregressive diffusion world model for autonomous driving

    Kaiwen Zhang, Zhenyu Tang, Xiaotao Hu, Xingang Pan, Xiaoyang Guo, Yuan Liu, Jingwei Huang, Li Yuan, Qian Zhang, Xiao-Xiao Long, et al. Epona: Autoregressive diffusion world model for autonomous driving. InICCV, pages 27220–27230, 2025

  62. [62]

    A survey of autonomous driving from a deep learning perspective.ACM Computing Surveys, 57(10):1–60, 2025

    Jingyuan Zhao, Yuyan Wu, Rui Deng, Susu Xu, Jinpeng Gao, and Andrew Burke. A survey of autonomous driving from a deep learning perspective.ACM Computing Surveys, 57(10):1–60, 2025

  63. [63]

    Genad: Genera- tive end-to-end autonomous driving

    Wenzhao Zheng, Ruiqi Song, Xianda Guo, Chenming Zhang, and Long Chen. Genad: Genera- tive end-to-end autonomous driving. InECCV, pages 87–104. Springer, 2024

  64. [64]

    World4drive: End-to-end autonomous driving via intention-aware physical latent world model

    Yupeng Zheng, Pengxuan Yang, Zebin Xing, Qichao Zhang, Yuhang Zheng, Yinfeng Gao, Pengfei Li, Teng Zhang, Zhongpu Xia, Peng Jia, et al. World4drive: End-to-end autonomous driving via intention-aware physical latent world model. InICCV, pages 28632–28642, 2025

  65. [65]

    Resad: Normalized residual trajectory modeling for end-to-end autonomous driving.CVPR, 2026

    Zhiyu Zheng, Shaoyu Chen, Haoran Yin, Xinbang Zhang, Jialv Zou, Xinggang Wang, Qian Zhang, and Lefei Zhang. Resad: Normalized residual trajectory modeling for end-to-end autonomous driving.CVPR, 2026

  66. [66]

    Open- drivevla: Towards end-to-end autonomous driving with large vision language action model

    Xingcheng Zhou, Xuyuan Han, Feng Yang, Yunpu Ma, V olker Tresp, and Alois Knoll. Open- drivevla: Towards end-to-end autonomous driving with large vision language action model. In AAAI, volume 40, pages 13782–13790, 2026

  67. [67]

    Autovla: A vision-language-action model for end-to-end autonomous driving with adaptive reasoning and reinforcement fine-tuning.NeurIPS, 2025

    Zewei Zhou, Tianhui Cai, Seth Z Zhao, Yun Zhang, Zhiyu Huang, Bolei Zhou, and Jiaqi Ma. Autovla: A vision-language-action model for end-to-end autonomous driving with adaptive reasoning and reinforcement fine-tuning.NeurIPS, 2025

  68. [68]

    InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

    Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models.arXiv preprint arXiv:2504.10479, 2025. 13