pith. sign in

arxiv: 2605.14201 · v2 · pith:HNFYBUPPnew · submitted 2026-05-13 · 💻 cs.RO · cs.CV

MAPLE: Latent Multi-Agent Play for End-to-End Autonomous Driving

Pith reviewed 2026-05-21 07:45 UTC · model grok-4.3

classification 💻 cs.RO cs.CV
keywords end-to-end autonomous drivingvision-language-action modelsmulti-agent rolloutsclosed-loop traininglatent space simulationreinforcement learningBench2Drive
0
0 comments X

The pith

MAPLE trains end-to-end driving models through reactive multi-agent rollouts performed inside the model's own latent space.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces MAPLE to fix brittleness in vision-language-action models for autonomous driving, which arise when they are trained only with imitation learning. It does this by running closed-loop multi-agent simulations directly in the model's latent space, letting the ego vehicle and nearby traffic agents each take independent multi-step actions while reacting to one another. The training happens in two stages: supervised fine-tuning on rollouts derived from ground-truth trajectories, then reinforcement learning that adds rewards for safety, progress, interaction realism, and behavioral diversity. A sympathetic reader would care because the method promises more robust driving policies without depending on slow, low-fidelity external simulators.

Core claim

MAPLE is a framework for reactive, multi-agent rollout of a dynamic driving scenario in the latent space of the VLA model. The ego vehicle and nearby traffic agents are independently controlled over multi-step horizons while remaining reactive to other agents, enabling closed-loop training. The approach uses two stages of training: supervised fine-tuning on latent rollouts based on ground-truth trajectories, followed by reinforcement learning with global and agent-specific rewards plus diversity rewards. MAPLE reaches state-of-the-art driving performance on Bench2Drive and shows that scalable closed-loop multi-agent play can produce robust end-to-end autonomous driving systems.

What carries the argument

Latent multi-agent rollout mechanism that independently advances the ego vehicle and traffic agents over multiple steps while modeling their mutual reactivity inside the VLA model's latent space.

If this is right

  • Closed-loop training of driving policies becomes feasible without running external simulators.
  • Models learn to handle reactive traffic interactions more robustly than imitation learning alone allows.
  • Diversity rewards let the planner produce behaviors absent from the original logged data.
  • Global and agent-specific rewards jointly encourage safety, forward progress, and realistic multi-agent dynamics.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same latent-play structure could be tested on other multi-agent embodied tasks such as manipulation in crowded scenes.
  • Staying inside the model's latent space might reduce the distribution shift that usually appears when policies move from training to real-world sensors.
  • Diversity rewards could be combined with real-world data collection loops to continually expand the set of encountered driving scenarios.

Load-bearing premise

The VLA model's latent space can faithfully represent independent multi-step controls for the ego vehicle and traffic agents while capturing their reactive interactions.

What would settle it

If closed-loop evaluations on Bench2Drive or similar benchmarks show that MAPLE-trained models produce no measurable gains in safety, progress, or collision avoidance compared with standard imitation-learning baselines, the central claim would be refuted.

Figures

Figures reproduced from arXiv: 2605.14201 by Deepti Hegde, Fatih Porikli, Hanno Ackermann, Hong Cai, Hsin-Pai Cheng, Litian Liu, Meysam Sadeghigooghari, Mohammad Ghavamzadeh, Pranav Desai, Rajeev Yasarla, Shizhong Han, Yunxiao Shi.

Figure 1
Figure 1. Figure 1: MAPLE pretraining and future state prediction. Left: Pretraining the VLA backbone with auxiliary supervision (e.g., map learning, detection, and motion prediction). Right: State-transition pretraining that predicts next-step ego/agent states over a horizon T to stabilize the token space. Multi-agent simulation and self-play. Trajectory forecasting methods [12, 35, 49] model joint agent futures from fixed o… view at source ↗
Figure 2
Figure 2. Figure 2: MAPLE supervised fine-tuning (SFT) stage. Left: Single-step supervision and inference. The VLA backbone encodes multi-view images (and map features) into ego and agent tokens, which are decoded by an ego planner, reactive-agent planners, and a motion head. Right: The same model unrolled for T steps during imitation-learning-based scenario rollouts. Predicted tokens/trajectories are fed back autoregressivel… view at source ↗
Figure 3
Figure 3. Figure 3: MAPLE RL fine-tuning stage. Starting from the SFT model, we optimize multi-step rollouts over T steps using RL with safety-aware and interaction-aware rewards (e.g., collision avoidance and TTC). progress and safe driving for each controlled agent, and (iii) a diversity reward that promotes distinct behaviors across different planners/policies. At a time t, we define the total rollout reward as Rt = Gt + D… view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative examples of closed-loop driving on Bench2Drive using MAPLE. We show repre￾sentative trajectories in diverse scenarios, including adverse-weather scenes with limited visibility and sudden pedestrian crossings (top row), and clear suburban traffic with dynamic agents such as cyclists and surrounding vehicles (bottom row). Blue curves denote the planned ego-vehicle trajectory, highlighting smooth … view at source ↗
Figure 5
Figure 5. Figure 5: BEV qualitative comparison on Bench2Drive (closed-loop). Bird’s-eye-view visualiza￾tion for the same route/scenario (RouteScenario_25951_rep0, HazardAtSideLaneTwoWays_1, weather_id=7). Left: ReCogDrive [27]. Right: MAPLE (ours). The planned ego trajectory is overlaid, illustrating different interaction outcomes in the same context [PITH_FULL_IMAGE:figures/full_fig_p016_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Closed-loop rollout comparison on Bench2Drive. Multi-frame qualitative roll￾outs for the same route/scenario (RouteScenario_25951_rep0, HazardAtSideLaneTwoWays_1, weather_id=7). Top row: ReCogDrive. Bottom row: MAPLE. Colored curves denote the planned ego trajectory across time, highlighting differences in closed-loop interaction behavior. containing dynamic agents (e.g., two cyclists traveling along the r… view at source ↗
Figure 7
Figure 7. Figure 7: Additional qualitative closed-loop driving examples on Bench2Drive using MAPLE. These examples includes challenging conditions, like low-light/night driving with sudden pedestrian appearances and wet-road reflections, dense fog/highway driving with reduced visibility, and urban scenes with adverse weather. Blue/cyan curves denote the planned ego trajectory [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Additional qualitative closed-loop driving examples on Bench2Drive using MAPLE. More examples covering suburban/rural traffic with oncoming vehicles and lane curvature, as well as nighttime intersection scenarios with wet-road conditions and surrounding traffic. Blue/cyan curves denote the planned ego trajectory. gradual curvature and without abrupt corrections [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Failure case: over-cautious avoidance leading to lane departure (Bench2Drive, closed￾loop). In this scenario, MAPLE performs an overly conservative unprotected left turn to avoid a potential collision, resulting in a brief deviation outside the route lanes for about 1.0 meters (1.29% of the full route). The car quickly moves back to the lane after this brief deviation. Blue/cyan curves denote the planned e… view at source ↗
read the original abstract

Vision-language-action (VLA) models are effective as end-to-end motion planners, but can be brittle when evaluated in closed-loop settings due to being trained under traditional imitation learning framework. Existing closed-loop supervision approaches lack scalability and fail to completely model a reactive environment. We propose MAPLE, a novel framework for reactive, multi-agent rollout of a dynamic driving scenario in the latent space of the VLA model. The ego vehicle and nearby traffic agents are independently controlled over multi-step horizons, while being reactive to other agents in the scene, enabling closed-loop training. MAPLE consists of two training stages: (1) supervised fine-tuning on the latent rollouts based on ground-truth trajectories, followed by (2) reinforcement learning with global and agent -specific rewards that encourage safety, progress, and interaction realism. We further propose diversity rewards that encourage the model to generate planning behaviors that may not be present in logged driving data. Notably, our closed-loop training framework is scalable and does not require external simulators, which can be computationally expensive to run and have limited visual fidelity to the real-world. MAPLE achieves state-of-the-art driving performance on Bench2Drive and demonstrates scalable, closed-loop multi-agent play for robust E2E autonomous driving systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The manuscript introduces MAPLE, a two-stage framework for closed-loop training of vision-language-action (VLA) models in end-to-end autonomous driving. Stage 1 performs supervised fine-tuning on multi-step latent rollouts derived from ground-truth trajectories, with the ego vehicle and nearby traffic agents controlled independently yet reactively. Stage 2 applies reinforcement learning using a combination of global, agent-specific, and diversity rewards to promote safety, progress, interaction realism, and behaviors absent from logged data. The approach operates entirely in the VLA latent space without external simulators and reports state-of-the-art results on the Bench2Drive benchmark.

Significance. If the empirical claims hold, the work offers a scalable route to robust closed-loop E2E driving policies by explicitly modeling reactive multi-agent dynamics inside a learned latent space. The two-stage pipeline and the addition of diversity rewards to escape imitation-learning mode collapse constitute a practical contribution that could reduce dependence on expensive, low-fidelity simulators while improving generalization.

major comments (2)
  1. [§3.2] §3.2 (Latent Rollout and RL Stage): The central claim that independent multi-step control of ego and traffic agents inside the VLA latent space yields realistic closed-loop interactions rests on the untested assumption that the latent dynamics encode sufficient causal structure. Without a quantitative fidelity check—such as multi-step prediction error or collision-rate agreement between latent rollouts and held-out simulator trajectories—the RL stage (global + agent-specific + diversity rewards) may optimize against an inaccurate internal world model, rendering the Bench2Drive SOTA result potentially artifactual.
  2. [Table 2] Table 2 (Bench2Drive closed-loop results): The reported SOTA margins are presented without seed-wise variance, statistical significance tests, or an ablation that isolates the contribution of the diversity reward term. If the diversity component yields only marginal gains (as suggested by the modest effect sizes in the reward-ablation rows), the emphasis on generating novel planning behaviors not present in logged data is weakened.
minor comments (3)
  1. The abstract contains a minor typographical inconsistency ('agent -specific' with extraneous space).
  2. [Figure 3] Figure 3 (example latent rollouts) would benefit from clearer annotation of reactive events (e.g., arrows indicating agent responses) to help readers verify the claimed interaction realism.
  3. [§2] Related-work section §2 omits several recent VLA driving papers that also explore latent-space planning; adding them would better situate the novelty of the multi-agent rollout.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript describing MAPLE. We address each major comment below with clarifications and indicate where revisions will be made to strengthen the paper.

read point-by-point responses
  1. Referee: [§3.2] §3.2 (Latent Rollout and RL Stage): The central claim that independent multi-step control of ego and traffic agents inside the VLA latent space yields realistic closed-loop interactions rests on the untested assumption that the latent dynamics encode sufficient causal structure. Without a quantitative fidelity check—such as multi-step prediction error or collision-rate agreement between latent rollouts and held-out simulator trajectories—the RL stage (global + agent-specific + diversity rewards) may optimize against an inaccurate internal world model, rendering the Bench2Drive SOTA result potentially artifactual.

    Authors: We agree that an explicit quantitative fidelity analysis would provide stronger support for the assumption that the VLA latent space encodes sufficient causal structure for multi-agent interactions. While the closed-loop SOTA results on Bench2Drive (which uses a high-fidelity simulator for evaluation) offer indirect validation that the learned dynamics support effective policy optimization, we acknowledge this does not fully substitute for direct multi-step prediction metrics. In the revision we will add a new subsection with multi-step rollout error analysis and collision-rate comparisons against held-out trajectories to quantify latent dynamics fidelity. revision: yes

  2. Referee: [Table 2] Table 2 (Bench2Drive closed-loop results): The reported SOTA margins are presented without seed-wise variance, statistical significance tests, or an ablation that isolates the contribution of the diversity reward term. If the diversity component yields only marginal gains (as suggested by the modest effect sizes in the reward-ablation rows), the emphasis on generating novel planning behaviors not present in logged data is weakened.

    Authors: We appreciate this point on statistical rigor. The current manuscript includes reward ablations but does not report per-seed variance or formal significance testing. We will revise Table 2 to include standard deviations across multiple random seeds and add p-value comparisons for key metrics. For the diversity reward, while the ablation rows show its contribution to escaping mode collapse, we agree the effect sizes merit further emphasis; the revision will expand the ablation table with additional metrics (e.g., behavior novelty scores) and clarify how diversity interacts with the other reward terms to produce behaviors absent from the training distribution. revision: yes

Circularity Check

0 steps flagged

No circularity in MAPLE derivation chain

full rationale

The paper presents MAPLE as a two-stage training procedure (supervised fine-tuning on latent rollouts from ground-truth trajectories, followed by RL using global, agent-specific, and diversity rewards) for closed-loop multi-agent control inside a VLA latent space. No equations, first-principles derivations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided text. The central claims rest on the empirical SOTA result on Bench2Drive and the architectural description of independent multi-step control with reactivity; these do not reduce to the inputs by construction and remain externally falsifiable via simulator-free evaluation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The approach rests on the assumption that latent-space rollouts can substitute for external simulators in modeling reactive multi-agent driving scenes; no numerical free parameters are specified in the abstract.

axioms (1)
  • domain assumption The VLA model's latent space supports accurate multi-step reactive rollouts of ego and traffic agents.
    Invoked to enable closed-loop training without external simulators as stated in the abstract.
invented entities (1)
  • Diversity rewards no independent evidence
    purpose: Encourage generation of planning behaviors absent from logged driving data.
    New reward component introduced in the RL stage to promote exploration beyond training distribution.

pith-pipeline@v0.9.0 · 5802 in / 1383 out tokens · 82176 ms · 2026-05-21T07:45:25.063615+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

50 extracted references · 50 canonical work pages · 10 internal anchors

  1. [1]

    Learning dexterous in-hand manipulation

    Marcin Andrychowicz, Bowen Baker, Maciek Chociej, Rafal Józefowicz, Bob McGrew, Jakub Pachocki, Arthur Petron, Matthias Plappert, Glenn Powell, Alex Ray, et al. Learning dexterous in-hand manipulation. The International Journal of Robotics Research, 39(1):3–20, 2020

  2. [2]

    Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

    Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond.arXiv preprint arXiv:2308.12966, 2023

  3. [3]

    Qwen2.5-VL Technical Report

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-vl technical report.ar...

  4. [4]

    SimNet: Learning reactive self-driving simulations from real-world observations

    Luca Bergamini, Yawei Ye, Oliver Scheel, Long Chen, Chih-Yuan Hu, Luca Delévaux, Niels Muller, and Peter Ondruska. SimNet: Learning reactive self-driving simulations from real-world observations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021

  5. [5]

    Killian, Stuart Bowers, Ozan Sener, Philipp Kraehenbuehl, and Vladlen Koltun

    Marco Cusumano-Towner, David Hafner, Alexander Hertzberg, Brody Huval, Aleksei Petrenko, Eugene Vinitsky, Erik Wijmans, Taylor W. Killian, Stuart Bowers, Ozan Sener, Philipp Kraehenbuehl, and Vladlen Koltun. Robust autonomy emerges from self-play. InProceedings of the 42nd International Conference on Machine Learning (ICML), 2025

  6. [6]

    Parting with misconceptions about learning-based vehicle motion planning

    Daniel Dauner, Marcel Hallgarten, Andreas Geiger, and Kashyap Chitta. Parting with misconceptions about learning-based vehicle motion planning. InConference on Robot Learning, pages 1268–1281. PMLR, 2023

  7. [7]

    Carla: An open urban driving simulator

    Alexey Dosovitskiy, German Ros, Felipe Codevilla, Antonio Lopez, and Vladlen Koltun. Carla: An open urban driving simulator. InConference on robot learning, pages 1–16. PMLR, 2017

  8. [8]

    Eva: Exploring the limits of masked visual representation learning at scale

    Yuxin Fang, Wen Wang, Binhui Xie, Quan Sun, Ledell Wu, Xinggang Wang, Tiejun Huang, Xinlong Wang, and Yue Cao. Eva: Exploring the limits of masked visual representation learning at scale. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 19358–19369, 2023

  9. [9]

    ORION: A Holistic End-to-End Autonomous Driving Framework by Vision-Language Instructed Action Generation

    Haoyu Fu, Diankun Zhang, Zongchuang Zhao, Jianfeng Cui, Dingkang Liang, Chong Zhang, Dingyuan Zhang, Hongwei Xie, Bing Wang, and Xiang Bai. Orion: A holistic end-to-end autonomous driving framework by vision-language instructed action generation.arXiv preprint arXiv:2503.19755, 2025

  10. [10]

    Co-Reyes, Rishabh Agarwal, Rebecca Roelofs, Yao Lu, Nico Montali, Paul Mougin, Zoey Yang, Brandyn White, Aleksandra Faust, Rowan McAllister, Dragomir Anguelov, and Benjamin Sapp

    Cole Gulino, Justin Fu, Wenjie Luo, George Tucker, Eli Bronstein, Yiren Lu, Jean Harb, Xinlei Pan, Yan Wang, Xiangyu Chen, John D. Co-Reyes, Rishabh Agarwal, Rebecca Roelofs, Yao Lu, Nico Montali, Paul Mougin, Zoey Yang, Brandyn White, Aleksandra Faust, Rowan McAllister, Dragomir Anguelov, and Benjamin Sapp. Waymax: An accelerated, data-driven simulator f...

  11. [11]

    Tan et al

    K. Tan et al. H. Caesar, J. Kabzan. Nuplan: A closed-loop ml-based planning benchmark for autonomous vehicles. InCVPR ADP3 workshop, 2021

  12. [12]

    Social force model for pedestrian dynamics.Physical Review E, 51(5): 4282–4286, 1995

    Dirk Helbing and Péter Molnár. Social force model for pedestrian dynamics.Physical Review E, 51(5): 4282–4286, 1995

  13. [13]

    Planning-oriented autonomous driving

    Yihan Hu, Jiazhi Yang, Li Chen, Keyu Li, Chonghao Sima, Xizhou Zhu, Siqi Chai, Senyao Du, Tianwei Lin, Wenhai Wang, et al. Planning-oriented autonomous driving. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17853–17862, 2023

  14. [14]

    Gen-drive: Enhancing diffusion generative driving policies with reward modeling and reinforcement learning fine-tuning

    Zhiyu Huang, Xinshuo Weng, Maximilian Igl, Yuxiao Chen, Yulong Cao, Boris Ivanovic, Marco Pavone, and Chen Lv. Gen-drive: Enhancing diffusion generative driving policies with reward modeling and reinforcement learning fine-tuning. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 3445–3451. IEEE, 2025

  15. [15]

    EMMA: End-to-End Multimodal Model for Autonomous Driving

    Jyh-Jing Hwang, Runsheng Xu, Hubert Lin, Wei-Chih Hung, Jingwei Ji, Kristy Choi, Di Huang, Tong He, Paul Covington, Benjamin Sapp, Yin Zhou, James Guo, Dragomir Anguelov, and Mingxing Tan. Emma: End-to-end multimodal model for autonomous driving.arXiv preprint arXiv:2410.23262, 2024

  16. [16]

    Carl: Learning scalable planning policies with simple rewards.arXiv preprint arXiv:2504.17838, 2025

    Bernhard Jaeger, Daniel Dauner, Jens Beißwenger, Simon Gerstenecker, Kashyap Chitta, and Andreas Geiger. Carl: Learning scalable planning policies with simple rewards.arXiv preprint arXiv:2504.17838, 2025. 10

  17. [17]

    Driveadapter: Breaking the coupling barrier of perception and planning in end-to-end autonomous driving

    Xiaosong Jia, Yulu Gao, Li Chen, Junchi Yan, Patrick Langechuan Liu, and Hongyang Li. Driveadapter: Breaking the coupling barrier of perception and planning in end-to-end autonomous driving. InICCV, 2023

  18. [18]

    Think twice before driving: Towards scalable decoders for end-to-end autonomous driving

    Xiaosong Jia, Penghao Wu, Li Chen, Jiangwei Xie, Conghui He, Junchi Yan, and Hongyang Li. Think twice before driving: Towards scalable decoders for end-to-end autonomous driving. InCVPR, 2023

  19. [19]

    Bench2drive: Towards multi- ability benchmarking of closed-loop end-to-end autonomous driving

    Xiaosong Jia, Zhenjie Yang, Qifeng Li, Zhiyuan Zhang, and Junchi Yan. Bench2drive: Towards multi- ability benchmarking of closed-loop end-to-end autonomous driving. InNeurIPS 2024 Datasets and Benchmarks Track, 2024

  20. [20]

    Drivetransformer: Unified transformer for scalable end-to-end autonomous driving

    Xiaosong Jia, Junqi You, Zhiyuan Zhang, and Junchi Yan. Drivetransformer: Unified transformer for scalable end-to-end autonomous driving. InThe Thirteenth International Conference on Learning Representations, 2025

  21. [21]

    Vad: Vectorized scene representation for efficient autonomous driving

    Bo Jiang, Shaoyu Chen, Qing Xu, Bencheng Liao, Jiajie Chen, Helong Zhou, Qian Zhang, Wenyu Liu, Chang Huang, and Xinggang Wang. Vad: Vectorized scene representation for efficient autonomous driving. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 8340–8350, 2023

  22. [22]

    Senna: Bridging Large Vision-Language Models and End-to-End Autonomous Driving

    Bo Jiang, Shaoyu Chen, Bencheng Liao, Xingyu Zhang, Wei Yin, Qian Zhang, Chang Huang, Wenyu Liu, and Xinggang Wang. Senna: Bridging large vision-language models and end-to-end autonomous driving. arXiv preprint arXiv:2410.22313, 2024

  23. [23]

    AlphaDrive: Unleashing the Power of VLMs in Autonomous Driving via Reinforcement Learning and Reasoning

    Bo Jiang, Shaoyu Chen, Qian Zhang, Wenyu Liu, and Xinggang Wang. Alphadrive: Unleashing the power of vlms in autonomous driving via reinforcement learning and reasoning.arXiv preprint arXiv:2503.07608, 2025

  24. [24]

    Beyond behavior cloning in autonomous driving: a survey of closed-loop training techniques

    Peter Karkus, Maximilian Igl, Yuxiao Chen, Kashyap Chitta, Boris Ivanovic, and Marco Pavone. Beyond behavior cloning in autonomous driving: a survey of closed-loop training techniques. Technical report, NVIDIA Research, 2025

  25. [25]

    A survey of generalisation in deep reinforcement learning.arXiv preprint arXiv:2111.09794, 2023

    Roberta Kirk, Amy Zhang, Edward Grefenstette, and Tim Rocktäschel. A survey of generalisation in deep reinforcement learning.arXiv preprint arXiv:2111.09794, 2023

  26. [26]

    Finetuning generative trajectory model with reinforcement learning from human feedback.arXiv preprint arXiv:2503.10434, 2025

    Derun Li, Jianwei Ren, Yue Wang, Xin Wen, Pengxiang Li, Leimeng Xu, Kun Zhan, Zhongpu Xia, Peng Jia, Xianpeng Lang, et al. Finetuning generative trajectory model with reinforcement learning from human feedback.arXiv preprint arXiv:2503.10434, 2025

  27. [27]

    Recogdrive: A reinforced cognitive framework for end-to-end autonomous driving

    Yongkang Li, Kaixin Xiong, Xiangyu Guo, Fang Li, Sixu Yan, Gangwei Xu, Lijun Zhou, Long Chen, Haiyang Sun, Bing Wang, et al. Recogdrive: A reinforced cognitive framework for end-to-end autonomous driving. InInternational Conference on Learning Representations (ICLR), 2026

  28. [28]

    Diffusiondrive: Truncated diffusion model for end-to-end autonomous driving

    Bencheng Liao, Shaoyu Chen, Haoran Yin, Bo Jiang, Cheng Wang, Sixu Yan, Xinbang Zhang, Xiangyu Li, Ying Zhang, Qian Zhang, et al. Diffusiondrive: Truncated diffusion model for end-to-end autonomous driving. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 12037–12047, 2025

  29. [29]

    Reinforced refinement with self-aware ex- pansion for end-to-end autonomous driving.arXiv preprint arXiv:2506.09800, 2025

    Haochen Liu, Tianyu Li, Haohan Yang, Li Chen, Caojun Wang, Ke Guo, Haochen Tian, Hongchen Li, Hongyang Li, and Chen Lv. Reinforced refinement with self-aware expansion for end-to-end autonomous driving.arXiv preprint arXiv:2506.09800, 2025

  30. [30]

    Decoupled weight decay regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InInternational Conference on Learning Representations, 2019

  31. [31]

    GPT-Driver: Learning to Drive with GPT

    Jiageng Mao, Yuxi Qian, Junjie Ye, Hang Zhao, and Yue Wang. Gpt-driver: Learning to drive with gpt. arXiv preprint arXiv:2310.01415, 2023

  32. [32]

    Generating useful accident- prone driving scenarios via a learned traffic prior

    Davis Rempe, Jonah Philion, Leonidas J Guibas, Sanja Fidler, and Or Litany. Generating useful accident- prone driving scenarios via a learned traffic prior. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022

  33. [33]

    Simlingo: Vision-only closed-loop autonomous driving with language-action alignment

    Katrin Renz, Long Chen, Elahe Arani, and Oleg Sinavski. Simlingo: Vision-only closed-loop autonomous driving with language-action alignment. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 11993–12003, 2025

  34. [34]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024. 11

  35. [35]

    Motion transformer with global intention localization and local movement refinement.arXiv preprint arXiv:2209.13508, 2022

    Shaoshuai Shi, Li Jiang, Dengxin Dai, and Bernt Schiele. Motion transformer with global intention localization and local movement refinement.arXiv preprint arXiv:2209.13508, 2022

  36. [36]

    Mastering the game of Go without human knowledge.Nature, 550:354–359, 2017

    David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, et al. Mastering the game of Go without human knowledge.Nature, 550:354–359, 2017

  37. [37]

    Don’t shake the wheel: Momentum-aware planning in end-to-end autonomous driving

    Ziying Song, Caiyan Jia, Lin Liu, Hongyu Pan, Yongchang Zhang, Junming Wang, Xingyu Zhang, Shaoqing Xu, Lei Yang, and Yadan Luo. Don’t shake the wheel: Momentum-aware planning in end-to-end autonomous driving. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 22432–22441, 2025

  38. [38]

    TrafficSim: Learning to simulate realistic multi-agent behaviors

    Simon Suo, Sebastian Regalado, Sergio Casas, and Raquel Urtasun. TrafficSim: Learning to simulate realistic multi-agent behaviors. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10400–10409, 2021

  39. [39]

    Hip-ad: Hierarchical and multi-granularity planning with deformable attention for autonomous driving in a single decoder

    Yingqi Tang, Zhuoran Xu, Zhaotie Meng, and Erkang Cheng. Hip-ad: Hierarchical and multi-granularity planning with deformable attention for autonomous driving in a single decoder. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 25605–25615, 2025

  40. [40]

    DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models

    Xiaoyu Tian, Junru Gu, Bailin Li, Yicheng Liu, Chenxu Hu, Yang Wang, Kun Zhan, Peng Jia, Xianpeng Lang, and Hang Zhao. Drivevlm: The convergence of autonomous driving and large vision-language models.arXiv preprint arXiv:2402.12289, 2024

  41. [41]

    Grandmaster level in StarCraft II using multi-agent reinforcement learning.Nature, 575:350–354, 2019

    Oriol Vinyals, Igor Babuschkin, Wojciech M Czarnecki, Michaël Mathieu, Andrew Dudzik, Junyoung Chung, David H Choi, Richard Powell, Timo Ewalds, et al. Grandmaster level in StarCraft II using multi-agent reinforcement learning.Nature, 575:350–354, 2019

  42. [42]

    Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837, 2022

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837, 2022

  43. [43]

    Trajectory-guided con- trol prediction for end-to-end autonomous driving: A simple yet strong baseline.Advances in Neural Information Processing Systems, 35:6119–6132, 2022

    Penghao Wu, Xiaosong Jia, Li Chen, Junchi Yan, Hongyang Li, and Yu Qiao. Trajectory-guided con- trol prediction for end-to-end autonomous driving: A simple yet strong baseline.Advances in Neural Information Processing Systems, 35:6119–6132, 2022

  44. [44]

    Goalflow: Goal-driven flow matching for multimodal trajectories generation in end-to-end autonomous driving

    Zebin Xing, Xingyu Zhang, Yang Hu, Bo Jiang, Tong He, Qian Zhang, Xiaoxiao Long, and Wei Yin. Goalflow: Goal-driven flow matching for multimodal trajectories generation in end-to-end autonomous driving. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 1602–1611, 2025

  45. [45]

    Generative scenario rollouts for end-to-end autonomous driving.arXiv preprint arXiv:2601.11475, 2026

    Rajeev Yasarla, Deepti Hegde, Shizhong Han, Hsin-Pai Cheng, Yunxiao Shi, Meysam Sadeghigooghari, Shweta Mahajan, Apratim Bhattacharyya, Litian Liu, Risheek Garrepalli, et al. Generative scenario rollouts for end-to-end autonomous driving.arXiv preprint arXiv:2601.11475, 2026

  46. [46]

    Diffrefiner: Coarse to fine trajectory planning via diffusion refinement with semantic interaction for end to end autonomous driving

    Liuhan Yin, Runkun Ju, Guodong Guo, and Erkang Cheng. Diffrefiner: Coarse to fine trajectory planning via diffusion refinement with semantic interaction for end to end autonomous driving. InProceedings of the AAAI Conference on Artificial Intelligence, pages 12009–12017, 2026

  47. [47]

    CAT: Closed-loop adversarial training for safe end-to-end driving

    Linrui Zhang, Zhenghao Peng, Quanyi Li, and Bolei Zhou. CAT: Closed-loop adversarial training for safe end-to-end driving. InConference on Robot Learning, 2023

  48. [48]

    Diffusion-based planning for autonomous driving with flexible guidance.arXiv preprint arXiv:2501.15564, 2025

    Yinan Zheng, Ruiming Liang, Kexin Zheng, Jinliang Zheng, Liyuan Mao, Jianxiong Li, Weihao Gu, Rui Ai, Shengbo Eben Li, Xianyuan Zhan, et al. Diffusion-based planning for autonomous driving with flexible guidance.arXiv preprint arXiv:2501.15564, 2025

  49. [49]

    Query-centric trajectory prediction

    Zikang Zhou, Jianping Wang, Yung-Hui Li, and Yu-Kai Huang. Query-centric trajectory prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17863– 17873, 2023

  50. [50]

    AutoVLA: A Vision-Language-Action Model for End-to-End Autonomous Driving with Adaptive Reasoning and Reinforcement Fine-Tuning

    Zewei Zhou, Tianhui Cai, Seth Z Zhao, Yun Zhang, Zhiyu Huang, Bolei Zhou, and Jiaqi Ma. Au- tovla: A vision-language-action model for end-to-end autonomous driving with adaptive reasoning and reinforcement fine-tuning.arXiv preprint arXiv:2506.13757, 2025. 12 A Ablation Study A.1 Number of Reactive Agents Agent Distribution in Bench2Drive.To contextualize...