MAPLE: Latent Multi-Agent Play for End-to-End Autonomous Driving

Deepti Hegde; Fatih Porikli; Hanno Ackermann; Hong Cai; Hsin-Pai Cheng; Litian Liu; Meysam Sadeghigooghari; Mohammad Ghavamzadeh; Pranav Desai; Rajeev Yasarla

arxiv: 2605.14201 · v2 · pith:HNFYBUPPnew · submitted 2026-05-13 · 💻 cs.RO · cs.CV

MAPLE: Latent Multi-Agent Play for End-to-End Autonomous Driving

Rajeev Yasarla , Deepti Hegde , Hsin-Pai Cheng , Shizhong Han , Yunxiao Shi , Meysam Sadeghigooghari , Hanno Ackermann , Litian Liu

show 4 more authors

Pranav Desai Fatih Porikli Mohammad Ghavamzadeh Hong Cai

This is my paper

Pith reviewed 2026-05-21 07:45 UTC · model grok-4.3

classification 💻 cs.RO cs.CV

keywords end-to-end autonomous drivingvision-language-action modelsmulti-agent rolloutsclosed-loop traininglatent space simulationreinforcement learningBench2Drive

0 comments

The pith

MAPLE trains end-to-end driving models through reactive multi-agent rollouts performed inside the model's own latent space.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces MAPLE to fix brittleness in vision-language-action models for autonomous driving, which arise when they are trained only with imitation learning. It does this by running closed-loop multi-agent simulations directly in the model's latent space, letting the ego vehicle and nearby traffic agents each take independent multi-step actions while reacting to one another. The training happens in two stages: supervised fine-tuning on rollouts derived from ground-truth trajectories, then reinforcement learning that adds rewards for safety, progress, interaction realism, and behavioral diversity. A sympathetic reader would care because the method promises more robust driving policies without depending on slow, low-fidelity external simulators.

Core claim

MAPLE is a framework for reactive, multi-agent rollout of a dynamic driving scenario in the latent space of the VLA model. The ego vehicle and nearby traffic agents are independently controlled over multi-step horizons while remaining reactive to other agents, enabling closed-loop training. The approach uses two stages of training: supervised fine-tuning on latent rollouts based on ground-truth trajectories, followed by reinforcement learning with global and agent-specific rewards plus diversity rewards. MAPLE reaches state-of-the-art driving performance on Bench2Drive and shows that scalable closed-loop multi-agent play can produce robust end-to-end autonomous driving systems.

What carries the argument

Latent multi-agent rollout mechanism that independently advances the ego vehicle and traffic agents over multiple steps while modeling their mutual reactivity inside the VLA model's latent space.

If this is right

Closed-loop training of driving policies becomes feasible without running external simulators.
Models learn to handle reactive traffic interactions more robustly than imitation learning alone allows.
Diversity rewards let the planner produce behaviors absent from the original logged data.
Global and agent-specific rewards jointly encourage safety, forward progress, and realistic multi-agent dynamics.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same latent-play structure could be tested on other multi-agent embodied tasks such as manipulation in crowded scenes.
Staying inside the model's latent space might reduce the distribution shift that usually appears when policies move from training to real-world sensors.
Diversity rewards could be combined with real-world data collection loops to continually expand the set of encountered driving scenarios.

Load-bearing premise

The VLA model's latent space can faithfully represent independent multi-step controls for the ego vehicle and traffic agents while capturing their reactive interactions.

What would settle it

If closed-loop evaluations on Bench2Drive or similar benchmarks show that MAPLE-trained models produce no measurable gains in safety, progress, or collision avoidance compared with standard imitation-learning baselines, the central claim would be refuted.

Figures

Figures reproduced from arXiv: 2605.14201 by Deepti Hegde, Fatih Porikli, Hanno Ackermann, Hong Cai, Hsin-Pai Cheng, Litian Liu, Meysam Sadeghigooghari, Mohammad Ghavamzadeh, Pranav Desai, Rajeev Yasarla, Shizhong Han, Yunxiao Shi.

**Figure 1.** Figure 1: MAPLE pretraining and future state prediction. Left: Pretraining the VLA backbone with auxiliary supervision (e.g., map learning, detection, and motion prediction). Right: State-transition pretraining that predicts next-step ego/agent states over a horizon T to stabilize the token space. Multi-agent simulation and self-play. Trajectory forecasting methods [12, 35, 49] model joint agent futures from fixed o… view at source ↗

**Figure 2.** Figure 2: MAPLE supervised fine-tuning (SFT) stage. Left: Single-step supervision and inference. The VLA backbone encodes multi-view images (and map features) into ego and agent tokens, which are decoded by an ego planner, reactive-agent planners, and a motion head. Right: The same model unrolled for T steps during imitation-learning-based scenario rollouts. Predicted tokens/trajectories are fed back autoregressivel… view at source ↗

**Figure 3.** Figure 3: MAPLE RL fine-tuning stage. Starting from the SFT model, we optimize multi-step rollouts over T steps using RL with safety-aware and interaction-aware rewards (e.g., collision avoidance and TTC). progress and safe driving for each controlled agent, and (iii) a diversity reward that promotes distinct behaviors across different planners/policies. At a time t, we define the total rollout reward as Rt = Gt + D… view at source ↗

**Figure 4.** Figure 4: Qualitative examples of closed-loop driving on Bench2Drive using MAPLE. We show representative trajectories in diverse scenarios, including adverse-weather scenes with limited visibility and sudden pedestrian crossings (top row), and clear suburban traffic with dynamic agents such as cyclists and surrounding vehicles (bottom row). Blue curves denote the planned ego-vehicle trajectory, highlighting smooth … view at source ↗

**Figure 5.** Figure 5: BEV qualitative comparison on Bench2Drive (closed-loop). Bird’s-eye-view visualization for the same route/scenario (RouteScenario_25951_rep0, HazardAtSideLaneTwoWays_1, weather_id=7). Left: ReCogDrive [27]. Right: MAPLE (ours). The planned ego trajectory is overlaid, illustrating different interaction outcomes in the same context [PITH_FULL_IMAGE:figures/full_fig_p016_5.png] view at source ↗

**Figure 6.** Figure 6: Closed-loop rollout comparison on Bench2Drive. Multi-frame qualitative rollouts for the same route/scenario (RouteScenario_25951_rep0, HazardAtSideLaneTwoWays_1, weather_id=7). Top row: ReCogDrive. Bottom row: MAPLE. Colored curves denote the planned ego trajectory across time, highlighting differences in closed-loop interaction behavior. containing dynamic agents (e.g., two cyclists traveling along the r… view at source ↗

**Figure 7.** Figure 7: Additional qualitative closed-loop driving examples on Bench2Drive using MAPLE. These examples includes challenging conditions, like low-light/night driving with sudden pedestrian appearances and wet-road reflections, dense fog/highway driving with reduced visibility, and urban scenes with adverse weather. Blue/cyan curves denote the planned ego trajectory [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗

**Figure 8.** Figure 8: Additional qualitative closed-loop driving examples on Bench2Drive using MAPLE. More examples covering suburban/rural traffic with oncoming vehicles and lane curvature, as well as nighttime intersection scenarios with wet-road conditions and surrounding traffic. Blue/cyan curves denote the planned ego trajectory. gradual curvature and without abrupt corrections [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗

**Figure 9.** Figure 9: Failure case: over-cautious avoidance leading to lane departure (Bench2Drive, closedloop). In this scenario, MAPLE performs an overly conservative unprotected left turn to avoid a potential collision, resulting in a brief deviation outside the route lanes for about 1.0 meters (1.29% of the full route). The car quickly moves back to the lane after this brief deviation. Blue/cyan curves denote the planned e… view at source ↗

read the original abstract

Vision-language-action (VLA) models are effective as end-to-end motion planners, but can be brittle when evaluated in closed-loop settings due to being trained under traditional imitation learning framework. Existing closed-loop supervision approaches lack scalability and fail to completely model a reactive environment. We propose MAPLE, a novel framework for reactive, multi-agent rollout of a dynamic driving scenario in the latent space of the VLA model. The ego vehicle and nearby traffic agents are independently controlled over multi-step horizons, while being reactive to other agents in the scene, enabling closed-loop training. MAPLE consists of two training stages: (1) supervised fine-tuning on the latent rollouts based on ground-truth trajectories, followed by (2) reinforcement learning with global and agent -specific rewards that encourage safety, progress, and interaction realism. We further propose diversity rewards that encourage the model to generate planning behaviors that may not be present in logged driving data. Notably, our closed-loop training framework is scalable and does not require external simulators, which can be computationally expensive to run and have limited visual fidelity to the real-world. MAPLE achieves state-of-the-art driving performance on Bench2Drive and demonstrates scalable, closed-loop multi-agent play for robust E2E autonomous driving systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MAPLE puts multi-agent rollouts inside a VLA latent space to enable simulator-free closed-loop training, but the abstract leaves the actual results and validation too thin to judge yet.

read the letter

MAPLE runs multi-agent rollouts directly in the latent space of a vision-language-action model so the ego vehicle and nearby agents can be controlled independently over multiple steps while staying reactive to each other. This is meant to give closed-loop supervision without calling an external simulator, which the authors flag as expensive and low-fidelity. The training splits into supervised fine-tuning on ground-truth trajectories followed by reinforcement learning that mixes global safety and progress rewards, agent-specific terms, and diversity rewards to push beyond logged data distributions. That specific mix of latent rollouts plus staged SFT-then-RL with diversity incentives is the clearest new element relative to standard imitation or simulator-heavy baselines. If the latent dynamics turn out to be faithful enough, the approach could scale more easily than methods that rely on full physics simulators. The abstract claims state-of-the-art numbers on Bench2Drive, which would matter for the subfield if the gains are real and reproducible. The main weakness right now is the lack of any reported metrics, baseline tables, or ablation results in the material available. Without those details it is difficult to separate the contribution of the latent multi-agent mechanism from other implementation choices. The central assumption—that the VLA latent space already encodes sufficiently accurate causal interactions for realistic rollouts—also needs direct evidence; if the internal dynamics are off, the RL stage will optimize against a distorted world model and any reported robustness may not carry over. This is the kind of paper that matters to groups working on end-to-end driving planners who are tired of simulator bottlenecks. A reader already experimenting with latent world models or closed-loop RL for planning would find the framework worth examining even if they end up changing pieces of it. It deserves a serious referee because the problem is real and the proposed direction is concrete enough to get useful technical feedback.

Referee Report

2 major / 3 minor

Summary. The manuscript introduces MAPLE, a two-stage framework for closed-loop training of vision-language-action (VLA) models in end-to-end autonomous driving. Stage 1 performs supervised fine-tuning on multi-step latent rollouts derived from ground-truth trajectories, with the ego vehicle and nearby traffic agents controlled independently yet reactively. Stage 2 applies reinforcement learning using a combination of global, agent-specific, and diversity rewards to promote safety, progress, interaction realism, and behaviors absent from logged data. The approach operates entirely in the VLA latent space without external simulators and reports state-of-the-art results on the Bench2Drive benchmark.

Significance. If the empirical claims hold, the work offers a scalable route to robust closed-loop E2E driving policies by explicitly modeling reactive multi-agent dynamics inside a learned latent space. The two-stage pipeline and the addition of diversity rewards to escape imitation-learning mode collapse constitute a practical contribution that could reduce dependence on expensive, low-fidelity simulators while improving generalization.

major comments (2)

[§3.2] §3.2 (Latent Rollout and RL Stage): The central claim that independent multi-step control of ego and traffic agents inside the VLA latent space yields realistic closed-loop interactions rests on the untested assumption that the latent dynamics encode sufficient causal structure. Without a quantitative fidelity check—such as multi-step prediction error or collision-rate agreement between latent rollouts and held-out simulator trajectories—the RL stage (global + agent-specific + diversity rewards) may optimize against an inaccurate internal world model, rendering the Bench2Drive SOTA result potentially artifactual.
[Table 2] Table 2 (Bench2Drive closed-loop results): The reported SOTA margins are presented without seed-wise variance, statistical significance tests, or an ablation that isolates the contribution of the diversity reward term. If the diversity component yields only marginal gains (as suggested by the modest effect sizes in the reward-ablation rows), the emphasis on generating novel planning behaviors not present in logged data is weakened.

minor comments (3)

The abstract contains a minor typographical inconsistency ('agent -specific' with extraneous space).
[Figure 3] Figure 3 (example latent rollouts) would benefit from clearer annotation of reactive events (e.g., arrows indicating agent responses) to help readers verify the claimed interaction realism.
[§2] Related-work section §2 omits several recent VLA driving papers that also explore latent-space planning; adding them would better situate the novelty of the multi-agent rollout.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript describing MAPLE. We address each major comment below with clarifications and indicate where revisions will be made to strengthen the paper.

read point-by-point responses

Referee: [§3.2] §3.2 (Latent Rollout and RL Stage): The central claim that independent multi-step control of ego and traffic agents inside the VLA latent space yields realistic closed-loop interactions rests on the untested assumption that the latent dynamics encode sufficient causal structure. Without a quantitative fidelity check—such as multi-step prediction error or collision-rate agreement between latent rollouts and held-out simulator trajectories—the RL stage (global + agent-specific + diversity rewards) may optimize against an inaccurate internal world model, rendering the Bench2Drive SOTA result potentially artifactual.

Authors: We agree that an explicit quantitative fidelity analysis would provide stronger support for the assumption that the VLA latent space encodes sufficient causal structure for multi-agent interactions. While the closed-loop SOTA results on Bench2Drive (which uses a high-fidelity simulator for evaluation) offer indirect validation that the learned dynamics support effective policy optimization, we acknowledge this does not fully substitute for direct multi-step prediction metrics. In the revision we will add a new subsection with multi-step rollout error analysis and collision-rate comparisons against held-out trajectories to quantify latent dynamics fidelity. revision: yes
Referee: [Table 2] Table 2 (Bench2Drive closed-loop results): The reported SOTA margins are presented without seed-wise variance, statistical significance tests, or an ablation that isolates the contribution of the diversity reward term. If the diversity component yields only marginal gains (as suggested by the modest effect sizes in the reward-ablation rows), the emphasis on generating novel planning behaviors not present in logged data is weakened.

Authors: We appreciate this point on statistical rigor. The current manuscript includes reward ablations but does not report per-seed variance or formal significance testing. We will revise Table 2 to include standard deviations across multiple random seeds and add p-value comparisons for key metrics. For the diversity reward, while the ablation rows show its contribution to escaping mode collapse, we agree the effect sizes merit further emphasis; the revision will expand the ablation table with additional metrics (e.g., behavior novelty scores) and clarify how diversity interacts with the other reward terms to produce behaviors absent from the training distribution. revision: yes

Circularity Check

0 steps flagged

No circularity in MAPLE derivation chain

full rationale

The paper presents MAPLE as a two-stage training procedure (supervised fine-tuning on latent rollouts from ground-truth trajectories, followed by RL using global, agent-specific, and diversity rewards) for closed-loop multi-agent control inside a VLA latent space. No equations, first-principles derivations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided text. The central claims rest on the empirical SOTA result on Bench2Drive and the architectural description of independent multi-step control with reactivity; these do not reduce to the inputs by construction and remain externally falsifiable via simulator-free evaluation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The approach rests on the assumption that latent-space rollouts can substitute for external simulators in modeling reactive multi-agent driving scenes; no numerical free parameters are specified in the abstract.

axioms (1)

domain assumption The VLA model's latent space supports accurate multi-step reactive rollouts of ego and traffic agents.
Invoked to enable closed-loop training without external simulators as stated in the abstract.

invented entities (1)

Diversity rewards no independent evidence
purpose: Encourage generation of planning behaviors absent from logged driving data.
New reward component introduced in the RL stage to promote exploration beyond training distribution.

pith-pipeline@v0.9.0 · 5802 in / 1383 out tokens · 82176 ms · 2026-05-21T07:45:25.063615+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/Breath1024.lean period8 := 8; flipAt512; reality_from_one_distinction echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

MAPLE consists of two training stages: (1) supervised fine-tuning on the latent rollouts ... (2) reinforcement learning with global and agent-specific rewards ... diversity rewards ... rollout horizon of T=8 ... NR=8 for reactive-agent planners
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

MAPLE achieves state-of-the-art driving performance on Bench2Drive ... scalable, closed-loop multi-agent play

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

50 extracted references · 50 canonical work pages · 10 internal anchors

[1]

Learning dexterous in-hand manipulation

Marcin Andrychowicz, Bowen Baker, Maciek Chociej, Rafal Józefowicz, Bob McGrew, Jakub Pachocki, Arthur Petron, Matthias Plappert, Glenn Powell, Alex Ray, et al. Learning dexterous in-hand manipulation. The International Journal of Robotics Research, 39(1):3–20, 2020

work page 2020
[2]

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond.arXiv preprint arXiv:2308.12966, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[3]

Qwen2.5-VL Technical Report

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-vl technical report.ar...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

SimNet: Learning reactive self-driving simulations from real-world observations

Luca Bergamini, Yawei Ye, Oliver Scheel, Long Chen, Chih-Yuan Hu, Luca Delévaux, Niels Muller, and Peter Ondruska. SimNet: Learning reactive self-driving simulations from real-world observations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021

work page 2021
[5]

Killian, Stuart Bowers, Ozan Sener, Philipp Kraehenbuehl, and Vladlen Koltun

Marco Cusumano-Towner, David Hafner, Alexander Hertzberg, Brody Huval, Aleksei Petrenko, Eugene Vinitsky, Erik Wijmans, Taylor W. Killian, Stuart Bowers, Ozan Sener, Philipp Kraehenbuehl, and Vladlen Koltun. Robust autonomy emerges from self-play. InProceedings of the 42nd International Conference on Machine Learning (ICML), 2025

work page 2025
[6]

Parting with misconceptions about learning-based vehicle motion planning

Daniel Dauner, Marcel Hallgarten, Andreas Geiger, and Kashyap Chitta. Parting with misconceptions about learning-based vehicle motion planning. InConference on Robot Learning, pages 1268–1281. PMLR, 2023

work page 2023
[7]

Carla: An open urban driving simulator

Alexey Dosovitskiy, German Ros, Felipe Codevilla, Antonio Lopez, and Vladlen Koltun. Carla: An open urban driving simulator. InConference on robot learning, pages 1–16. PMLR, 2017

work page 2017
[8]

Eva: Exploring the limits of masked visual representation learning at scale

Yuxin Fang, Wen Wang, Binhui Xie, Quan Sun, Ledell Wu, Xinggang Wang, Tiejun Huang, Xinlong Wang, and Yue Cao. Eva: Exploring the limits of masked visual representation learning at scale. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 19358–19369, 2023

work page 2023
[9]

ORION: A Holistic End-to-End Autonomous Driving Framework by Vision-Language Instructed Action Generation

Haoyu Fu, Diankun Zhang, Zongchuang Zhao, Jianfeng Cui, Dingkang Liang, Chong Zhang, Dingyuan Zhang, Hongwei Xie, Bing Wang, and Xiang Bai. Orion: A holistic end-to-end autonomous driving framework by vision-language instructed action generation.arXiv preprint arXiv:2503.19755, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[10]

Co-Reyes, Rishabh Agarwal, Rebecca Roelofs, Yao Lu, Nico Montali, Paul Mougin, Zoey Yang, Brandyn White, Aleksandra Faust, Rowan McAllister, Dragomir Anguelov, and Benjamin Sapp

Cole Gulino, Justin Fu, Wenjie Luo, George Tucker, Eli Bronstein, Yiren Lu, Jean Harb, Xinlei Pan, Yan Wang, Xiangyu Chen, John D. Co-Reyes, Rishabh Agarwal, Rebecca Roelofs, Yao Lu, Nico Montali, Paul Mougin, Zoey Yang, Brandyn White, Aleksandra Faust, Rowan McAllister, Dragomir Anguelov, and Benjamin Sapp. Waymax: An accelerated, data-driven simulator f...

work page 2023
[11]

Tan et al

K. Tan et al. H. Caesar, J. Kabzan. Nuplan: A closed-loop ml-based planning benchmark for autonomous vehicles. InCVPR ADP3 workshop, 2021

work page 2021
[12]

Social force model for pedestrian dynamics.Physical Review E, 51(5): 4282–4286, 1995

Dirk Helbing and Péter Molnár. Social force model for pedestrian dynamics.Physical Review E, 51(5): 4282–4286, 1995

work page 1995
[13]

Planning-oriented autonomous driving

Yihan Hu, Jiazhi Yang, Li Chen, Keyu Li, Chonghao Sima, Xizhou Zhu, Siqi Chai, Senyao Du, Tianwei Lin, Wenhai Wang, et al. Planning-oriented autonomous driving. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17853–17862, 2023

work page 2023
[14]

Gen-drive: Enhancing diffusion generative driving policies with reward modeling and reinforcement learning fine-tuning

Zhiyu Huang, Xinshuo Weng, Maximilian Igl, Yuxiao Chen, Yulong Cao, Boris Ivanovic, Marco Pavone, and Chen Lv. Gen-drive: Enhancing diffusion generative driving policies with reward modeling and reinforcement learning fine-tuning. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 3445–3451. IEEE, 2025

work page 2025
[15]

EMMA: End-to-End Multimodal Model for Autonomous Driving

Jyh-Jing Hwang, Runsheng Xu, Hubert Lin, Wei-Chih Hung, Jingwei Ji, Kristy Choi, Di Huang, Tong He, Paul Covington, Benjamin Sapp, Yin Zhou, James Guo, Dragomir Anguelov, and Mingxing Tan. Emma: End-to-end multimodal model for autonomous driving.arXiv preprint arXiv:2410.23262, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[16]

Carl: Learning scalable planning policies with simple rewards.arXiv preprint arXiv:2504.17838, 2025

Bernhard Jaeger, Daniel Dauner, Jens Beißwenger, Simon Gerstenecker, Kashyap Chitta, and Andreas Geiger. Carl: Learning scalable planning policies with simple rewards.arXiv preprint arXiv:2504.17838, 2025. 10

work page arXiv 2025
[17]

Driveadapter: Breaking the coupling barrier of perception and planning in end-to-end autonomous driving

Xiaosong Jia, Yulu Gao, Li Chen, Junchi Yan, Patrick Langechuan Liu, and Hongyang Li. Driveadapter: Breaking the coupling barrier of perception and planning in end-to-end autonomous driving. InICCV, 2023

work page 2023
[18]

Think twice before driving: Towards scalable decoders for end-to-end autonomous driving

Xiaosong Jia, Penghao Wu, Li Chen, Jiangwei Xie, Conghui He, Junchi Yan, and Hongyang Li. Think twice before driving: Towards scalable decoders for end-to-end autonomous driving. InCVPR, 2023

work page 2023
[19]

Bench2drive: Towards multi- ability benchmarking of closed-loop end-to-end autonomous driving

Xiaosong Jia, Zhenjie Yang, Qifeng Li, Zhiyuan Zhang, and Junchi Yan. Bench2drive: Towards multi- ability benchmarking of closed-loop end-to-end autonomous driving. InNeurIPS 2024 Datasets and Benchmarks Track, 2024

work page 2024
[20]

Drivetransformer: Unified transformer for scalable end-to-end autonomous driving

Xiaosong Jia, Junqi You, Zhiyuan Zhang, and Junchi Yan. Drivetransformer: Unified transformer for scalable end-to-end autonomous driving. InThe Thirteenth International Conference on Learning Representations, 2025

work page 2025
[21]

Vad: Vectorized scene representation for efficient autonomous driving

Bo Jiang, Shaoyu Chen, Qing Xu, Bencheng Liao, Jiajie Chen, Helong Zhou, Qian Zhang, Wenyu Liu, Chang Huang, and Xinggang Wang. Vad: Vectorized scene representation for efficient autonomous driving. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 8340–8350, 2023

work page 2023
[22]

Senna: Bridging Large Vision-Language Models and End-to-End Autonomous Driving

Bo Jiang, Shaoyu Chen, Bencheng Liao, Xingyu Zhang, Wei Yin, Qian Zhang, Chang Huang, Wenyu Liu, and Xinggang Wang. Senna: Bridging large vision-language models and end-to-end autonomous driving. arXiv preprint arXiv:2410.22313, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[23]

AlphaDrive: Unleashing the Power of VLMs in Autonomous Driving via Reinforcement Learning and Reasoning

Bo Jiang, Shaoyu Chen, Qian Zhang, Wenyu Liu, and Xinggang Wang. Alphadrive: Unleashing the power of vlms in autonomous driving via reinforcement learning and reasoning.arXiv preprint arXiv:2503.07608, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[24]

Beyond behavior cloning in autonomous driving: a survey of closed-loop training techniques

Peter Karkus, Maximilian Igl, Yuxiao Chen, Kashyap Chitta, Boris Ivanovic, and Marco Pavone. Beyond behavior cloning in autonomous driving: a survey of closed-loop training techniques. Technical report, NVIDIA Research, 2025

work page 2025
[25]

A survey of generalisation in deep reinforcement learning.arXiv preprint arXiv:2111.09794, 2023

Roberta Kirk, Amy Zhang, Edward Grefenstette, and Tim Rocktäschel. A survey of generalisation in deep reinforcement learning.arXiv preprint arXiv:2111.09794, 2023

work page arXiv 2023
[26]

Finetuning generative trajectory model with reinforcement learning from human feedback.arXiv preprint arXiv:2503.10434, 2025

Derun Li, Jianwei Ren, Yue Wang, Xin Wen, Pengxiang Li, Leimeng Xu, Kun Zhan, Zhongpu Xia, Peng Jia, Xianpeng Lang, et al. Finetuning generative trajectory model with reinforcement learning from human feedback.arXiv preprint arXiv:2503.10434, 2025

work page arXiv 2025
[27]

Recogdrive: A reinforced cognitive framework for end-to-end autonomous driving

Yongkang Li, Kaixin Xiong, Xiangyu Guo, Fang Li, Sixu Yan, Gangwei Xu, Lijun Zhou, Long Chen, Haiyang Sun, Bing Wang, et al. Recogdrive: A reinforced cognitive framework for end-to-end autonomous driving. InInternational Conference on Learning Representations (ICLR), 2026

work page 2026
[28]

Diffusiondrive: Truncated diffusion model for end-to-end autonomous driving

Bencheng Liao, Shaoyu Chen, Haoran Yin, Bo Jiang, Cheng Wang, Sixu Yan, Xinbang Zhang, Xiangyu Li, Ying Zhang, Qian Zhang, et al. Diffusiondrive: Truncated diffusion model for end-to-end autonomous driving. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 12037–12047, 2025

work page 2025
[29]

Reinforced refinement with self-aware ex- pansion for end-to-end autonomous driving.arXiv preprint arXiv:2506.09800, 2025

Haochen Liu, Tianyu Li, Haohan Yang, Li Chen, Caojun Wang, Ke Guo, Haochen Tian, Hongchen Li, Hongyang Li, and Chen Lv. Reinforced refinement with self-aware expansion for end-to-end autonomous driving.arXiv preprint arXiv:2506.09800, 2025

work page arXiv 2025
[30]

Decoupled weight decay regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InInternational Conference on Learning Representations, 2019

work page 2019
[31]

GPT-Driver: Learning to Drive with GPT

Jiageng Mao, Yuxi Qian, Junjie Ye, Hang Zhao, and Yue Wang. Gpt-driver: Learning to drive with gpt. arXiv preprint arXiv:2310.01415, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[32]

Generating useful accident- prone driving scenarios via a learned traffic prior

Davis Rempe, Jonah Philion, Leonidas J Guibas, Sanja Fidler, and Or Litany. Generating useful accident- prone driving scenarios via a learned traffic prior. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022

work page 2022
[33]

Simlingo: Vision-only closed-loop autonomous driving with language-action alignment

Katrin Renz, Long Chen, Elahe Arani, and Oleg Sinavski. Simlingo: Vision-only closed-loop autonomous driving with language-action alignment. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 11993–12003, 2025

work page 2025
[34]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024. 11

work page internal anchor Pith review Pith/arXiv arXiv 2024
[35]

Motion transformer with global intention localization and local movement refinement.arXiv preprint arXiv:2209.13508, 2022

Shaoshuai Shi, Li Jiang, Dengxin Dai, and Bernt Schiele. Motion transformer with global intention localization and local movement refinement.arXiv preprint arXiv:2209.13508, 2022

work page arXiv 2022
[36]

Mastering the game of Go without human knowledge.Nature, 550:354–359, 2017

David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, et al. Mastering the game of Go without human knowledge.Nature, 550:354–359, 2017

work page 2017
[37]

Don’t shake the wheel: Momentum-aware planning in end-to-end autonomous driving

Ziying Song, Caiyan Jia, Lin Liu, Hongyu Pan, Yongchang Zhang, Junming Wang, Xingyu Zhang, Shaoqing Xu, Lei Yang, and Yadan Luo. Don’t shake the wheel: Momentum-aware planning in end-to-end autonomous driving. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 22432–22441, 2025

work page 2025
[38]

TrafficSim: Learning to simulate realistic multi-agent behaviors

Simon Suo, Sebastian Regalado, Sergio Casas, and Raquel Urtasun. TrafficSim: Learning to simulate realistic multi-agent behaviors. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10400–10409, 2021

work page 2021
[39]

Hip-ad: Hierarchical and multi-granularity planning with deformable attention for autonomous driving in a single decoder

Yingqi Tang, Zhuoran Xu, Zhaotie Meng, and Erkang Cheng. Hip-ad: Hierarchical and multi-granularity planning with deformable attention for autonomous driving in a single decoder. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 25605–25615, 2025

work page 2025
[40]

DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models

Xiaoyu Tian, Junru Gu, Bailin Li, Yicheng Liu, Chenxu Hu, Yang Wang, Kun Zhan, Peng Jia, Xianpeng Lang, and Hang Zhao. Drivevlm: The convergence of autonomous driving and large vision-language models.arXiv preprint arXiv:2402.12289, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[41]

Grandmaster level in StarCraft II using multi-agent reinforcement learning.Nature, 575:350–354, 2019

Oriol Vinyals, Igor Babuschkin, Wojciech M Czarnecki, Michaël Mathieu, Andrew Dudzik, Junyoung Chung, David H Choi, Richard Powell, Timo Ewalds, et al. Grandmaster level in StarCraft II using multi-agent reinforcement learning.Nature, 575:350–354, 2019

work page 2019
[42]

Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837, 2022

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837, 2022

work page 2022
[43]

Trajectory-guided con- trol prediction for end-to-end autonomous driving: A simple yet strong baseline.Advances in Neural Information Processing Systems, 35:6119–6132, 2022

Penghao Wu, Xiaosong Jia, Li Chen, Junchi Yan, Hongyang Li, and Yu Qiao. Trajectory-guided con- trol prediction for end-to-end autonomous driving: A simple yet strong baseline.Advances in Neural Information Processing Systems, 35:6119–6132, 2022

work page 2022
[44]

Goalflow: Goal-driven flow matching for multimodal trajectories generation in end-to-end autonomous driving

Zebin Xing, Xingyu Zhang, Yang Hu, Bo Jiang, Tong He, Qian Zhang, Xiaoxiao Long, and Wei Yin. Goalflow: Goal-driven flow matching for multimodal trajectories generation in end-to-end autonomous driving. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 1602–1611, 2025

work page 2025
[45]

Generative scenario rollouts for end-to-end autonomous driving.arXiv preprint arXiv:2601.11475, 2026

Rajeev Yasarla, Deepti Hegde, Shizhong Han, Hsin-Pai Cheng, Yunxiao Shi, Meysam Sadeghigooghari, Shweta Mahajan, Apratim Bhattacharyya, Litian Liu, Risheek Garrepalli, et al. Generative scenario rollouts for end-to-end autonomous driving.arXiv preprint arXiv:2601.11475, 2026

work page arXiv 2026
[46]

Diffrefiner: Coarse to fine trajectory planning via diffusion refinement with semantic interaction for end to end autonomous driving

Liuhan Yin, Runkun Ju, Guodong Guo, and Erkang Cheng. Diffrefiner: Coarse to fine trajectory planning via diffusion refinement with semantic interaction for end to end autonomous driving. InProceedings of the AAAI Conference on Artificial Intelligence, pages 12009–12017, 2026

work page 2026
[47]

CAT: Closed-loop adversarial training for safe end-to-end driving

Linrui Zhang, Zhenghao Peng, Quanyi Li, and Bolei Zhou. CAT: Closed-loop adversarial training for safe end-to-end driving. InConference on Robot Learning, 2023

work page 2023
[48]

Diffusion-based planning for autonomous driving with flexible guidance.arXiv preprint arXiv:2501.15564, 2025

Yinan Zheng, Ruiming Liang, Kexin Zheng, Jinliang Zheng, Liyuan Mao, Jianxiong Li, Weihao Gu, Rui Ai, Shengbo Eben Li, Xianyuan Zhan, et al. Diffusion-based planning for autonomous driving with flexible guidance.arXiv preprint arXiv:2501.15564, 2025

work page arXiv 2025
[49]

Query-centric trajectory prediction

Zikang Zhou, Jianping Wang, Yung-Hui Li, and Yu-Kai Huang. Query-centric trajectory prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17863– 17873, 2023

work page 2023
[50]

AutoVLA: A Vision-Language-Action Model for End-to-End Autonomous Driving with Adaptive Reasoning and Reinforcement Fine-Tuning

Zewei Zhou, Tianhui Cai, Seth Z Zhao, Yun Zhang, Zhiyu Huang, Bolei Zhou, and Jiaqi Ma. Au- tovla: A vision-language-action model for end-to-end autonomous driving with adaptive reasoning and reinforcement fine-tuning.arXiv preprint arXiv:2506.13757, 2025. 12 A Ablation Study A.1 Number of Reactive Agents Agent Distribution in Bench2Drive.To contextualize...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[1] [1]

Learning dexterous in-hand manipulation

Marcin Andrychowicz, Bowen Baker, Maciek Chociej, Rafal Józefowicz, Bob McGrew, Jakub Pachocki, Arthur Petron, Matthias Plappert, Glenn Powell, Alex Ray, et al. Learning dexterous in-hand manipulation. The International Journal of Robotics Research, 39(1):3–20, 2020

work page 2020

[2] [2]

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond.arXiv preprint arXiv:2308.12966, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[3] [3]

Qwen2.5-VL Technical Report

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-vl technical report.ar...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[4] [4]

SimNet: Learning reactive self-driving simulations from real-world observations

Luca Bergamini, Yawei Ye, Oliver Scheel, Long Chen, Chih-Yuan Hu, Luca Delévaux, Niels Muller, and Peter Ondruska. SimNet: Learning reactive self-driving simulations from real-world observations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021

work page 2021

[5] [5]

Killian, Stuart Bowers, Ozan Sener, Philipp Kraehenbuehl, and Vladlen Koltun

Marco Cusumano-Towner, David Hafner, Alexander Hertzberg, Brody Huval, Aleksei Petrenko, Eugene Vinitsky, Erik Wijmans, Taylor W. Killian, Stuart Bowers, Ozan Sener, Philipp Kraehenbuehl, and Vladlen Koltun. Robust autonomy emerges from self-play. InProceedings of the 42nd International Conference on Machine Learning (ICML), 2025

work page 2025

[6] [6]

Parting with misconceptions about learning-based vehicle motion planning

Daniel Dauner, Marcel Hallgarten, Andreas Geiger, and Kashyap Chitta. Parting with misconceptions about learning-based vehicle motion planning. InConference on Robot Learning, pages 1268–1281. PMLR, 2023

work page 2023

[7] [7]

Carla: An open urban driving simulator

Alexey Dosovitskiy, German Ros, Felipe Codevilla, Antonio Lopez, and Vladlen Koltun. Carla: An open urban driving simulator. InConference on robot learning, pages 1–16. PMLR, 2017

work page 2017

[8] [8]

Eva: Exploring the limits of masked visual representation learning at scale

Yuxin Fang, Wen Wang, Binhui Xie, Quan Sun, Ledell Wu, Xinggang Wang, Tiejun Huang, Xinlong Wang, and Yue Cao. Eva: Exploring the limits of masked visual representation learning at scale. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 19358–19369, 2023

work page 2023

[9] [9]

ORION: A Holistic End-to-End Autonomous Driving Framework by Vision-Language Instructed Action Generation

Haoyu Fu, Diankun Zhang, Zongchuang Zhao, Jianfeng Cui, Dingkang Liang, Chong Zhang, Dingyuan Zhang, Hongwei Xie, Bing Wang, and Xiang Bai. Orion: A holistic end-to-end autonomous driving framework by vision-language instructed action generation.arXiv preprint arXiv:2503.19755, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[10] [10]

Co-Reyes, Rishabh Agarwal, Rebecca Roelofs, Yao Lu, Nico Montali, Paul Mougin, Zoey Yang, Brandyn White, Aleksandra Faust, Rowan McAllister, Dragomir Anguelov, and Benjamin Sapp

Cole Gulino, Justin Fu, Wenjie Luo, George Tucker, Eli Bronstein, Yiren Lu, Jean Harb, Xinlei Pan, Yan Wang, Xiangyu Chen, John D. Co-Reyes, Rishabh Agarwal, Rebecca Roelofs, Yao Lu, Nico Montali, Paul Mougin, Zoey Yang, Brandyn White, Aleksandra Faust, Rowan McAllister, Dragomir Anguelov, and Benjamin Sapp. Waymax: An accelerated, data-driven simulator f...

work page 2023

[11] [11]

Tan et al

K. Tan et al. H. Caesar, J. Kabzan. Nuplan: A closed-loop ml-based planning benchmark for autonomous vehicles. InCVPR ADP3 workshop, 2021

work page 2021

[12] [12]

Social force model for pedestrian dynamics.Physical Review E, 51(5): 4282–4286, 1995

Dirk Helbing and Péter Molnár. Social force model for pedestrian dynamics.Physical Review E, 51(5): 4282–4286, 1995

work page 1995

[13] [13]

Planning-oriented autonomous driving

Yihan Hu, Jiazhi Yang, Li Chen, Keyu Li, Chonghao Sima, Xizhou Zhu, Siqi Chai, Senyao Du, Tianwei Lin, Wenhai Wang, et al. Planning-oriented autonomous driving. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17853–17862, 2023

work page 2023

[14] [14]

Gen-drive: Enhancing diffusion generative driving policies with reward modeling and reinforcement learning fine-tuning

Zhiyu Huang, Xinshuo Weng, Maximilian Igl, Yuxiao Chen, Yulong Cao, Boris Ivanovic, Marco Pavone, and Chen Lv. Gen-drive: Enhancing diffusion generative driving policies with reward modeling and reinforcement learning fine-tuning. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 3445–3451. IEEE, 2025

work page 2025

[15] [15]

EMMA: End-to-End Multimodal Model for Autonomous Driving

Jyh-Jing Hwang, Runsheng Xu, Hubert Lin, Wei-Chih Hung, Jingwei Ji, Kristy Choi, Di Huang, Tong He, Paul Covington, Benjamin Sapp, Yin Zhou, James Guo, Dragomir Anguelov, and Mingxing Tan. Emma: End-to-end multimodal model for autonomous driving.arXiv preprint arXiv:2410.23262, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[16] [16]

Carl: Learning scalable planning policies with simple rewards.arXiv preprint arXiv:2504.17838, 2025

Bernhard Jaeger, Daniel Dauner, Jens Beißwenger, Simon Gerstenecker, Kashyap Chitta, and Andreas Geiger. Carl: Learning scalable planning policies with simple rewards.arXiv preprint arXiv:2504.17838, 2025. 10

work page arXiv 2025

[17] [17]

Driveadapter: Breaking the coupling barrier of perception and planning in end-to-end autonomous driving

Xiaosong Jia, Yulu Gao, Li Chen, Junchi Yan, Patrick Langechuan Liu, and Hongyang Li. Driveadapter: Breaking the coupling barrier of perception and planning in end-to-end autonomous driving. InICCV, 2023

work page 2023

[18] [18]

Think twice before driving: Towards scalable decoders for end-to-end autonomous driving

Xiaosong Jia, Penghao Wu, Li Chen, Jiangwei Xie, Conghui He, Junchi Yan, and Hongyang Li. Think twice before driving: Towards scalable decoders for end-to-end autonomous driving. InCVPR, 2023

work page 2023

[19] [19]

Bench2drive: Towards multi- ability benchmarking of closed-loop end-to-end autonomous driving

Xiaosong Jia, Zhenjie Yang, Qifeng Li, Zhiyuan Zhang, and Junchi Yan. Bench2drive: Towards multi- ability benchmarking of closed-loop end-to-end autonomous driving. InNeurIPS 2024 Datasets and Benchmarks Track, 2024

work page 2024

[20] [20]

Drivetransformer: Unified transformer for scalable end-to-end autonomous driving

Xiaosong Jia, Junqi You, Zhiyuan Zhang, and Junchi Yan. Drivetransformer: Unified transformer for scalable end-to-end autonomous driving. InThe Thirteenth International Conference on Learning Representations, 2025

work page 2025

[21] [21]

Vad: Vectorized scene representation for efficient autonomous driving

Bo Jiang, Shaoyu Chen, Qing Xu, Bencheng Liao, Jiajie Chen, Helong Zhou, Qian Zhang, Wenyu Liu, Chang Huang, and Xinggang Wang. Vad: Vectorized scene representation for efficient autonomous driving. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 8340–8350, 2023

work page 2023

[22] [22]

Senna: Bridging Large Vision-Language Models and End-to-End Autonomous Driving

Bo Jiang, Shaoyu Chen, Bencheng Liao, Xingyu Zhang, Wei Yin, Qian Zhang, Chang Huang, Wenyu Liu, and Xinggang Wang. Senna: Bridging large vision-language models and end-to-end autonomous driving. arXiv preprint arXiv:2410.22313, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[23] [23]

AlphaDrive: Unleashing the Power of VLMs in Autonomous Driving via Reinforcement Learning and Reasoning

Bo Jiang, Shaoyu Chen, Qian Zhang, Wenyu Liu, and Xinggang Wang. Alphadrive: Unleashing the power of vlms in autonomous driving via reinforcement learning and reasoning.arXiv preprint arXiv:2503.07608, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[24] [24]

Beyond behavior cloning in autonomous driving: a survey of closed-loop training techniques

Peter Karkus, Maximilian Igl, Yuxiao Chen, Kashyap Chitta, Boris Ivanovic, and Marco Pavone. Beyond behavior cloning in autonomous driving: a survey of closed-loop training techniques. Technical report, NVIDIA Research, 2025

work page 2025

[25] [25]

A survey of generalisation in deep reinforcement learning.arXiv preprint arXiv:2111.09794, 2023

Roberta Kirk, Amy Zhang, Edward Grefenstette, and Tim Rocktäschel. A survey of generalisation in deep reinforcement learning.arXiv preprint arXiv:2111.09794, 2023

work page arXiv 2023

[26] [26]

Finetuning generative trajectory model with reinforcement learning from human feedback.arXiv preprint arXiv:2503.10434, 2025

Derun Li, Jianwei Ren, Yue Wang, Xin Wen, Pengxiang Li, Leimeng Xu, Kun Zhan, Zhongpu Xia, Peng Jia, Xianpeng Lang, et al. Finetuning generative trajectory model with reinforcement learning from human feedback.arXiv preprint arXiv:2503.10434, 2025

work page arXiv 2025

[27] [27]

Recogdrive: A reinforced cognitive framework for end-to-end autonomous driving

Yongkang Li, Kaixin Xiong, Xiangyu Guo, Fang Li, Sixu Yan, Gangwei Xu, Lijun Zhou, Long Chen, Haiyang Sun, Bing Wang, et al. Recogdrive: A reinforced cognitive framework for end-to-end autonomous driving. InInternational Conference on Learning Representations (ICLR), 2026

work page 2026

[28] [28]

Diffusiondrive: Truncated diffusion model for end-to-end autonomous driving

Bencheng Liao, Shaoyu Chen, Haoran Yin, Bo Jiang, Cheng Wang, Sixu Yan, Xinbang Zhang, Xiangyu Li, Ying Zhang, Qian Zhang, et al. Diffusiondrive: Truncated diffusion model for end-to-end autonomous driving. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 12037–12047, 2025

work page 2025

[29] [29]

Reinforced refinement with self-aware ex- pansion for end-to-end autonomous driving.arXiv preprint arXiv:2506.09800, 2025

Haochen Liu, Tianyu Li, Haohan Yang, Li Chen, Caojun Wang, Ke Guo, Haochen Tian, Hongchen Li, Hongyang Li, and Chen Lv. Reinforced refinement with self-aware expansion for end-to-end autonomous driving.arXiv preprint arXiv:2506.09800, 2025

work page arXiv 2025

[30] [30]

Decoupled weight decay regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InInternational Conference on Learning Representations, 2019

work page 2019

[31] [31]

GPT-Driver: Learning to Drive with GPT

Jiageng Mao, Yuxi Qian, Junjie Ye, Hang Zhao, and Yue Wang. Gpt-driver: Learning to drive with gpt. arXiv preprint arXiv:2310.01415, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[32] [32]

Generating useful accident- prone driving scenarios via a learned traffic prior

Davis Rempe, Jonah Philion, Leonidas J Guibas, Sanja Fidler, and Or Litany. Generating useful accident- prone driving scenarios via a learned traffic prior. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022

work page 2022

[33] [33]

Simlingo: Vision-only closed-loop autonomous driving with language-action alignment

Katrin Renz, Long Chen, Elahe Arani, and Oleg Sinavski. Simlingo: Vision-only closed-loop autonomous driving with language-action alignment. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 11993–12003, 2025

work page 2025

[34] [34]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024. 11

work page internal anchor Pith review Pith/arXiv arXiv 2024

[35] [35]

Motion transformer with global intention localization and local movement refinement.arXiv preprint arXiv:2209.13508, 2022

Shaoshuai Shi, Li Jiang, Dengxin Dai, and Bernt Schiele. Motion transformer with global intention localization and local movement refinement.arXiv preprint arXiv:2209.13508, 2022

work page arXiv 2022

[36] [36]

Mastering the game of Go without human knowledge.Nature, 550:354–359, 2017

David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, et al. Mastering the game of Go without human knowledge.Nature, 550:354–359, 2017

work page 2017

[37] [37]

Don’t shake the wheel: Momentum-aware planning in end-to-end autonomous driving

Ziying Song, Caiyan Jia, Lin Liu, Hongyu Pan, Yongchang Zhang, Junming Wang, Xingyu Zhang, Shaoqing Xu, Lei Yang, and Yadan Luo. Don’t shake the wheel: Momentum-aware planning in end-to-end autonomous driving. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 22432–22441, 2025

work page 2025

[38] [38]

TrafficSim: Learning to simulate realistic multi-agent behaviors

Simon Suo, Sebastian Regalado, Sergio Casas, and Raquel Urtasun. TrafficSim: Learning to simulate realistic multi-agent behaviors. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10400–10409, 2021

work page 2021

[39] [39]

Hip-ad: Hierarchical and multi-granularity planning with deformable attention for autonomous driving in a single decoder

Yingqi Tang, Zhuoran Xu, Zhaotie Meng, and Erkang Cheng. Hip-ad: Hierarchical and multi-granularity planning with deformable attention for autonomous driving in a single decoder. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 25605–25615, 2025

work page 2025

[40] [40]

DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models

Xiaoyu Tian, Junru Gu, Bailin Li, Yicheng Liu, Chenxu Hu, Yang Wang, Kun Zhan, Peng Jia, Xianpeng Lang, and Hang Zhao. Drivevlm: The convergence of autonomous driving and large vision-language models.arXiv preprint arXiv:2402.12289, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[41] [41]

Grandmaster level in StarCraft II using multi-agent reinforcement learning.Nature, 575:350–354, 2019

Oriol Vinyals, Igor Babuschkin, Wojciech M Czarnecki, Michaël Mathieu, Andrew Dudzik, Junyoung Chung, David H Choi, Richard Powell, Timo Ewalds, et al. Grandmaster level in StarCraft II using multi-agent reinforcement learning.Nature, 575:350–354, 2019

work page 2019

[42] [42]

Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837, 2022

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837, 2022

work page 2022

[43] [43]

Trajectory-guided con- trol prediction for end-to-end autonomous driving: A simple yet strong baseline.Advances in Neural Information Processing Systems, 35:6119–6132, 2022

Penghao Wu, Xiaosong Jia, Li Chen, Junchi Yan, Hongyang Li, and Yu Qiao. Trajectory-guided con- trol prediction for end-to-end autonomous driving: A simple yet strong baseline.Advances in Neural Information Processing Systems, 35:6119–6132, 2022

work page 2022

[44] [44]

Goalflow: Goal-driven flow matching for multimodal trajectories generation in end-to-end autonomous driving

Zebin Xing, Xingyu Zhang, Yang Hu, Bo Jiang, Tong He, Qian Zhang, Xiaoxiao Long, and Wei Yin. Goalflow: Goal-driven flow matching for multimodal trajectories generation in end-to-end autonomous driving. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 1602–1611, 2025

work page 2025

[45] [45]

Generative scenario rollouts for end-to-end autonomous driving.arXiv preprint arXiv:2601.11475, 2026

Rajeev Yasarla, Deepti Hegde, Shizhong Han, Hsin-Pai Cheng, Yunxiao Shi, Meysam Sadeghigooghari, Shweta Mahajan, Apratim Bhattacharyya, Litian Liu, Risheek Garrepalli, et al. Generative scenario rollouts for end-to-end autonomous driving.arXiv preprint arXiv:2601.11475, 2026

work page arXiv 2026

[46] [46]

Diffrefiner: Coarse to fine trajectory planning via diffusion refinement with semantic interaction for end to end autonomous driving

Liuhan Yin, Runkun Ju, Guodong Guo, and Erkang Cheng. Diffrefiner: Coarse to fine trajectory planning via diffusion refinement with semantic interaction for end to end autonomous driving. InProceedings of the AAAI Conference on Artificial Intelligence, pages 12009–12017, 2026

work page 2026

[47] [47]

CAT: Closed-loop adversarial training for safe end-to-end driving

Linrui Zhang, Zhenghao Peng, Quanyi Li, and Bolei Zhou. CAT: Closed-loop adversarial training for safe end-to-end driving. InConference on Robot Learning, 2023

work page 2023

[48] [48]

Diffusion-based planning for autonomous driving with flexible guidance.arXiv preprint arXiv:2501.15564, 2025

Yinan Zheng, Ruiming Liang, Kexin Zheng, Jinliang Zheng, Liyuan Mao, Jianxiong Li, Weihao Gu, Rui Ai, Shengbo Eben Li, Xianyuan Zhan, et al. Diffusion-based planning for autonomous driving with flexible guidance.arXiv preprint arXiv:2501.15564, 2025

work page arXiv 2025

[49] [49]

Query-centric trajectory prediction

Zikang Zhou, Jianping Wang, Yung-Hui Li, and Yu-Kai Huang. Query-centric trajectory prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17863– 17873, 2023

work page 2023

[50] [50]

AutoVLA: A Vision-Language-Action Model for End-to-End Autonomous Driving with Adaptive Reasoning and Reinforcement Fine-Tuning

Zewei Zhou, Tianhui Cai, Seth Z Zhao, Yun Zhang, Zhiyu Huang, Bolei Zhou, and Jiaqi Ma. Au- tovla: A vision-language-action model for end-to-end autonomous driving with adaptive reasoning and reinforcement fine-tuning.arXiv preprint arXiv:2506.13757, 2025. 12 A Ablation Study A.1 Number of Reactive Agents Agent Distribution in Bench2Drive.To contextualize...

work page internal anchor Pith review Pith/arXiv arXiv 2025