Scaling Self-Play for End-to-End Driving

Alaap Grandhi; Christopher Pal; Daphne Cornelisse; Eugene Vinitsky; Felix Heide; Liam Paull; Luke Rowe; Rodrigue de Schaetzen; Roger Girgis

arxiv: 2606.19641 · v2 · pith:EPL3X3OAnew · submitted 2026-06-17 · 💻 cs.RO · cs.CV

Scaling Self-Play for End-to-End Driving

Luke Rowe , Roger Girgis , Rodrigue de Schaetzen , Daphne Cornelisse , Alaap Grandhi , Felix Heide , Eugene Vinitsky , Christopher Pal

show 1 more author

Liam Paull

This is my paper

Pith reviewed 2026-06-26 20:20 UTC · model grok-4.3

classification 💻 cs.RO cs.CV

keywords self-playend-to-end drivingautonomous drivingsimulationpolicy distillationsim-to-realreinforcement learning

0 comments

The pith

End-to-end driving policies trained via self-play in a simplified pixel simulator transfer competitively to real sensor data without any human trajectory supervision.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that offline human datasets give end-to-end driving models too little state coverage and no closed-loop feedback, so they suffer from compounding errors and brittle behavior with other agents. It replaces that data source with large-scale self-play inside Gigapixel, a batched simulator that renders a simplified bounding-box world at 50k steps per second and supports direct pixel observations. Policies are trained by distilling from a privileged RL teacher inside the simulator, then adapted to real cameras with lightweight perception fine-tuning. The resulting models reach competitive scores on HUGSIM and NAVSIM-v2, and further self-play training produces proportional gains.

Core claim

Policies trained through self-play DAgger inside Gigapixel and then adapted to real sensor data achieve competitive performance on the HUGSIM and NAVSIM-v2 benchmarks without human trajectory supervision; scaling the amount of self-play training produces proportional improvements in policy performance.

What carries the argument

Gigapixel, a high-throughput batched simulator that renders perspective views of a simplified bounding-box world, paired with self-play DAgger that distills pixel policies from a privileged RL teacher before lightweight perception adaptation to real data.

If this is right

End-to-end driving models can be trained without any human trajectory data.
Closed-loop self-play supplies the state coverage and interaction feedback missing from offline datasets.
Performance of the final policy improves in direct proportion to the amount of self-play training performed.
A simplified non-photorealistic simulator is sufficient for the training stage when followed by perception adaptation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same self-play-plus-adaptation pipeline could be applied to other robotic control problems that currently rely on human demonstrations.
Further increases in simulator throughput or model scale would be expected to continue raising benchmark scores.
Edge cases that depend on fine visual detail not captured by bounding boxes may still require additional real-world adaptation steps.
Removing the privileged teacher entirely and training the pixel policy directly with RL could test how much the distillation step contributes.

Load-bearing premise

The simplified bounding-box world rendered by Gigapixel preserves enough scene structure that policies trained inside it can be transferred to real sensor data through lightweight perception adaptation.

What would settle it

If increasing the scale of self-play training inside Gigapixel produces no measurable improvement on HUGSIM or NAVSIM-v2 after perception adaptation, or if the adapted policies fall substantially below the performance of human-supervised baselines on those benchmarks.

Figures

Figures reproduced from arXiv: 2606.19641 by Alaap Grandhi, Christopher Pal, Daphne Cornelisse, Eugene Vinitsky, Felix Heide, Liam Paull, Luke Rowe, Rodrigue de Schaetzen, Roger Girgis.

**Figure 1.** Figure 1: Gigapixel Throughput vs. Resolution. Agent steps per second (SPS) across rendering resolutions and policy architectures on 1 NVIDIA A100L GPU. Render Only isolates renderer throughput without policy forward or backward passes. CNN is a simple CNN, and DrivoR [21] is a transformer-based architecture. CNN and DrivoR throughputs are reported in an RL training loop. The gap between the rasterizer (Rast.) … view at source ↗

**Figure 2.** Figure 2: Self-Play for End-to-End Driving. We propose self-play training for end-to-end driving in three stages: (a) a vectorized teacher is trained with self-play RL in an abstract simulator; (b) a pixel-based student is distilled via self-play DAgger in Gigapixel, with all agents student-controlled and trajectory targets generated by parallel teacher rollouts (Sec. 3.2); and (c) the student is adapted to real ima… view at source ↗

**Figure 3.** Figure 3: Self-play DAgger vs. RL. We compare two pixel-based training methods, self-play DAgger and self-play RL, with a CNN-based model in Gigapixel. We plot Gigapixel Driving Score (Completion Rate− Offroad Rate − Collision Rate) on a held-out set as a function of agent steps in Gigapixel. The dashed line marks the vectorized teacher performance at 25B steps. Implementation Details. We train policies in Gigapi… view at source ↗

**Figure 4.** Figure 4: Scaling Experiments. We compare different end-to-end training strategies at scale in Gigapixel. We plot Gigapixel Driving Score (Completion Rate− Offroad Rate − Collision Rate) on a held-out set as a function of agent steps in Gigapixel. In all configurations, we train the single-mode DrivoR-Reg architecture [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Qualitative comparison of closed-loop driving behavior near a decelerating lead vehicle. Frames advance left to right (t1 → t3); the yellow waypoints denote the model’s planned future trajectory. Top (green): Self-Play Trained DrivoR. As the lead vehicle comes to a stop, the policy reduces speed and plans a smooth lateral trajectory that nudges out of the lane and passes the stationary vehicle safely. Bot… view at source ↗

**Figure 6.** Figure 6: Behavior from a recovery state near the road edge. Frames advance in time (t1 → t3); yellow waypoints denote the planned trajectory. Top (green): Self-Play Trained DrivoR plans a slow, careful turn that corrects toward the lane and keeps the vehicle on the drivable surface. Bottom (red): BC Trained DrivoR fails to correct adequately near the road edge and drifts off-road [PITH_FULL_IMAGE:figures/full_fig_… view at source ↗

**Figure 7.** Figure 7: Ray-traced vs. rasterized Gigapixel renderings. We render the front-view of three scenes with both the Gigapixel ray-tracer and rasterizer across a range of resolutions. The rasterizer achieves higher throughput, but at the cost of simpler shading, as shown in the bottom two rows. 20 [PITH_FULL_IMAGE:figures/full_fig_p020_7.png] view at source ↗

read the original abstract

End-to-end autonomous driving models are typically trained on offline human-demonstration datasets that provide limited state coverage and often no closed-loop feedback, making them prone to compounding errors when deployed in closed-loop and brittle to long-tail agent interactions. To overcome these limitations, we propose an alternative strategy for training end-to-end driving models: large-scale self-play directly from pixels in simulation. While prior self-play approaches have shown promising transfer to real-world driving, they typically assume vectorized Bird's-Eye-View (BEV) observations that are incompatible with end-to-end policies operating directly on sensor observations. To this end, we introduce Gigapixel, a high-throughput batched driving simulator with perspective rendering, enabling scalable self-play directly from pixel observations. Rather than targeting compute-costly photorealistic sensor simulation, Gigapixel renders a simplified bounding-box world that preserves essential scene structure while achieving throughput at 50k agent steps per second. Since direct pixel-space self-play RL is prohibitively sample-inefficient at end-to-end model scale, we propose self-play DAgger training: we train pixel-based policies in self-play via on-policy distillation from a privileged RL teacher. To bridge the sim-to-real gap, we subsequently transfer the self-play trained policies to real-world sensor data through lightweight perception adaptation. Policies trained in Gigapixel and adapted to real-world sensor data achieve competitive performance on the HUGSIM and NAVSIM-v2 benchmarks without human trajectory supervision. Moreover, scaling self-play training yields proportional gains in policy performance, establishing self-play as a practical and scalable strategy for training end-to-end models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper introduces a fast bounding-box simulator and self-play DAgger pipeline for pixel-based driving policies that avoids human data, but the abstract supplies no numbers or ablations to check whether the transfer actually works.

read the letter

The main point is that this work trains end-to-end driving policies through self-play in simulation rather than offline human trajectories. They built Gigapixel to render simplified bounding-box scenes in perspective at 50k steps per second, run self-play DAgger from a privileged teacher, and then do lightweight adaptation to real sensor data, claiming competitive scores on HUGSIM and NAVSIM-v2 plus scaling gains.

What is new is the concrete combination of high-throughput perspective rendering with on-policy distillation for sensor-level policies. Prior self-play work stayed in vectorized BEV, so this pipeline directly targets the end-to-end case while keeping throughput high enough for scaling.

The paper does a clean job stating the core problem: human datasets give limited coverage and no closed-loop signal, so policies compound errors on long-tail interactions. Using simulation self-play to generate that signal is a logical response.

The soft spots sit in the evidence. The abstract states competitive results and proportional scaling but gives no baselines, error bars, or ablation numbers. That leaves the central transfer claim untested on the page. The stress-test concern about missing photometric cues in the bounding-box world is reasonable given the description; without data showing the adaptation recovers enough structure for closed-loop robustness, the claim stays provisional.

This is for people working on simulation scaling and sim-to-real for autonomous driving. A reader already thinking about compute-efficient ways to generate diverse driving data would find the setup useful.

It deserves a serious referee because the motivation and method are coherent and the direction matters if the results hold up. I would send it to review.

Referee Report

2 major / 2 minor

Summary. The paper introduces Gigapixel, a high-throughput batched simulator that renders simplified bounding-box worlds (50k agent steps/s) to support large-scale self-play for pixel-based end-to-end driving policies. It trains via self-play DAgger by distilling from a privileged RL teacher, then applies lightweight perception adaptation to real sensor data. The central claims are that the resulting policies achieve competitive closed-loop performance on the HUGSIM and NAVSIM-v2 benchmarks without any human trajectory supervision, and that increasing self-play training scale produces proportional gains in policy performance.

Significance. If the sim-to-real transfer holds, the result would be significant: it offers a scalable route to end-to-end driving policies that avoids the state-coverage and compounding-error limitations of offline human datasets while using only simplified rendering rather than photorealistic simulation. The reported scaling behavior, if quantitatively documented, would also be a useful empirical finding for the community.

major comments (2)

[Abstract and Gigapixel / transfer description] The headline claim (competitive benchmark performance after transfer) rests on the untested assumption that Gigapixel's bounding-box rendering retains sufficient scene structure for pixel policies once the privileged teacher is removed. The manuscript provides no quantitative ablations or metrics (e.g., closed-loop score degradation when photometric cues are removed, or comparison against a photorealistic baseline) that would demonstrate the domain gap is bridgeable by lightweight adaptation alone, particularly in long-tail interactions.
[Abstract] The abstract asserts 'competitive performance' and 'proportional gains' from scaling but supplies no numerical scores, baselines, error bars, or ablation tables. Without these data in the results section, it is impossible to evaluate whether the self-play DAgger + adaptation pipeline actually surpasses or merely matches existing methods on HUGSIM/NAVSIM-v2.

minor comments (2)

[Methods] Clarify the exact architecture and training details of the 'lightweight perception adaptation' step (e.g., which layers are updated, domain-randomization parameters) so readers can reproduce the sim-to-real bridge.
[Experiments] Define the precise notion of 'scaling self-play training' (number of agents, episodes, or compute) and report the functional form of the performance-vs-scale curve with confidence intervals.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive review. We address each major comment below with clarifications from the manuscript and indicate revisions where the feedback identifies opportunities to strengthen the presentation of results.

read point-by-point responses

Referee: [Abstract and Gigapixel / transfer description] The headline claim (competitive benchmark performance after transfer) rests on the untested assumption that Gigapixel's bounding-box rendering retains sufficient scene structure for pixel policies once the privileged teacher is removed. The manuscript provides no quantitative ablations or metrics (e.g., closed-loop score degradation when photometric cues are removed, or comparison against a photorealistic baseline) that would demonstrate the domain gap is bridgeable by lightweight adaptation alone, particularly in long-tail interactions.

Authors: The manuscript demonstrates that policies trained via self-play DAgger in the simplified bounding-box renderer achieve competitive closed-loop scores on HUGSIM and NAVSIM-v2 after lightweight perception adaptation, without any human trajectory data. This outcome indicates that the retained geometric structure is sufficient for the pixel policy to learn transferable behaviors from the privileged teacher. We agree that explicit ablations would make this more rigorous and will add quantitative comparisons (e.g., closed-loop degradation when removing specific visual elements in simulation) in the revised manuscript. A direct photorealistic baseline comparison is not feasible within the current experimental scope given the throughput focus, but we will expand the discussion of domain-gap limitations. revision: partial
Referee: [Abstract] The abstract asserts 'competitive performance' and 'proportional gains' from scaling but supplies no numerical scores, baselines, error bars, or ablation tables. Without these data in the results section, it is impossible to evaluate whether the self-play DAgger + adaptation pipeline actually surpasses or merely matches existing methods on HUGSIM/NAVSIM-v2.

Authors: The results section contains the requested numerical benchmark scores, baseline comparisons, error bars across runs, and scaling curves with ablation tables. To improve clarity, we will revise the abstract to incorporate the key quantitative results (specific closed-loop metrics and scaling trends) so that the claims are self-contained and directly supported by the reported data. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical method evaluated on external benchmarks

full rationale

The paper proposes an empirical training pipeline (self-play DAgger in Gigapixel simulator followed by lightweight perception adaptation) and reports closed-loop performance on HUGSIM and NAVSIM-v2. No derivation chain, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided text. All central claims reduce to experimental outcomes on independent benchmarks rather than algebraic identities or self-referential definitions. This is a standard non-circular empirical systems paper.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim depends on the domain assumption that simplified bounding-box rendering suffices for transferable policy learning and on the empirical effectiveness of the DAgger distillation step; no free parameters or invented physical entities are stated in the abstract.

axioms (1)

domain assumption Simplified bounding-box rendering preserves essential scene structure for policy transfer.
Explicitly invoked to justify non-photorealistic simulation.

invented entities (1)

Gigapixel simulator no independent evidence
purpose: High-throughput batched driving simulator enabling pixel-based self-play at 50k steps per second.
New tool introduced to support the training method.

pith-pipeline@v0.9.1-grok · 5849 in / 1288 out tokens · 25900 ms · 2026-06-26T20:20:03.213778+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

104 extracted references · 15 linked inside Pith

[1]

P. S. Chib and P. Singh. Recent advancements in end-to-end autonomous driving using deep learning: A survey.IEEE Transactions on Intelligent Vehicles, 9(1):103–118, 2023

2023
[2]

L. Chen, P. Wu, K. Chitta, B. Jaeger, A. Geiger, and H. Li. End-to-end autonomous driving: Challenges and frontiers.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(12):10164–10183, 2024

2024
[3]

Chitta, A

K. Chitta, A. Prakash, B. Jaeger, Z. Yu, K. Renz, and A. Geiger. Transfuser: Imitation with transformer-based sensor fusion for autonomous driving.IEEE transactions on pattern analysis and machine intelligence, 45(11):12878–12895, 2022

2022
[4]

Y . Hu, J. Yang, L. Chen, K. Li, C. Sima, X. Zhu, S. Chai, S. Du, T. Lin, W. Wang, et al. Planning-oriented autonomous driving. InProceedings of the IEEE/CVF conference on com- puter vision and pattern recognition, pages 17853–17862, 2023

2023
[5]

Jiang, S

B. Jiang, S. Chen, Q. Xu, B. Liao, J. Chen, H. Zhou, Q. Zhang, W. Liu, C. Huang, and X. Wang. Vad: Vectorized scene representation for efficient autonomous driving. InPro- ceedings of the IEEE/CVF International Conference on Computer Vision, pages 8340–8350, 2023

2023
[6]

Hwang, R

J.-J. Hwang, R. Xu, H. Lin, W.-C. Hung, J. Ji, K. Choi, D. Huang, T. He, P. Covington, B. Sapp, et al. Emma: End-to-end multimodal model for autonomous driving.arXiv preprint arXiv:2410.23262, 2024

Pith/arXiv arXiv 2024
[7]

L. Rowe, R. de Schaetzen, R. Girgis, C. Pal, and L. Paull. Poutine: Vision-language-trajectory pre-training and reinforcement learning post-training enable robust end-to-end autonomous driving.arXiv preprint arXiv:2506.11234, 2025

arXiv 2025
[8]

Zheng, P

Y . Zheng, P. Yang, Z. Xia, Q. Zhang, Y . Zheng, S. Gu, B. Jin, T. Zhang, B. Lu, C. Han, et al. Data scaling laws for imitation learning-based end-to-end autonomous driving.arXiv preprint arXiv:2412.02689, 2024

arXiv 2024
[9]

Naumann, X

A. Naumann, X. Gu, T. Dimlioglu, M. Bojarski, A. Degirmenci, A. Popov, D. Bisla, M. Pavone, U. Muller, and B. Ivanovic. Data scaling laws for end-to-end autonomous driving. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 2571– 2582, 2025

2025
[10]

Baniodeh, K

M. Baniodeh, K. Goel, S. Ettinger, C. Fuertes, A. Seff, T. Shen, C. Gulino, C. Yang, G. Jerfel, D. Choe, et al. Scaling laws of motion forecasting and planning–technical report.arXiv preprint arXiv:2506.08228, 2025

arXiv 2025
[11]

Le Mero, D

L. Le Mero, D. Yi, M. Dianati, and A. Mouzakitis. A survey on imitation learning tech- niques for end-to-end autonomous vehicles.IEEE Transactions on Intelligent Transportation Systems, 23(9):14128–14147, 2022

2022
[12]

Ljungbergh, A

W. Ljungbergh, A. Tonderski, J. Johnander, H. Caesar, K. Åström, M. Felsberg, and C. Peters- son. Neuroncap: Photorealistic closed-loop safety testing for autonomous driving.European Conference on Computer Vision (ECCV), 2024. 11

2024
[13]

Karkus, M

P. Karkus, M. Igl, Y . Chen, K. Chitta, J. Packer, B. Douillard, R. Tian, A. Naumann, G. Garcia-Cobo, S. Tan, et al. Beyond behavior cloning in autonomous driving: a survey of closed-loop training techniques.Authorea Preprints, 2025

2025
[14]

S. Ross, G. Gordon, and D. Bagnell. A reduction of imitation learning and structured predic- tion to no-regret online learning. InProceedings of the fourteenth international conference on artificial intelligence and statistics, pages 627–635. JMLR Workshop and Conference Pro- ceedings, 2011

2011
[15]

Codevilla, E

F. Codevilla, E. Santana, A. M. López, and A. Gaidon. Exploring the limitations of behavior cloning for autonomous driving. InProceedings of the IEEE/CVF international conference on computer vision, pages 9329–9338, 2019

2019
[16]

Cusumano-Towner, D

M. Cusumano-Towner, D. Hafner, A. Hertzberg, B. Huval, A. Petrenko, E. Vinitsky, E. Wij- mans, T. W. Killian, S. Bowers, O. Sener, et al. Robust autonomy emerges from self-play. In International Conference on Machine Learning, pages 11710–11737. PMLR, 2025

2025
[17]

Cornelisse, S

D. Cornelisse, S. Cheng, P. Mandavilli, J. Hunt, K. Joseph, W. Doulazmi, V . Charraut, A. Gupta, J. Suarez, and E. Vinitsky. PufferDrive: A fast and friendly driving simulator for training and evaluating RL agents, 2025. URLhttps://github.com/Emerge-Lab/ PufferDrive

2025
[18]

Cornelisse and E

D. Cornelisse and E. Vinitsky. Human-compatible driving agents through data-regularized self-play reinforcement learning. InReinforcement Learning Conference, 2024

2024
[19]

Cornelisse, A

D. Cornelisse, A. Pandya, K. Joseph, J. Suárez, and E. Vinitsky. Building reliable sim driving agents by scaling self-play.arXiv preprint arXiv:2502.14706, 2025

arXiv 2025
[20]

Zhang, S

C. Zhang, S. Biswas, K. Wong, K. Fallah, L. Zhang, D. Chen, S. Casas, and R. Urtasun. Learning to drive via asymmetric self-play. InEuropean Conference on Computer Vision, pages 149–168. Springer, 2024

2024
[21]

Kirby, A

E. Kirby, A. Boulch, Y . Xu, Y . Yin, G. Puy, É. Zablocki, A. Bursuc, S. Gidaris, R. Marlet, F. Bartoccioni, et al. Driving on registers.arXiv preprint arXiv:2601.05083, 2026

arXiv 2026
[22]

L. G. Rosenzweig, B. Shacklett, W. Xia, and K. Fatahalian. High-throughput batch rendering for embodied ai. InSIGGRAPH Asia 2024 Conference Papers, pages 1–9, 2024

2024
[23]

Agarwal, N

R. Agarwal, N. Vieillard, Y . Zhou, P. Stanczyk, S. R. Garea, M. Geist, and O. Bachem. On- policy distillation of language models: Learning from self-generated mistakes. InThe twelfth international conference on learning representations, 2024

2024
[24]

Y . Wang, W. Luo, J. Bai, Y . Cao, T. Che, K. Chen, Y . Chen, J. Diamond, Y . Ding, W. Ding, et al. Alpamayo-r1: Bridging reasoning and action prediction for generalizable autonomous driving in the long tail.arXiv preprint arXiv:2511.00088, 2025

Pith/arXiv arXiv 2025
[25]

B. Liao, S. Chen, H. Yin, B. Jiang, C. Wang, S. Yan, X. Zhang, X. Li, Y . Zhang, Q. Zhang, et al. Diffusiondrive: Truncated diffusion model for end-to-end autonomous driving. InPro- ceedings of the Computer Vision and Pattern Recognition Conference, pages 12037–12047, 2025

2025
[26]

Jaeger, D

B. Jaeger, D. Dauner, J. Beißwenger, S. Gerstenecker, K. Chitta, and A. Geiger. Carl: Learn- ing scalable planning policies with simple rewards.arXiv preprint arXiv:2504.17838, 2025

arXiv 2025
[27]

H. Zhou, L. Lin, J. Wang, Y . Lu, D. Bai, B. Liu, Y . Wang, A. Geiger, and Y . Liao. Hugsim: A real-time, photo-realistic and closed-loop simulator for autonomous driving.IEEE Trans- actions on Pattern Analysis and Machine Intelligence, 2025. 12

2025
[28]

W. Cao, M. Hallgarten, T. Li, D. Dauner, X. Gu, C. Wang, Y . Miron, M. Aiello, H. Li, I. Gilitschenski, et al. Pseudo-simulation for autonomous driving. InConference on Robot Learning, pages 4709–4722. PMLR, 2025

2025
[29]

Caesar, V

H. Caesar, V . Bankiti, A. H. Lang, S. V ora, V . E. Liong, Q. Xu, A. Krishnan, Y . Pan, G. Bal- dan, and O. Beijbom. nuscenes: A multimodal dataset for autonomous driving. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11621–11631, 2020

2020
[30]

Caesar, J

H. Caesar, J. Kabzan, K. S. Tan, W. K. Fong, E. Wolff, A. Lang, L. Fletcher, O. Beijbom, and S. Omari. nuplan: A closed-loop ml-based planning benchmark for autonomous vehicles. arXiv preprint arXiv:2106.11810, 2021

Pith/arXiv arXiv 2021
[31]

P. Sun, H. Kretzschmar, X. Dotiwalla, A. Chouard, V . Patnaik, P. Tsui, J. Guo, Y . Zhou, Y . Chai, B. Caine, et al. Scalability in perception for autonomous driving: Waymo open dataset. InProceedings of the IEEE/CVF conference on computer vision and pattern recog- nition, pages 2446–2454, 2020

2020
[32]

R. Xu, H. Lin, W. Jeon, H. Feng, Y . Zou, L. Sun, J. Gorman, E. Tolstaya, S. Tang, B. White, et al. Wod-e2e: Waymo open dataset for end-to-end driving in challenging long-tail scenarios. arXiv preprint arXiv:2510.26125, 2025

arXiv 2025
[33]

H. Arai, K. Miwa, K. Sasaki, K. Watanabe, Y . Yamaguchi, S. Aoki, and I. Yamamoto. Covla: Comprehensive vision-language-action dataset for autonomous driving. In2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 1933–1943. IEEE, 2025

1933
[34]

Ghilotti, E

F. Ghilotti, E. Palladin, S. Brucker, A. Sigal, M. Bijelic, and F. Heide. Truckdrive: Long-range autonomous highway driving dataset.arXiv preprint arXiv:2603.02413, 2026

arXiv 2026
[35]

X. Weng, B. Ivanovic, Y . Wang, Y . Wang, and M. Pavone. Para-drive: Parallelized archi- tecture for real-time autonomous driving. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15449–15458, 2024

2024
[36]

Z. Li, K. Li, S. Wang, S. Lan, Z. Yu, Y . Ji, Z. Li, Z. Zhu, J. Kautz, Z. Wu, et al. Hydra- mdp: End-to-end multimodal planning with multi-target hydra-distillation.arXiv preprint arXiv:2406.06978, 2024

Pith/arXiv arXiv 2024
[37]

Z. Xing, X. Zhang, Y . Hu, B. Jiang, T. He, Q. Zhang, X. Long, and W. Yin. Goalflow: Goal-driven flow matching for multimodal trajectories generation in end-to-end autonomous driving. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 1602–1611, 2025

2025
[38]

K. Renz, L. Chen, E. Arani, and O. Sinavski. Simlingo: Vision-only closed-loop autonomous driving with language-action alignment. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 11993–12003, 2025

2025
[39]

Z. Zhou, T. Cai, S. Zhao, Y . Zhang, Z. Huang, B. Zhou, and J. Ma. Autovla: A vision- language-action model for end-to-end autonomous driving with adaptive reasoning and re- inforcement fine-tuning.Advances in Neural Information Processing Systems, 38:27920– 27956, 2026

2026
[40]

De Haan, D

P. De Haan, D. Jayaraman, and S. Levine. Causal confusion in imitation learning.Advances in neural information processing systems, 32, 2019

2019
[41]

Dauner, M

D. Dauner, M. Hallgarten, A. Geiger, and K. Chitta. Parting with misconceptions about learning-based vehicle motion planning. InConference on Robot Learning, pages 1268–
[42]

B. R. Kiran, I. Sobh, V . Talpaert, P. Mannion, A. A. Al Sallab, S. Yogamani, and P. Pérez. Deep reinforcement learning for autonomous driving: A survey.IEEE transactions on intel- ligent transportation systems, 23(6):4909–4926, 2021

2021
[43]

J. Chen, B. Yuan, and M. Tomizuka. Model-free deep reinforcement learning for urban au- tonomous driving. In2019 IEEE intelligent transportation systems conference (ITSC), pages 2765–2771. IEEE, 2019

2019
[44]

Toromanoff, E

M. Toromanoff, E. Wirbel, and F. Moutarde. End-to-end model-free reinforcement learning for urban driving using implicit affordances. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 7153–7162, 2020

2020
[45]

Chekroun, M

R. Chekroun, M. Toromanoff, S. Hornauer, and F. Moutarde. Gri: General reinforced imita- tion and its application to vision-based autonomous driving.Robotics, 12(5):127, 2023

2023
[46]

Zhang, R

C. Zhang, R. Guo, W. Zeng, Y . Xiong, B. Dai, R. Hu, M. Ren, and R. Urtasun. Rethinking closed-loop training for autonomous driving. InEuropean Conference on Computer Vision, pages 264–282. Springer, 2022

2022
[47]

H. Gao, S. Chen, B. Jiang, B. Liao, Y . Shi, X. Guo, Y . Pu, X. Li, W. Liu, Q. Zhang, et al. Rad: Training an end-to-end driving policy via large-scale 3dgs-based reinforcement learning.Ad- vances in Neural Information Processing Systems, 38:32551–32576, 2026

2026
[48]

C. Ni, G. Zhao, X. Wang, Z. Zhu, W. Qin, X. Chen, G. Jia, G. Huang, and W. Mei. Recondreamer-rl: Enhancing reinforcement learning via diffusion-based scene reconstruc- tion.arXiv preprint arXiv:2508.08170, 2025

arXiv 2025
[49]

Zhang, P

Z. Zhang, P. Karkus, M. Igl, W. Ding, Y . Chen, B. Ivanovic, and M. Pavone. Closed-loop supervised fine-tuning of tokenized traffic models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 5422–5432, 2025

2025
[50]

Zhang and K

J. Zhang and K. Cho. Query-efficient imitation learning for end-to-end autonomous driving. arXiv preprint arXiv:1605.06450, 2016

Pith/arXiv arXiv 2016
[51]

Prakash, A

A. Prakash, A. Behl, E. Ohn-Bar, K. Chitta, and A. Geiger. Exploring data aggregation in policy learning for vision-based urban autonomous driving. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11763–11773, 2020

2020
[52]

Garcia-Cobo, M

G. Garcia-Cobo, M. Igl, P. Karkus, Z. Zhang, M. Watson, Y . Chen, B. Ivanovic, and M. Pavone. Road: Rollouts as demonstrations for closed-loop supervised fine-tuning of au- tonomous driving policies.arXiv preprint arXiv:2512.01993, 2025

arXiv 2025
[53]

S. Z. Zhao, L. Wang, H. Ruan, Y . Bao, Y . Chen, Z. Leng, A. Ravichandran, H. He, Z. Zhou, X. Han, et al. Bridgesim: Unveiling the ol-cl gap in end-to-end autonomous driving.arXiv preprint arXiv:2604.10856, 2026

Pith/arXiv arXiv 2026
[54]

Popov, A

A. Popov, A. Degirmenci, D. Wehr, S. Hegde, R. Oldja, A. Kamenev, B. Douillard, D. Nistér, U. Muller, R. Bhargava, et al. Mitigating covariate shift in imitation learning for autonomous vehicles using latent space generative world models.arXiv preprint arXiv:2409.16663, 2024

arXiv 2024
[55]

Dosovitskiy, G

A. Dosovitskiy, G. Ros, F. Codevilla, A. Lopez, and V . Koltun. CARLA: An open urban driving simulator. InConference on Robot Learning, pages 1–16. PMLR, 2017

2017
[56]

Coelho, M

D. Coelho, M. Oliveira, and V . Santos. Rlad: Reinforcement learning from pixels for au- tonomous driving in urban environments.IEEE Transactions on Automation Science and Engineering, 21(4):7427–7435, 2023

2023
[57]

Osi ´nski, A

B. Osi ´nski, A. Jakubowski, P. Zi˛ ecina, P. Miło ´s, C. Galias, S. Homoceanu, and H. Michalewski. Simulation-based reinforcement learning for real-world autonomous driv- ing. In2020 IEEE international conference on robotics and automation (ICRA), pages 6411–
[58]

Gulino, J

C. Gulino, J. Fu, W. Luo, G. Tucker, E. Bronstein, Y . Lu, J. Harb, X. Pan, Y . Wang, X. Chen, et al. Waymax: An accelerated, data-driven simulator for large-scale autonomous driving research.Advances in Neural Information Processing Systems, 36, 2023

2023
[59]

Vinitsky, N

E. Vinitsky, N. Lichtlé, X. Yang, B. Amos, and J. Foerster. Nocturne: a scalable driving benchmark for bringing multi-agent learning one step closer to the real world.Advances in Neural Information Processing Systems, 35:3962–3974, 2022

2022
[60]

Kazemkhani, A

S. Kazemkhani, A. Pandya, D. Cornelisse, B. Shacklett, and E. Vinitsky. GPUDrive: Data- driven, multi-agent driving simulation at 1 million FPS. InThe Thirteenth International Conference on Learning Representations, ICLR 2025, 2025

2025
[61]

Wymann, E

B. Wymann, E. Espié, C. Guionneau, C. Dimitrakakis, R. Coulom, and A. Sumner. Torcs, the open racing car simulator.Software available at http://torcs. sourceforge. net, 4(6):2, 2000

2000
[62]

Q. Li, Z. Peng, L. Feng, Q. Zhang, Z. Xue, and B. Zhou. Metadrive: Composing diverse driving scenarios for generalizable reinforcement learning.IEEE transactions on pattern analysis and machine intelligence, 45(3):3461–3475, 2022

2022
[63]

Z. Yang, Y . Chen, J. Wang, S. Manivasagam, W.-C. Ma, A. J. Yang, and R. Urtasun. Unisim: A neural closed-loop sensor simulator. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, pages 1389–1399, 2023

2023
[64]

Tonderski, C

A. Tonderski, C. Lindström, G. Hess, W. Ljungbergh, L. Svensson, and C. Petersson. Neurad: Neural rendering for autonomous driving. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14895–14904, 2024

2024
[65]

NVIDIA, Y . Cao, R. de Lutio, S. Fidler, G. G. Cobo, Z. Gojcic, M. Igl, B. Ivanovic, P. Karkus, J. M. Esturo, M. Pavone, A. Smith, E. Tanimura, M. Tyszkiewicz, M. Watson, Q. Wu, and L. Zhang. Alpasim: A modular, lightweight, and data-driven research simulator for au- tonomous driving, October 2025. URLhttps://github.com/NVlabs/alpasim

2025
[66]

A. Hu, L. Russell, H. Yeo, Z. Murez, G. Fedoseev, A. Kendall, J. Shotton, and G. Corrado. Gaia-1: A generative world model for autonomous driving.arXiv preprint arXiv:2309.17080, 2023

Pith/arXiv arXiv 2023
[67]

Russell, A

L. Russell, A. Hu, L. Bertoni, G. Fedoseev, J. Shotton, E. Arani, and G. Corrado. Gaia-2: A controllable multi-view generative world model for autonomous driving.arXiv preprint arXiv:2503.20523, 2025

Pith/arXiv arXiv 2025
[68]

J. Yang, K. Chitta, S. Gao, L. Chen, Y . Shao, X. Jia, H. Li, A. Geiger, X. Yue, and L. Chen. ReSim: Reliable world simulation for autonomous driving.arXiv preprint arXiv:2506.09981, 2025

Pith/arXiv arXiv 2025
[69]

X. Yang, L. Wen, T. Wei, Y . Ma, J. Mei, X. Li, W. Lei, D. Fu, P. Cai, M. Dou, et al. Drivearena: A closed-loop generative simulation platform for autonomous driving. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 26933–26943, 2025

2025
[70]

T. Yan, D. Wu, W. Han, J. Jiang, X. Zhou, K. Zhan, C.-z. Xu, and J. Shen. Drivingsphere: Building a high-fidelity 4d world for closed-loop simulation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 27531–27541, 2025

2025
[71]

L. Feng, Y . Gao, E. Zablocki, Q. Li, W. Li, S. Liu, M. Cord, and A. Alahi. RAP: 3d rasteriza- tion augmented end-to-end planning. InThe Fourteenth International Conference on Learn- ing Representations, 2026. URLhttps://openreview.net/forum?id=a9bOgeqbdB

2026
[72]

Shacklett, L

B. Shacklett, L. G. Rosenzweig, Z. Xie, B. Sarkar, A. Szot, E. Wijmans, V . Koltun, D. Batra, and K. Fatahalian. An extensible, data-oriented architecture for high-performance, many- world simulation.ACM Transactions on Graphics (TOG), 42(4):1–13, 2023. 15

2023
[73]

H. X. Liu and S. Feng. Curse of rarity for autonomous vehicles.nature communications, 15 (1):4808, 2024

2024
[74]

Silver, T

D. Silver, T. Hubert, J. Schrittwieser, I. Antonoglou, M. Lai, A. Guez, M. Lanctot, L. Sifre, D. Kumaran, T. Graepel, T. Lillicrap, K. Simonyan, and D. Hassabis. Mastering chess and shogi by self-play with a general reinforcement learning algorithm.arXiv preprint arXiv:1712.01815, 2017

Pith/arXiv arXiv 2017
[75]

Silver, J

D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A. Guez, T. Hubert, L. Baker, M. Lai, A. Bolton, Y . Chen, T. Lillicrap, F. Hui, L. Sifre, G. van den Driessche, T. Graepel, and D. Hassabis. Mastering the game of Go without human knowledge.Nature, 550(7676):354–359, 2017

2017
[76]

Jaderberg, W

M. Jaderberg, W. M. Czarnecki, I. Dunning, L. Marris, G. Lever, A. García Castañeda, C. Beattie, N. C. Rabinowitz, A. S. Morcos, A. Ruderman, N. Sonnerat, T. Green, L. Deason, J. Z. Leibo, D. Silver, D. Hassabis, K. Kavukcuoglu, and T. Graepel. Human-level perfor- mance in first-person multiplayer games with population-based deep reinforcement learning. a...

Pith/arXiv arXiv 2018
[77]

Bakhtin, D

A. Bakhtin, D. J. Wu, A. Lerer, J. Gray, A. P. Jacob, G. Farina, A. H. Miller, and N. Brown. Mastering the game of no-press diplomacy via human-regularized reinforcement learning and planning. InThe Eleventh International Conference on Learning Representations, ICLR 2023, 2023

2023
[78]

Chang, A

W.-J. Chang, A. Rangesh, K. Joseph, M. Strong, M. Tomizuka, Y . Hu, and W. Zhan. Spacer: Self-play anchoring with centralized reference models.arXiv preprint arXiv:2510.18060, 2025

arXiv 2025
[79]

Z. Wang, S. Rahmani, D. Cornelisse, B. Sarkar, A. D. Goldie, J. N. Foerster, and S. White- son. Learning to drive in new cities without human demonstrations.arXiv preprint arXiv:2602.15891, 2026

arXiv 2026
[80]

Distelzweig, F

A. Distelzweig, F. Janjoš, A. Look, A. Rothenhäusler, D. Jost, O. Scheel, R. Rajan, D. Cor- nelisse, E. Vinitsky, and J. Boedecker. Beyond self-play and scale: A behavior benchmark for generalization in autonomous driving.arXiv preprint arXiv:2605.10034, 2026

Pith/arXiv arXiv 2026

Showing first 80 references.

[1] [1]

P. S. Chib and P. Singh. Recent advancements in end-to-end autonomous driving using deep learning: A survey.IEEE Transactions on Intelligent Vehicles, 9(1):103–118, 2023

2023

[2] [2]

L. Chen, P. Wu, K. Chitta, B. Jaeger, A. Geiger, and H. Li. End-to-end autonomous driving: Challenges and frontiers.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(12):10164–10183, 2024

2024

[3] [3]

Chitta, A

K. Chitta, A. Prakash, B. Jaeger, Z. Yu, K. Renz, and A. Geiger. Transfuser: Imitation with transformer-based sensor fusion for autonomous driving.IEEE transactions on pattern analysis and machine intelligence, 45(11):12878–12895, 2022

2022

[4] [4]

Y . Hu, J. Yang, L. Chen, K. Li, C. Sima, X. Zhu, S. Chai, S. Du, T. Lin, W. Wang, et al. Planning-oriented autonomous driving. InProceedings of the IEEE/CVF conference on com- puter vision and pattern recognition, pages 17853–17862, 2023

2023

[5] [5]

Jiang, S

B. Jiang, S. Chen, Q. Xu, B. Liao, J. Chen, H. Zhou, Q. Zhang, W. Liu, C. Huang, and X. Wang. Vad: Vectorized scene representation for efficient autonomous driving. InPro- ceedings of the IEEE/CVF International Conference on Computer Vision, pages 8340–8350, 2023

2023

[6] [6]

Hwang, R

J.-J. Hwang, R. Xu, H. Lin, W.-C. Hung, J. Ji, K. Choi, D. Huang, T. He, P. Covington, B. Sapp, et al. Emma: End-to-end multimodal model for autonomous driving.arXiv preprint arXiv:2410.23262, 2024

Pith/arXiv arXiv 2024

[7] [7]

L. Rowe, R. de Schaetzen, R. Girgis, C. Pal, and L. Paull. Poutine: Vision-language-trajectory pre-training and reinforcement learning post-training enable robust end-to-end autonomous driving.arXiv preprint arXiv:2506.11234, 2025

arXiv 2025

[8] [8]

Zheng, P

Y . Zheng, P. Yang, Z. Xia, Q. Zhang, Y . Zheng, S. Gu, B. Jin, T. Zhang, B. Lu, C. Han, et al. Data scaling laws for imitation learning-based end-to-end autonomous driving.arXiv preprint arXiv:2412.02689, 2024

arXiv 2024

[9] [9]

Naumann, X

A. Naumann, X. Gu, T. Dimlioglu, M. Bojarski, A. Degirmenci, A. Popov, D. Bisla, M. Pavone, U. Muller, and B. Ivanovic. Data scaling laws for end-to-end autonomous driving. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 2571– 2582, 2025

2025

[10] [10]

Baniodeh, K

M. Baniodeh, K. Goel, S. Ettinger, C. Fuertes, A. Seff, T. Shen, C. Gulino, C. Yang, G. Jerfel, D. Choe, et al. Scaling laws of motion forecasting and planning–technical report.arXiv preprint arXiv:2506.08228, 2025

arXiv 2025

[11] [11]

Le Mero, D

L. Le Mero, D. Yi, M. Dianati, and A. Mouzakitis. A survey on imitation learning tech- niques for end-to-end autonomous vehicles.IEEE Transactions on Intelligent Transportation Systems, 23(9):14128–14147, 2022

2022

[12] [12]

Ljungbergh, A

W. Ljungbergh, A. Tonderski, J. Johnander, H. Caesar, K. Åström, M. Felsberg, and C. Peters- son. Neuroncap: Photorealistic closed-loop safety testing for autonomous driving.European Conference on Computer Vision (ECCV), 2024. 11

2024

[13] [13]

Karkus, M

P. Karkus, M. Igl, Y . Chen, K. Chitta, J. Packer, B. Douillard, R. Tian, A. Naumann, G. Garcia-Cobo, S. Tan, et al. Beyond behavior cloning in autonomous driving: a survey of closed-loop training techniques.Authorea Preprints, 2025

2025

[14] [14]

S. Ross, G. Gordon, and D. Bagnell. A reduction of imitation learning and structured predic- tion to no-regret online learning. InProceedings of the fourteenth international conference on artificial intelligence and statistics, pages 627–635. JMLR Workshop and Conference Pro- ceedings, 2011

2011

[15] [15]

Codevilla, E

F. Codevilla, E. Santana, A. M. López, and A. Gaidon. Exploring the limitations of behavior cloning for autonomous driving. InProceedings of the IEEE/CVF international conference on computer vision, pages 9329–9338, 2019

2019

[16] [16]

Cusumano-Towner, D

M. Cusumano-Towner, D. Hafner, A. Hertzberg, B. Huval, A. Petrenko, E. Vinitsky, E. Wij- mans, T. W. Killian, S. Bowers, O. Sener, et al. Robust autonomy emerges from self-play. In International Conference on Machine Learning, pages 11710–11737. PMLR, 2025

2025

[17] [17]

Cornelisse, S

D. Cornelisse, S. Cheng, P. Mandavilli, J. Hunt, K. Joseph, W. Doulazmi, V . Charraut, A. Gupta, J. Suarez, and E. Vinitsky. PufferDrive: A fast and friendly driving simulator for training and evaluating RL agents, 2025. URLhttps://github.com/Emerge-Lab/ PufferDrive

2025

[18] [18]

Cornelisse and E

D. Cornelisse and E. Vinitsky. Human-compatible driving agents through data-regularized self-play reinforcement learning. InReinforcement Learning Conference, 2024

2024

[19] [19]

Cornelisse, A

D. Cornelisse, A. Pandya, K. Joseph, J. Suárez, and E. Vinitsky. Building reliable sim driving agents by scaling self-play.arXiv preprint arXiv:2502.14706, 2025

arXiv 2025

[20] [20]

Zhang, S

C. Zhang, S. Biswas, K. Wong, K. Fallah, L. Zhang, D. Chen, S. Casas, and R. Urtasun. Learning to drive via asymmetric self-play. InEuropean Conference on Computer Vision, pages 149–168. Springer, 2024

2024

[21] [21]

Kirby, A

E. Kirby, A. Boulch, Y . Xu, Y . Yin, G. Puy, É. Zablocki, A. Bursuc, S. Gidaris, R. Marlet, F. Bartoccioni, et al. Driving on registers.arXiv preprint arXiv:2601.05083, 2026

arXiv 2026

[22] [22]

L. G. Rosenzweig, B. Shacklett, W. Xia, and K. Fatahalian. High-throughput batch rendering for embodied ai. InSIGGRAPH Asia 2024 Conference Papers, pages 1–9, 2024

2024

[23] [23]

Agarwal, N

R. Agarwal, N. Vieillard, Y . Zhou, P. Stanczyk, S. R. Garea, M. Geist, and O. Bachem. On- policy distillation of language models: Learning from self-generated mistakes. InThe twelfth international conference on learning representations, 2024

2024

[24] [24]

Y . Wang, W. Luo, J. Bai, Y . Cao, T. Che, K. Chen, Y . Chen, J. Diamond, Y . Ding, W. Ding, et al. Alpamayo-r1: Bridging reasoning and action prediction for generalizable autonomous driving in the long tail.arXiv preprint arXiv:2511.00088, 2025

Pith/arXiv arXiv 2025

[25] [25]

B. Liao, S. Chen, H. Yin, B. Jiang, C. Wang, S. Yan, X. Zhang, X. Li, Y . Zhang, Q. Zhang, et al. Diffusiondrive: Truncated diffusion model for end-to-end autonomous driving. InPro- ceedings of the Computer Vision and Pattern Recognition Conference, pages 12037–12047, 2025

2025

[26] [26]

Jaeger, D

B. Jaeger, D. Dauner, J. Beißwenger, S. Gerstenecker, K. Chitta, and A. Geiger. Carl: Learn- ing scalable planning policies with simple rewards.arXiv preprint arXiv:2504.17838, 2025

arXiv 2025

[27] [27]

H. Zhou, L. Lin, J. Wang, Y . Lu, D. Bai, B. Liu, Y . Wang, A. Geiger, and Y . Liao. Hugsim: A real-time, photo-realistic and closed-loop simulator for autonomous driving.IEEE Trans- actions on Pattern Analysis and Machine Intelligence, 2025. 12

2025

[28] [28]

W. Cao, M. Hallgarten, T. Li, D. Dauner, X. Gu, C. Wang, Y . Miron, M. Aiello, H. Li, I. Gilitschenski, et al. Pseudo-simulation for autonomous driving. InConference on Robot Learning, pages 4709–4722. PMLR, 2025

2025

[29] [29]

Caesar, V

H. Caesar, V . Bankiti, A. H. Lang, S. V ora, V . E. Liong, Q. Xu, A. Krishnan, Y . Pan, G. Bal- dan, and O. Beijbom. nuscenes: A multimodal dataset for autonomous driving. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11621–11631, 2020

2020

[30] [30]

Caesar, J

H. Caesar, J. Kabzan, K. S. Tan, W. K. Fong, E. Wolff, A. Lang, L. Fletcher, O. Beijbom, and S. Omari. nuplan: A closed-loop ml-based planning benchmark for autonomous vehicles. arXiv preprint arXiv:2106.11810, 2021

Pith/arXiv arXiv 2021

[31] [31]

P. Sun, H. Kretzschmar, X. Dotiwalla, A. Chouard, V . Patnaik, P. Tsui, J. Guo, Y . Zhou, Y . Chai, B. Caine, et al. Scalability in perception for autonomous driving: Waymo open dataset. InProceedings of the IEEE/CVF conference on computer vision and pattern recog- nition, pages 2446–2454, 2020

2020

[32] [32]

R. Xu, H. Lin, W. Jeon, H. Feng, Y . Zou, L. Sun, J. Gorman, E. Tolstaya, S. Tang, B. White, et al. Wod-e2e: Waymo open dataset for end-to-end driving in challenging long-tail scenarios. arXiv preprint arXiv:2510.26125, 2025

arXiv 2025

[33] [33]

H. Arai, K. Miwa, K. Sasaki, K. Watanabe, Y . Yamaguchi, S. Aoki, and I. Yamamoto. Covla: Comprehensive vision-language-action dataset for autonomous driving. In2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 1933–1943. IEEE, 2025

1933

[34] [34]

Ghilotti, E

F. Ghilotti, E. Palladin, S. Brucker, A. Sigal, M. Bijelic, and F. Heide. Truckdrive: Long-range autonomous highway driving dataset.arXiv preprint arXiv:2603.02413, 2026

arXiv 2026

[35] [35]

X. Weng, B. Ivanovic, Y . Wang, Y . Wang, and M. Pavone. Para-drive: Parallelized archi- tecture for real-time autonomous driving. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15449–15458, 2024

2024

[36] [36]

Z. Li, K. Li, S. Wang, S. Lan, Z. Yu, Y . Ji, Z. Li, Z. Zhu, J. Kautz, Z. Wu, et al. Hydra- mdp: End-to-end multimodal planning with multi-target hydra-distillation.arXiv preprint arXiv:2406.06978, 2024

Pith/arXiv arXiv 2024

[37] [37]

Z. Xing, X. Zhang, Y . Hu, B. Jiang, T. He, Q. Zhang, X. Long, and W. Yin. Goalflow: Goal-driven flow matching for multimodal trajectories generation in end-to-end autonomous driving. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 1602–1611, 2025

2025

[38] [38]

K. Renz, L. Chen, E. Arani, and O. Sinavski. Simlingo: Vision-only closed-loop autonomous driving with language-action alignment. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 11993–12003, 2025

2025

[39] [39]

Z. Zhou, T. Cai, S. Zhao, Y . Zhang, Z. Huang, B. Zhou, and J. Ma. Autovla: A vision- language-action model for end-to-end autonomous driving with adaptive reasoning and re- inforcement fine-tuning.Advances in Neural Information Processing Systems, 38:27920– 27956, 2026

2026

[40] [40]

De Haan, D

P. De Haan, D. Jayaraman, and S. Levine. Causal confusion in imitation learning.Advances in neural information processing systems, 32, 2019

2019

[41] [41]

Dauner, M

D. Dauner, M. Hallgarten, A. Geiger, and K. Chitta. Parting with misconceptions about learning-based vehicle motion planning. InConference on Robot Learning, pages 1268–

[42] [42]

B. R. Kiran, I. Sobh, V . Talpaert, P. Mannion, A. A. Al Sallab, S. Yogamani, and P. Pérez. Deep reinforcement learning for autonomous driving: A survey.IEEE transactions on intel- ligent transportation systems, 23(6):4909–4926, 2021

2021

[43] [43]

J. Chen, B. Yuan, and M. Tomizuka. Model-free deep reinforcement learning for urban au- tonomous driving. In2019 IEEE intelligent transportation systems conference (ITSC), pages 2765–2771. IEEE, 2019

2019

[44] [44]

Toromanoff, E

M. Toromanoff, E. Wirbel, and F. Moutarde. End-to-end model-free reinforcement learning for urban driving using implicit affordances. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 7153–7162, 2020

2020

[45] [45]

Chekroun, M

R. Chekroun, M. Toromanoff, S. Hornauer, and F. Moutarde. Gri: General reinforced imita- tion and its application to vision-based autonomous driving.Robotics, 12(5):127, 2023

2023

[46] [46]

Zhang, R

C. Zhang, R. Guo, W. Zeng, Y . Xiong, B. Dai, R. Hu, M. Ren, and R. Urtasun. Rethinking closed-loop training for autonomous driving. InEuropean Conference on Computer Vision, pages 264–282. Springer, 2022

2022

[47] [47]

H. Gao, S. Chen, B. Jiang, B. Liao, Y . Shi, X. Guo, Y . Pu, X. Li, W. Liu, Q. Zhang, et al. Rad: Training an end-to-end driving policy via large-scale 3dgs-based reinforcement learning.Ad- vances in Neural Information Processing Systems, 38:32551–32576, 2026

2026

[48] [48]

C. Ni, G. Zhao, X. Wang, Z. Zhu, W. Qin, X. Chen, G. Jia, G. Huang, and W. Mei. Recondreamer-rl: Enhancing reinforcement learning via diffusion-based scene reconstruc- tion.arXiv preprint arXiv:2508.08170, 2025

arXiv 2025

[49] [49]

Zhang, P

Z. Zhang, P. Karkus, M. Igl, W. Ding, Y . Chen, B. Ivanovic, and M. Pavone. Closed-loop supervised fine-tuning of tokenized traffic models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 5422–5432, 2025

2025

[50] [50]

Zhang and K

J. Zhang and K. Cho. Query-efficient imitation learning for end-to-end autonomous driving. arXiv preprint arXiv:1605.06450, 2016

Pith/arXiv arXiv 2016

[51] [51]

Prakash, A

A. Prakash, A. Behl, E. Ohn-Bar, K. Chitta, and A. Geiger. Exploring data aggregation in policy learning for vision-based urban autonomous driving. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11763–11773, 2020

2020

[52] [52]

Garcia-Cobo, M

G. Garcia-Cobo, M. Igl, P. Karkus, Z. Zhang, M. Watson, Y . Chen, B. Ivanovic, and M. Pavone. Road: Rollouts as demonstrations for closed-loop supervised fine-tuning of au- tonomous driving policies.arXiv preprint arXiv:2512.01993, 2025

arXiv 2025

[53] [53]

S. Z. Zhao, L. Wang, H. Ruan, Y . Bao, Y . Chen, Z. Leng, A. Ravichandran, H. He, Z. Zhou, X. Han, et al. Bridgesim: Unveiling the ol-cl gap in end-to-end autonomous driving.arXiv preprint arXiv:2604.10856, 2026

Pith/arXiv arXiv 2026

[54] [54]

Popov, A

A. Popov, A. Degirmenci, D. Wehr, S. Hegde, R. Oldja, A. Kamenev, B. Douillard, D. Nistér, U. Muller, R. Bhargava, et al. Mitigating covariate shift in imitation learning for autonomous vehicles using latent space generative world models.arXiv preprint arXiv:2409.16663, 2024

arXiv 2024

[55] [55]

Dosovitskiy, G

A. Dosovitskiy, G. Ros, F. Codevilla, A. Lopez, and V . Koltun. CARLA: An open urban driving simulator. InConference on Robot Learning, pages 1–16. PMLR, 2017

2017

[56] [56]

Coelho, M

D. Coelho, M. Oliveira, and V . Santos. Rlad: Reinforcement learning from pixels for au- tonomous driving in urban environments.IEEE Transactions on Automation Science and Engineering, 21(4):7427–7435, 2023

2023

[57] [57]

Osi ´nski, A

B. Osi ´nski, A. Jakubowski, P. Zi˛ ecina, P. Miło ´s, C. Galias, S. Homoceanu, and H. Michalewski. Simulation-based reinforcement learning for real-world autonomous driv- ing. In2020 IEEE international conference on robotics and automation (ICRA), pages 6411–

[58] [58]

Gulino, J

C. Gulino, J. Fu, W. Luo, G. Tucker, E. Bronstein, Y . Lu, J. Harb, X. Pan, Y . Wang, X. Chen, et al. Waymax: An accelerated, data-driven simulator for large-scale autonomous driving research.Advances in Neural Information Processing Systems, 36, 2023

2023

[59] [59]

Vinitsky, N

E. Vinitsky, N. Lichtlé, X. Yang, B. Amos, and J. Foerster. Nocturne: a scalable driving benchmark for bringing multi-agent learning one step closer to the real world.Advances in Neural Information Processing Systems, 35:3962–3974, 2022

2022

[60] [60]

Kazemkhani, A

S. Kazemkhani, A. Pandya, D. Cornelisse, B. Shacklett, and E. Vinitsky. GPUDrive: Data- driven, multi-agent driving simulation at 1 million FPS. InThe Thirteenth International Conference on Learning Representations, ICLR 2025, 2025

2025

[61] [61]

Wymann, E

B. Wymann, E. Espié, C. Guionneau, C. Dimitrakakis, R. Coulom, and A. Sumner. Torcs, the open racing car simulator.Software available at http://torcs. sourceforge. net, 4(6):2, 2000

2000

[62] [62]

Q. Li, Z. Peng, L. Feng, Q. Zhang, Z. Xue, and B. Zhou. Metadrive: Composing diverse driving scenarios for generalizable reinforcement learning.IEEE transactions on pattern analysis and machine intelligence, 45(3):3461–3475, 2022

2022

[63] [63]

Z. Yang, Y . Chen, J. Wang, S. Manivasagam, W.-C. Ma, A. J. Yang, and R. Urtasun. Unisim: A neural closed-loop sensor simulator. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, pages 1389–1399, 2023

2023

[64] [64]

Tonderski, C

A. Tonderski, C. Lindström, G. Hess, W. Ljungbergh, L. Svensson, and C. Petersson. Neurad: Neural rendering for autonomous driving. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14895–14904, 2024

2024

[65] [65]

NVIDIA, Y . Cao, R. de Lutio, S. Fidler, G. G. Cobo, Z. Gojcic, M. Igl, B. Ivanovic, P. Karkus, J. M. Esturo, M. Pavone, A. Smith, E. Tanimura, M. Tyszkiewicz, M. Watson, Q. Wu, and L. Zhang. Alpasim: A modular, lightweight, and data-driven research simulator for au- tonomous driving, October 2025. URLhttps://github.com/NVlabs/alpasim

2025

[66] [66]

A. Hu, L. Russell, H. Yeo, Z. Murez, G. Fedoseev, A. Kendall, J. Shotton, and G. Corrado. Gaia-1: A generative world model for autonomous driving.arXiv preprint arXiv:2309.17080, 2023

Pith/arXiv arXiv 2023

[67] [67]

Russell, A

L. Russell, A. Hu, L. Bertoni, G. Fedoseev, J. Shotton, E. Arani, and G. Corrado. Gaia-2: A controllable multi-view generative world model for autonomous driving.arXiv preprint arXiv:2503.20523, 2025

Pith/arXiv arXiv 2025

[68] [68]

J. Yang, K. Chitta, S. Gao, L. Chen, Y . Shao, X. Jia, H. Li, A. Geiger, X. Yue, and L. Chen. ReSim: Reliable world simulation for autonomous driving.arXiv preprint arXiv:2506.09981, 2025

Pith/arXiv arXiv 2025

[69] [69]

X. Yang, L. Wen, T. Wei, Y . Ma, J. Mei, X. Li, W. Lei, D. Fu, P. Cai, M. Dou, et al. Drivearena: A closed-loop generative simulation platform for autonomous driving. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 26933–26943, 2025

2025

[70] [70]

T. Yan, D. Wu, W. Han, J. Jiang, X. Zhou, K. Zhan, C.-z. Xu, and J. Shen. Drivingsphere: Building a high-fidelity 4d world for closed-loop simulation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 27531–27541, 2025

2025

[71] [71]

L. Feng, Y . Gao, E. Zablocki, Q. Li, W. Li, S. Liu, M. Cord, and A. Alahi. RAP: 3d rasteriza- tion augmented end-to-end planning. InThe Fourteenth International Conference on Learn- ing Representations, 2026. URLhttps://openreview.net/forum?id=a9bOgeqbdB

2026

[72] [72]

Shacklett, L

B. Shacklett, L. G. Rosenzweig, Z. Xie, B. Sarkar, A. Szot, E. Wijmans, V . Koltun, D. Batra, and K. Fatahalian. An extensible, data-oriented architecture for high-performance, many- world simulation.ACM Transactions on Graphics (TOG), 42(4):1–13, 2023. 15

2023

[73] [73]

H. X. Liu and S. Feng. Curse of rarity for autonomous vehicles.nature communications, 15 (1):4808, 2024

2024

[74] [74]

Silver, T

D. Silver, T. Hubert, J. Schrittwieser, I. Antonoglou, M. Lai, A. Guez, M. Lanctot, L. Sifre, D. Kumaran, T. Graepel, T. Lillicrap, K. Simonyan, and D. Hassabis. Mastering chess and shogi by self-play with a general reinforcement learning algorithm.arXiv preprint arXiv:1712.01815, 2017

Pith/arXiv arXiv 2017

[75] [75]

Silver, J

D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A. Guez, T. Hubert, L. Baker, M. Lai, A. Bolton, Y . Chen, T. Lillicrap, F. Hui, L. Sifre, G. van den Driessche, T. Graepel, and D. Hassabis. Mastering the game of Go without human knowledge.Nature, 550(7676):354–359, 2017

2017

[76] [76]

Jaderberg, W

M. Jaderberg, W. M. Czarnecki, I. Dunning, L. Marris, G. Lever, A. García Castañeda, C. Beattie, N. C. Rabinowitz, A. S. Morcos, A. Ruderman, N. Sonnerat, T. Green, L. Deason, J. Z. Leibo, D. Silver, D. Hassabis, K. Kavukcuoglu, and T. Graepel. Human-level perfor- mance in first-person multiplayer games with population-based deep reinforcement learning. a...

Pith/arXiv arXiv 2018

[77] [77]

Bakhtin, D

A. Bakhtin, D. J. Wu, A. Lerer, J. Gray, A. P. Jacob, G. Farina, A. H. Miller, and N. Brown. Mastering the game of no-press diplomacy via human-regularized reinforcement learning and planning. InThe Eleventh International Conference on Learning Representations, ICLR 2023, 2023

2023

[78] [78]

Chang, A

W.-J. Chang, A. Rangesh, K. Joseph, M. Strong, M. Tomizuka, Y . Hu, and W. Zhan. Spacer: Self-play anchoring with centralized reference models.arXiv preprint arXiv:2510.18060, 2025

arXiv 2025

[79] [79]

Z. Wang, S. Rahmani, D. Cornelisse, B. Sarkar, A. D. Goldie, J. N. Foerster, and S. White- son. Learning to drive in new cities without human demonstrations.arXiv preprint arXiv:2602.15891, 2026

arXiv 2026

[80] [80]

Distelzweig, F

A. Distelzweig, F. Janjoš, A. Look, A. Rothenhäusler, D. Jost, O. Scheel, R. Rajan, D. Cor- nelisse, E. Vinitsky, and J. Boedecker. Beyond self-play and scale: A behavior benchmark for generalization in autonomous driving.arXiv preprint arXiv:2605.10034, 2026

Pith/arXiv arXiv 2026