pith. sign in

arxiv: 2606.19641 · v2 · pith:EPL3X3OAnew · submitted 2026-06-17 · 💻 cs.RO · cs.CV

Scaling Self-Play for End-to-End Driving

Pith reviewed 2026-06-26 20:20 UTC · model grok-4.3

classification 💻 cs.RO cs.CV
keywords self-playend-to-end drivingautonomous drivingsimulationpolicy distillationsim-to-realreinforcement learning
0
0 comments X

The pith

End-to-end driving policies trained via self-play in a simplified pixel simulator transfer competitively to real sensor data without any human trajectory supervision.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that offline human datasets give end-to-end driving models too little state coverage and no closed-loop feedback, so they suffer from compounding errors and brittle behavior with other agents. It replaces that data source with large-scale self-play inside Gigapixel, a batched simulator that renders a simplified bounding-box world at 50k steps per second and supports direct pixel observations. Policies are trained by distilling from a privileged RL teacher inside the simulator, then adapted to real cameras with lightweight perception fine-tuning. The resulting models reach competitive scores on HUGSIM and NAVSIM-v2, and further self-play training produces proportional gains.

Core claim

Policies trained through self-play DAgger inside Gigapixel and then adapted to real sensor data achieve competitive performance on the HUGSIM and NAVSIM-v2 benchmarks without human trajectory supervision; scaling the amount of self-play training produces proportional improvements in policy performance.

What carries the argument

Gigapixel, a high-throughput batched simulator that renders perspective views of a simplified bounding-box world, paired with self-play DAgger that distills pixel policies from a privileged RL teacher before lightweight perception adaptation to real data.

If this is right

  • End-to-end driving models can be trained without any human trajectory data.
  • Closed-loop self-play supplies the state coverage and interaction feedback missing from offline datasets.
  • Performance of the final policy improves in direct proportion to the amount of self-play training performed.
  • A simplified non-photorealistic simulator is sufficient for the training stage when followed by perception adaptation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same self-play-plus-adaptation pipeline could be applied to other robotic control problems that currently rely on human demonstrations.
  • Further increases in simulator throughput or model scale would be expected to continue raising benchmark scores.
  • Edge cases that depend on fine visual detail not captured by bounding boxes may still require additional real-world adaptation steps.
  • Removing the privileged teacher entirely and training the pixel policy directly with RL could test how much the distillation step contributes.

Load-bearing premise

The simplified bounding-box world rendered by Gigapixel preserves enough scene structure that policies trained inside it can be transferred to real sensor data through lightweight perception adaptation.

What would settle it

If increasing the scale of self-play training inside Gigapixel produces no measurable improvement on HUGSIM or NAVSIM-v2 after perception adaptation, or if the adapted policies fall substantially below the performance of human-supervised baselines on those benchmarks.

Figures

Figures reproduced from arXiv: 2606.19641 by Alaap Grandhi, Christopher Pal, Daphne Cornelisse, Eugene Vinitsky, Felix Heide, Liam Paull, Luke Rowe, Rodrigue de Schaetzen, Roger Girgis.

Figure 1
Figure 1. Figure 1: Gigapixel Throughput vs. Resolu￾tion. Agent steps per second (SPS) across ren￾dering resolutions and policy architectures on 1 NVIDIA A100L GPU. Render Only isolates ren￾derer throughput without policy forward or back￾ward passes. CNN is a simple CNN, and DrivoR [21] is a transformer-based architecture. CNN and DrivoR throughputs are reported in an RL train￾ing loop. The gap between the rasterizer (Rast.) … view at source ↗
Figure 2
Figure 2. Figure 2: Self-Play for End-to-End Driving. We propose self-play training for end-to-end driving in three stages: (a) a vectorized teacher is trained with self-play RL in an abstract simulator; (b) a pixel-based student is distilled via self-play DAgger in Gigapixel, with all agents student-controlled and trajectory targets generated by parallel teacher rollouts (Sec. 3.2); and (c) the student is adapted to real ima… view at source ↗
Figure 3
Figure 3. Figure 3: Self-play DAgger vs. RL. We compare two pixel-based training methods, self-play DAgger and self-play RL, with a CNN-based model in Gigapixel. We plot Gi￾gapixel Driving Score (Completion Rate− Offroad Rate − Collision Rate) on a held-out set as a function of agent steps in Gigapixel. The dashed line marks the vector￾ized teacher performance at 25B steps. Implementation Details. We train policies in Gi￾gapi… view at source ↗
Figure 4
Figure 4. Figure 4: Scaling Experiments. We com￾pare different end-to-end training strate￾gies at scale in Gigapixel. We plot Gi￾gapixel Driving Score (Completion Rate− Offroad Rate − Collision Rate) on a held-out set as a function of agent steps in Gigapixel. In all configurations, we train the single-mode DrivoR-Reg architecture [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative comparison of closed-loop driving behavior near a decelerating lead ve￾hicle. Frames advance left to right (t1 → t3); the yellow waypoints denote the model’s planned future trajectory. Top (green): Self-Play Trained DrivoR. As the lead vehicle comes to a stop, the policy reduces speed and plans a smooth lateral trajectory that nudges out of the lane and passes the stationary vehicle safely. Bot… view at source ↗
Figure 6
Figure 6. Figure 6: Behavior from a recovery state near the road edge. Frames advance in time (t1 → t3); yellow waypoints denote the planned trajectory. Top (green): Self-Play Trained DrivoR plans a slow, careful turn that corrects toward the lane and keeps the vehicle on the drivable surface. Bottom (red): BC Trained DrivoR fails to correct adequately near the road edge and drifts off-road [PITH_FULL_IMAGE:figures/full_fig_… view at source ↗
Figure 7
Figure 7. Figure 7: Ray-traced vs. rasterized Gigapixel renderings. We render the front-view of three scenes with both the Gigapixel ray-tracer and rasterizer across a range of resolutions. The rasterizer achieves higher throughput, but at the cost of simpler shading, as shown in the bottom two rows. 20 [PITH_FULL_IMAGE:figures/full_fig_p020_7.png] view at source ↗
read the original abstract

End-to-end autonomous driving models are typically trained on offline human-demonstration datasets that provide limited state coverage and often no closed-loop feedback, making them prone to compounding errors when deployed in closed-loop and brittle to long-tail agent interactions. To overcome these limitations, we propose an alternative strategy for training end-to-end driving models: large-scale self-play directly from pixels in simulation. While prior self-play approaches have shown promising transfer to real-world driving, they typically assume vectorized Bird's-Eye-View (BEV) observations that are incompatible with end-to-end policies operating directly on sensor observations. To this end, we introduce Gigapixel, a high-throughput batched driving simulator with perspective rendering, enabling scalable self-play directly from pixel observations. Rather than targeting compute-costly photorealistic sensor simulation, Gigapixel renders a simplified bounding-box world that preserves essential scene structure while achieving throughput at 50k agent steps per second. Since direct pixel-space self-play RL is prohibitively sample-inefficient at end-to-end model scale, we propose self-play DAgger training: we train pixel-based policies in self-play via on-policy distillation from a privileged RL teacher. To bridge the sim-to-real gap, we subsequently transfer the self-play trained policies to real-world sensor data through lightweight perception adaptation. Policies trained in Gigapixel and adapted to real-world sensor data achieve competitive performance on the HUGSIM and NAVSIM-v2 benchmarks without human trajectory supervision. Moreover, scaling self-play training yields proportional gains in policy performance, establishing self-play as a practical and scalable strategy for training end-to-end models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces Gigapixel, a high-throughput batched simulator that renders simplified bounding-box worlds (50k agent steps/s) to support large-scale self-play for pixel-based end-to-end driving policies. It trains via self-play DAgger by distilling from a privileged RL teacher, then applies lightweight perception adaptation to real sensor data. The central claims are that the resulting policies achieve competitive closed-loop performance on the HUGSIM and NAVSIM-v2 benchmarks without any human trajectory supervision, and that increasing self-play training scale produces proportional gains in policy performance.

Significance. If the sim-to-real transfer holds, the result would be significant: it offers a scalable route to end-to-end driving policies that avoids the state-coverage and compounding-error limitations of offline human datasets while using only simplified rendering rather than photorealistic simulation. The reported scaling behavior, if quantitatively documented, would also be a useful empirical finding for the community.

major comments (2)
  1. [Abstract and Gigapixel / transfer description] The headline claim (competitive benchmark performance after transfer) rests on the untested assumption that Gigapixel's bounding-box rendering retains sufficient scene structure for pixel policies once the privileged teacher is removed. The manuscript provides no quantitative ablations or metrics (e.g., closed-loop score degradation when photometric cues are removed, or comparison against a photorealistic baseline) that would demonstrate the domain gap is bridgeable by lightweight adaptation alone, particularly in long-tail interactions.
  2. [Abstract] The abstract asserts 'competitive performance' and 'proportional gains' from scaling but supplies no numerical scores, baselines, error bars, or ablation tables. Without these data in the results section, it is impossible to evaluate whether the self-play DAgger + adaptation pipeline actually surpasses or merely matches existing methods on HUGSIM/NAVSIM-v2.
minor comments (2)
  1. [Methods] Clarify the exact architecture and training details of the 'lightweight perception adaptation' step (e.g., which layers are updated, domain-randomization parameters) so readers can reproduce the sim-to-real bridge.
  2. [Experiments] Define the precise notion of 'scaling self-play training' (number of agents, episodes, or compute) and report the functional form of the performance-vs-scale curve with confidence intervals.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive review. We address each major comment below with clarifications from the manuscript and indicate revisions where the feedback identifies opportunities to strengthen the presentation of results.

read point-by-point responses
  1. Referee: [Abstract and Gigapixel / transfer description] The headline claim (competitive benchmark performance after transfer) rests on the untested assumption that Gigapixel's bounding-box rendering retains sufficient scene structure for pixel policies once the privileged teacher is removed. The manuscript provides no quantitative ablations or metrics (e.g., closed-loop score degradation when photometric cues are removed, or comparison against a photorealistic baseline) that would demonstrate the domain gap is bridgeable by lightweight adaptation alone, particularly in long-tail interactions.

    Authors: The manuscript demonstrates that policies trained via self-play DAgger in the simplified bounding-box renderer achieve competitive closed-loop scores on HUGSIM and NAVSIM-v2 after lightweight perception adaptation, without any human trajectory data. This outcome indicates that the retained geometric structure is sufficient for the pixel policy to learn transferable behaviors from the privileged teacher. We agree that explicit ablations would make this more rigorous and will add quantitative comparisons (e.g., closed-loop degradation when removing specific visual elements in simulation) in the revised manuscript. A direct photorealistic baseline comparison is not feasible within the current experimental scope given the throughput focus, but we will expand the discussion of domain-gap limitations. revision: partial

  2. Referee: [Abstract] The abstract asserts 'competitive performance' and 'proportional gains' from scaling but supplies no numerical scores, baselines, error bars, or ablation tables. Without these data in the results section, it is impossible to evaluate whether the self-play DAgger + adaptation pipeline actually surpasses or merely matches existing methods on HUGSIM/NAVSIM-v2.

    Authors: The results section contains the requested numerical benchmark scores, baseline comparisons, error bars across runs, and scaling curves with ablation tables. To improve clarity, we will revise the abstract to incorporate the key quantitative results (specific closed-loop metrics and scaling trends) so that the claims are self-contained and directly supported by the reported data. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical method evaluated on external benchmarks

full rationale

The paper proposes an empirical training pipeline (self-play DAgger in Gigapixel simulator followed by lightweight perception adaptation) and reports closed-loop performance on HUGSIM and NAVSIM-v2. No derivation chain, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided text. All central claims reduce to experimental outcomes on independent benchmarks rather than algebraic identities or self-referential definitions. This is a standard non-circular empirical systems paper.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim depends on the domain assumption that simplified bounding-box rendering suffices for transferable policy learning and on the empirical effectiveness of the DAgger distillation step; no free parameters or invented physical entities are stated in the abstract.

axioms (1)
  • domain assumption Simplified bounding-box rendering preserves essential scene structure for policy transfer.
    Explicitly invoked to justify non-photorealistic simulation.
invented entities (1)
  • Gigapixel simulator no independent evidence
    purpose: High-throughput batched driving simulator enabling pixel-based self-play at 50k steps per second.
    New tool introduced to support the training method.

pith-pipeline@v0.9.1-grok · 5849 in / 1288 out tokens · 25900 ms · 2026-06-26T20:20:03.213778+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

104 extracted references · 15 linked inside Pith

  1. [1]

    P. S. Chib and P. Singh. Recent advancements in end-to-end autonomous driving using deep learning: A survey.IEEE Transactions on Intelligent Vehicles, 9(1):103–118, 2023

  2. [2]

    L. Chen, P. Wu, K. Chitta, B. Jaeger, A. Geiger, and H. Li. End-to-end autonomous driving: Challenges and frontiers.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(12):10164–10183, 2024

  3. [3]

    Chitta, A

    K. Chitta, A. Prakash, B. Jaeger, Z. Yu, K. Renz, and A. Geiger. Transfuser: Imitation with transformer-based sensor fusion for autonomous driving.IEEE transactions on pattern analysis and machine intelligence, 45(11):12878–12895, 2022

  4. [4]

    Y . Hu, J. Yang, L. Chen, K. Li, C. Sima, X. Zhu, S. Chai, S. Du, T. Lin, W. Wang, et al. Planning-oriented autonomous driving. InProceedings of the IEEE/CVF conference on com- puter vision and pattern recognition, pages 17853–17862, 2023

  5. [5]

    Jiang, S

    B. Jiang, S. Chen, Q. Xu, B. Liao, J. Chen, H. Zhou, Q. Zhang, W. Liu, C. Huang, and X. Wang. Vad: Vectorized scene representation for efficient autonomous driving. InPro- ceedings of the IEEE/CVF International Conference on Computer Vision, pages 8340–8350, 2023

  6. [6]

    Hwang, R

    J.-J. Hwang, R. Xu, H. Lin, W.-C. Hung, J. Ji, K. Choi, D. Huang, T. He, P. Covington, B. Sapp, et al. Emma: End-to-end multimodal model for autonomous driving.arXiv preprint arXiv:2410.23262, 2024

  7. [7]

    L. Rowe, R. de Schaetzen, R. Girgis, C. Pal, and L. Paull. Poutine: Vision-language-trajectory pre-training and reinforcement learning post-training enable robust end-to-end autonomous driving.arXiv preprint arXiv:2506.11234, 2025

  8. [8]

    Zheng, P

    Y . Zheng, P. Yang, Z. Xia, Q. Zhang, Y . Zheng, S. Gu, B. Jin, T. Zhang, B. Lu, C. Han, et al. Data scaling laws for imitation learning-based end-to-end autonomous driving.arXiv preprint arXiv:2412.02689, 2024

  9. [9]

    Naumann, X

    A. Naumann, X. Gu, T. Dimlioglu, M. Bojarski, A. Degirmenci, A. Popov, D. Bisla, M. Pavone, U. Muller, and B. Ivanovic. Data scaling laws for end-to-end autonomous driving. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 2571– 2582, 2025

  10. [10]

    Baniodeh, K

    M. Baniodeh, K. Goel, S. Ettinger, C. Fuertes, A. Seff, T. Shen, C. Gulino, C. Yang, G. Jerfel, D. Choe, et al. Scaling laws of motion forecasting and planning–technical report.arXiv preprint arXiv:2506.08228, 2025

  11. [11]

    Le Mero, D

    L. Le Mero, D. Yi, M. Dianati, and A. Mouzakitis. A survey on imitation learning tech- niques for end-to-end autonomous vehicles.IEEE Transactions on Intelligent Transportation Systems, 23(9):14128–14147, 2022

  12. [12]

    Ljungbergh, A

    W. Ljungbergh, A. Tonderski, J. Johnander, H. Caesar, K. Åström, M. Felsberg, and C. Peters- son. Neuroncap: Photorealistic closed-loop safety testing for autonomous driving.European Conference on Computer Vision (ECCV), 2024. 11

  13. [13]

    Karkus, M

    P. Karkus, M. Igl, Y . Chen, K. Chitta, J. Packer, B. Douillard, R. Tian, A. Naumann, G. Garcia-Cobo, S. Tan, et al. Beyond behavior cloning in autonomous driving: a survey of closed-loop training techniques.Authorea Preprints, 2025

  14. [14]

    S. Ross, G. Gordon, and D. Bagnell. A reduction of imitation learning and structured predic- tion to no-regret online learning. InProceedings of the fourteenth international conference on artificial intelligence and statistics, pages 627–635. JMLR Workshop and Conference Pro- ceedings, 2011

  15. [15]

    Codevilla, E

    F. Codevilla, E. Santana, A. M. López, and A. Gaidon. Exploring the limitations of behavior cloning for autonomous driving. InProceedings of the IEEE/CVF international conference on computer vision, pages 9329–9338, 2019

  16. [16]

    Cusumano-Towner, D

    M. Cusumano-Towner, D. Hafner, A. Hertzberg, B. Huval, A. Petrenko, E. Vinitsky, E. Wij- mans, T. W. Killian, S. Bowers, O. Sener, et al. Robust autonomy emerges from self-play. In International Conference on Machine Learning, pages 11710–11737. PMLR, 2025

  17. [17]

    Cornelisse, S

    D. Cornelisse, S. Cheng, P. Mandavilli, J. Hunt, K. Joseph, W. Doulazmi, V . Charraut, A. Gupta, J. Suarez, and E. Vinitsky. PufferDrive: A fast and friendly driving simulator for training and evaluating RL agents, 2025. URLhttps://github.com/Emerge-Lab/ PufferDrive

  18. [18]

    Cornelisse and E

    D. Cornelisse and E. Vinitsky. Human-compatible driving agents through data-regularized self-play reinforcement learning. InReinforcement Learning Conference, 2024

  19. [19]

    Cornelisse, A

    D. Cornelisse, A. Pandya, K. Joseph, J. Suárez, and E. Vinitsky. Building reliable sim driving agents by scaling self-play.arXiv preprint arXiv:2502.14706, 2025

  20. [20]

    Zhang, S

    C. Zhang, S. Biswas, K. Wong, K. Fallah, L. Zhang, D. Chen, S. Casas, and R. Urtasun. Learning to drive via asymmetric self-play. InEuropean Conference on Computer Vision, pages 149–168. Springer, 2024

  21. [21]

    Kirby, A

    E. Kirby, A. Boulch, Y . Xu, Y . Yin, G. Puy, É. Zablocki, A. Bursuc, S. Gidaris, R. Marlet, F. Bartoccioni, et al. Driving on registers.arXiv preprint arXiv:2601.05083, 2026

  22. [22]

    L. G. Rosenzweig, B. Shacklett, W. Xia, and K. Fatahalian. High-throughput batch rendering for embodied ai. InSIGGRAPH Asia 2024 Conference Papers, pages 1–9, 2024

  23. [23]

    Agarwal, N

    R. Agarwal, N. Vieillard, Y . Zhou, P. Stanczyk, S. R. Garea, M. Geist, and O. Bachem. On- policy distillation of language models: Learning from self-generated mistakes. InThe twelfth international conference on learning representations, 2024

  24. [24]

    Y . Wang, W. Luo, J. Bai, Y . Cao, T. Che, K. Chen, Y . Chen, J. Diamond, Y . Ding, W. Ding, et al. Alpamayo-r1: Bridging reasoning and action prediction for generalizable autonomous driving in the long tail.arXiv preprint arXiv:2511.00088, 2025

  25. [25]

    B. Liao, S. Chen, H. Yin, B. Jiang, C. Wang, S. Yan, X. Zhang, X. Li, Y . Zhang, Q. Zhang, et al. Diffusiondrive: Truncated diffusion model for end-to-end autonomous driving. InPro- ceedings of the Computer Vision and Pattern Recognition Conference, pages 12037–12047, 2025

  26. [26]

    Jaeger, D

    B. Jaeger, D. Dauner, J. Beißwenger, S. Gerstenecker, K. Chitta, and A. Geiger. Carl: Learn- ing scalable planning policies with simple rewards.arXiv preprint arXiv:2504.17838, 2025

  27. [27]

    H. Zhou, L. Lin, J. Wang, Y . Lu, D. Bai, B. Liu, Y . Wang, A. Geiger, and Y . Liao. Hugsim: A real-time, photo-realistic and closed-loop simulator for autonomous driving.IEEE Trans- actions on Pattern Analysis and Machine Intelligence, 2025. 12

  28. [28]

    W. Cao, M. Hallgarten, T. Li, D. Dauner, X. Gu, C. Wang, Y . Miron, M. Aiello, H. Li, I. Gilitschenski, et al. Pseudo-simulation for autonomous driving. InConference on Robot Learning, pages 4709–4722. PMLR, 2025

  29. [29]

    Caesar, V

    H. Caesar, V . Bankiti, A. H. Lang, S. V ora, V . E. Liong, Q. Xu, A. Krishnan, Y . Pan, G. Bal- dan, and O. Beijbom. nuscenes: A multimodal dataset for autonomous driving. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11621–11631, 2020

  30. [30]

    Caesar, J

    H. Caesar, J. Kabzan, K. S. Tan, W. K. Fong, E. Wolff, A. Lang, L. Fletcher, O. Beijbom, and S. Omari. nuplan: A closed-loop ml-based planning benchmark for autonomous vehicles. arXiv preprint arXiv:2106.11810, 2021

  31. [31]

    P. Sun, H. Kretzschmar, X. Dotiwalla, A. Chouard, V . Patnaik, P. Tsui, J. Guo, Y . Zhou, Y . Chai, B. Caine, et al. Scalability in perception for autonomous driving: Waymo open dataset. InProceedings of the IEEE/CVF conference on computer vision and pattern recog- nition, pages 2446–2454, 2020

  32. [32]

    R. Xu, H. Lin, W. Jeon, H. Feng, Y . Zou, L. Sun, J. Gorman, E. Tolstaya, S. Tang, B. White, et al. Wod-e2e: Waymo open dataset for end-to-end driving in challenging long-tail scenarios. arXiv preprint arXiv:2510.26125, 2025

  33. [33]

    H. Arai, K. Miwa, K. Sasaki, K. Watanabe, Y . Yamaguchi, S. Aoki, and I. Yamamoto. Covla: Comprehensive vision-language-action dataset for autonomous driving. In2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 1933–1943. IEEE, 2025

  34. [34]

    Ghilotti, E

    F. Ghilotti, E. Palladin, S. Brucker, A. Sigal, M. Bijelic, and F. Heide. Truckdrive: Long-range autonomous highway driving dataset.arXiv preprint arXiv:2603.02413, 2026

  35. [35]

    X. Weng, B. Ivanovic, Y . Wang, Y . Wang, and M. Pavone. Para-drive: Parallelized archi- tecture for real-time autonomous driving. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15449–15458, 2024

  36. [36]

    Z. Li, K. Li, S. Wang, S. Lan, Z. Yu, Y . Ji, Z. Li, Z. Zhu, J. Kautz, Z. Wu, et al. Hydra- mdp: End-to-end multimodal planning with multi-target hydra-distillation.arXiv preprint arXiv:2406.06978, 2024

  37. [37]

    Z. Xing, X. Zhang, Y . Hu, B. Jiang, T. He, Q. Zhang, X. Long, and W. Yin. Goalflow: Goal-driven flow matching for multimodal trajectories generation in end-to-end autonomous driving. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 1602–1611, 2025

  38. [38]

    K. Renz, L. Chen, E. Arani, and O. Sinavski. Simlingo: Vision-only closed-loop autonomous driving with language-action alignment. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 11993–12003, 2025

  39. [39]

    Z. Zhou, T. Cai, S. Zhao, Y . Zhang, Z. Huang, B. Zhou, and J. Ma. Autovla: A vision- language-action model for end-to-end autonomous driving with adaptive reasoning and re- inforcement fine-tuning.Advances in Neural Information Processing Systems, 38:27920– 27956, 2026

  40. [40]

    De Haan, D

    P. De Haan, D. Jayaraman, and S. Levine. Causal confusion in imitation learning.Advances in neural information processing systems, 32, 2019

  41. [41]

    Dauner, M

    D. Dauner, M. Hallgarten, A. Geiger, and K. Chitta. Parting with misconceptions about learning-based vehicle motion planning. InConference on Robot Learning, pages 1268–

  42. [42]

    B. R. Kiran, I. Sobh, V . Talpaert, P. Mannion, A. A. Al Sallab, S. Yogamani, and P. Pérez. Deep reinforcement learning for autonomous driving: A survey.IEEE transactions on intel- ligent transportation systems, 23(6):4909–4926, 2021

  43. [43]

    J. Chen, B. Yuan, and M. Tomizuka. Model-free deep reinforcement learning for urban au- tonomous driving. In2019 IEEE intelligent transportation systems conference (ITSC), pages 2765–2771. IEEE, 2019

  44. [44]

    Toromanoff, E

    M. Toromanoff, E. Wirbel, and F. Moutarde. End-to-end model-free reinforcement learning for urban driving using implicit affordances. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 7153–7162, 2020

  45. [45]

    Chekroun, M

    R. Chekroun, M. Toromanoff, S. Hornauer, and F. Moutarde. Gri: General reinforced imita- tion and its application to vision-based autonomous driving.Robotics, 12(5):127, 2023

  46. [46]

    Zhang, R

    C. Zhang, R. Guo, W. Zeng, Y . Xiong, B. Dai, R. Hu, M. Ren, and R. Urtasun. Rethinking closed-loop training for autonomous driving. InEuropean Conference on Computer Vision, pages 264–282. Springer, 2022

  47. [47]

    H. Gao, S. Chen, B. Jiang, B. Liao, Y . Shi, X. Guo, Y . Pu, X. Li, W. Liu, Q. Zhang, et al. Rad: Training an end-to-end driving policy via large-scale 3dgs-based reinforcement learning.Ad- vances in Neural Information Processing Systems, 38:32551–32576, 2026

  48. [48]

    C. Ni, G. Zhao, X. Wang, Z. Zhu, W. Qin, X. Chen, G. Jia, G. Huang, and W. Mei. Recondreamer-rl: Enhancing reinforcement learning via diffusion-based scene reconstruc- tion.arXiv preprint arXiv:2508.08170, 2025

  49. [49]

    Zhang, P

    Z. Zhang, P. Karkus, M. Igl, W. Ding, Y . Chen, B. Ivanovic, and M. Pavone. Closed-loop supervised fine-tuning of tokenized traffic models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 5422–5432, 2025

  50. [50]

    Zhang and K

    J. Zhang and K. Cho. Query-efficient imitation learning for end-to-end autonomous driving. arXiv preprint arXiv:1605.06450, 2016

  51. [51]

    Prakash, A

    A. Prakash, A. Behl, E. Ohn-Bar, K. Chitta, and A. Geiger. Exploring data aggregation in policy learning for vision-based urban autonomous driving. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11763–11773, 2020

  52. [52]

    Garcia-Cobo, M

    G. Garcia-Cobo, M. Igl, P. Karkus, Z. Zhang, M. Watson, Y . Chen, B. Ivanovic, and M. Pavone. Road: Rollouts as demonstrations for closed-loop supervised fine-tuning of au- tonomous driving policies.arXiv preprint arXiv:2512.01993, 2025

  53. [53]

    S. Z. Zhao, L. Wang, H. Ruan, Y . Bao, Y . Chen, Z. Leng, A. Ravichandran, H. He, Z. Zhou, X. Han, et al. Bridgesim: Unveiling the ol-cl gap in end-to-end autonomous driving.arXiv preprint arXiv:2604.10856, 2026

  54. [54]

    Popov, A

    A. Popov, A. Degirmenci, D. Wehr, S. Hegde, R. Oldja, A. Kamenev, B. Douillard, D. Nistér, U. Muller, R. Bhargava, et al. Mitigating covariate shift in imitation learning for autonomous vehicles using latent space generative world models.arXiv preprint arXiv:2409.16663, 2024

  55. [55]

    Dosovitskiy, G

    A. Dosovitskiy, G. Ros, F. Codevilla, A. Lopez, and V . Koltun. CARLA: An open urban driving simulator. InConference on Robot Learning, pages 1–16. PMLR, 2017

  56. [56]

    Coelho, M

    D. Coelho, M. Oliveira, and V . Santos. Rlad: Reinforcement learning from pixels for au- tonomous driving in urban environments.IEEE Transactions on Automation Science and Engineering, 21(4):7427–7435, 2023

  57. [57]

    Osi ´nski, A

    B. Osi ´nski, A. Jakubowski, P. Zi˛ ecina, P. Miło ´s, C. Galias, S. Homoceanu, and H. Michalewski. Simulation-based reinforcement learning for real-world autonomous driv- ing. In2020 IEEE international conference on robotics and automation (ICRA), pages 6411–

  58. [58]

    Gulino, J

    C. Gulino, J. Fu, W. Luo, G. Tucker, E. Bronstein, Y . Lu, J. Harb, X. Pan, Y . Wang, X. Chen, et al. Waymax: An accelerated, data-driven simulator for large-scale autonomous driving research.Advances in Neural Information Processing Systems, 36, 2023

  59. [59]

    Vinitsky, N

    E. Vinitsky, N. Lichtlé, X. Yang, B. Amos, and J. Foerster. Nocturne: a scalable driving benchmark for bringing multi-agent learning one step closer to the real world.Advances in Neural Information Processing Systems, 35:3962–3974, 2022

  60. [60]

    Kazemkhani, A

    S. Kazemkhani, A. Pandya, D. Cornelisse, B. Shacklett, and E. Vinitsky. GPUDrive: Data- driven, multi-agent driving simulation at 1 million FPS. InThe Thirteenth International Conference on Learning Representations, ICLR 2025, 2025

  61. [61]

    Wymann, E

    B. Wymann, E. Espié, C. Guionneau, C. Dimitrakakis, R. Coulom, and A. Sumner. Torcs, the open racing car simulator.Software available at http://torcs. sourceforge. net, 4(6):2, 2000

  62. [62]

    Q. Li, Z. Peng, L. Feng, Q. Zhang, Z. Xue, and B. Zhou. Metadrive: Composing diverse driving scenarios for generalizable reinforcement learning.IEEE transactions on pattern analysis and machine intelligence, 45(3):3461–3475, 2022

  63. [63]

    Z. Yang, Y . Chen, J. Wang, S. Manivasagam, W.-C. Ma, A. J. Yang, and R. Urtasun. Unisim: A neural closed-loop sensor simulator. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, pages 1389–1399, 2023

  64. [64]

    Tonderski, C

    A. Tonderski, C. Lindström, G. Hess, W. Ljungbergh, L. Svensson, and C. Petersson. Neurad: Neural rendering for autonomous driving. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14895–14904, 2024

  65. [65]

    NVIDIA, Y . Cao, R. de Lutio, S. Fidler, G. G. Cobo, Z. Gojcic, M. Igl, B. Ivanovic, P. Karkus, J. M. Esturo, M. Pavone, A. Smith, E. Tanimura, M. Tyszkiewicz, M. Watson, Q. Wu, and L. Zhang. Alpasim: A modular, lightweight, and data-driven research simulator for au- tonomous driving, October 2025. URLhttps://github.com/NVlabs/alpasim

  66. [66]

    A. Hu, L. Russell, H. Yeo, Z. Murez, G. Fedoseev, A. Kendall, J. Shotton, and G. Corrado. Gaia-1: A generative world model for autonomous driving.arXiv preprint arXiv:2309.17080, 2023

  67. [67]

    Russell, A

    L. Russell, A. Hu, L. Bertoni, G. Fedoseev, J. Shotton, E. Arani, and G. Corrado. Gaia-2: A controllable multi-view generative world model for autonomous driving.arXiv preprint arXiv:2503.20523, 2025

  68. [68]

    J. Yang, K. Chitta, S. Gao, L. Chen, Y . Shao, X. Jia, H. Li, A. Geiger, X. Yue, and L. Chen. ReSim: Reliable world simulation for autonomous driving.arXiv preprint arXiv:2506.09981, 2025

  69. [69]

    X. Yang, L. Wen, T. Wei, Y . Ma, J. Mei, X. Li, W. Lei, D. Fu, P. Cai, M. Dou, et al. Drivearena: A closed-loop generative simulation platform for autonomous driving. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 26933–26943, 2025

  70. [70]

    T. Yan, D. Wu, W. Han, J. Jiang, X. Zhou, K. Zhan, C.-z. Xu, and J. Shen. Drivingsphere: Building a high-fidelity 4d world for closed-loop simulation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 27531–27541, 2025

  71. [71]

    L. Feng, Y . Gao, E. Zablocki, Q. Li, W. Li, S. Liu, M. Cord, and A. Alahi. RAP: 3d rasteriza- tion augmented end-to-end planning. InThe Fourteenth International Conference on Learn- ing Representations, 2026. URLhttps://openreview.net/forum?id=a9bOgeqbdB

  72. [72]

    Shacklett, L

    B. Shacklett, L. G. Rosenzweig, Z. Xie, B. Sarkar, A. Szot, E. Wijmans, V . Koltun, D. Batra, and K. Fatahalian. An extensible, data-oriented architecture for high-performance, many- world simulation.ACM Transactions on Graphics (TOG), 42(4):1–13, 2023. 15

  73. [73]

    H. X. Liu and S. Feng. Curse of rarity for autonomous vehicles.nature communications, 15 (1):4808, 2024

  74. [74]

    Silver, T

    D. Silver, T. Hubert, J. Schrittwieser, I. Antonoglou, M. Lai, A. Guez, M. Lanctot, L. Sifre, D. Kumaran, T. Graepel, T. Lillicrap, K. Simonyan, and D. Hassabis. Mastering chess and shogi by self-play with a general reinforcement learning algorithm.arXiv preprint arXiv:1712.01815, 2017

  75. [75]

    Silver, J

    D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A. Guez, T. Hubert, L. Baker, M. Lai, A. Bolton, Y . Chen, T. Lillicrap, F. Hui, L. Sifre, G. van den Driessche, T. Graepel, and D. Hassabis. Mastering the game of Go without human knowledge.Nature, 550(7676):354–359, 2017

  76. [76]

    Jaderberg, W

    M. Jaderberg, W. M. Czarnecki, I. Dunning, L. Marris, G. Lever, A. García Castañeda, C. Beattie, N. C. Rabinowitz, A. S. Morcos, A. Ruderman, N. Sonnerat, T. Green, L. Deason, J. Z. Leibo, D. Silver, D. Hassabis, K. Kavukcuoglu, and T. Graepel. Human-level perfor- mance in first-person multiplayer games with population-based deep reinforcement learning. a...

  77. [77]

    Bakhtin, D

    A. Bakhtin, D. J. Wu, A. Lerer, J. Gray, A. P. Jacob, G. Farina, A. H. Miller, and N. Brown. Mastering the game of no-press diplomacy via human-regularized reinforcement learning and planning. InThe Eleventh International Conference on Learning Representations, ICLR 2023, 2023

  78. [78]

    Chang, A

    W.-J. Chang, A. Rangesh, K. Joseph, M. Strong, M. Tomizuka, Y . Hu, and W. Zhan. Spacer: Self-play anchoring with centralized reference models.arXiv preprint arXiv:2510.18060, 2025

  79. [79]

    Z. Wang, S. Rahmani, D. Cornelisse, B. Sarkar, A. D. Goldie, J. N. Foerster, and S. White- son. Learning to drive in new cities without human demonstrations.arXiv preprint arXiv:2602.15891, 2026

  80. [80]

    Distelzweig, F

    A. Distelzweig, F. Janjoš, A. Look, A. Rothenhäusler, D. Jost, O. Scheel, R. Rajan, D. Cor- nelisse, E. Vinitsky, and J. Boedecker. Beyond self-play and scale: A behavior benchmark for generalization in autonomous driving.arXiv preprint arXiv:2605.10034, 2026

Showing first 80 references.