TerraTransfer: Learning End-to-End Driving Policies Without Expert Demonstrations

Akshay Rangesh; Chen Tang; Grantland Hall; Saarth Bonde; Weixin Li; Wei Zhan; Yihan Hu; Zhouchonghao Wu; Zikang Xiong

arxiv: 2606.17386 · v1 · pith:PONEVWXWnew · submitted 2026-06-16 · 💻 cs.CV · cs.AI· cs.RO

TerraTransfer: Learning End-to-End Driving Policies Without Expert Demonstrations

Zikang Xiong , Weixin Li , Zhouchonghao Wu , Akshay Rangesh , Saarth Bonde , Grantland Hall , Chen Tang , Yihan Hu

show 1 more author

Wei Zhan

This is my paper

Pith reviewed 2026-06-27 02:23 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.RO

keywords end-to-end drivingself-playpolicy transferautonomous drivinglatent alignmentno expert demonstrations3D Gaussian splattingreinforcement learning

0 comments

The pith

Self-play in vectorized simulators enables end-to-end image driving policies without expert demonstrations via latent space alignment.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that end-to-end autonomous driving from images can be learned without any expert trajectory data. It first pretrains a policy through self-play inside a fast vectorized simulator that produces millions of steps including collisions and recoveries. The policy is then transferred to a vision backbone by aligning latent representations using action KL divergence from the self-play outputs plus a batch-relational low-rank structural loss on paired image and scene-state frames. This removes the need to collect and label millions of real driving logs. On closed-loop tests with photorealistic 3D Gaussian splatting renderings the resulting policy matches or exceeds earlier end-to-end methods that relied on expert supervision.

Core claim

By pretraining a policy exclusively through self-play in a vectorized simulator and then aligning its latent space to a vision backbone with action KL divergence and batch-relational low-rank structural loss on paired image and scene-state data, the approach produces an end-to-end image-based driving policy that requires no expert trajectory supervision and achieves performance matching or exceeding prior methods on photorealistic closed-loop scenarios.

What carries the argument

The latent space alignment process that transfers self-play behavior to an image-based model using action KL divergence and batch-relational low-rank structural loss.

If this is right

End-to-end policies can be trained without the cost of collecting and labeling expert driving demonstrations.
Self-play supplies state distributions rich in collisions and recoveries absent from logged data.
Pretrained vision backbones can be used directly for image inputs after alignment.
Only paired image and scene-state frames are required for the transfer step.
Closed-loop performance on photorealistic renderings reaches or surpasses imitation-based baselines.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The decoupling of policy learning from vision learning could apply to other simulation-heavy robotics tasks.
Similar alignment techniques might reduce expert data needs in other vision-based control domains.
Extending self-play to more complex vectorized environments could improve transfer quality.
The batch-relational loss may help preserve structural consistency across diverse driving situations.

Load-bearing premise

That aligning the self-play policy's latent space to a vision backbone via action KL divergence and batch-relational low-rank structural loss will produce a functional image-based driving policy without expert trajectory supervision.

What would settle it

If the aligned image-based policy shows markedly worse closed-loop success rates than prior expert-supervised methods on the same photorealistic 3D Gaussian splatting scenarios, the central claim would be refuted.

Figures

Figures reproduced from arXiv: 2606.17386 by Akshay Rangesh, Chen Tang, Grantland Hall, Saarth Bonde, Weixin Li, Wei Zhan, Yihan Hu, Zhouchonghao Wu, Zikang Xiong.

**Figure 1.** Figure 1: Conventional vs. proposed training paradigm. (a) Conventional recipes begin with imitation pretraining on fleet-scale logs, then add supervised fine-tuning, open-loop RL on logged trajectories, or closed-loop image RL in sensor simulators, each path requiring expensive humandriving data or photorealistic rendering. (b) Our two-phase paradigm decouples learning to drive from learning to see: Phase 1 trains… view at source ↗

**Figure 2.** Figure 2: Two-phase training pipeline. Flame icons mark trainable modules and snowflakes mark frozen ones; red arrows carry gradients and blue arrows do not. Phase 1 (left): A single self-play vector policy is trained end-to-end with PPO in a multi-agent vectorized simulator. The ego, map, and partner encoders, and the action head are jointly optimized, and the same parameter set controls every agent in the scene so… view at source ↗

**Figure 3.** Figure 3: Alignment data efficiency. Closed-loop HD-Score vs. relative nuPlan training data ρ (our full alignment set = 1.83M frames ⇒ ρ = 1), for the All set (top) and each HUGSim tier (bottom) on nuScenes; bands are ±1 per-scene std. The self-play teacher (§4.2) uses no nuPlan data and sits at ρ = 0 (horizontal dashed line) in every panel; ECO Smoothing-only is placed at ρ ≈ 1.6 (∼2.8M nuPlan frames vs. our 1.83M… view at source ↗

**Figure 5.** Figure 5: Modal coverage: ensemble-averaged squared projection of the batch-derived top- [PITH_FULL_IMAGE:figures/full_fig_p019_5.png] view at source ↗

**Figure 6.** Figure 6: Fraction of perturbation energy preserved in the projected subspace as a function of batch [PITH_FULL_IMAGE:figures/full_fig_p020_6.png] view at source ↗

**Figure 7.** Figure 7: Front-camera samples from decoupled alignment data. A grid of CAM F0 observations collected from HUGSim rollouts. These frames provide paired visual observations and reconstructed scene states for the alignment loss; their actions are not used as expert demonstrations, so rollout quality affects alignment only through the state coverage induced by the data-collection policy [PITH_FULL_IMAGE:figures/full… view at source ↗

**Figure 8.** Figure 8: Cut-in negotiation. The ego initially travels at low speed while monitoring a vehicle that could cut in. Once it confirms the vehicle has no cut-in intention (center), the ego accelerates and overtakes smoothly (right). The speed profile shows a brief deceleration followed by a sustained ramp-up, demonstrating that the policy withholds commitment until the other agent’s intent is resolved. A static car is… view at source ↗

**Figure 9.** Figure 9: Overtaking with oncoming traffic. A static vehicle blocks the lane while a slow oncoming car approaches. The policy decides to overtake (center), accelerates past the static vehicle, and merges back into the original lane before the oncoming car arrives (right). The speed profile rises throughout the maneuver, reflecting a committed overtake decision made under time pressure. 22 [PITH_FULL_IMAGE:figures/… view at source ↗

**Figure 10.** Figure 10: Lead vehicle yielding route. The policy follows a slow lead vehicle at a steady speed (left). When the lead vehicle diverges from the ego’s route (center), the ego turns onto its own path and accelerates (right). The speed profile is flat during the follow phase and then climbs sharply after the turn, showing the policy correctly disengages from the lead once it is no longer relevant. Slow down when appro… view at source ↗

**Figure 11.** Figure 11: Narrow-lane passage. In a narrow lane, the policy slows when approaching an oncoming vehicle (left), holds a reduced speed while passing (center), and then begins decelerating again in anticipation of a vulnerable road user detected ahead (right, red box). The speed profile oscillates rather than recovering fully, reflecting the policy’s forward-looking awareness of the downstream hazard. Extreme scenario… view at source ↗

**Figure 12.** Figure 12: Visually implausible extreme-tier scenarios. Extreme scenarios in the HUGSim benchmark can be visually unrealistic, with implausible initial configurations, or an occluded environment where other cars can pass through each other. 23 [PITH_FULL_IMAGE:figures/full_fig_p023_12.png] view at source ↗

read the original abstract

End-to-end autonomous driving has achieved state-of-the-art performance on benchmarks and real-world deployments. Its standard training recipe, however, is expensive across all stages: collecting and labeling millions of driving frames is costly, and closed-loop RL on images is bottlenecked by the per-step cost of photorealistic rendering plus a forward pass through a large vision backbone. Self-play in vectorized simulators changes the economics: millions of rollout steps per second, and a state distribution naturally rich in collisions, near-misses, and recoveries that no driving log contains. Our approach exploits this asymmetry by decoupling learning to drive from learning to see. We pretrain a single policy by self-play, then align its latent space with a pretrained vision backbone, through the action KL divergence and a batch-relational low-rank structural loss. The action target comes from the self-play policy, so alignment never supervises against a logged trajectory: a paired dataset of (image, scene-state) frames suffices, with no need for the curated expert demonstrations that imitation pretraining is built on. On photorealistic 3D Gaussian splatting closed-loop scenarios, the resulting end-to-end policy matches or exceeds prior end-to-end methods.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper's idea of skipping expert demos via vector self-play pretraining plus latent alignment to vision is a reasonable attempt to cut data costs, but the domain gap between simulators looks like it could undermine the closed-loop claims.

read the letter

The main thing here is a method that pretrains a driving policy entirely through self-play in a cheap vectorized simulator, then aligns its latent space to a vision backbone so the final model can take images as input. Alignment uses action KL divergence from the self-play policy plus a batch-relational low-rank structural loss on paired image-state frames. No expert trajectories are used at any stage, and the claim is that the resulting policy matches or beats prior end-to-end methods when tested in photorealistic 3D Gaussian splatting closed-loop scenarios.

What works is the basic separation of concerns. Self-play in vector sims is fast and naturally produces collisions and recoveries that logged data rarely shows. Using the self-play policy itself as the action teacher during alignment avoids the usual imitation-learning dependence on curated human demonstrations. That part directly targets the expense problem mentioned in the abstract.

The soft spot is the transfer step. The self-play policy operates on vector states, the alignment pairs presumably come from rendering those same states, and evaluation happens in a separate photorealistic environment with different visuals, geometry, and possibly dynamics. Nothing in the described losses penalizes mismatches across those domains, so the policy could learn behaviors that only make sense under vector assumptions. The stress-test note on this point holds up from the abstract alone; without ablations or details on pair generation, it is hard to see why the closed-loop results should be trusted.

This is for researchers focused on scaling end-to-end driving training without massive data pipelines. Readers already working on self-play or latent alignment might pick up the specific loss combination. If the full paper supplies controlled experiments that address the domain gap and show the alignment actually transfers, it deserves a serious referee. Otherwise the central performance claim rests on an untested assumption.

Referee Report

2 major / 1 minor

Summary. The paper presents TerraTransfer, a method to learn end-to-end driving policies without expert demonstrations. A policy is first pretrained via self-play in a vectorized simulator; its latent space is then aligned to a pretrained vision backbone using action KL divergence together with a batch-relational low-rank structural loss on paired (image, scene-state) frames. The resulting image-based policy is evaluated in closed-loop on photorealistic 3D Gaussian splatting scenarios, where it is claimed to match or exceed prior end-to-end methods.

Significance. If the central claim holds, the work would materially lower the cost of training end-to-end driving policies by exploiting the speed and rich failure-state distribution of vectorized self-play while removing the need for curated expert trajectories. The explicit decoupling of policy learning from perception learning is a substantive conceptual contribution that could generalize beyond driving.

major comments (2)

[Abstract / §3 (alignment)] The alignment procedure (described in the abstract and presumably §3) uses paired (image, scene-state) frames and action KL divergence, yet the manuscript does not state whether the scene-states are generated inside the vectorized simulator or inside the target Gaussian-splatting environment. If the former, the domain gap in dynamics, geometry, and physics is never directly penalized and remains load-bearing for the claim that the transferred policy functions in photorealistic closed-loop rollouts.
[§4] §4 (closed-loop evaluation): the claim that the policy “matches or exceeds prior end-to-end methods” is presented without reported trial counts, variance, or statistical tests against the cited baselines; a single qualitative statement is insufficient to support the performance conclusion that justifies the entire pipeline.

minor comments (1)

[Abstract / §3] The batch-relational low-rank structural loss is named but not defined or referenced; a short equation or citation would remove ambiguity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive report. We address each major comment below and will revise the manuscript to improve clarity and rigor.

read point-by-point responses

Referee: [Abstract / §3 (alignment)] The alignment procedure (described in the abstract and presumably §3) uses paired (image, scene-state) frames and action KL divergence, yet the manuscript does not state whether the scene-states are generated inside the vectorized simulator or inside the target Gaussian-splatting environment. If the former, the domain gap in dynamics, geometry, and physics is never directly penalized and remains load-bearing for the claim that the transferred policy functions in photorealistic closed-loop rollouts.

Authors: We agree that the source of the scene-states must be stated explicitly. The paired (image, scene-state) frames are constructed by rendering images from the 3D Gaussian splatting environment while obtaining the corresponding scene-states from the vectorized simulator for the same underlying scene configuration. We will revise §3 to state this construction clearly and add a short discussion of how the alignment losses are intended to mitigate the resulting domain gap. revision: yes
Referee: [§4] §4 (closed-loop evaluation): the claim that the policy “matches or exceeds prior end-to-end methods” is presented without reported trial counts, variance, or statistical tests against the cited baselines; a single qualitative statement is insufficient to support the performance conclusion that justifies the entire pipeline.

Authors: We acknowledge that the current presentation of the closed-loop results lacks the quantitative detail needed to substantiate the performance claims. We will expand §4 to report the number of independent trials, standard deviations or confidence intervals, and appropriate statistical comparisons against the baselines. revision: yes

Circularity Check

0 steps flagged

No circularity: method is a standard self-play pretrain + latent alignment pipeline with empirical claims

full rationale

The provided abstract and description outline a two-stage procedure: (1) self-play RL in a vectorized simulator to obtain a policy, (2) alignment of a vision backbone's latent space to that policy via action KL divergence plus a structural loss on paired (image, scene-state) frames. No equation in the text reduces a claimed result to its own inputs by construction, no fitted parameter is relabeled as a prediction, and no load-bearing uniqueness theorem or ansatz is imported via self-citation. The performance statement is an empirical claim on closed-loop Gaussian-splatting rollouts rather than a mathematical identity. The derivation chain is therefore self-contained and does not exhibit any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no information on free parameters, axioms, or invented entities.

pith-pipeline@v0.9.1-grok · 5774 in / 973 out tokens · 34373 ms · 2026-06-27T02:23:17.615749+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

52 extracted references · 14 canonical work pages · 6 internal anchors

[1]

Paden, M

B. Paden, M. ˇC´ap, S. Z. Yong, D. Yershov, and E. Frazzoli. A survey of motion planning and control techniques for self-driving urban vehicles.IEEE Transactions on Intelligent V ehicles, 1(1):33–55, 2016

2016
[2]

P. Sun, H. Kretzschmar, X. Dotiwalla, A. Chouard, V . Patnaik, P. Tsui, J. Guo, Y . Zhou, Y . Chai, B. Caine, V . Vasudevan, W. Han, J. Ngiam, H. Zhao, A. Timofeev, S. Ettinger, M. Krivokon, A. Gao, A. Joshi, Y . Zhang, J. Shlens, Z. Chen, and D. Anguelov. Scalability in perception for autonomous driving: Waymo open dataset. InProceedings of the IEEE/CVF ...

2020
[3]

NuPlan: A closed-loop ML-based planning benchmark for autonomous vehicles

H. Caesar, J. Kabzan, K. S. Tan, W. K. Fong, E. Wolff, A. Lang, L. Fletcher, O. Beijbom, and S. Omari. nuPlan: A closed-loop ML-based planning benchmark for autonomous vehicles. arXiv preprint arXiv:2106.11810, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[4]

End to End Learning for Self-Driving Cars

M. Bojarski, D. D. Testa, D. Dworakowski, B. Firner, B. Flepp, P. Goyal, L. D. Jackel, M. Mon- fort, U. Muller, J. Zhang, X. Zhang, J. Zhao, and K. Zieba. End to end learning for self-driving cars.arXiv preprint arXiv:1604.07316, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[5]

Codevilla, E

F. Codevilla, E. Santana, A. M. L ´opez, and A. Gaidon. Exploring the limitations of behavior cloning for autonomous driving. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2019

2019
[6]

Caesar, V

H. Caesar, V . Bankiti, A. H. Lang, S. V ora, V . E. Liong, Q. Xu, A. Krishnan, Y . Pan, G. Baldan, and O. Beijbom. nuScenes: A multimodal dataset for autonomous driving. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020

2020
[7]

Y . Hu, J. Yang, L. Chen, K. Li, C. Sima, X. Zhu, S. Chai, S. Du, T. Lin, W. Wang, L. Lu, X. Jia, Q. Liu, J. Dai, Y . Qiao, and H. Li. Planning-oriented autonomous driving. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023

2023
[8]

Jiang, S

B. Jiang, S. Chen, Q. Xu, B. Liao, J. Chen, H. Zhou, Q. Zhang, W. Liu, C. Huang, and X. Wang. V AD: Vectorized scene representation for efficient autonomous driving. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023

2023
[9]

Chitta, A

K. Chitta, A. Prakash, B. Jaeger, Z. Yu, K. Renz, and A. Geiger. TransFuser: Imitation with transformer-based sensor fusion for autonomous driving.IEEE Transactions on Pattern Anal- ysis and Machine Intelligence, 45(11):12878–12895, 2023

2023
[10]

Dauner, M

D. Dauner, M. Hallgarten, T. Li, X. Weng, Z. Huang, Z. Yang, H. Li, I. Gilitschenski, B. Ivanovic, M. Pavone, A. Geiger, and K. Chitta. NA VSIM: Data-driven non-reactive au- tonomous vehicle simulation and benchmarking. InAdvances in Neural Information Process- ing Systems (NeurIPS) Datasets and Benchmarks Track, 2024

2024
[11]

H. Zhou, L. Lin, J. Wang, Y . Lu, D. Bai, B. Liu, Y . Wang, A. Geiger, and Y . Liao. HUGSIM: A real-time, photo-realistic and closed-loop simulator for autonomous driving.arXiv preprint arXiv:2412.01718, 2024

work page arXiv 2024
[12]

Zhang, M

B. Zhang, M. Golchoubian, I. Gilitschenski, B. Ivanovic, and K. Chitta. Endpoint constrained trajectory optimization for driving foundation models. InICCV RealADSim Workshop, 2025

2025
[13]

Karkus, M

P. Karkus, M. Igl, Y . Chen, K. Chitta, J. Packer, B. Douillard, T. Tian, A. Naumann, G. Garcia- Cobo, S. Tan, A. Degirmenci, A. Popov, N. Smolyanskiy, U. Muller, B. Ivanovic, and M. Pavone. Beyond behavior cloning in autonomous driving: a survey of closed-loop training techniques.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

2025
[14]

Dosovitskiy, G

A. Dosovitskiy, G. Ros, F. Codevilla, A. Lopez, and V . Koltun. CARLA: An open urban driving simulator. InProceedings of the Conference on Robot Learning (CoRL), 2017. 9

2017
[15]

D. Chen, B. Zhou, V . Koltun, and P. Kr¨ahenb¨uhl. Learning by cheating. InProceedings of the Conference on Robot Learning (CoRL), 2019

2019
[16]

Zhang, A

Z. Zhang, A. Liniger, D. Dai, F. Yu, and L. Van Gool. End-to-end urban driving by imitating a reinforcement learning coach. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021

2021
[17]

P. Wu, X. Jia, L. Chen, J. Yan, H. Li, and Y . Qiao. Trajectory-guided control prediction for end- to-end autonomous driving: A simple yet strong baseline. InAdvances in Neural Information Processing Systems (NeurIPS), 2022

2022
[18]

Cusumano-Towner, D

M. Cusumano-Towner, D. Hafner, A. Hertzberg, B. Huval, A. Petrenko, E. Vinitsky, E. Wij- mans, T. Killian, S. Bowers, O. Sener, P. Kr¨ahenb¨uhl, and V . Koltun. Robust autonomy emerges from self-play.arXiv preprint arXiv:2502.03349, 2025

work page arXiv 2025
[19]

Chang, A

W.-J. Chang, A. Rangesh, K. Joseph, M. Strong, M. Tomizuka, Y . Hu, and W. Zhan. SPACeR: Self-play anchoring with centralized reference models. InProceedings of the International Conference on Learning Representations (ICLR), 2026

2026
[20]

Seong, J.-K

H. Seong, J.-K. Lee, H. Myeong, Y . Shin, H.-M. Cho, D. H. Kim, P. Desai, and M. Surana. Post-training and test-time scaling of generative agent behavior models for interactive au- tonomous driving.arXiv preprint arXiv:2512.13262, 2025

work page arXiv 2025
[21]

Y . Guo, D. Ye, S. Chen, A. Liu, and X. Liu. CorrectionPlanner: Self-correction planner with reinforcement learning in autonomous driving.arXiv preprint arXiv:2603.15771, 2026

work page arXiv 2026
[22]

Konstantinidis, M

F. Konstantinidis, M. Sackmann, U. Hofmann, and C. Stiller. Toward efficient and robust behavior models for multi-agent driving simulation. InProceedings of the IEEE International Conference on Robotics and Automation (ICRA), 2026

2026
[23]

Ahmadi, H

E. Ahmadi, H. Schofield, B. Khamidehi, F. Arasteh, J. Shan, L. Mou, K. Rezaee, and D. Bai. RLFTSim: Realistic and controllable multi-agent traffic simulation via reinforcement learning fine-tuning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026

2026
[24]

DINOv3

O. Sim ´eoni, H. V . V o, M. Seitzer, F. Baldassarre, M. Oquab, C. Jose, V . Khalidov, M. Szafraniec, S. Yi, M. Ramamonjisoa, F. Massa, D. Haziza, L. Wehrstedt, J. Wang, T. Darcet, T. Moutakanni, L. Sentana, C. Roberts, A. Vedaldi, J. Tolan, J. Brandt, C. Couprie, J. Mairal, H. J´egou, P. Labatut, and P. Bojanowski. DINOv3.arXiv preprint arXiv:2508.10104, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[25]

W. Wu, X. Feng, Z. Gao, and Y . Kan. SMART: Scalable multi-agent real-time motion gen- eration via next-token prediction. InAdvances in Neural Information Processing Systems (NeurIPS), 2024

2024
[26]

G. Hess, C. Lindstr ¨om, M. Fatemi, C. Petersson, and L. Svensson. SplatAD: Real-time lidar and camera rendering with 3d gaussian splatting for autonomous driving. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025

2025
[27]

H. Gao, S. Chen, B. Jiang, B. Liao, Y . Shi, X. Guo, Y . Pu, H. Yin, X. Li, X. Zhang, Y . Zhang, W. Liu, Q. Zhang, and X. Wang. RAD: Training an end-to-end driving policy via large-scale 3DGS-based reinforcement learning. InAdvances in Neural Information Processing Systems (NeurIPS), 2025

2025
[28]

J. Suarez. PufferLib: Making reinforcement learning libraries and environments play nice. arXiv preprint arXiv:2406.12905, 2024

work page arXiv 2024
[29]

Kaufmann, L

E. Kaufmann, L. Bauersfeld, A. Loquercio, M. M ¨uller, V . Koltun, and D. Scaramuzza. Champion-level drone racing using deep reinforcement learning.Nature, 620(7976):982–987, 2023. 10

2023
[30]

Kumar, R

A. Kumar, R. Bahlous-Boldi, P. Sharma, P. Isola, S. Risi, Y . Tang, and D. Ha. Digital red queen: Adversarial program evolution in core war with LLMs.arXiv preprint arXiv:2601.03335, 2026

work page arXiv 2026
[31]

Radford, J

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever. Learning transferable visual models from natural language supervision. InProceedings of the International Conference on Machine Learning (ICML), 2021

2021
[32]

C. Jia, Y . Yang, Y . Xia, Y .-T. Chen, Z. Parekh, H. Pham, Q. V . Le, Y .-H. Sung, Z. Li, and T. Duerig. Scaling up visual and vision-language representation learning with noisy text su- pervision. InProceedings of the International Conference on Machine Learning (ICML), 2021

2021
[33]

J. Li, D. Li, S. Savarese, and S. Hoi. BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InProceedings of the International Conference on Machine Learning (ICML), 2023

2023
[34]

A. A. Rusu, S. G. Colmenarejo, C ¸ . G ¨ulc ¸ehre, G. Desjardins, J. Kirkpatrick, R. Pascanu, V . Mnih, K. Kavukcuoglu, and R. Hadsell. Policy distillation. InProceedings of the Inter- national Conference on Learning Representations (ICLR), 2016

2016
[35]

Parisotto, J

E. Parisotto, J. Ba, and R. Salakhutdinov. Actor-mimic: Deep multitask and transfer reinforce- ment learning. InProceedings of the International Conference on Learning Representations (ICLR), 2016

2016
[36]

Y . W. Teh, V . Bapst, W. M. Czarnecki, J. Quan, J. Kirkpatrick, R. Hadsell, N. Heess, and R. Pascanu. Distral: Robust multitask reinforcement learning. InAdvances in Neural Infor- mation Processing Systems (NeurIPS), 2017

2017
[37]

Kickstarting Deep Reinforcement Learning

S. Schmitt, J. J. Hudson, A. ˇZ´ıdek, S. Osindero, C. Doersch, W. M. Czarnecki, J. Z. Leibo, H. K¨uttler, A. Zisserman, K. Simonyan, and S. M. A. Eslami. Kickstarting deep reinforcement learning.arXiv preprint arXiv:1803.03835, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[38]

J. Lee, J. Hwangbo, L. Wellhausen, V . Koltun, and M. Hutter. Learning quadrupedal locomo- tion over challenging terrain.Science Robotics, 5(47):eabc5986, 2020

2020
[39]

T. Miki, J. Lee, J. Hwangbo, L. Wellhausen, V . Koltun, and M. Hutter. Learning robust per- ceptive locomotion for quadrupedal robots in the wild.Science Robotics, 7(62):eabk2822, 2022

2022
[40]

Kumar, Z

A. Kumar, Z. Fu, D. Pathak, and J. Malik. RMA: Rapid motor adaptation for legged robots. In Robotics: Science and Systems (RSS), 2021

2021
[41]

Loquercio, E

A. Loquercio, E. Kaufmann, R. Ranftl, M. M ¨uller, V . Koltun, and D. Scaramuzza. Learning high-speed flight in the wild.Science Robotics, 6(59):eabg5810, 2021

2021
[42]

T. Chen, J. Xu, and P. Agrawal. A system for general in-hand object re-orientation. InPro- ceedings of the Conference on Robot Learning (CoRL), 2021

2021
[43]

Z. Wu, R. Song, V . Mundheda, L. E. Navarro-Serment, C. Schoenborn, and J. Schneider. TADPO: Reinforcement learning goes off-road. InProceedings of the IEEE International Conference on Robotics and Automation (ICRA), 2026. URLhttps://arxiv.org/abs/ 2603.05995

work page arXiv 2026
[44]

Hinton, O

G. Hinton, O. Vinyals, and J. Dean. Distilling the knowledge in a neural network. InNeurIPS Deep Learning and Representation Learning Workshop, 2015

2015
[45]

Philion, A

J. Philion, A. Kar, and S. Fidler. Learning to evaluate perception models using planner-centric metrics. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recog- nition (CVPR), 2020. 11

2020
[46]

Li and X

W.-X. Li and X. Yang. Transcendental idealism of planner: Evaluating perception from plan- ning perspective for autonomous driving. InProceedings of the International Conference on Machine Learning (ICML), 2023

2023
[47]

Tung and G

F. Tung and G. Mori. Similarity-preserving knowledge distillation. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 1365–1374, 2019

2019
[48]

Zaheer, S

M. Zaheer, S. Kottur, S. Ravanbakhsh, B. Poczos, R. R. Salakhutdinov, and A. J. Smola. Deep sets. InAdvances in Neural Information Processing Systems (NeurIPS), 2017

2017
[49]

Proximal Policy Optimization Algorithms

J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[50]

High-Dimensional Continuous Control Using Generalized Advantage Estimation

J. Schulman, P. Moritz, S. Levine, M. Jordan, and P. Abbeel. High-dimensional continuous control using generalized advantage estimation.arXiv preprint arXiv:1506.02438, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[51]

Gavish and D

M. Gavish and D. L. Donoho. The optimal hard threshold for singular values is4/ √ 3.IEEE Transactions on Information Theory, 60(8):5040–5053, 2014

2014
[52]

Roy and M

O. Roy and M. Vetterli. The effective rank: A measure of effective dimensionality. In2007 15th European Signal Processing Conference, pages 606–610. IEEE, 2007. 12 Supplementary Contents A Self-Play Policy Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 B Closed-Loop HD...

work page arXiv 2007

[1] [1]

Paden, M

B. Paden, M. ˇC´ap, S. Z. Yong, D. Yershov, and E. Frazzoli. A survey of motion planning and control techniques for self-driving urban vehicles.IEEE Transactions on Intelligent V ehicles, 1(1):33–55, 2016

2016

[2] [2]

P. Sun, H. Kretzschmar, X. Dotiwalla, A. Chouard, V . Patnaik, P. Tsui, J. Guo, Y . Zhou, Y . Chai, B. Caine, V . Vasudevan, W. Han, J. Ngiam, H. Zhao, A. Timofeev, S. Ettinger, M. Krivokon, A. Gao, A. Joshi, Y . Zhang, J. Shlens, Z. Chen, and D. Anguelov. Scalability in perception for autonomous driving: Waymo open dataset. InProceedings of the IEEE/CVF ...

2020

[3] [3]

NuPlan: A closed-loop ML-based planning benchmark for autonomous vehicles

H. Caesar, J. Kabzan, K. S. Tan, W. K. Fong, E. Wolff, A. Lang, L. Fletcher, O. Beijbom, and S. Omari. nuPlan: A closed-loop ML-based planning benchmark for autonomous vehicles. arXiv preprint arXiv:2106.11810, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[4] [4]

End to End Learning for Self-Driving Cars

M. Bojarski, D. D. Testa, D. Dworakowski, B. Firner, B. Flepp, P. Goyal, L. D. Jackel, M. Mon- fort, U. Muller, J. Zhang, X. Zhang, J. Zhao, and K. Zieba. End to end learning for self-driving cars.arXiv preprint arXiv:1604.07316, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[5] [5]

Codevilla, E

F. Codevilla, E. Santana, A. M. L ´opez, and A. Gaidon. Exploring the limitations of behavior cloning for autonomous driving. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2019

2019

[6] [6]

Caesar, V

H. Caesar, V . Bankiti, A. H. Lang, S. V ora, V . E. Liong, Q. Xu, A. Krishnan, Y . Pan, G. Baldan, and O. Beijbom. nuScenes: A multimodal dataset for autonomous driving. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020

2020

[7] [7]

Y . Hu, J. Yang, L. Chen, K. Li, C. Sima, X. Zhu, S. Chai, S. Du, T. Lin, W. Wang, L. Lu, X. Jia, Q. Liu, J. Dai, Y . Qiao, and H. Li. Planning-oriented autonomous driving. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023

2023

[8] [8]

Jiang, S

B. Jiang, S. Chen, Q. Xu, B. Liao, J. Chen, H. Zhou, Q. Zhang, W. Liu, C. Huang, and X. Wang. V AD: Vectorized scene representation for efficient autonomous driving. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023

2023

[9] [9]

Chitta, A

K. Chitta, A. Prakash, B. Jaeger, Z. Yu, K. Renz, and A. Geiger. TransFuser: Imitation with transformer-based sensor fusion for autonomous driving.IEEE Transactions on Pattern Anal- ysis and Machine Intelligence, 45(11):12878–12895, 2023

2023

[10] [10]

Dauner, M

D. Dauner, M. Hallgarten, T. Li, X. Weng, Z. Huang, Z. Yang, H. Li, I. Gilitschenski, B. Ivanovic, M. Pavone, A. Geiger, and K. Chitta. NA VSIM: Data-driven non-reactive au- tonomous vehicle simulation and benchmarking. InAdvances in Neural Information Process- ing Systems (NeurIPS) Datasets and Benchmarks Track, 2024

2024

[11] [11]

H. Zhou, L. Lin, J. Wang, Y . Lu, D. Bai, B. Liu, Y . Wang, A. Geiger, and Y . Liao. HUGSIM: A real-time, photo-realistic and closed-loop simulator for autonomous driving.arXiv preprint arXiv:2412.01718, 2024

work page arXiv 2024

[12] [12]

Zhang, M

B. Zhang, M. Golchoubian, I. Gilitschenski, B. Ivanovic, and K. Chitta. Endpoint constrained trajectory optimization for driving foundation models. InICCV RealADSim Workshop, 2025

2025

[13] [13]

Karkus, M

P. Karkus, M. Igl, Y . Chen, K. Chitta, J. Packer, B. Douillard, T. Tian, A. Naumann, G. Garcia- Cobo, S. Tan, A. Degirmenci, A. Popov, N. Smolyanskiy, U. Muller, B. Ivanovic, and M. Pavone. Beyond behavior cloning in autonomous driving: a survey of closed-loop training techniques.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

2025

[14] [14]

Dosovitskiy, G

A. Dosovitskiy, G. Ros, F. Codevilla, A. Lopez, and V . Koltun. CARLA: An open urban driving simulator. InProceedings of the Conference on Robot Learning (CoRL), 2017. 9

2017

[15] [15]

D. Chen, B. Zhou, V . Koltun, and P. Kr¨ahenb¨uhl. Learning by cheating. InProceedings of the Conference on Robot Learning (CoRL), 2019

2019

[16] [16]

Zhang, A

Z. Zhang, A. Liniger, D. Dai, F. Yu, and L. Van Gool. End-to-end urban driving by imitating a reinforcement learning coach. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021

2021

[17] [17]

P. Wu, X. Jia, L. Chen, J. Yan, H. Li, and Y . Qiao. Trajectory-guided control prediction for end- to-end autonomous driving: A simple yet strong baseline. InAdvances in Neural Information Processing Systems (NeurIPS), 2022

2022

[18] [18]

Cusumano-Towner, D

M. Cusumano-Towner, D. Hafner, A. Hertzberg, B. Huval, A. Petrenko, E. Vinitsky, E. Wij- mans, T. Killian, S. Bowers, O. Sener, P. Kr¨ahenb¨uhl, and V . Koltun. Robust autonomy emerges from self-play.arXiv preprint arXiv:2502.03349, 2025

work page arXiv 2025

[19] [19]

Chang, A

W.-J. Chang, A. Rangesh, K. Joseph, M. Strong, M. Tomizuka, Y . Hu, and W. Zhan. SPACeR: Self-play anchoring with centralized reference models. InProceedings of the International Conference on Learning Representations (ICLR), 2026

2026

[20] [20]

Seong, J.-K

H. Seong, J.-K. Lee, H. Myeong, Y . Shin, H.-M. Cho, D. H. Kim, P. Desai, and M. Surana. Post-training and test-time scaling of generative agent behavior models for interactive au- tonomous driving.arXiv preprint arXiv:2512.13262, 2025

work page arXiv 2025

[21] [21]

Y . Guo, D. Ye, S. Chen, A. Liu, and X. Liu. CorrectionPlanner: Self-correction planner with reinforcement learning in autonomous driving.arXiv preprint arXiv:2603.15771, 2026

work page arXiv 2026

[22] [22]

Konstantinidis, M

F. Konstantinidis, M. Sackmann, U. Hofmann, and C. Stiller. Toward efficient and robust behavior models for multi-agent driving simulation. InProceedings of the IEEE International Conference on Robotics and Automation (ICRA), 2026

2026

[23] [23]

Ahmadi, H

E. Ahmadi, H. Schofield, B. Khamidehi, F. Arasteh, J. Shan, L. Mou, K. Rezaee, and D. Bai. RLFTSim: Realistic and controllable multi-agent traffic simulation via reinforcement learning fine-tuning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026

2026

[24] [24]

DINOv3

O. Sim ´eoni, H. V . V o, M. Seitzer, F. Baldassarre, M. Oquab, C. Jose, V . Khalidov, M. Szafraniec, S. Yi, M. Ramamonjisoa, F. Massa, D. Haziza, L. Wehrstedt, J. Wang, T. Darcet, T. Moutakanni, L. Sentana, C. Roberts, A. Vedaldi, J. Tolan, J. Brandt, C. Couprie, J. Mairal, H. J´egou, P. Labatut, and P. Bojanowski. DINOv3.arXiv preprint arXiv:2508.10104, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[25] [25]

W. Wu, X. Feng, Z. Gao, and Y . Kan. SMART: Scalable multi-agent real-time motion gen- eration via next-token prediction. InAdvances in Neural Information Processing Systems (NeurIPS), 2024

2024

[26] [26]

G. Hess, C. Lindstr ¨om, M. Fatemi, C. Petersson, and L. Svensson. SplatAD: Real-time lidar and camera rendering with 3d gaussian splatting for autonomous driving. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025

2025

[27] [27]

H. Gao, S. Chen, B. Jiang, B. Liao, Y . Shi, X. Guo, Y . Pu, H. Yin, X. Li, X. Zhang, Y . Zhang, W. Liu, Q. Zhang, and X. Wang. RAD: Training an end-to-end driving policy via large-scale 3DGS-based reinforcement learning. InAdvances in Neural Information Processing Systems (NeurIPS), 2025

2025

[28] [28]

J. Suarez. PufferLib: Making reinforcement learning libraries and environments play nice. arXiv preprint arXiv:2406.12905, 2024

work page arXiv 2024

[29] [29]

Kaufmann, L

E. Kaufmann, L. Bauersfeld, A. Loquercio, M. M ¨uller, V . Koltun, and D. Scaramuzza. Champion-level drone racing using deep reinforcement learning.Nature, 620(7976):982–987, 2023. 10

2023

[30] [30]

Kumar, R

A. Kumar, R. Bahlous-Boldi, P. Sharma, P. Isola, S. Risi, Y . Tang, and D. Ha. Digital red queen: Adversarial program evolution in core war with LLMs.arXiv preprint arXiv:2601.03335, 2026

work page arXiv 2026

[31] [31]

Radford, J

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever. Learning transferable visual models from natural language supervision. InProceedings of the International Conference on Machine Learning (ICML), 2021

2021

[32] [32]

C. Jia, Y . Yang, Y . Xia, Y .-T. Chen, Z. Parekh, H. Pham, Q. V . Le, Y .-H. Sung, Z. Li, and T. Duerig. Scaling up visual and vision-language representation learning with noisy text su- pervision. InProceedings of the International Conference on Machine Learning (ICML), 2021

2021

[33] [33]

J. Li, D. Li, S. Savarese, and S. Hoi. BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InProceedings of the International Conference on Machine Learning (ICML), 2023

2023

[34] [34]

A. A. Rusu, S. G. Colmenarejo, C ¸ . G ¨ulc ¸ehre, G. Desjardins, J. Kirkpatrick, R. Pascanu, V . Mnih, K. Kavukcuoglu, and R. Hadsell. Policy distillation. InProceedings of the Inter- national Conference on Learning Representations (ICLR), 2016

2016

[35] [35]

Parisotto, J

E. Parisotto, J. Ba, and R. Salakhutdinov. Actor-mimic: Deep multitask and transfer reinforce- ment learning. InProceedings of the International Conference on Learning Representations (ICLR), 2016

2016

[36] [36]

Y . W. Teh, V . Bapst, W. M. Czarnecki, J. Quan, J. Kirkpatrick, R. Hadsell, N. Heess, and R. Pascanu. Distral: Robust multitask reinforcement learning. InAdvances in Neural Infor- mation Processing Systems (NeurIPS), 2017

2017

[37] [37]

Kickstarting Deep Reinforcement Learning

S. Schmitt, J. J. Hudson, A. ˇZ´ıdek, S. Osindero, C. Doersch, W. M. Czarnecki, J. Z. Leibo, H. K¨uttler, A. Zisserman, K. Simonyan, and S. M. A. Eslami. Kickstarting deep reinforcement learning.arXiv preprint arXiv:1803.03835, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[38] [38]

J. Lee, J. Hwangbo, L. Wellhausen, V . Koltun, and M. Hutter. Learning quadrupedal locomo- tion over challenging terrain.Science Robotics, 5(47):eabc5986, 2020

2020

[39] [39]

T. Miki, J. Lee, J. Hwangbo, L. Wellhausen, V . Koltun, and M. Hutter. Learning robust per- ceptive locomotion for quadrupedal robots in the wild.Science Robotics, 7(62):eabk2822, 2022

2022

[40] [40]

Kumar, Z

A. Kumar, Z. Fu, D. Pathak, and J. Malik. RMA: Rapid motor adaptation for legged robots. In Robotics: Science and Systems (RSS), 2021

2021

[41] [41]

Loquercio, E

A. Loquercio, E. Kaufmann, R. Ranftl, M. M ¨uller, V . Koltun, and D. Scaramuzza. Learning high-speed flight in the wild.Science Robotics, 6(59):eabg5810, 2021

2021

[42] [42]

T. Chen, J. Xu, and P. Agrawal. A system for general in-hand object re-orientation. InPro- ceedings of the Conference on Robot Learning (CoRL), 2021

2021

[43] [43]

Z. Wu, R. Song, V . Mundheda, L. E. Navarro-Serment, C. Schoenborn, and J. Schneider. TADPO: Reinforcement learning goes off-road. InProceedings of the IEEE International Conference on Robotics and Automation (ICRA), 2026. URLhttps://arxiv.org/abs/ 2603.05995

work page arXiv 2026

[44] [44]

Hinton, O

G. Hinton, O. Vinyals, and J. Dean. Distilling the knowledge in a neural network. InNeurIPS Deep Learning and Representation Learning Workshop, 2015

2015

[45] [45]

Philion, A

J. Philion, A. Kar, and S. Fidler. Learning to evaluate perception models using planner-centric metrics. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recog- nition (CVPR), 2020. 11

2020

[46] [46]

Li and X

W.-X. Li and X. Yang. Transcendental idealism of planner: Evaluating perception from plan- ning perspective for autonomous driving. InProceedings of the International Conference on Machine Learning (ICML), 2023

2023

[47] [47]

Tung and G

F. Tung and G. Mori. Similarity-preserving knowledge distillation. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 1365–1374, 2019

2019

[48] [48]

Zaheer, S

M. Zaheer, S. Kottur, S. Ravanbakhsh, B. Poczos, R. R. Salakhutdinov, and A. J. Smola. Deep sets. InAdvances in Neural Information Processing Systems (NeurIPS), 2017

2017

[49] [49]

Proximal Policy Optimization Algorithms

J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[50] [50]

High-Dimensional Continuous Control Using Generalized Advantage Estimation

J. Schulman, P. Moritz, S. Levine, M. Jordan, and P. Abbeel. High-dimensional continuous control using generalized advantage estimation.arXiv preprint arXiv:1506.02438, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015

[51] [51]

Gavish and D

M. Gavish and D. L. Donoho. The optimal hard threshold for singular values is4/ √ 3.IEEE Transactions on Information Theory, 60(8):5040–5053, 2014

2014

[52] [52]

Roy and M

O. Roy and M. Vetterli. The effective rank: A measure of effective dimensionality. In2007 15th European Signal Processing Conference, pages 606–610. IEEE, 2007. 12 Supplementary Contents A Self-Play Policy Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 B Closed-Loop HD...

work page arXiv 2007