pith. sign in

arxiv: 2606.17386 · v1 · pith:PONEVWXWnew · submitted 2026-06-16 · 💻 cs.CV · cs.AI· cs.RO

TerraTransfer: Learning End-to-End Driving Policies Without Expert Demonstrations

Pith reviewed 2026-06-27 02:23 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.RO
keywords end-to-end drivingself-playpolicy transferautonomous drivinglatent alignmentno expert demonstrations3D Gaussian splattingreinforcement learning
0
0 comments X

The pith

Self-play in vectorized simulators enables end-to-end image driving policies without expert demonstrations via latent space alignment.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that end-to-end autonomous driving from images can be learned without any expert trajectory data. It first pretrains a policy through self-play inside a fast vectorized simulator that produces millions of steps including collisions and recoveries. The policy is then transferred to a vision backbone by aligning latent representations using action KL divergence from the self-play outputs plus a batch-relational low-rank structural loss on paired image and scene-state frames. This removes the need to collect and label millions of real driving logs. On closed-loop tests with photorealistic 3D Gaussian splatting renderings the resulting policy matches or exceeds earlier end-to-end methods that relied on expert supervision.

Core claim

By pretraining a policy exclusively through self-play in a vectorized simulator and then aligning its latent space to a vision backbone with action KL divergence and batch-relational low-rank structural loss on paired image and scene-state data, the approach produces an end-to-end image-based driving policy that requires no expert trajectory supervision and achieves performance matching or exceeding prior methods on photorealistic closed-loop scenarios.

What carries the argument

The latent space alignment process that transfers self-play behavior to an image-based model using action KL divergence and batch-relational low-rank structural loss.

If this is right

  • End-to-end policies can be trained without the cost of collecting and labeling expert driving demonstrations.
  • Self-play supplies state distributions rich in collisions and recoveries absent from logged data.
  • Pretrained vision backbones can be used directly for image inputs after alignment.
  • Only paired image and scene-state frames are required for the transfer step.
  • Closed-loop performance on photorealistic renderings reaches or surpasses imitation-based baselines.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The decoupling of policy learning from vision learning could apply to other simulation-heavy robotics tasks.
  • Similar alignment techniques might reduce expert data needs in other vision-based control domains.
  • Extending self-play to more complex vectorized environments could improve transfer quality.
  • The batch-relational loss may help preserve structural consistency across diverse driving situations.

Load-bearing premise

That aligning the self-play policy's latent space to a vision backbone via action KL divergence and batch-relational low-rank structural loss will produce a functional image-based driving policy without expert trajectory supervision.

What would settle it

If the aligned image-based policy shows markedly worse closed-loop success rates than prior expert-supervised methods on the same photorealistic 3D Gaussian splatting scenarios, the central claim would be refuted.

Figures

Figures reproduced from arXiv: 2606.17386 by Akshay Rangesh, Chen Tang, Grantland Hall, Saarth Bonde, Weixin Li, Wei Zhan, Yihan Hu, Zhouchonghao Wu, Zikang Xiong.

Figure 1
Figure 1. Figure 1: Conventional vs. proposed training paradigm. (a) Conventional recipes begin with imitation pretraining on fleet-scale logs, then add supervised fine-tuning, open-loop RL on logged trajectories, or closed-loop image RL in sensor simulators, each path requiring expensive human￾driving data or photorealistic rendering. (b) Our two-phase paradigm decouples learning to drive from learning to see: Phase 1 trains… view at source ↗
Figure 2
Figure 2. Figure 2: Two-phase training pipeline. Flame icons mark trainable modules and snowflakes mark frozen ones; red arrows carry gradients and blue arrows do not. Phase 1 (left): A single self-play vector policy is trained end-to-end with PPO in a multi-agent vectorized simulator. The ego, map, and partner encoders, and the action head are jointly optimized, and the same parameter set controls every agent in the scene so… view at source ↗
Figure 3
Figure 3. Figure 3: Alignment data efficiency. Closed-loop HD-Score vs. relative nuPlan training data ρ (our full alignment set = 1.83M frames ⇒ ρ = 1), for the All set (top) and each HUGSim tier (bottom) on nuScenes; bands are ±1 per-scene std. The self-play teacher (§4.2) uses no nuPlan data and sits at ρ = 0 (horizon￾tal dashed line) in every panel; ECO Smoothing-only is placed at ρ ≈ 1.6 (∼2.8M nuPlan frames vs. our 1.83M… view at source ↗
Figure 5
Figure 5. Figure 5: Modal coverage: ensemble-averaged squared projection of the batch-derived top- [PITH_FULL_IMAGE:figures/full_fig_p019_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Fraction of perturbation energy preserved in the projected subspace as a function of batch [PITH_FULL_IMAGE:figures/full_fig_p020_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Front-camera samples from decoupled alignment data. A grid of CAM F0 observa￾tions collected from HUGSim rollouts. These frames provide paired visual observations and re￾constructed scene states for the alignment loss; their actions are not used as expert demonstrations, so rollout quality affects alignment only through the state coverage induced by the data-collection policy [PITH_FULL_IMAGE:figures/full… view at source ↗
Figure 8
Figure 8. Figure 8: Cut-in negotiation. The ego initially travels at low speed while monitoring a vehicle that could cut in. Once it confirms the vehicle has no cut-in intention (center), the ego accelerates and overtakes smoothly (right). The speed profile shows a brief deceleration followed by a sustained ramp-up, demonstrating that the policy withholds commitment until the other agent’s intent is re￾solved. A static car is… view at source ↗
Figure 9
Figure 9. Figure 9: Overtaking with oncoming traffic. A static vehicle blocks the lane while a slow oncom￾ing car approaches. The policy decides to overtake (center), accelerates past the static vehicle, and merges back into the original lane before the oncoming car arrives (right). The speed profile rises throughout the maneuver, reflecting a committed overtake decision made under time pressure. 22 [PITH_FULL_IMAGE:figures/… view at source ↗
Figure 10
Figure 10. Figure 10: Lead vehicle yielding route. The policy follows a slow lead vehicle at a steady speed (left). When the lead vehicle diverges from the ego’s route (center), the ego turns onto its own path and accelerates (right). The speed profile is flat during the follow phase and then climbs sharply after the turn, showing the policy correctly disengages from the lead once it is no longer relevant. Slow down when appro… view at source ↗
Figure 11
Figure 11. Figure 11: Narrow-lane passage. In a narrow lane, the policy slows when approaching an oncoming vehicle (left), holds a reduced speed while passing (center), and then begins decelerating again in anticipation of a vulnerable road user detected ahead (right, red box). The speed profile oscillates rather than recovering fully, reflecting the policy’s forward-looking awareness of the downstream hazard. Extreme scenario… view at source ↗
Figure 12
Figure 12. Figure 12: Visually implausible extreme-tier scenarios. Extreme scenarios in the HUGSim bench￾mark can be visually unrealistic, with implausible initial configurations, or an occluded environment where other cars can pass through each other. 23 [PITH_FULL_IMAGE:figures/full_fig_p023_12.png] view at source ↗
read the original abstract

End-to-end autonomous driving has achieved state-of-the-art performance on benchmarks and real-world deployments. Its standard training recipe, however, is expensive across all stages: collecting and labeling millions of driving frames is costly, and closed-loop RL on images is bottlenecked by the per-step cost of photorealistic rendering plus a forward pass through a large vision backbone. Self-play in vectorized simulators changes the economics: millions of rollout steps per second, and a state distribution naturally rich in collisions, near-misses, and recoveries that no driving log contains. Our approach exploits this asymmetry by decoupling learning to drive from learning to see. We pretrain a single policy by self-play, then align its latent space with a pretrained vision backbone, through the action KL divergence and a batch-relational low-rank structural loss. The action target comes from the self-play policy, so alignment never supervises against a logged trajectory: a paired dataset of (image, scene-state) frames suffices, with no need for the curated expert demonstrations that imitation pretraining is built on. On photorealistic 3D Gaussian splatting closed-loop scenarios, the resulting end-to-end policy matches or exceeds prior end-to-end methods.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper presents TerraTransfer, a method to learn end-to-end driving policies without expert demonstrations. A policy is first pretrained via self-play in a vectorized simulator; its latent space is then aligned to a pretrained vision backbone using action KL divergence together with a batch-relational low-rank structural loss on paired (image, scene-state) frames. The resulting image-based policy is evaluated in closed-loop on photorealistic 3D Gaussian splatting scenarios, where it is claimed to match or exceed prior end-to-end methods.

Significance. If the central claim holds, the work would materially lower the cost of training end-to-end driving policies by exploiting the speed and rich failure-state distribution of vectorized self-play while removing the need for curated expert trajectories. The explicit decoupling of policy learning from perception learning is a substantive conceptual contribution that could generalize beyond driving.

major comments (2)
  1. [Abstract / §3 (alignment)] The alignment procedure (described in the abstract and presumably §3) uses paired (image, scene-state) frames and action KL divergence, yet the manuscript does not state whether the scene-states are generated inside the vectorized simulator or inside the target Gaussian-splatting environment. If the former, the domain gap in dynamics, geometry, and physics is never directly penalized and remains load-bearing for the claim that the transferred policy functions in photorealistic closed-loop rollouts.
  2. [§4] §4 (closed-loop evaluation): the claim that the policy “matches or exceeds prior end-to-end methods” is presented without reported trial counts, variance, or statistical tests against the cited baselines; a single qualitative statement is insufficient to support the performance conclusion that justifies the entire pipeline.
minor comments (1)
  1. [Abstract / §3] The batch-relational low-rank structural loss is named but not defined or referenced; a short equation or citation would remove ambiguity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive report. We address each major comment below and will revise the manuscript to improve clarity and rigor.

read point-by-point responses
  1. Referee: [Abstract / §3 (alignment)] The alignment procedure (described in the abstract and presumably §3) uses paired (image, scene-state) frames and action KL divergence, yet the manuscript does not state whether the scene-states are generated inside the vectorized simulator or inside the target Gaussian-splatting environment. If the former, the domain gap in dynamics, geometry, and physics is never directly penalized and remains load-bearing for the claim that the transferred policy functions in photorealistic closed-loop rollouts.

    Authors: We agree that the source of the scene-states must be stated explicitly. The paired (image, scene-state) frames are constructed by rendering images from the 3D Gaussian splatting environment while obtaining the corresponding scene-states from the vectorized simulator for the same underlying scene configuration. We will revise §3 to state this construction clearly and add a short discussion of how the alignment losses are intended to mitigate the resulting domain gap. revision: yes

  2. Referee: [§4] §4 (closed-loop evaluation): the claim that the policy “matches or exceeds prior end-to-end methods” is presented without reported trial counts, variance, or statistical tests against the cited baselines; a single qualitative statement is insufficient to support the performance conclusion that justifies the entire pipeline.

    Authors: We acknowledge that the current presentation of the closed-loop results lacks the quantitative detail needed to substantiate the performance claims. We will expand §4 to report the number of independent trials, standard deviations or confidence intervals, and appropriate statistical comparisons against the baselines. revision: yes

Circularity Check

0 steps flagged

No circularity: method is a standard self-play pretrain + latent alignment pipeline with empirical claims

full rationale

The provided abstract and description outline a two-stage procedure: (1) self-play RL in a vectorized simulator to obtain a policy, (2) alignment of a vision backbone's latent space to that policy via action KL divergence plus a structural loss on paired (image, scene-state) frames. No equation in the text reduces a claimed result to its own inputs by construction, no fitted parameter is relabeled as a prediction, and no load-bearing uniqueness theorem or ansatz is imported via self-citation. The performance statement is an empirical claim on closed-loop Gaussian-splatting rollouts rather than a mathematical identity. The derivation chain is therefore self-contained and does not exhibit any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no information on free parameters, axioms, or invented entities.

pith-pipeline@v0.9.1-grok · 5774 in / 973 out tokens · 34373 ms · 2026-06-27T02:23:17.615749+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

52 extracted references · 14 canonical work pages · 6 internal anchors

  1. [1]

    Paden, M

    B. Paden, M. ˇC´ap, S. Z. Yong, D. Yershov, and E. Frazzoli. A survey of motion planning and control techniques for self-driving urban vehicles.IEEE Transactions on Intelligent V ehicles, 1(1):33–55, 2016

  2. [2]

    P. Sun, H. Kretzschmar, X. Dotiwalla, A. Chouard, V . Patnaik, P. Tsui, J. Guo, Y . Zhou, Y . Chai, B. Caine, V . Vasudevan, W. Han, J. Ngiam, H. Zhao, A. Timofeev, S. Ettinger, M. Krivokon, A. Gao, A. Joshi, Y . Zhang, J. Shlens, Z. Chen, and D. Anguelov. Scalability in perception for autonomous driving: Waymo open dataset. InProceedings of the IEEE/CVF ...

  3. [3]

    NuPlan: A closed-loop ML-based planning benchmark for autonomous vehicles

    H. Caesar, J. Kabzan, K. S. Tan, W. K. Fong, E. Wolff, A. Lang, L. Fletcher, O. Beijbom, and S. Omari. nuPlan: A closed-loop ML-based planning benchmark for autonomous vehicles. arXiv preprint arXiv:2106.11810, 2021

  4. [4]

    End to End Learning for Self-Driving Cars

    M. Bojarski, D. D. Testa, D. Dworakowski, B. Firner, B. Flepp, P. Goyal, L. D. Jackel, M. Mon- fort, U. Muller, J. Zhang, X. Zhang, J. Zhao, and K. Zieba. End to end learning for self-driving cars.arXiv preprint arXiv:1604.07316, 2016

  5. [5]

    Codevilla, E

    F. Codevilla, E. Santana, A. M. L ´opez, and A. Gaidon. Exploring the limitations of behavior cloning for autonomous driving. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2019

  6. [6]

    Caesar, V

    H. Caesar, V . Bankiti, A. H. Lang, S. V ora, V . E. Liong, Q. Xu, A. Krishnan, Y . Pan, G. Baldan, and O. Beijbom. nuScenes: A multimodal dataset for autonomous driving. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020

  7. [7]

    Y . Hu, J. Yang, L. Chen, K. Li, C. Sima, X. Zhu, S. Chai, S. Du, T. Lin, W. Wang, L. Lu, X. Jia, Q. Liu, J. Dai, Y . Qiao, and H. Li. Planning-oriented autonomous driving. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023

  8. [8]

    Jiang, S

    B. Jiang, S. Chen, Q. Xu, B. Liao, J. Chen, H. Zhou, Q. Zhang, W. Liu, C. Huang, and X. Wang. V AD: Vectorized scene representation for efficient autonomous driving. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023

  9. [9]

    Chitta, A

    K. Chitta, A. Prakash, B. Jaeger, Z. Yu, K. Renz, and A. Geiger. TransFuser: Imitation with transformer-based sensor fusion for autonomous driving.IEEE Transactions on Pattern Anal- ysis and Machine Intelligence, 45(11):12878–12895, 2023

  10. [10]

    Dauner, M

    D. Dauner, M. Hallgarten, T. Li, X. Weng, Z. Huang, Z. Yang, H. Li, I. Gilitschenski, B. Ivanovic, M. Pavone, A. Geiger, and K. Chitta. NA VSIM: Data-driven non-reactive au- tonomous vehicle simulation and benchmarking. InAdvances in Neural Information Process- ing Systems (NeurIPS) Datasets and Benchmarks Track, 2024

  11. [11]

    H. Zhou, L. Lin, J. Wang, Y . Lu, D. Bai, B. Liu, Y . Wang, A. Geiger, and Y . Liao. HUGSIM: A real-time, photo-realistic and closed-loop simulator for autonomous driving.arXiv preprint arXiv:2412.01718, 2024

  12. [12]

    Zhang, M

    B. Zhang, M. Golchoubian, I. Gilitschenski, B. Ivanovic, and K. Chitta. Endpoint constrained trajectory optimization for driving foundation models. InICCV RealADSim Workshop, 2025

  13. [13]

    Karkus, M

    P. Karkus, M. Igl, Y . Chen, K. Chitta, J. Packer, B. Douillard, T. Tian, A. Naumann, G. Garcia- Cobo, S. Tan, A. Degirmenci, A. Popov, N. Smolyanskiy, U. Muller, B. Ivanovic, and M. Pavone. Beyond behavior cloning in autonomous driving: a survey of closed-loop training techniques.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

  14. [14]

    Dosovitskiy, G

    A. Dosovitskiy, G. Ros, F. Codevilla, A. Lopez, and V . Koltun. CARLA: An open urban driving simulator. InProceedings of the Conference on Robot Learning (CoRL), 2017. 9

  15. [15]

    D. Chen, B. Zhou, V . Koltun, and P. Kr¨ahenb¨uhl. Learning by cheating. InProceedings of the Conference on Robot Learning (CoRL), 2019

  16. [16]

    Zhang, A

    Z. Zhang, A. Liniger, D. Dai, F. Yu, and L. Van Gool. End-to-end urban driving by imitating a reinforcement learning coach. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021

  17. [17]

    P. Wu, X. Jia, L. Chen, J. Yan, H. Li, and Y . Qiao. Trajectory-guided control prediction for end- to-end autonomous driving: A simple yet strong baseline. InAdvances in Neural Information Processing Systems (NeurIPS), 2022

  18. [18]

    Cusumano-Towner, D

    M. Cusumano-Towner, D. Hafner, A. Hertzberg, B. Huval, A. Petrenko, E. Vinitsky, E. Wij- mans, T. Killian, S. Bowers, O. Sener, P. Kr¨ahenb¨uhl, and V . Koltun. Robust autonomy emerges from self-play.arXiv preprint arXiv:2502.03349, 2025

  19. [19]

    Chang, A

    W.-J. Chang, A. Rangesh, K. Joseph, M. Strong, M. Tomizuka, Y . Hu, and W. Zhan. SPACeR: Self-play anchoring with centralized reference models. InProceedings of the International Conference on Learning Representations (ICLR), 2026

  20. [20]

    Seong, J.-K

    H. Seong, J.-K. Lee, H. Myeong, Y . Shin, H.-M. Cho, D. H. Kim, P. Desai, and M. Surana. Post-training and test-time scaling of generative agent behavior models for interactive au- tonomous driving.arXiv preprint arXiv:2512.13262, 2025

  21. [21]

    Y . Guo, D. Ye, S. Chen, A. Liu, and X. Liu. CorrectionPlanner: Self-correction planner with reinforcement learning in autonomous driving.arXiv preprint arXiv:2603.15771, 2026

  22. [22]

    Konstantinidis, M

    F. Konstantinidis, M. Sackmann, U. Hofmann, and C. Stiller. Toward efficient and robust behavior models for multi-agent driving simulation. InProceedings of the IEEE International Conference on Robotics and Automation (ICRA), 2026

  23. [23]

    Ahmadi, H

    E. Ahmadi, H. Schofield, B. Khamidehi, F. Arasteh, J. Shan, L. Mou, K. Rezaee, and D. Bai. RLFTSim: Realistic and controllable multi-agent traffic simulation via reinforcement learning fine-tuning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026

  24. [24]

    DINOv3

    O. Sim ´eoni, H. V . V o, M. Seitzer, F. Baldassarre, M. Oquab, C. Jose, V . Khalidov, M. Szafraniec, S. Yi, M. Ramamonjisoa, F. Massa, D. Haziza, L. Wehrstedt, J. Wang, T. Darcet, T. Moutakanni, L. Sentana, C. Roberts, A. Vedaldi, J. Tolan, J. Brandt, C. Couprie, J. Mairal, H. J´egou, P. Labatut, and P. Bojanowski. DINOv3.arXiv preprint arXiv:2508.10104, 2025

  25. [25]

    W. Wu, X. Feng, Z. Gao, and Y . Kan. SMART: Scalable multi-agent real-time motion gen- eration via next-token prediction. InAdvances in Neural Information Processing Systems (NeurIPS), 2024

  26. [26]

    G. Hess, C. Lindstr ¨om, M. Fatemi, C. Petersson, and L. Svensson. SplatAD: Real-time lidar and camera rendering with 3d gaussian splatting for autonomous driving. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025

  27. [27]

    H. Gao, S. Chen, B. Jiang, B. Liao, Y . Shi, X. Guo, Y . Pu, H. Yin, X. Li, X. Zhang, Y . Zhang, W. Liu, Q. Zhang, and X. Wang. RAD: Training an end-to-end driving policy via large-scale 3DGS-based reinforcement learning. InAdvances in Neural Information Processing Systems (NeurIPS), 2025

  28. [28]

    J. Suarez. PufferLib: Making reinforcement learning libraries and environments play nice. arXiv preprint arXiv:2406.12905, 2024

  29. [29]

    Kaufmann, L

    E. Kaufmann, L. Bauersfeld, A. Loquercio, M. M ¨uller, V . Koltun, and D. Scaramuzza. Champion-level drone racing using deep reinforcement learning.Nature, 620(7976):982–987, 2023. 10

  30. [30]

    Kumar, R

    A. Kumar, R. Bahlous-Boldi, P. Sharma, P. Isola, S. Risi, Y . Tang, and D. Ha. Digital red queen: Adversarial program evolution in core war with LLMs.arXiv preprint arXiv:2601.03335, 2026

  31. [31]

    Radford, J

    A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever. Learning transferable visual models from natural language supervision. InProceedings of the International Conference on Machine Learning (ICML), 2021

  32. [32]

    C. Jia, Y . Yang, Y . Xia, Y .-T. Chen, Z. Parekh, H. Pham, Q. V . Le, Y .-H. Sung, Z. Li, and T. Duerig. Scaling up visual and vision-language representation learning with noisy text su- pervision. InProceedings of the International Conference on Machine Learning (ICML), 2021

  33. [33]

    J. Li, D. Li, S. Savarese, and S. Hoi. BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InProceedings of the International Conference on Machine Learning (ICML), 2023

  34. [34]

    A. A. Rusu, S. G. Colmenarejo, C ¸ . G ¨ulc ¸ehre, G. Desjardins, J. Kirkpatrick, R. Pascanu, V . Mnih, K. Kavukcuoglu, and R. Hadsell. Policy distillation. InProceedings of the Inter- national Conference on Learning Representations (ICLR), 2016

  35. [35]

    Parisotto, J

    E. Parisotto, J. Ba, and R. Salakhutdinov. Actor-mimic: Deep multitask and transfer reinforce- ment learning. InProceedings of the International Conference on Learning Representations (ICLR), 2016

  36. [36]

    Y . W. Teh, V . Bapst, W. M. Czarnecki, J. Quan, J. Kirkpatrick, R. Hadsell, N. Heess, and R. Pascanu. Distral: Robust multitask reinforcement learning. InAdvances in Neural Infor- mation Processing Systems (NeurIPS), 2017

  37. [37]

    Kickstarting Deep Reinforcement Learning

    S. Schmitt, J. J. Hudson, A. ˇZ´ıdek, S. Osindero, C. Doersch, W. M. Czarnecki, J. Z. Leibo, H. K¨uttler, A. Zisserman, K. Simonyan, and S. M. A. Eslami. Kickstarting deep reinforcement learning.arXiv preprint arXiv:1803.03835, 2018

  38. [38]

    J. Lee, J. Hwangbo, L. Wellhausen, V . Koltun, and M. Hutter. Learning quadrupedal locomo- tion over challenging terrain.Science Robotics, 5(47):eabc5986, 2020

  39. [39]

    T. Miki, J. Lee, J. Hwangbo, L. Wellhausen, V . Koltun, and M. Hutter. Learning robust per- ceptive locomotion for quadrupedal robots in the wild.Science Robotics, 7(62):eabk2822, 2022

  40. [40]

    Kumar, Z

    A. Kumar, Z. Fu, D. Pathak, and J. Malik. RMA: Rapid motor adaptation for legged robots. In Robotics: Science and Systems (RSS), 2021

  41. [41]

    Loquercio, E

    A. Loquercio, E. Kaufmann, R. Ranftl, M. M ¨uller, V . Koltun, and D. Scaramuzza. Learning high-speed flight in the wild.Science Robotics, 6(59):eabg5810, 2021

  42. [42]

    T. Chen, J. Xu, and P. Agrawal. A system for general in-hand object re-orientation. InPro- ceedings of the Conference on Robot Learning (CoRL), 2021

  43. [43]

    Z. Wu, R. Song, V . Mundheda, L. E. Navarro-Serment, C. Schoenborn, and J. Schneider. TADPO: Reinforcement learning goes off-road. InProceedings of the IEEE International Conference on Robotics and Automation (ICRA), 2026. URLhttps://arxiv.org/abs/ 2603.05995

  44. [44]

    Hinton, O

    G. Hinton, O. Vinyals, and J. Dean. Distilling the knowledge in a neural network. InNeurIPS Deep Learning and Representation Learning Workshop, 2015

  45. [45]

    Philion, A

    J. Philion, A. Kar, and S. Fidler. Learning to evaluate perception models using planner-centric metrics. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recog- nition (CVPR), 2020. 11

  46. [46]

    Li and X

    W.-X. Li and X. Yang. Transcendental idealism of planner: Evaluating perception from plan- ning perspective for autonomous driving. InProceedings of the International Conference on Machine Learning (ICML), 2023

  47. [47]

    Tung and G

    F. Tung and G. Mori. Similarity-preserving knowledge distillation. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 1365–1374, 2019

  48. [48]

    Zaheer, S

    M. Zaheer, S. Kottur, S. Ravanbakhsh, B. Poczos, R. R. Salakhutdinov, and A. J. Smola. Deep sets. InAdvances in Neural Information Processing Systems (NeurIPS), 2017

  49. [49]

    Proximal Policy Optimization Algorithms

    J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

  50. [50]

    High-Dimensional Continuous Control Using Generalized Advantage Estimation

    J. Schulman, P. Moritz, S. Levine, M. Jordan, and P. Abbeel. High-dimensional continuous control using generalized advantage estimation.arXiv preprint arXiv:1506.02438, 2015

  51. [51]

    Gavish and D

    M. Gavish and D. L. Donoho. The optimal hard threshold for singular values is4/ √ 3.IEEE Transactions on Information Theory, 60(8):5040–5053, 2014

  52. [52]

    Roy and M

    O. Roy and M. Vetterli. The effective rank: A measure of effective dimensionality. In2007 15th European Signal Processing Conference, pages 606–610. IEEE, 2007. 12 Supplementary Contents A Self-Play Policy Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 B Closed-Loop HD...