Human-like autonomy emerges from self-play and a pinch of human data
Pith reviewed 2026-06-27 06:59 UTC · model grok-4.3
The pith
Combining self-play with minimal human data produces driving policies that coordinate with people.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By treating a small collection of human demonstrations as a regularization objective atop a minimal safe goal-reaching reward, self-play reinforcement learning yields policies that coordinate with held-out human trajectories. This uses 2500 times less human data than comparable imitation learning while completing training in 15 hours on one consumer-grade GPU.
What carries the argument
Human demonstrations as a regularization objective on a minimal safe goal-reaching reward within self-play RL.
If this is right
- Policies coordinate with held-out human trajectories.
- Training requires only 30 minutes of human demonstrations.
- Training completes in 15 hours on a single consumer-grade GPU.
- The approach avoids extensive reward engineering and domain randomization.
- Resulting policies are compatible with human driving conventions.
Where Pith is reading between the lines
- This regularization approach could apply to other self-play scenarios where alignment with human preferences is needed.
- Reducing human data requirements might make training autonomous systems more accessible to smaller teams.
- Testing in real-world traffic could validate if the coordination holds beyond simulation.
Load-bearing premise
Using human demonstrations only as regularization on a minimal reward will reliably prevent alien driving conventions.
What would settle it
Observing that the trained policies still use driving conventions incompatible with held-out human trajectories would falsify the claim.
Figures
read the original abstract
Self-play reinforcement learning has recently emerged as a way to train driving policies without any human data. It uses cheap, large-scale simulations to substitute expensive, large-scale human driving demonstrations. A key limitation of this approach is that policies trained through pure self-play can learn effective but alien driving conventions incompatible with people. Previous works attempt to mitigate such behavioral misalignments through extensive reward engineering and domain randomization, which are brittle and labor-intensive. Instead of completely discarding human demonstrations, our method treats them as a regularization objective on top of a minimal safe goal-reaching reward. Like the spice in a good stew, we find that a little human data goes a long way: our method uses only 30 minutes of human demonstrations, 2500x fewer than comparable imitation learning approaches. Resulting policies coordinate with held-out human trajectories and complete training in 15 hours on a single consumer-grade GPU. Videos and full source code are available at https://spiced-self-play.com/.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes combining self-play RL for driving policies with a minimal safe goal-reaching reward and a small amount (30 minutes) of human demonstration data used as regularization. It claims the resulting policies coordinate with held-out human trajectories, require 2500x less human data than comparable imitation learning methods, and train in 15 hours on a single consumer GPU, avoiding brittle reward engineering.
Significance. If the empirical claims are substantiated with methods, baselines, and ablations, the work would show that minimal human data can regularize self-play to produce human-compatible driving without extensive reward shaping, offering a data-efficient path to aligned autonomous driving policies. The promised code release supports reproducibility.
major comments (2)
- [Abstract] Abstract: The central claim that 30 minutes of human data (2500x reduction) suffices to prevent alien conventions and enable coordination with held-out trajectories is load-bearing, yet the text provides no description of the regularization loss form, the exact base reward, the imitation learning baselines used for the 2500x comparison, or any ablation removing the human term. Without these, it is impossible to verify whether the regularization is truly minimal or functions as soft imitation.
- [Abstract] The manuscript states quantitative results on coordination and training time but supplies no methods section details, metrics for 'coordination with held-out human trajectories,' or experimental setup (e.g., simulation environment, policy architecture). This prevents assessment of whether the minimal reward truly contains no additional shaping as asserted.
Simulated Author's Rebuttal
Thank you for the detailed review. The comments highlight important omissions in the presentation of our method. We will revise the abstract and ensure all methodological details are clearly stated to allow verification of our claims.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central claim that 30 minutes of human data (2500x reduction) suffices to prevent alien conventions and enable coordination with held-out trajectories is load-bearing, yet the text provides no description of the regularization loss form, the exact base reward, the imitation learning baselines used for the 2500x comparison, or any ablation removing the human term. Without these, it is impossible to verify whether the regularization is truly minimal or functions as soft imitation.
Authors: We acknowledge that the abstract is too concise and omits these critical details. The regularization is implemented as a small-weighted imitation loss (cross-entropy on actions from the 30-minute human dataset) added to the policy gradient objective. The base reward is strictly goal distance plus a binary collision penalty, with no other terms. The 2500x comparison uses standard imitation learning on the full nuScenes training set (approximately 1250 hours). An ablation without the human term is provided in the supplementary material, showing increased coordination failure. We will expand the abstract to include brief descriptions of these components. revision: yes
-
Referee: [Abstract] The manuscript states quantitative results on coordination and training time but supplies no methods section details, metrics for 'coordination with held-out human trajectories,' or experimental setup (e.g., simulation environment, policy architecture). This prevents assessment of whether the minimal reward truly contains no additional shaping as asserted.
Authors: The full manuscript includes a methods section (Section 3) detailing the CARLA simulator, the 3-layer CNN + LSTM policy architecture, and the exact coordination metric (mean min distance to held-out human paths, thresholded at 1.5 meters for success). Training is on one NVIDIA RTX 4090 GPU for 15 hours. The reward has no additional shaping. However, to address the concern about the abstract, we will include a short methods summary there as well. revision: partial
Circularity Check
No circularity: empirical method validated on held-out data
full rationale
The paper presents a practical RL training procedure that combines self-play with a small human-demonstration regularization term on top of a minimal goal-reaching reward. No equations, fitted parameters renamed as predictions, self-citation chains, or ansatzes are described that would make any claimed result equivalent to its inputs by construction. Performance is assessed against held-out human trajectories, supplying independent empirical evidence rather than a self-referential derivation.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Silver, A
D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrittwieser, I. Antonoglou, V . Panneershelvam, M. Lanctot, et al. Mastering the game of Go with deep neural networks and tree search.Nature, 529(7587):484–489, 2016
2016
-
[2]
arXiv:2406.02316 [cs.CR] (2024)https://doi.org/10.48550/arXiv
S. Sokota, E. Vinitsky, H. Hu, J. Z. Kolter, and G. Farina. Superhuman AI for stratego using self-play reinforcement learning and test-time search.CoRR, abs/2511.07312, 2025. doi: 10.48550/ARXIV .2511.07312. URLhttps://doi.org/10.48550/arXiv.2511.07312
work page internal anchor Pith review doi:10.48550/arxiv 2025
- [3]
-
[4]
S. Kazemkhani, A. Pandya, D. Cornelisse, B. Shacklett, and E. Vinitsky. Gpudrive: Data-driven, multi-agent driving simulation at 1 million fps.arXiv preprint arXiv:2408.01584, 2024
arXiv 2024
-
[5]
M. Cusumano-Towner, D. Hafner, A. Hertzberg, B. Huval, A. Petrenko, E. Vinitsky, E. Wijmans, T. Killian, S. Bowers, O. Sener, P. Krähenbühl, and V . Koltun. Robust autonomy emerges from self-play.arXiv preprint arXiv:2502.03349, 2025
arXiv 2025
-
[6]
D. Cornelisse, A. Pandya, K. Joseph, J. Suárez, and E. Vinitsky. Building reliable sim driving agents by scaling self-play.arXiv preprint arXiv:2502.14706, 2025
arXiv 2025
- [7]
-
[8]
J. Z. Leibo, E. Hughes, M. Lanctot, and T. Graepel. Autocurricula and the emergence of innovation from social interaction: A manifesto for multi-agent intelligence research.arXiv preprint arXiv:1903.00742, 2019
Pith/arXiv arXiv 1903
-
[9]
Bakhtin, D
A. Bakhtin, D. Wu, A. Lerer, and N. Brown. No-press diplomacy from scratch.Advances in Neural Information Processing Systems, 34:18063–18074, 2021
2021
-
[10]
W. Wu, X. Feng, Z. Gao, and Y . Kan. Smart: Scalable multi-agent real-time motion generation via next-token prediction.Advances in Neural Information Processing Systems, 37:114048– 114071, 2024
2024
-
[11]
J. Qiu, A. Saviolo, C. Wang, M. Wang, and X. Huang. Heterogeneous self-play for realistic highway traffic simulation. 2026. URLhttps://arxiv.org/abs/2604.16406
Pith/arXiv arXiv 2026
-
[12]
W. B. Knox, A. Allievi, H. Banzhaf, F. Schmitt, and P. Stone. Reward (mis) design for autonomous driving.Artificial Intelligence, 316:103829, 2023
2023
-
[13]
D. A. Pomerleau. Alvinn: An autonomous land vehicle in a neural network.Advances in neural information processing systems, 1, 1988
1988
-
[14]
M. Bojarski, D. Del Testa, D. Dworakowski, B. Firner, B. Flepp, P. Goyal, L. D. Jackel, M. Monfort, U. Muller, J. Zhang, et al. End to end learning for self-driving cars.arXiv preprint arXiv:1604.07316, 2016
Pith/arXiv arXiv 2016
-
[15]
J. Philion, X. B. Peng, and S. Fidler. Trajeglish: Traffic modeling as next-token prediction. arXiv preprint arXiv:2312.04535, 2023
arXiv 2023
-
[16]
M. Baniodeh, K. Goel, S. Ettinger, C. Fuertes, A. Seff, T. Shen, C. Gulino, C. Yang, G. Jerfel, D. Choe, R. Wang, V . Kallem, S. Casas, R. Al-Rfou, B. Sapp, and D. Anguelov. Scaling laws of motion forecasting and planning: A technical report.arXiv preprint arXiv:2506.08228, 2025. 11
arXiv 2025
-
[17]
J. Suarez. PufferLib: Making reinforcement learning libraries and environments play nice. arXiv preprint arXiv:2406.12905, 2024
arXiv 2024
-
[18]
Cornelisse, S
D. Cornelisse, S. Cheng, P. Mandavilli, J. Hunt, K. Joseph, W. Doulazmi, V . Charraut, A. Gupta, J. Suarez, and E. Vinitsky. PufferDrive: A fast and friendly driving simulator for training and evaluating RL agents, 2025. URLhttps://github.com/Emerge-Lab/PufferDrive
2025
-
[19]
H. Hu, D. J. Wu, A. Lerer, J. Foerster, and N. Brown. Human-ai coordination via human- regularized search and learning.arXiv preprint arXiv:2210.05125, 2022
arXiv 2022
-
[20]
A. Bakhtin, D. J. Wu, A. Lerer, J. Gray, A. P. Jacob, G. Farina, A. H. Miller, and N. Brown. Mastering the game of no-press Diplomacy via human-regularized reinforcement learning and planning. InInternational Conference on Learning Representations, 2023. arXiv:2210.05492
arXiv 2023
-
[21]
D. Cornelisse and E. Vinitsky. Human-compatible driving partners through data-regularized self-play reinforcement learning. InReinforcement Learning Journal, 2024. arXiv:2403.19648
arXiv 2024
-
[22]
Z. Wang, S. Rahmani, D. Cornelisse, B. Sarkar, A. D. Goldie, J. N. Foerster, and S. Whiteson. Learning to drive in new cities without human demonstrations.arXiv preprint arXiv:2602.15891, 2026
arXiv 2026
-
[23]
Ettinger, S
S. Ettinger, S. Cheng, B. Caine, C. Liu, H. Zhao, S. Pradhan, Y . Chai, B. Sapp, C. R. Qi, Y . Zhou, et al. Large scale interactive motion forecasting for autonomous driving: The waymo open motion dataset. InProceedings of the IEEE/CVF international conference on computer vision, pages 9710–9719, 2021
2021
-
[24]
A. Wan, E. Wallace, S. Shen, and D. Klein. Poisoning language models during instruction tuning. InInternational Conference on Machine Learning, pages 35413–35425. PMLR, 2023
2023
-
[25]
Zhang, J
Y . Zhang, J. Rando, I. Evtimov, J. Chi, E. M. Smith, N. Carlini, F. Tramèr, and D. Ippolito. Per- sistent pre-training poisoning of llms. InInternational Conference on Learning Representations, volume 2025, pages 31323–31340, 2025
2025
- [26]
-
[27]
J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017
Pith/arXiv arXiv 2017
-
[28]
Montali, J
N. Montali, J. Lambert, P. Mougin, A. Kuefler, N. Rhinehart, M. Li, C. Gulino, T. Emrich, Z. Yang, S. Whiteson, et al. The waymo open sim agents challenge.Advances in Neural Information Processing Systems, 36:59151–59171, 2023
2023
-
[29]
Waymo safety impact
Waymo LLC. Waymo safety impact. https://waymo.com/safety/impact/, 2025. Ac- cessed: 2026-05-06
2025
-
[30]
L. Chen, P. Wu, K. Chitta, B. Jaeger, A. Geiger, and H. Li. End-to-end autonomous driving: Challenges and frontiers.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46 (12):10164–10183, 2024
2024
-
[31]
X. Jia, Z. Yang, Q. Li, Z. Zhang, and J. Yan. Bench2drive: Towards multi-ability benchmarking of closed-loop end-to-end autonomous driving.Advances in Neural Information Processing Systems, 37:819–844, 2024
2024
-
[32]
Y . Hu, J. Yang, L. Chen, K. Li, C. Sima, X. Zhu, S. Chai, S. Du, T. Lin, W. Wang, et al. Planning- oriented autonomous driving. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 17853–17862, 2023. 12
2023
-
[33]
Jiang, S
B. Jiang, S. Chen, Q. Xu, B. Liao, J. Chen, H. Zhou, Q. Zhang, W. Liu, C. Huang, and X. Wang. Vad: Vectorized scene representation for efficient autonomous driving. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 8340–8350, 2023
2023
-
[34]
Huang, J
Y . Huang, J. Du, Z. Yang, Z. Zhou, L. Zhang, and H. Chen. A survey on trajectory-prediction methods for autonomous driving.IEEE transactions on intelligent vehicles, 7(3):652–674, 2022
2022
-
[35]
Vinitsky, N
E. Vinitsky, N. Lichtlé, S. Kanaa, et al. Nocturne: a scalable driving benchmark for bringing multi-agent learning one step closer to the real world. InAdvances in Neural Information Processing Systems (NeurIPS), 2022
2022
-
[36]
Caesar, V
H. Caesar, V . Bankiti, A. Lang, S. V ora, V . Liong, Q. Xu, A. Krishnan, Y . Pan, G. Baldan, and O. Beijbom. nuscenes: A multimodal dataset for autonomous driving. arxiv. 2019
2019
-
[37]
B. Wilson, W. Qi, T. Agarwal, J. Lambert, J. Singh, S. Khandelwal, B. Pan, R. Kumar, A. Hart- nett, J. K. Pontes, et al. Argoverse 2: Next generation datasets for self-driving perception and forecasting.arXiv preprint arXiv:2301.00493, 2023
Pith/arXiv arXiv 2023
-
[38]
M. Bansal, A. Krizhevsky, and A. Ogale. Chauffeurnet: Learning to drive by imitating the best and synthesizing the worst.arXiv preprint arXiv:1812.03079, 2018
Pith/arXiv arXiv 2018
-
[39]
Salzmann, B
T. Salzmann, B. Ivanovic, P. Chakravarty, and M. Pavone. Trajectron++: Dynamically-feasible trajectory forecasting with heterogeneous data. InComputer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XVIII 16, pages 683–700. Springer, 2020
2020
-
[40]
J. Gu, C. Sun, and H. Zhao. Densetnt: End-to-end trajectory prediction from dense goal sets. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 15303–15312, 2021
2021
-
[41]
Nayakanti, R
N. Nayakanti, R. Al-Rfou, A. Zhou, K. Goel, K. S. Refaat, and B. Sapp. Wayformer: Motion forecasting via simple & efficient attention networks. In2023 IEEE International Conference on Robotics and Automation (ICRA), pages 2980–2987. IEEE, 2023
2023
- [42]
-
[43]
Z. Zhou, L. Ye, J. Wang, K. Wu, and K. Lu. Hivt: Hierarchical vector transformer for multi- agent motion prediction. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8823–8833, 2022
2022
-
[44]
S. Shi, L. Jiang, D. Dai, and B. Schiele. Motion transformer with global intention localization and local movement refinement.Advances in Neural Information Processing Systems, 35: 6531–6543, 2022
2022
-
[45]
Z. Zhou, J. Wang, Y .-H. Li, and Y .-K. Huang. Query-centric trajectory prediction. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17863–17873, 2023
2023
-
[46]
A. Seff, B. Cera, D. Chen, M. Ng, A. Zhou, N. Nayakanti, K. S. Refaat, R. Al-Rfou, and B. Sapp. Motionlm: Multi-agent motion forecasting as language modeling. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 8579–8590, 2023
2023
-
[47]
Zhong, D
Z. Zhong, D. Rempe, D. Xu, Y . Chen, S. Veer, T. Che, B. Ray, and M. Pavone. Guided conditional diffusion for controllable traffic simulation. In2023 IEEE international conference on robotics and automation (ICRA), pages 3560–3566. IEEE, 2023. 13
2023
-
[48]
C. M. Jiang, Y . Bai, A. Cornman, C. Davis, X. Huang, H. Jeon, S. Kulshrestha, J. Lambert, S. Li, X. Zhou, et al. Scenediffuser: Efficient and controllable driving simulation initialization and rollout.Advances in Neural Information Processing Systems, 37:55729–55760, 2024
2024
- [49]
-
[50]
S. Tan, J. Lambert, H. Jeon, S. Kulshrestha, Y . Bai, J. Luo, D. Anguelov, M. Tan, and C. M. Jiang. Scenediffuser++: City-scale traffic simulation via a generative world model. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 1570–1580, 2025
2025
-
[51]
B. Liao, S. Chen, H. Yin, B. Jiang, C. Wang, S. Yan, X. Zhang, X. Li, Y . Zhang, Q. Zhang, et al. Diffusiondrive: Truncated diffusion model for end-to-end autonomous driving. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 12037–12047, 2025
2025
-
[52]
Y . Lu, J. Fu, G. Tucker, X. Pan, E. Bronstein, R. Roelofs, B. Sapp, B. White, A. Faust, S. Whiteson, et al. Imitation is not enough: Robustifying imitation with reinforcement learning for challenging driving scenarios. In2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 7553–7560. IEEE, 2023
2023
-
[53]
Z. Peng, W. Luo, Y . Lu, T. Shen, C. Gulino, A. Seff, and J. Fu. Improving agent behaviors with RL fine-tuning for autonomous driving. InComputer Vision - ECCV 2024 - 18th European Conference, volume 15083 ofLecture Notes in Computer Science, pages 165–181. Springer, 2024
2024
-
[54]
Zhang, P
Z. Zhang, P. Karkus, M. Igl, W. Ding, Y . Chen, B. Ivanovic, and M. Pavone. Closed-loop supervised fine-tuning of tokenized traffic models. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2025
2025
-
[55]
Silver, T
D. Silver, T. Hubert, J. Schrittwieser, I. Antonoglou, M. Lai, A. Guez, M. Lanctot, L. Sifre, D. Kumaran, T. Graepel, et al. A general reinforcement learning algorithm that masters chess, shogi, and go through self-play.Science, 362(6419):1140–1144, 2018
2018
-
[56]
Vinyals, I
O. Vinyals, I. Babuschkin, W. M. Czarnecki, M. Mathieu, A. Dudzik, J. Chung, D. H. Choi, R. Powell, T. Ewalds, P. Georgiev, et al. Grandmaster level in starcraft ii using multi-agent reinforcement learning.nature, 575(7782):350–354, 2019
2019
-
[57]
other-play
H. Hu, A. Lerer, A. Peysakhovich, and J. Foerster. “other-play” for zero-shot coordination. In International conference on machine learning, pages 4399–4410. PMLR, 2020
2020
-
[58]
N. Bard, J. N. Foerster, S. Chandar, N. Burch, M. Lanctot, H. F. Song, E. Parisotto, V . Dumoulin, S. Moitra, E. Hughes, et al. The hanabi challenge: A new frontier for ai research.Artificial Intelligence, 280:103216, 2020
2020
-
[59]
A. P. Jacob, D. J. Wu, G. Farina, A. Lerer, H. Hu, A. Bakhtin, J. Andreas, and N. Brown. Modeling strong and human-like gameplay with KL-regularized search. InInternational Conference on Machine Learning, pages 9695–9728. PMLR, 2022
2022
-
[60]
Congested traffic states in empirical observations and microscopic simulations , volume=
M. Treiber, A. Hennecke, and D. Helbing. Congested traffic states in empirical observations and microscopic simulations.Physical Review E, 62(2):1805–1824, Aug. 2000. ISSN 1063-651X, 1095-3787. doi:10.1103/PhysRevE.62.1805. URL https://link.aps.org/doi/10.1103/ PhysRevE.62.1805
work page internal anchor Pith review doi:10.1103/physreve.62.1805 2000
-
[61]
Charraut, W
V . Charraut, W. Doulazmi, T. Tournaire, and T. Buhet. V-Max: A RL framework for autonomous driving.Reinforcement Learning Journal, 6:2427–2451, 2025
2025
-
[62]
Dauner, M
D. Dauner, M. Hallgarten, T. Li, X. Weng, Z. Huang, Z. Yang, H. Li, I. Gilitschenski, B. Ivanovic, M. Pavone, et al. Navsim: Data-driven non-reactive autonomous vehicle simulation and benchmarking.Advances in Neural Information Processing Systems, 37:28706–28719, 2024. 14
2024
-
[63]
Cornelisse
D. Cornelisse. Human-likeness metrics for autonomous agents: are we measuring the right thing? Substack, 2025. Blog post analyzing the Waymo Open Sim Agent Challenge (WOSAC) realism benchmark
2025
-
[64]
D. Arumugam, S. Kumar, R. Gummadi, and B. Van Roy. Satisficing exploration for deep reinforcement learning.arXiv preprint arXiv:2407.12185, 2024
arXiv 2024
-
[65]
J. M. Scanlon, K. D. Kusano, J. Engstrom, and T. Victor. Collision avoidance effectiveness of an automated driving system using a human driver behavior reference model in reconstructed fatal collisions. InWCX SAE World Congress Experience. SAE Technical Paper, 2026
2026
- [66]
-
[67]
A. Distelzweig, F. Janjoš, A. Look, A. Rothenhäusler, D. Jost, O. Scheel, R. Rajan, D. Cor- nelisse, E. Vinitsky, and J. Boedecker. Beyond self-play and scale: A behavior benchmark for generalization in autonomous driving.arXiv preprint arXiv:2605.10034, 2026
Pith/arXiv arXiv 2026
-
[68]
Gulino, J
C. Gulino, J. Fu, W. Luo, G. Tucker, E. Bronstein, Y . Lu, J. Harb, X. Pan, Y . Wang, X. Chen, et al. Waymax: An accelerated, data-driven simulator for large-scale autonomous driving research.Advances in Neural Information Processing Systems, 36:7730–7742, 2023. 15 A Simulation Environment and Design A.1 World Initialization from Scenario Metadata We use ...
2023
-
[69]
This caps acceleration and braking
Longitudinal acceleration bound.The change in implied forward speed is clipped to ±Along,max ·∆t, whereA long,max = 8m/s 2. This caps acceleration and braking. 17 Figure 6: Discretized delta-local action space for each component (∆x, ∆y, ∆ψ). Histograms show the empirical density (blue) of 10,996,751 valid action timesteps recovered from expert trajectori...
-
[70]
This eliminates lateral sliding and side-shimmy at low forward speed
Lateral motion envelope.Lateral displacement is bounded by |∆y| ≤ |∆x| ·tan(δ max), where δmax = 0.7 rad is the maximum effective steering angle. This eliminates lateral sliding and side-shimmy at low forward speed. These physical constraints prevent kinematically implausible actions; they do not encode any prefer- ence over driving style and are independ...
2000
-
[71]
Curriculum learning based on advantage.Each scenario can be treated as a level whose difficulty is measured by the agent’s average advantage. Upsampling scenarios proportionally to their advantage would concentrate training signal on cases the agent finds difficult, naturally increasing exposure to rare but safety-critical situations such as sudden cut-in...
-
[72]
Domain randomization.Masking out the observation of a ratio of agents within each scenario ("blind" agents [5]) and adding noise to the dynamics or partner features provides a targeted form of domain randomization that could make policy behavior more cautious
-
[73]
Adversarial fine-tuning.A third training stage that fine-tunes on a curated set of adversarial human data would expose the policy to scenarios where the other agents in the scene do not respond to it
-
[74]
Human-like opponents.Occasionally replacing the self-play opponent with the BC anchor rather than a copy of the RL policy would expose the agent to more human-like partner behavior throughout training
-
[75]
Table 10: Interactive evaluation across all scaling checkpoints
Stronger anchor policy.The BC anchor is itself a limiting factor: our best anchor achieves a closed-loop score of 0.66 (Table 7), and a stronger anchor, whether through architectural improvements or additional data, would give the KL regularizer a more reliable behavioral target. Table 10: Interactive evaluation across all scaling checkpoints. All metrics...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.