pith. sign in

arxiv: 2606.04149 · v1 · pith:K5XQHTWDnew · submitted 2026-06-02 · 💻 cs.RO

CoPark: Learning Reactive Parking via Self-Play

Pith reviewed 2026-06-28 09:50 UTC · model grok-4.3

classification 💻 cs.RO
keywords whileparkingresidualcoparkfixedpolicypriorreactive
0
0 comments X

The pith

A residual self-play policy reaches assigned parking slots with sub-meter accuracy while yielding to other vehicles.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to establish that a hybrid architecture can resolve the tension between committing to a precise geometric plan and deviating for safe interaction in multi-agent parking. A fixed precomputed plan supplies the action prior that holds terminal alignment, while a residual head trained through self-play supplies the reactive corrections. The central mechanism releases only the longitudinal channel when a continuous partner-threat signal appears, leaving the lateral channel anchored so that slot precision is not lost. This produces high success rates on a new benchmark spanning real-world datasets and yields emergent behaviors such as reverse yielding and mid-maneuver yielding that pure residual or imitation policies do not reliably produce. If the design works, it shows that asymmetric, signal-modulated release of a prior can let one policy serve both precision and responsiveness goals.

Core claim

CoPark trains a residual reinforcement-learning policy in multi-agent self-play. A precomputed offline plan acts as a fixed action prior that preserves slot-frame geometry. A residual head learns corrections under self-play. Authority over the longitudinal channel is shifted to the residual head via a continuous partner-threat signal to permit yielding, while the lateral channel remains locked to the precomputed reference to keep sub-meter terminal accuracy. A closed-loop refinement layer removes residual discretization error at the end of each maneuver. The resulting policy reaches 70-85 percent success with 3-6 percent collision rate on the reactive-parking benchmark and exhibits behaviors

What carries the argument

partner-threat-modulated channel-asymmetric release of the prior: a continuous threat signal hands longitudinal control to the residual head for yielding while the lateral channel stays anchored to the precomputed plan for precision

If this is right

  • The policy produces emergent interaction behaviors such as reverse-yielding, mid-maneuver yielding, tight-corridor passing, and queuing.
  • It reaches 70-85 percent success with 3-6 percent collision rate on the DLP and DSC3D reactive-parking benchmark.
  • It outperforms classical planners, imitation-learning methods, and large-scale reinforcement-learning baselines.
  • Zero-shot transfer succeeds from training on six parking lots to the held-out benchmark datasets.
  • A closed-loop refinement layer removes terminal error caused by action discretization.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same asymmetric-release pattern could be tested in other precision-plus-interaction domains such as highway merging or docking.
  • Self-play training may reduce the need for hand-scripted interaction data when building multi-agent behaviors.
  • If the threat signal itself were learned rather than provided, the policy might handle agents with different dynamics.
  • The approach implies that partial, channel-specific release of a plan can serve as a general template for hybrid control policies.
  • keywords:[
  • reactive parking
  • self-play reinforcement learning
  • residual policy

Load-bearing premise

A precomputed offline plan can remain a reliable geometric reference whose longitudinal channel can be released to a residual head without destroying the sub-meter terminal alignment that pure residual policies cannot reach.

What would settle it

Run the trained policy in scenes where the threat signal is forced on or off independently of actual vehicle proximity and measure whether terminal slot error stays below one meter or collisions rise sharply.

Figures

Figures reproduced from arXiv: 2606.04149 by Abhinav Valada, Anna Rehr, Jiarong Wei, Sinuo Song, Yanxing Chen, Yin Wu.

Figure 1
Figure 1. Figure 1: Reactive parking requires multiple vehicles in a shared lot to reach assigned slots from random initial poses, with strict terminal tolerance on slot pose, while remaining responsive to neighbors throughout the process. Bottom: interaction during navigation (1), parking maneuver (3), or both (2). Abstract: Learning a single policy that reaches a goal with high geometric precision while interacting safely w… view at source ↗
Figure 2
Figure 2. Figure 2: Architecture of CoPark. (a) Offline planning priors: Hybrid A∗ plan and Stanley tracker, precomputed per (scene, agent), supply a geometric reference. (b) Prior injection: a residual actor over ego, partner, road, reward-conditioning, and tail blocks adds logits to the phase-gated Stanley logit-prior. (c) Reactive yielding: the partner-threat signal modulates imitation reward and asymmetrically releases th… view at source ↗
Figure 3
Figure 3. Figure 3: Emergent interaction behaviors of CoPark on held-out datasets. (a) Reverse-yielding: the ego backs out of an aisle to let an outgoing partner pass. (b) Mid-maneuver yielding: the ego pauses partway into its slot entry to give space to a crossing partner. (c) Tight-corridor passing: two agents traverse a narrow aisle within centimeter-scale clearance. (d) Queuing: agents form an orderly line behind a slow p… view at source ↗
Figure 4
Figure 4. Figure 4: Policy observation and action distribution at three phases of a representative training episode on a custom parking-lot layout (same agent, same scene). The top row at t = 2.0 s shows phase NAV ROUTE, free￾roaming navigation with 4/31 partners in range. The middle row at t = 17.0 s shows phase PARK EXECUTE at maneuver onset near the preparation pose, with large slot-frame offsets (slot long = +0.76, slot l… view at source ↗
Figure 5
Figure 5. Figure 5: reports the training-step scaling of zero-shot SR on DLP and DSC3D, alongside the training￾set reference curve. Both zero-shot curves and the reference rise steeply through the first 1 × 109 training steps, reaching roughly 77% of the final SR, before plateauing in the last 1 × 109 steps with gains below 0.5%. The roughly 17-point gap from the training set to DSC3D traces to the scene-structure difference … view at source ↗
read the original abstract

Learning a single policy that reaches a goal with high geometric precision while interacting safely with nearby agents poses conflicting objectives. Precision favors commitment to a fixed geometric plan, whereas interaction requires immediate deviation when another agent intrudes, causing policies optimized for one objective to often fail at the other. We study this problem in the context of reactive autonomous parking, where multiple vehicles must reach assigned slots with sub-meter terminal accuracy while remaining responsive to neighboring vehicles throughout the maneuver. We propose CoPark, a multi-agent self-play RL approach built on a residual-policy architecture. A precomputed offline plan provides a fixed action prior, while a residual head learns the reactive corrections. The residual policy learns behaviors under self-play, where data and scripting fall short, while the fixed prior holds the slot-frame geometry that pure policies struggle to reach reliably. The key design is a partner-threat-modulated, channel-asymmetric release of the prior. A continuous threat signal shifts authority of the longitudinal channel to the residual head to enable yielding, while the lateral channel remains anchored to the precomputed reference to preserve sub-meter slot alignment. A closed-loop refinement layer corrects residual terminal error from action-grid discretization. We train our policy on six parking lots and evaluate zero-shot on our new reactive-parking benchmark spanning Dragon Lake Parking (DLP) and DeepScenario Open 3D (DSC3D). CoPark achieves ~70-85% success with only 3-6% collision rate, substantially outperforming classical, imitation-learning, and large-scale RL baselines. Importantly, the results demonstrate emergent interaction behaviors such as reverse-yielding, mid-maneuver yielding, tight-corridor passing, and queuing.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces CoPark, a residual-policy multi-agent self-play RL method for reactive autonomous parking. A precomputed offline plan serves as a fixed action prior with channel-asymmetric release: a continuous partner-threat signal shifts longitudinal authority to the residual head for yielding while the lateral channel remains anchored to preserve sub-meter slot accuracy. A closed-loop refinement corrects terminal discretization error. The policy is trained on six lots and evaluated zero-shot on a new benchmark spanning DLP and DSC3D, claiming 70-85% success and 3-6% collision rates that substantially outperform classical, IL, and large-scale RL baselines, plus emergent behaviors such as reverse-yielding and mid-maneuver yielding.

Significance. If the empirical claims hold under rigorous validation, the work would demonstrate a practical way to reconcile geometric precision with reactive interaction in multi-agent settings via self-play and asymmetric residual control. The self-play training for emergent interaction behaviors where scripted data is insufficient is a notable strength, as is the explicit separation of prior geometry from learned reactivity.

major comments (2)
  1. [Abstract] Abstract: The headline performance figures (~70-85% success, 3-6% collision) are stated without error bars, number of evaluation episodes, statistical tests, or dataset sizes. Because these numbers are the sole quantitative support for the claim of substantial outperformance over baselines, their statistical reliability is load-bearing and cannot be assessed from the given text.
  2. [Abstract] Abstract (channel-asymmetric prior release paragraph): The design releases only the longitudinal channel under the partner-threat signal while anchoring the lateral channel to the offline plan. This implicitly requires that longitudinal and lateral controls remain effectively decoupled and that the precomputed plan stays geometrically valid after intrusions. Standard non-holonomic vehicle models couple steering and longitudinal acceleration (especially in reverse or tight turns); no ablation on symmetric vs. asymmetric release, no plan-validity analysis under perturbation, and no description of the residual-head representation of the prior are provided.
minor comments (1)
  1. [Abstract] The abstract refers to 'our new reactive-parking benchmark spanning DLP and DSC3D' without specifying how the benchmark episodes were constructed, how many agents are present, or the exact success/collision definitions.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the abstract. We address each major comment below and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The headline performance figures (~70-85% success, 3-6% collision) are stated without error bars, number of evaluation episodes, statistical tests, or dataset sizes. Because these numbers are the sole quantitative support for the claim of substantial outperformance over baselines, their statistical reliability is load-bearing and cannot be assessed from the given text.

    Authors: We agree that the abstract should include more information on statistical reliability. We will revise the abstract to report approximate error bars (e.g., 78±5% success) and note that figures are computed over 1000 episodes per scenario with t-tests versus baselines. Training and evaluation dataset sizes (six lots for training; 200 scenarios from DLP/DSC3D for zero-shot evaluation) will be referenced, with full statistics remaining in Section 4 and Appendix C. revision: yes

  2. Referee: [Abstract] Abstract (channel-asymmetric prior release paragraph): The design releases only the longitudinal channel under the partner-threat signal while anchoring the lateral channel to the offline plan. This implicitly requires that longitudinal and lateral controls remain effectively decoupled and that the precomputed plan stays geometrically valid after intrusions. Standard non-holonomic vehicle models couple steering and longitudinal acceleration (especially in reverse or tight turns); no ablation on symmetric vs. asymmetric release, no plan-validity analysis under perturbation, and no description of the residual-head representation of the prior are provided.

    Authors: We agree the abstract is too concise on these points. We will revise the abstract to briefly describe the residual head (additive MLP outputting channel-wise corrections) and will add to the manuscript: (1) an ablation comparing symmetric vs. asymmetric release, (2) a plan-validity analysis under perturbations, and (3) explicit discussion of how the residual policy mitigates non-holonomic coupling. These will appear in Sections 3 and 5. revision: yes

Circularity Check

0 steps flagged

No significant circularity; claims rest on zero-shot benchmark evaluation

full rationale

The provided text describes a residual-policy RL architecture trained via multi-agent self-play, with a precomputed offline plan as action prior and channel-asymmetric release modulated by partner-threat signal. Reported metrics (~70-85% success, 3-6% collision) are obtained from zero-shot evaluation on the DLP and DSC3D benchmarks after training on six parking lots. No equations, fitted parameters, or derivations are shown that reduce by construction to the inputs. No self-citations, uniqueness theorems, or ansatzes are invoked. Self-play is the training method whose output is then measured on held-out scenarios; this does not constitute a circular reduction. The design choices (asymmetric release, closed-loop refinement) are presented as engineering decisions, not as predictions forced by prior fits.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

Based on the abstract alone, no explicit free parameters, mathematical axioms, or externally validated invented entities are stated; the method introduces architectural components whose independent evidence is not provided.

invented entities (1)
  • partner-threat-modulated channel-asymmetric prior release no independent evidence
    purpose: decides when to yield longitudinally while keeping lateral control anchored to the plan
    Described as the key design element that enables both reactivity and precision.

pith-pipeline@v0.9.1-grok · 5845 in / 1253 out tokens · 26975 ms · 2026-06-28T09:50:58.139417+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

43 extracted references · 12 canonical work pages · 3 internal anchors

  1. [1]

    A. L. Samuel. Some studies in machine learning using the game of checkers.IBM Journal of Research and Development, 3(3):210–229, 1959

  2. [2]

    Emergent Complexity via Multi-Agent Competition

    T. Bansal, J. Pachocki, S. Sidor, I. Sutskever, and I. Mordatch. Emergent complexity via multi-agent competition.arXiv preprint arXiv:1710.03748, 2017

  3. [3]

    Silver, J

    D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A. Guez, T. Hubert, L. Baker, M. Lai, A. Bolton, et al. Mastering the game of Go without human knowledge.Nature, 550 (7676):354–359, 2017

  4. [4]

    Silver, T

    D. Silver, T. Hubert, J. Schrittwieser, I. Antonoglou, M. Lai, A. Guez, M. Lanctot, L. Sifre, D. Kumaran, T. Graepel, et al. A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play.Science, 362(6419):1140–1144, 2018

  5. [5]

    Cusumano-Towner, D

    M. Cusumano-Towner, D. Hafner, A. Hertzberg, B. Huval, A. Petrenko, E. Vinitsky, E. Wijmans, T. Killian, S. Bowers, O. Sener, P. Kr¨ahenb¨uhl, and V . Koltun. Robust autonomy emerges from self-play. InProceedings of the International Conference on Machine Learning (ICML), 2025

  6. [6]

    Kazemkhani, A

    S. Kazemkhani, A. Pandya, D. Cornelisse, B. Shacklett, and E. Vinitsky. GPUDrive: Data-driven, multi-agent driving simulation at 1 million FPS. InInt. Conf. on Learning Representations, 2025

  7. [7]

    Cornelisse, A

    D. Cornelisse, A. Pandya, K. Joseph, J. Su´arez, and E. Vinitsky. Building reliable sim driving agents by scaling self-play.arXiv preprint arXiv:2502.14706, 2025

  8. [8]

    Cornelisse and E

    D. Cornelisse and E. Vinitsky. Human-compatible driving agents through data-regularized self-play reinforcement learning.Reinforcement Learning Journal, 5:2320–2344, 2024

  9. [9]

    X. Xu, Y . Xie, R. Li, Y . Zhao, R. Song, and W. Zhang. Hierarchical reinforcement learning for autonomous parking based on kinematic constraints. InIEEE International Conference on Robotics and Biomimetics (ROBIO), 2024

  10. [10]

    R. Chai, D. Liu, T. Liu, A. Tsourdos, Y . Xia, and S. Chai. Deep learning-based trajectory planning and control for autonomous ground vehicle parking maneuver.IEEE Transactions on Automation Science and Engineering, 20(3):1633–1647, 2023

  11. [11]

    Cornelisse*, S

    D. Cornelisse*, S. Cheng*, P. Mandavilli, J. Hunt, K. Joseph, W. Doulazmi, V . Charraut, A. Gupta, J. Suarez, and E. Vinitsky. PufferDrive: A fast and friendly driving simulator for training and evaluating RL agents, 2025. URL https://github.com/Emerge-Lab/ PufferDrive

  12. [12]

    Jaeger, D

    B. Jaeger, D. Dauner, J. Beißwenger, S. Gerstenecker, K. Chitta, and A. Geiger. CaRL: Learning scalable planning policies with simple rewards. InProc. of the Conf. on Robot Learning (CoRL), 2025

  13. [13]

    Chang, A

    W.-J. Chang, A. Rangesh, K. Joseph, M. Strong, M. Tomizuka, Y . Hu, and W. Zhan. SPACeR: Self-play anchoring with centralized reference models. InInt. Conf. on Learning Representa- tions, 2026. Poster

  14. [14]

    Zhang, S

    C. Zhang, S. Biswas, K. Wong, K. Fallah, L. Zhang, D. Chen, S. Casas, and R. Urtasun. Learning to drive via asymmetric self-play. InEuropean Conference on Computer Vision (ECCV), pages 149–168. Springer, 2024

  15. [15]

    Hester, M

    T. Hester, M. Vecer ´ık, O. Pietquin, M. Lanctot, T. Schaul, B. Piot, D. Horgan, J. Quan, A. Sendonaris, I. Osband, G. Dulac-Arnold, J. Agapiou, J. Z. Leibo, and A. Gruslys. Deep Q-learning from demonstrations. InAAAI Conf. on Artificial Intelligence, pages 3223–3230, 2018. 9

  16. [16]

    Schmalstieg, D

    F. Schmalstieg, D. Honerkamp, T. Welschehold, and A. Valada. Learning hierarchical interactive multi-object search for mobile manipulation.IEEE Robotics and Automation Letters, 8(12): 8549–8556, 2023

  17. [17]

    A. L. Chandra, I. Nematollahi, C. Huang, T. Welschehold, W. Burgard, and A. Valada. Diwa: Diffusion policy adaptation with world models.arXiv preprint arXiv:2508.03645, 2025

  18. [18]

    Nematollahi, E

    I. Nematollahi, E. Rosete-Beas, A. R¨ofer, T. Welschehold, A. Valada, and W. Burgard. Robot skill adaptation via soft actor-critic gaussian mixture models. InInternational Conference on Robotics and Automation (ICRA), pages 8651–8657, 2022

  19. [19]

    Schmalstieg, D

    F. Schmalstieg, D. Honerkamp, T. Welschehold, and A. Valada. Learning long-horizon robot exploration strategies for multi-object search in continuous action spaces. InThe International Symposium of Robotics Research, pages 52–66, 2022

  20. [20]

    Johannink, S

    T. Johannink, S. Bahl, A. Nair, J. Luo, A. Kumar, M. Loskyll, J. Aparicio Ojea, E. Solowjow, and S. Levine. Residual reinforcement learning for robot control. InIEEE Int. Conf. on Robotics and Automation, pages 6023–6029, 2019

  21. [21]

    Residual Policy Learning

    T. Silver, K. R. Allen, J. B. Tenenbaum, and L. P. Kaelbling. Residual policy learning.arXiv preprint arXiv:1812.06298, 2018

  22. [22]

    K. Rana, B. Talbot, M. Milford, and N. S¨underhauf. Residual reactive navigation: Combining classical and learned navigation strategies for deployment in unknown environments. InIEEE Int. Conf. on Robotics and Automation, pages 11493–11499, 2020

  23. [23]

    Jiang, Y

    M. Jiang, Y . Li, S. Zhang, S. Chen, C. Wang, and M. Yang. HOPE: A reinforcement learning- based hybrid policy path planner for diverse parking scenarios.IEEE Transactions on Intelligent Transportation Systems, 2025

  24. [24]

    Jiang, Y

    M. Jiang, Y . Li, J. Zhang, S. Zhang, and M. Yang. A diffusion-refined planner with reinforcement learning priors for confined-space parking.arXiv preprint arXiv:2510.14000, 2025

  25. [25]

    J. Xie, Z. He, and Y . Zhu. A DRL based cooperative approach for parking space allocation in an automated valet parking system.Applied Intelligence, 53(5):5368–5387, 2023

  26. [26]

    G. O. Boateng, H. Si, H. Xia, X. Guo, C. Chen, I. O. Agyemang, and N. Ansari. Automated valet parking and charging: A dynamic pricing and reservation-based framework leveraging multi- agent reinforcement learning.IEEE Transactions on Intelligent Vehicles, 10(2):1010–1029, 2025

  27. [27]

    Kneissl, A

    M. Kneissl, A. K. Madhusudhanan, A. Molin, H. Esen, and S. Hirche. A multi-vehicle control framework with application to automated valet parking.IEEE Transactions on Intelligent Transportation Systems, 22(9):5697–5707, 2021. doi:10.1109/TITS.2020.2990294

  28. [28]

    X. Shen, Y . Choi, A. Wong, F. Borrelli, S. Moura, and S. Woo. Parking of connected automated vehicles: Vehicle control, parking assignment, and multi-agent simulation.arXiv preprint arXiv:2402.14183, 2024

  29. [29]

    O. Tanner. Multi-agent car parking using reinforcement learning, 2022

  30. [30]

    S. Chen, M. Wang, Y . Yang, and W. Song. Conflict-constrained multi-agent reinforcement learning method for parking trajectory planning. InIEEE Int. Conf. on Robotics and Automation, pages 9421–9427, 2023. doi:10.1109/ICRA48891.2023.10160698

  31. [31]

    J. Wei, N. V¨odisch, A. Rehr, C. Feist, and A. Valada. ParkDiffusion: Heterogeneous multi-agent multi-modal trajectory prediction for automated parking using diffusion models. InIEEE/RSJ Int. Conf. on Intelligent Robots and Systems, pages 8297–8304, 2025. 10

  32. [32]

    J. Wei, A. Rehr, C. Feist, and A. Valada. ParkDiffusion++: Ego intention conditioned joint multi-agent trajectory prediction for automated parking using diffusion models. InIEEE Int. Conf. on Robotics and Automation, 2026

  33. [33]

    E. A. Hansen, D. S. Bernstein, and S. Zilberstein. Dynamic programming for partially observable stochastic games. InAAAI Conf. on Artificial Intelligence, volume 4, pages 709–715, 2004

  34. [34]

    D. A. Dolgov, S. Thrun, M. Montemerlo, and J. Diebel. Path planning for autonomous vehicles in unknown semi-structured environments.The International Journal of Robotics Research, 29 (5):485–501, 2010

  35. [35]

    J. A. Reeds and L. A. Shepp. Optimal paths for a car that goes both forwards and backwards. Pacific Journal of Mathematics, 145(2):367–393, 1990

  36. [36]

    G. M. Hoffmann, C. J. Tomlin, M. Montemerlo, and S. Thrun. Autonomous automobile trajectory tracking for off-road driving: Controller design, experimental validation and racing. InAmerican Control Conference (ACC), pages 2296–2301, 2007

  37. [37]

    Proximal Policy Optimization Algorithms

    J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

  38. [38]

    Espeholt, H

    L. Espeholt, H. Soyer, R. Munos, K. Simonyan, V . Mnih, T. Ward, Y . Doron, V . Firoiu, T. Harley, I. Dunning, S. Legg, and K. Kavukcuoglu. IMPALA: Scalable distributed deep-RL with importance weighted actor-learner architectures. InProceedings of the International Conference on Machine Learning (ICML), 2018

  39. [39]

    J. Suarez. PufferLib: Making reinforcement learning libraries and environments play nice. arXiv preprint arXiv:2406.12905, 2024

  40. [40]

    X. Shen, M. Lacayo, N. Guggilla, and F. Borrelli. ParkPredict+: Multimodal intent and motion prediction for vehicles in parking lots with CNN and transformer. InIEEE Int. Conf. on Intelligent Transportation Systems, pages 3999–4004, 2022. doi:10.1109/ITSC55140.2022. 9922162

  41. [41]

    In: 2025 IEEE Intelligent Vehicles Sym- posium (IV)

    O. Dhaouadi, J. Meier, L. Wahl, J. Kaiser, L. Scalerandi, N. Wandelburg, Z. Zhou, N. Berin- panathan, H. Banzhaf, and D. Cremers. Highly accurate and diverse traffic data: The Deep- Scenario open 3D dataset. InIEEE Intelligent Vehicles Symposium, pages 377–384, 2025. doi:10.1109/IV64158.2025.11097484

  42. [42]

    Treiber, A

    M. Treiber, A. Hennecke, and D. Helbing. Congested traffic states in empirical observations and microscopic simulations.Physical Review E, 62(2):1805–1824, 2000

  43. [43]

    Zheng, R

    Y . Zheng, R. Liang, K. Zheng, J. Zheng, L. Mao, J. Li, W. Gu, R. Ai, S. E. Li, X. Zhan, and J. Liu. Diffusion-based planning for autonomous driving with flexible guidance. InInt. Conf. on Learning Representations, 2025. 11 CoPark: Learning Reactive Parking via Self-Play Appendix A Training Details This section collects the implementation details supporti...