pith. sign in

arxiv: 2605.20255 · v1 · pith:RV3UIBPVnew · submitted 2026-05-18 · 💻 cs.LG · cs.AI· cs.HC· cs.RO

Multi-Agent Reinforcement Learning for Safe Autonomous Driving Under Pedestrian Behavioral Uncertainty

Pith reviewed 2026-05-21 08:11 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.HCcs.RO
keywords multi-agent reinforcement learningautonomous drivingpedestrian behavior modelingjaywalking simulationsafety testingMAPPOtrajectory metrics
0
0 comments X

The pith

Jointly training self-driving cars and pedestrians with multi-agent reinforcement learning produces more realistic crossing scenarios and measurable behavior gaps than fixed policies.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that co-training an autonomous vehicle and a group of pedestrians in a shared reinforcement learning environment captures the hidden uncertainties in human crossing decisions better than training the vehicle against static pedestrian rules. Pedestrians follow fixed paths but learn go-or-wait choices influenced by an unobserved personality trait that raises jaywalking odds, while the vehicle must respond without seeing that trait. This setup matters for safety testing because current simulations often underestimate jaywalking risks that cause most collisions in the evaluations. Results show the jointly trained vehicle completes more goals with fewer crashes and exhibits clear speed differences near unpredictable crossings. The approach also demonstrates that these gaps can be read directly from recorded trajectories without additional sensors.

Core claim

An SDC and 12 pedestrians are co-trained with MAPPO in an environment where pedestrian locomotion uses Dijkstra pathfinding and an RL policy governs go/wait decisions modulated by a per-pedestrian hidden personality trait. In 500-episode tests the co-trained SDC reaches 78 percent of goals with a 14 percent collision rate, outperforming the best rule-based baseline of 35 percent goals and 33 percent collisions. A speed differential metric reveals the SDC moves 2.65 m/s faster near jaywalkers than near crosswalk users at 0-3 m range, while jaywalking events comprise only 13 percent of crossings yet account for 62 percent of collisions. Co-training reduces collisions by 30 percent relative to单

What carries the argument

MAPPO-based multi-agent co-training in which pedestrians learn go/wait policies conditioned on a hidden personality trait while the SDC must infer behavior from observed motion alone.

If this is right

  • Jaywalking accounts for a small fraction of crossings yet drives the majority of collisions, highlighting the need for anticipation models that treat personality-driven deviations separately.
  • The co-trained SDC shows higher speeds near jaywalkers at close range, indicating that interaction learning still leaves measurable anticipation gaps.
  • Collision rates drop 30 percent when pedestrians also adapt during training, suggesting that mutual policy learning improves overall safety metrics.
  • Trajectory-derived speed differentials provide a direct, sensor-free way to quantify the predictability gap between crossing types.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The hidden-trait mechanism could be extended to model other unobservable factors such as distraction or group influence in future multi-agent driving simulators.
  • Trajectory-based metrics might transfer to post-hoc analysis of real-world fleet data to flag high-risk interaction patterns without new instrumentation.
  • Scaling the same co-training loop to include cyclists or other road users could reveal similar behavior gaps in more complex mixed-traffic scenes.

Load-bearing premise

The combination of scripted Dijkstra paths, RL-controlled go/wait decisions, and one hidden personality trait per pedestrian is enough to represent real human crossing uncertainty and heterogeneity.

What would settle it

A direct comparison of speed profiles and collision rates in the simulated environment against video recordings of actual urban crossings under comparable visibility and speed conditions.

Figures

Figures reproduced from arXiv: 2605.20255 by Kaushik Raghupathruni, Prakash Aryan, Sebastiano Panichella, Timo Kehrer.

Figure 1
Figure 1. Figure 1: Two outcomes in our environment. The blue rectangle [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: System architecture. (a) CTDE: the centralized critic uses global state during training and is discarded at execution. [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Goal and collision rates across methods (500 episodes [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: SDC speed vs. distance to nearest pedestrian, separated [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗
read the original abstract

Simulation-based testing of self-driving cars (SDCs) typically relies on scripted or simplified pedestrian models that do not capture the heterogeneity and uncertainty of real human crossing behavior. This limits the realism of safety assessments, especially in scenarios involving jaywalking, which is governed by latent personality traits that the vehicle cannot observe. We hypothesize that jointly training pedestrians and the SDC with multi-agent reinforcement learning (MARL) produces more realistic interaction scenarios than training the SDC against fixed pedestrian policies, and that the resulting behavior gap between predictable and unpredictable crossings can be measured directly from trajectories. This paper describes a MARL environment in which an SDC and 12 pedestrians are co-trained using Multi-Agent Proximal Policy Optimization (MAPPO). Pedestrian locomotion follows scripted Dijkstra pathfinding, while an RL policy controls high-level go/wait decisions. Jaywalking probability depends on a per-pedestrian personality trait sampled at episode start and hidden from the SDC. In 500-episode evaluations, the co-trained SDC reached 78% of goals with a 14% collision rate, compared to 35% goals and 33% collisions for the best rule-based baseline. A speed differential metric shows that the SDC traveled 2.65 m/s faster near jaywalkers than near crosswalk users at close range (0-3 m), indicating that jaywalking encounters were not anticipated. Jaywalking accounted for 13% of crossing events but was associated with 62% of collisions. Co-training with MARL pedestrians reduced collisions by 30% relative to single-agent RL, as pedestrians learned to wait when the SDC approached at speed.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces a MARL environment using MAPPO to co-train an SDC and 12 pedestrians, where pedestrian locomotion uses scripted Dijkstra paths and RL controls go/wait decisions modulated by a hidden per-pedestrian personality trait that governs jaywalking probability. The central claim is that this co-training yields more realistic interaction scenarios than fixed pedestrian policies, with the behavior gap measurable from trajectories; 500-episode results report the co-trained SDC reaching 78% goals at 14% collision rate versus 35% goals and 33% collisions for the best rule-based baseline, plus a 2.65 m/s speed differential near jaywalkers and jaywalking (13% of events) linked to 62% of collisions.

Significance. If the performance gains hold under equivalent training budgets and the pedestrian model is shown to better approximate real human heterogeneity, the work could advance simulation-based safety validation for autonomous driving by incorporating latent behavioral uncertainty. The numerical differences are clear, but without real-data calibration or statistical rigor the significance for practical deployment remains provisional.

major comments (3)
  1. Abstract: the claim that co-training 'produces more realistic interaction scenarios' rests on the unvalidated assumption that the Dijkstra+RL+hidden-trait pedestrian model captures human crossing heterogeneity; no calibration or statistical comparison to real pedestrian trajectories is provided, so the metrics only demonstrate SDC behavioral differences rather than improved fidelity.
  2. Abstract (evaluation paragraph): the reported 78% vs 35% goal-reaching and 14% vs 33% collision rates do not state whether the rule-based baseline received an equivalent training budget or number of environment steps, leaving the superiority claim only partially supported.
  3. Abstract: no variance, confidence intervals, or statistical significance tests accompany the 500-episode aggregate metrics (e.g., the 2.65 m/s speed differential or 62% collision attribution), so it is unclear whether the observed gaps are robust or sensitive to random seeds.
minor comments (2)
  1. Abstract: the description of the personality trait sampling and its effect on jaywalking probability could be expanded with the exact functional form or probability mapping used.
  2. The manuscript should clarify the observation space available to the SDC (e.g., whether personality traits or go/wait intentions are fully hidden) to support reproducibility.

Simulated Author's Rebuttal

3 responses · 1 unresolved

We thank the referee for their insightful comments on our work. We address each of the major comments in detail below, indicating where we agree and the revisions we will implement to strengthen the manuscript.

read point-by-point responses
  1. Referee: Abstract: the claim that co-training 'produces more realistic interaction scenarios' rests on the unvalidated assumption that the Dijkstra+RL+hidden-trait pedestrian model captures human crossing heterogeneity; no calibration or statistical comparison to real pedestrian trajectories is provided, so the metrics only demonstrate SDC behavioral differences rather than improved fidelity.

    Authors: We agree that the pedestrian model is not calibrated against real data, and thus the assertion of 'more realistic' is an assumption based on the inclusion of hidden personality traits that modulate jaywalking. The manuscript presents this as a hypothesis, with results showing that co-training leads to different SDC behaviors and reduced collisions. To address the concern, we will revise the abstract to replace 'produces more realistic interaction scenarios' with 'leads to interaction scenarios with greater behavioral diversity' and add a sentence noting the lack of real-world calibration as a limitation. This change will be reflected in the revised manuscript. revision: yes

  2. Referee: Abstract (evaluation paragraph): the reported 78% vs 35% goal-reaching and 14% vs 33% collision rates do not state whether the rule-based baseline received an equivalent training budget or number of environment steps, leaving the superiority claim only partially supported.

    Authors: The rule-based baselines are fixed, non-adaptive policies and therefore do not receive training or environment steps in the same manner as the RL agents. The training is performed only for the MAPPO policies in the co-training setup and the single-agent RL comparison. All methods are evaluated over the same 500 episodes. We will revise the abstract and methods to explicitly state that the baselines are rule-based fixed policies without training, ensuring the comparison is clearly between learned policies against different pedestrian behaviors. revision: yes

  3. Referee: Abstract: no variance, confidence intervals, or statistical significance tests accompany the 500-episode aggregate metrics (e.g., the 2.65 m/s speed differential or 62% collision attribution), so it is unclear whether the observed gaps are robust or sensitive to random seeds.

    Authors: We concur that variability measures are important for assessing robustness. The reported figures are averages over 500 episodes that incorporate stochasticity from personality trait sampling and environment initialization. In the revision, we will add standard deviations to the key metrics in the abstract and results section. We will also clarify that the training used a single random seed for reproducibility, and note that sensitivity to seeds is a potential limitation that could be explored with additional computational resources. revision: partial

standing simulated objections not resolved
  • Providing a full statistical comparison and calibration of the simulated pedestrian behaviors to real human trajectory data, as this would necessitate new experiments with external datasets beyond the scope of the current simulation study.

Circularity Check

0 steps flagged

No significant circularity; results are direct simulation rollouts independent of model inputs

full rationale

The paper defines a MARL setup (MAPPO co-training of SDC and 12 pedestrians, Dijkstra locomotion plus RL go/wait decisions, hidden per-pedestrian personality trait controlling jaywalking probability) as an input modeling choice. It then reports independent empirical outcomes from 500-episode rollouts: 78% goal success and 14% collisions for the co-trained agent versus 35% goals and 33% collisions for the best rule-based baseline, plus derived trajectory metrics such as 2.65 m/s speed differential near jaywalkers and 62% of collisions from 13% jaywalking events. These quantities are computed directly from executed trajectories and do not reduce algebraically or definitionally to the personality parameters or training procedure. No self-citations, uniqueness theorems, or fitted-input renamings are invoked to support the central performance claims; the comparison to single-agent RL and rule-based baselines supplies external reference points within the simulation. The derivation chain is therefore self-contained against its own benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that the chosen pedestrian model (scripted paths plus RL go/wait plus hidden trait) is a faithful proxy for real behavioral uncertainty; no free parameters are explicitly fitted to external data in the reported results.

free parameters (1)
  • per-pedestrian personality trait
    Sampled once per episode to set jaywalking probability; value distribution not specified in abstract.
axioms (1)
  • domain assumption Pedestrian locomotion follows scripted Dijkstra pathfinding while RL controls only high-level go/wait decisions
    Stated directly in the environment description.

pith-pipeline@v0.9.0 · 5848 in / 1379 out tokens · 33164 ms · 2026-05-21T08:11:42.060568+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

29 extracted references · 29 canonical work pages · 5 internal anchors

  1. [1]

    Simulation of pedestrian interaction with autonomous vehicles via social force model,

    M. M. Rashid, M. Seyedi, and S. Jung, “Simulation of pedestrian interaction with autonomous vehicles via social force model,” Simulation Modelling Practice and Theory, vol. 132, p. 102901, Apr. 2024. [Online]. Available: https://www.sciencedirect.com/science/ article/pii/S1569190X24000157

  2. [2]

    How Does Simulation-Based Testing for Self- Driving Cars Match Human Perception?

    C. Birchler, T. K. Mohammed, P. Rani, T. Nechita, T. Kehrer, and S. Panichella, “How Does Simulation-Based Testing for Self- Driving Cars Match Human Perception?”Replication Package - ”How does Simulation-based Testing for Self-driving Cars match Human Perception?”, vol. 1, no. FSE, pp. 42:929–42:950, Jul. 2024. [Online]. Available: https://dl.acm.org/doi...

  3. [3]

    The surprising effectiveness of PPO in cooperative multi-agent games,

    C. Yu, A. Velu, E. Vinitsky, J. Gao, Y . Wang, A. Bayen, and Y . Wu, “The surprising effectiveness of PPO in cooperative multi-agent games,” in Proceedings of the 36th International Conference on Neural Information Processing Systems, ser. NIPS ’22. Red Hook, NY , USA: Curran Associates Inc., Nov. 2022, pp. 24 611–24 624

  4. [4]

    Decision-making of autonomous vehicles in interactions with jaywalkers: A risk-aware deep reinforcement learning approach,

    Z. Zhang, H. Li, T. Chen, N. N. Sze, W. Yang, Y . Zhang, and G. Ren, “Decision-making of autonomous vehicles in interactions with jaywalkers: A risk-aware deep reinforcement learning approach,” Accident Analysis & Prevention, vol. 210, p. 107843, Feb. 2025. [Online]. Available: https://www.sciencedirect.com/science/article/pii/ S0001457524003889

  5. [5]

    Social Force Model for Pedestrian Dynamics

    D. Helbing and P. Molnar, “Social Force Model for Pedestrian Dynamics,”Physical Review E, vol. 51, no. 5, pp. 4282–4286, May 1995, arXiv:cond-mat/9805244. [Online]. Available: http://arxiv.org/ abs/cond-mat/9805244

  6. [6]

    Will automated vehicles encourage more jaywalking? Results from a stated preference survey,

    X. Dong, E. Guerra, and R. A. Daziano, “Will automated vehicles encourage more jaywalking? Results from a stated preference survey,”Transportation Research Part F: Traffic Psychology and Behaviour, vol. 103, pp. 217–229, May 2024. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S1369847824000858

  7. [7]

    Multi-Agent Reinforcement Learning-based Cooperative Autonomous Driving in Smart Intersections,

    T. Yu, K. Wang, Z. Li, T. Yu, and K. Sakaguchi, “Multi-Agent Reinforcement Learning-based Cooperative Autonomous Driving in Smart Intersections,” May 2025, arXiv:2505.04231 [cs]. [Online]. Available: http://arxiv.org/abs/2505.04231

  8. [8]

    Multi-Agent Reinforcement Learning for Autonomous Driving: A Survey,

    R. Zhang, J. Hou, F. Walter, S. Gu, J. Guan, F. R ¨ohrbein, Y . Du, P. Cai, G. Chen, and A. Knoll, “Multi-Agent Reinforcement Learning for Autonomous Driving: A Survey,” Aug. 2024, arXiv:2408.09675 [cs]. [Online]. Available: http://arxiv.org/abs/2408.09675

  9. [9]

    Multi- agent actor-critic for mixed cooperative-competitive environments,

    R. Lowe, Y . Wu, A. Tamar, J. Harb, P. Abbeel, and I. Mordatch, “Multi- agent actor-critic for mixed cooperative-competitive environments,” in Proceedings of the 31st International Conference on Neural Information Processing Systems, ser. NIPS’17. Red Hook, NY , USA: Curran Associates Inc., Dec. 2017, pp. 6382–6393. [Online]. Available: https://dl.acm.org...

  10. [10]

    Monotonic value function factorisation for deep multi-agent reinforcement learning,

    T. Rashid, M. Samvelyan, C. S. De Witt, G. Farquhar, J. Foerster, and S. Whiteson, “Monotonic value function factorisation for deep multi-agent reinforcement learning,”J. Mach. Learn. Res., vol. 21, no. 1, pp. 178:7234–178:7284, Jan. 2020. [Online]. Available: https://dl.acm.org/doi/10.5555/3455716.3455894

  11. [11]

    Counterfactual multi-agent policy gradients,

    J. N. Foerster, G. Farquhar, T. Afouras, N. Nardelli, and S. Whiteson, “Counterfactual multi-agent policy gradients,” inProceedings of the Thirty-Second AAAI Conference on Artificial Intelligence and Thirtieth Innovative Applications of Artificial Intelligence Conference and Eighth AAAI Symposium on Educational Advances in Artificial Intelligence, ser. AA...

  12. [12]

    S., Gupta, T., Makoviichuk, D., Makoviychuk, V ., Torr, P

    C. S. d. Witt, T. Gupta, D. Makoviichuk, V . Makoviychuk, P. H. S. Torr, M. Sun, and S. Whiteson, “Is Independent Learning All You Need in the StarCraft Multi-Agent Challenge?” Nov. 2020, arXiv:2011.09533 [cs]. [Online]. Available: http://arxiv.org/abs/2011.09533

  13. [13]

    Proximal Policy Optimization Algorithms

    J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Prox- imal Policy Optimization Algorithms,” Aug. 2017, arXiv:1707.06347 [cs]. [Online]. Available: http://arxiv.org/abs/1707.06347

  14. [14]

    High-Dimensional Continuous Control Using Generalized Advantage Estimation

    J. Schulman, P. Moritz, S. Levine, M. Jordan, and P. Abbeel, “High-Dimensional Continuous Control Using Generalized Advantage Estimation,” Oct. 2018, arXiv:1506.02438 [cs]. [Online]. Available: http://arxiv.org/abs/1506.02438

  15. [15]

    F. A. Oliehoek and C. Amato,A Concise Introduction to Decentralized POMDPs, ser. SpringerBriefs in Intelligent Systems. Cham: Springer International Publishing, 2016. [Online]. Available: http://link.springer. com/10.1007/978-3-319-28929-8

  16. [16]

    Fuzzy Logic- Based Model That Incorporates Personality Traits for Heterogeneous Pedestrians,

    Z. Xue, Q. Dong, X. Fan, Q. Jin, H. Jian, and J. Liu, “Fuzzy Logic- Based Model That Incorporates Personality Traits for Heterogeneous Pedestrians,”Symmetry, vol. 9, no. 10, p. 239, Oct. 2017. [Online]. Available: https://www.mdpi.com/2073-8994/9/10/239

  17. [17]

    Modeling Pedestrian Crossing Behavior: A Reinforcement Learning Approach with Sensory Motor Constraints,

    Y . Wang, A. R. Srinivasan, Y . M. Lee, and G. Markkula, “Modeling Pedestrian Crossing Behavior: A Reinforcement Learning Approach with Sensory Motor Constraints,” Sep. 2024, arXiv:2409.14522 [cs]. [Online]. Available: http://arxiv.org/abs/2409.14522

  18. [18]

    Social LSTM: Human Trajectory Prediction in Crowded Spaces,

    A. Alahi, K. Goel, V . Ramanathan, A. Robicquet, L. Fei- Fei, and S. Savarese, “Social LSTM: Human Trajectory Prediction in Crowded Spaces,” 2016, pp. 961–971. [On- line]. Available: https://openaccess.thecvf.com/content cvpr 2016/html/ Alahi Social LSTM Human CVPR 2016 paper.html

  19. [19]

    Trajectron++: Dynamically-Feasible Trajectory Forecasting With Heterogeneous Data,

    T. Salzmann, B. Ivanovic, P. Chakravarty, and M. Pavone, “Trajectron++: Dynamically-Feasible Trajectory Forecasting With Heterogeneous Data,” Jan. 2021, arXiv:2001.03093 [cs]. [Online]. Available: http: //arxiv.org/abs/2001.03093

  20. [20]

    Impact of jaywalking on pedestrian interaction behavior: A multiagent Markov Game-based analysis,

    E. A. Khuzam, G. Lanzaro, and T. Sayed, “Impact of jaywalking on pedestrian interaction behavior: A multiagent Markov Game-based analysis,”Accident Analysis & Prevention, vol. 220, p. 108141, Sep. 2025. [Online]. Available: https://www.sciencedirect.com/science/ article/pii/S0001457525002271

  21. [21]

    What Uncertainties Do We Need in Bayesian Deep Learning for Computer Vision?

    A. Kendall and Y . Gal, “What Uncertainties Do We Need in Bayesian Deep Learning for Computer Vision?” Oct. 2017, arXiv:1703.04977 [cs]. [Online]. Available: http://arxiv.org/abs/1703.04977

  22. [22]

    Tactical Decision-Making in Autonomous Driving by Reinforcement Learning with Uncertainty Estimation,

    C.-J. Hoel, K. Wolff, and L. Laine, “Tactical Decision-Making in Autonomous Driving by Reinforcement Learning with Uncertainty Estimation,” Apr. 2020, arXiv:2004.10439 [cs]. [Online]. Available: http://arxiv.org/abs/2004.10439

  23. [23]

    Uncertainty Quantification for Safe and Reliable Autonomous Vehicles: A Review of Methods and Applications,

    K. Wang, C. Shen, X. Li, and J. Lu, “Uncertainty Quantification for Safe and Reliable Autonomous Vehicles: A Review of Methods and Applications,”IEEE Transactions on Intelligent Transportation Systems, vol. 26, no. 3, pp. 2880–2896, Mar. 2025. [Online]. Available: https://ieeexplore.ieee.org/document/10879299

  24. [24]

    CARLA: An Open Urban Driving Simulator

    A. Dosovitskiy, G. Ros, F. Codevilla, A. Lopez, and V . Koltun, “CARLA: An Open Urban Driving Simulator,” Nov. 2017, arXiv:1711.03938 [cs]. [Online]. Available: http: //arxiv.org/abs/1711.03938

  25. [25]

    Microscopic Traffic Simulation using SUMO,

    P. A. Lopez, M. Behrisch, L. Bieker-Walz, J. Erdmann, Y .-P. Fl¨otter¨od, R. Hilbrich, L. L ¨ucken, J. Rummel, P. Wagner, and E. WieBner, “Microscopic Traffic Simulation using SUMO,” in2018 21st International Conference on Intelligent Transportation Systems (ITSC). Maui, HI, USA: IEEE Press, Nov. 2018, pp. 2575–2582. [Online]. Available: https://doi.org/1...

  26. [26]

    JAX: composable transformations of Python+NumPy programs,

    J. Bradbury, R. Frostig, P. Hawkins, M. J. Johnson, Y . Katariya, C. Leary, D. Maclaurin, G. Necula, A. Paszke, J. VanderPlas, S. Wanderman-Milne, and Q. Zhang, “JAX: composable transformations of Python+NumPy programs,” 2018. [Online]. Available: http://github. com/jax-ml/jax

  27. [27]

    Jaxmarl: Multi-agent rl environments and algorithms in jax,

    A. Rutherford, B. Ellis, M. Gallici, J. Cook, A. Lupu, G. Ingvarsson, T. Willi, R. Hammond, A. Khan, C. S. d. Witt, A. Souly, S. Bandyopadhyay, M. Samvelyan, M. Jiang, R. T. Lange, S. Whiteson, B. Lacerda, N. Hawes, T. Rocktaschel, C. Lu, and J. N. Foerster, “JaxMARL: Multi-Agent RL Environments and Algorithms in JAX,” Nov. 2024, arXiv:2311.10090 [cs]. [O...

  28. [28]

    CleanRL: High-quality Single-file Implementations of Deep Reinforcement Learning Algorithms,

    S. Huang, R. F. J. Dossa, C. Ye, J. Braga, D. Chakraborty, K. Mehta, and J. G. M. Ara ´ujo, “CleanRL: High-quality Single-file Implementations of Deep Reinforcement Learning Algorithms,”Journal of Machine Learning Research, vol. 23, no. 274, pp. 1–18, 2022. [Online]. Available: http://jmlr.org/papers/v23/21-1342.html

  29. [29]

    The kinematic bicycle model: A consistent model for planning feasible trajectories for autonomous vehicles?

    P. Polack, F. Altch ´e, B. d’Andr ´ea Novel, and A. de La Fortelle, “The kinematic bicycle model: A consistent model for planning feasible trajectories for autonomous vehicles?” in2017 IEEE Intelligent Vehicles Symposium (IV), Jun. 2017, pp. 812–818