arxiv: 2604.16406 · v1 · submitted 2026-03-31 · 💻 cs.AI · cs.LG· cs.MA· cs.RO

Recognition: unknown

Heterogeneous Self-Play for Realistic Highway Traffic Simulation

Jinkai Qiu , Alessandro Saviolo , Chaojie Wang , Mingke Wang , Xiaoyu Huang

Authors on Pith no claims yet

Pith reviewed 2026-05-08 02:15 UTC · model gemini-3-flash-preview

classification 💻 cs.AI cs.LGcs.MAcs.RO

keywords Highway SimulationSelf-PlayReinforcement LearningAutonomous VehiclesMulti-Agent SystemsTraffic Modeling

0 comments

The pith

Synthetic self-play enables realistic highway traffic simulation without relying on real-world expert logs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The authors establish that training autonomous agents through self-play in synthetic environments can produce driving behaviors more realistic and diverse than those derived from recorded human data. By assigning agents specific roles and vehicle types, the system generates complex multi-vehicle interactions, including rare safety-critical events that are difficult to find in real-world logs. This approach allows for the scalable creation of highway scenarios where agents react intelligently to one another, providing a high-fidelity testbed for autonomous vehicle safety.

Core claim

The authors present PHASE, a self-play framework that achieves zero-shot transfer to real-world highway data with a 96.3% success rate despite being trained entirely on synthetic scenarios. The central discovery is that heterogeneous agents—ranging from passenger cars to articulated trucks—can learn stable, human-like interaction dynamics through context-conditioned reinforcement learning. This method significantly outperforms traditional models in behavioral realism, reducing trajectory errors and improving the diversity of generated maneuvers by explicitly modeling the physical constraints and goals of different vehicle types in a closed-loop environment.

What carries the argument

PHASE (Policy for Heterogeneous Agent Self-play on Expressway), a context-aware reinforcement learning framework that uses vehicle-specific dynamics and goal-conditioned policies to stabilize multi-agent interactions.

If this is right

Autonomous vehicle testing can move from static log-replays to dynamic interactions with diverse synthetic drivers.
Safety-critical edge cases can be generated on demand by adjusting the goal and behavior conditions of the agents.
The need for massive, curated real-world driving datasets for traffic simulation may be significantly reduced.
Simulation can now accurately model the specific constraints of heavy vehicles, like trucks, alongside passenger cars in a single unified policy.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The success of the method suggests that the physics of highway interaction—spacing, merging, and speed differentials—is more fundamental to realism than the stylistic nuances of individual human drivers.
This framework could likely be used to predict the impact of new traffic laws or infrastructure changes by observing how the self-playing agents adapt their strategies in the synthetic environment.

Load-bearing premise

The hand-tuned reward functions and safety rules are assumed to perfectly define the boundaries of human driving without biasing the agents against rare but realistic maneuvers.

What would settle it

A direct comparison of the agents' gap-acceptance and lane-change frequency against high-resolution real-world traffic data would reveal if the synthetic agents have developed a distinct machine driving style that differs from human norms.

Figures

Figures reproduced from arXiv: 2604.16406 by Alessandro Saviolo, Chaojie Wang, Jinkai Qiu, Mingke Wang, Xiaoyu Huang.

**Figure 1.** Figure 1: Effect of context conditioning on closed-loop highway behavior. The three panels in each row share the same initial scenario; only the context-conditioning variables C (Section 3.1) are changed across columns. Despite identical initialization, the conditioned policy generates distinct closed-loop trajectory patterns within each triplet. Joint conditioning on C and each agent’s Cartesian goal gi enables exp… view at source ↗

**Figure 2.** Figure 2: Example start–goal pairs sampled from the offline lane-graph pool. Red nodes denote start points, green nodes denote sampled goal points, and blue lines connect each start to its candidate goals. The map overlay shows road boundaries in red, lane boundaries in light blue, and lane centers in gray. Offline, for each map we construct a reusable tuple (map, pool), where the pool contains candidate start–goal… view at source ↗

**Figure 3.** Figure 3: Simulation mechanisms used for stabilization. Panels (a.1) and (a.2) show unrecoverable states in which the agent’s forward direction diverges from the goal direction, triggering early termination. Panel (b) illustrates collision fault attribution based on relative positions. agent is moving away from its destination view at source ↗

**Figure 4.** Figure 4: Model architecture of PHASE. Ego, partner-agent, and road-element features are encoded and fused via cross-attention and attention pooling. Red lines denote queries, and blue lines denote keys and values. slower agents are not permitted larger heading errors at the same spatial distance. The speed term rspeed,i encourages the agent to track its assigned target speed vgoal,i throughout the rollout: rspeed,i… view at source ↗

**Figure 5.** Figure 5: Cluster-wise embedding distance to real trajectories. Each row corresponds to a cluster in the trajectory embedding space. Blue markers denote the distance between the centroid of rollouts generated by PHASE and the centroid of real trajectories for the same cluster, while red markers denote the corresponding distance for IDM rollouts. Smaller values indicate closer alignment with the real-data distributi… view at source ↗

read the original abstract

Realistic highway simulation is critical for scalable safety evaluation of autonomous vehicles, particularly for interactions that are too rare to study from logged data alone. Yet highway traffic generation remains challenging because it requires broad coverage across speeds and maneuvers, controllable generation of rare safety-critical scenarios, and behavioral credibility in multi-agent interactions. We present PHASE, Policy for Heterogeneous Agent Self-play on Expressway, a context-aware self-play framework that addresses these three requirements through explicit per-agent conditioning for controllability, synthetic scenario generation for broad highway coverage, and closed-loop multi-agent training for realistic interaction dynamics. PHASE further supports different vehicle profiles, for example, passenger cars and articulated trailer trucks, within a single policy via vehicle-aware dynamics and context-conditioned actions, and stabilizes self-play with early termination of unrecoverable states, at-fault collision attribution, highway-aware reward shaping, coupled curricula, and robust policy optimization. Despite being trained only on synthetic data, PHASE transfers zero-shot to 512 unseen high-interaction real scenarios in exiD, achieving a 96.3% success rate and reducing ADE/FDE from 6.57/12.07 m to 2.44/5.25 m relative to a prior self-play baseline. In a learned trajectory embedding space, it also improves behavioral realism over IDM, reducing Frechet trajectory distance by 13.1% and energy distance by 20.2%. These results show that synthetic self-play can provide a scalable route to controllable and realistic highway scenario generation without direct imitation of expert logs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PHASE demonstrates that synthetic self-play can generate realistic, heterogeneous highway traffic with strong zero-shot transfer to real-world data without relying on expert logs.

read the letter

The punchline here is that these authors have built a self-play framework for highway traffic that actually works across different vehicle types—cars and trucks—and transfers zero-shot to real-world data. It moves away from the common trap of just imitating expert logs, which usually leads to agents that can't handle interactions they haven't seen before.

What is new is the integration of vehicle-aware dynamics into a single policy and the specific suite of stabilization techniques used to keep the self-play from collapsing. They use a 'coupled curriculum' that gradually increases the density and difficulty of the scenarios, and they tackle the 'lazy agent' problem where RL agents simply stop moving to avoid collisions. The zero-shot performance on the exiD dataset is solid evidence; reducing the displacement error by over 60% compared to existing self-play baselines is a result worth noting.

There is a soft spot in how they handle collisions. They use an 'at-fault' attribution mechanism where only the agent deemed responsible for a crash is penalized and terminated. While this effectively prevents agents from becoming overly passive, it creates a blind spot for defensive driving. If an agent isn't 'at fault,' it has no incentive to avoid a collision. This might lead to traffic that appears indifferent to the mistakes of others, which isn't quite how humans drive. Additionally, the reliance on hand-tuned reward shaping and early termination thresholds means the 'realism' is still bounded by the authors' definitions of good driving.

That said, the central argument that synthetic self-play can produce diverse and reactive traffic holds up well. This paper is for anyone struggling with the 'sim-to-real' gap in autonomous vehicle testing or looking for more robust multi-agent RL training setups. It is a serious, well-validated piece of work that offers a practical path to generating safety-critical scenarios at scale.

I recommend taking this paper seriously. It’s a strong candidate for peer review and should be on your radar if you're looking at traffic generation.

Referee Report

3 major / 3 minor

Summary. This paper introduces PHASE (Policy for Heterogeneous Agent Self-play on Expressway), a framework designed to generate realistic and controllable highway traffic simulations using multi-agent reinforcement learning. The authors address the challenge of creating high-fidelity interactions without relying solely on logged expert data, which often lacks coverage for rare safety-critical events. PHASE utilizes a context-aware policy that conditions on agent-specific goals and vehicle dynamics (e.g., articulated trucks), trained via self-play on synthetically generated highway layouts. Key stability mechanisms include at-fault collision attribution, early termination for unrecoverable states, and a curriculum for scenario complexity. The framework is evaluated on the real-world exiD dataset in a zero-shot manner, demonstrating significant improvements in trajectory accuracy (ADE/FDE) and distributional realism (Frechet and Energy distances) compared to existing self-play and heuristic (IDM) baselines.

Significance. The paper makes a significant contribution to the field of autonomous vehicle simulation by demonstrating that a policy trained entirely on synthetic data can transfer successfully to real-world highway scenarios. The inclusion of articulated trailer trucks within a unified multi-agent RL framework is technically impressive and addresses a common gap in existing simulators. Furthermore, the use of energy distance and learned trajectory embeddings for evaluation provides a more robust measure of behavioral realism than simple distance-to-log metrics. The high success rate (96.3%) in zero-shot transfer to the exiD dataset suggests that the framework effectively captures the fundamental physical and interaction constraints of highway driving.

major comments (3)

[§3.2.1] The 'at-fault collision attribution' mechanism, while effective for stabilizing training by preventing the 'lazy agent' problem, introduces a potentially significant bias in behavioral realism. By ensuring the 'victim agent is not penalized' and continues its simulation (as stated in Section 3.2.1), the framework creates agents that are incentivized to be indifferent to collisions they did not initiate. In real-world driving, realism includes defensive maneuvering to avoid the errors of others. The manuscript should provide an analysis or discussion on whether this leads to 'blind' behavior where agents fail to react to impending accidents caused by nearby reckless actors, and how this impacts the 'behavioral credibility' claim in §1.
[§4.2, Table 2] The improvement in ADE/FDE (reducing from 6.57/12.07m to 2.44/5.25m) is reported as a primary result for zero-shot transfer. However, because PHASE is a closed-loop simulator and exiD is a fixed log, there is a risk that these metrics reward agents for staying close to the original log even when interactions deviate. If a PHASE agent makes a valid but different maneuver than the human in the log, the ADE will penalize it, yet the simulation might be more 'realistic' than one that adheres to the log but ignores the current interactive state. The authors should clarify how they handle the divergence between the simulated state and the log-recorded background traffic over long horizons (e.g., 8 seconds).
[§3.2.2] The early termination of 'unrecoverable states' is defined by thresholds like deviation from the lane center. While this aids convergence, it may prune the 'rare safety-critical scenarios' that the paper aims to study (§1). If agents are terminated the moment they enter a high-slip or high-deviation state, the simulation may lack coverage of the very edge cases (near-misses, recovery maneuvers) needed for AV safety evaluation. The authors should specify these thresholds and justify that they do not exclude valid extreme behaviors found in real traffic.

minor comments (3)

[Figure 3] The visualization of the 'Articulated Trailer Truck' dynamics would benefit from a clearer depiction of the hinge constraint in the state representation. It is currently difficult to see how the trailer's orientation is used by the policy relative to the tractor.
[§3.1, Eq. (2)] The notation for the goal-reaching reward refers to delta values that are not explicitly defined in the surrounding text; defining them would improve readability.
[§4.2] The term 'zero-shot transfer' is used. It would be helpful to explicitly confirm if any fine-tuning or map-specific alignment was performed for the exiD layouts, or if the agent only receives local bird's-eye-view (BEV) features that are purely geometry-agnostic.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive feedback and for recognizing the significance of our work in zero-shot transfer and heterogeneous multi-agent simulation. The referee's comments regarding collision attribution, evaluation metrics, and early termination touch on the core challenges of realism in closed-loop simulators. In this response, we clarify how our training objectives implicitly maintain defensive behavior, explain our use of distributional metrics to supplement log-based distance measures, and provide specific details regarding the early termination thresholds to demonstrate that we do not prune meaningful safety-critical scenarios. We have updated the manuscript to include these clarifications and a more detailed discussion on the trade-offs of the at-fault collision mechanism.

read point-by-point responses

Referee: [§3.2.1] The 'at-fault collision attribution' mechanism... introduces a potentially significant bias in behavioral realism... the framework creates agents that are incentivized to be indifferent to collisions they did not initiate. In real-world driving, realism includes defensive maneuvering to avoid the errors of others.

Authors: The referee correctly identifies a potential pitfall in fault attribution. However, 'not penalizing' the victim does not result in indifference because any collision (regardless of fault) results in early termination of the episode. Since agents are rewarded for maintaining goal speed and staying in their lane over time, termination is a strictly negative outcome because it prevents the agent from accumulating future rewards. Consequently, agents maintain a strong implicit incentive for defensive driving to avoid termination. Our qualitative analysis (Section 4.3) confirms that agents frequently perform evasive maneuvers to avoid reckless actors. We have revised Section 3.2.1 to clarify that termination serves as a 'soft' penalty that preserves defensive incentives while the at-fault 'hard' penalty specifically targets the aggressive agent to stabilize the self-play equilibrium. revision: yes
Referee: [§4.2, Table 2] The improvement in ADE/FDE... is reported as a primary result... However, because PHASE is a closed-loop simulator and exiD is a fixed log, there is a risk that these metrics reward agents for staying close to the original log even when interactions deviate... The authors should clarify how they handle the divergence.

Authors: We agree that ADE/FDE are imperfect for closed-loop evaluation because they do not account for valid multi-modal divergence from the log. We included them primarily for comparison with existing benchmarks like GPIL. To address the referee's concern, we treat the distributional metrics (Frechet Trajectory Distance and Energy Distance in Table 3) as our primary indicators of behavioral realism. These metrics evaluate the similarity between the distribution of generated behaviors and the distribution of human behaviors in a learned latent space, which is invariant to the specific spatial divergence from a single log sequence. We have revised Section 4.2 to explicitly acknowledge that ADE/FDE are 'proxy' metrics and that our claims of realism are primarily supported by the distributional alignment shown in Section 4.4. revision: yes
Referee: [§3.2.2] The early termination of 'unrecoverable states' is defined by thresholds like deviation from the lane center. While this aids convergence, it may prune the 'rare safety-critical scenarios'... The authors should specify these thresholds and justify that they do not exclude valid extreme behaviors.

Authors: The thresholds for early termination are chosen to prune only 'non-physical' or 'out-of-domain' states that cannot be recovered by the policy, such as a lateral lane deviation > 4.0m (effectively leaving the highway) or a heading error relative to the lane > 90 degrees (driving backwards). These events represent failures of the simulation stability rather than 'near-miss' scenarios. High-slip or high-acceleration states that remain within the drivable bounds are NOT terminated, allowing the model to explore and learn recovery maneuvers from extreme but physically valid states. We have added a dedicated paragraph in Section 3.2.2 specifying these numerical thresholds and clarifying that they are designed to prune simulator artifacts while preserving valid safety-critical interactions. revision: yes

Circularity Check

0 steps flagged

Zero-shot transfer from synthetic self-play to real-world logs provides independent validation

full rationale

The paper demonstrates a robust methodology that avoids circularity by training its 'PHASE' model entirely on synthetic data and evaluating it against an independent, real-world dataset (exiD). The 'success' and 'realism' of the generated trajectories are measured using external benchmarks—such as Average Displacement Error (ADE), Final Displacement Error (FDE), and Frechet trajectory distance—calculated relative to human driving logs. This setup ensures that the performance gains are not a result of fitting to the evaluation distribution. While the paper introduces heuristics like 'at-fault collision attribution' to stabilize Multi-Agent Reinforcement Learning (MARL), these are treated as architectural design choices rather than definitions that guarantee the outcome. The evaluation criteria (no collisions of any type) remain stricter than the training reward (don't cause collisions), meaning the agents must learn defensive behaviors to achieve the reported 96.3% success rate on real scenarios. The reliance on the authors' prior work (MP-PPO) is limited to using it as a baseline for comparison, and the central claims of the paper are supported by empirical results on data the model never saw during training.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 1 invented entities

The framework relies on standard RL and automotive simulation assumptions but introduces several heuristic stabilization techniques (reward shaping, termination rules) common in the field.

free parameters (2)

Reward Weights
The weights for collision penalties, lane-keeping, and speed-matching are typically hand-tuned or searched via hyperparameter optimization.
Early Termination Thresholds
Specific conditions for 'unrecoverable states' used to stabilize self-play training.

axioms (2)

domain assumption Kinematic Bicycle Model
Standard vehicle dynamics model used for simulating passenger cars.
standard math Multi-Agent Markov Decision Process (MA-MDP)
The underlying mathematical framework for the self-play reinforcement learning.

invented entities (1)

PHASE Policy independent evidence
purpose: A context-conditioned, vehicle-aware neural network policy for multi-agent interaction.
Its effectiveness is validated against the external exiD dataset.

pith-pipeline@v0.9.0 · 6390 in / 1733 out tokens · 18135 ms · 2026-05-08T02:15:29.249230+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

24 extracted references · 2 canonical work pages · 2 internal anchors

[1]

Optuna: A next-generation hyperparameter optimization framework, 2019

Takuya Akiba, Shotaro Sano, Toshihiko Yanase, Takeru Ohta, and Masanori Koyama. Optuna: A next-generation hyperparameter optimization framework, 2019. 7

2019
[2]

Data-driven traffic simulation: A comprehensive review, 2023

Di Chen, Meixin Zhu, Hao Yang, Xuesong Wang, and Yin- hai Wang. Data-driven traffic simulation: A comprehensive review, 2023. 1

2023
[3]

Building reliable sim driving agents by scaling self-play, 2025

Daphne Cornelisse, Aarav Pandya, Kevin Joseph, Joseph Su´arez, and Eugene Vinitsky. Building reliable sim driving agents by scaling self-play, 2025. 1, 3, 5, 7

2025
[4]

Killian, Stuart Bowers, Ozan Sener, Philipp Kraehenbuehl, and Vladlen Koltun

Marco Cusumano-Towner, David Hafner, Alexander Hertzberg, Brody Huval, Aleksei Petrenko, Eugene Vinit- sky, Erik Wijmans, Taylor W. Killian, Stuart Bowers, Ozan Sener, Philipp Kraehenbuehl, and Vladlen Koltun. Robust autonomy emerges from self-play. InProceedings of the 42nd International Conference on Machine Learning, pages 11710–11737. PMLR, 2025. 1, 3

2025
[5]

Qi, Yin Zhou, Zoey Yang, Aur ´elien Chouard, Pei Sun, Jiquan Ngiam, Vijay Vasudevan, Alexander Mc- Cauley, Jonathon Shlens, and Dragomir Anguelov

Scott Ettinger, Shuyang Cheng, Benjamin Caine, Chenxi Liu, Hang Zhao, Sabeek Pradhan, Yuning Chai, Ben Sapp, Charles R. Qi, Yin Zhou, Zoey Yang, Aur ´elien Chouard, Pei Sun, Jiquan Ngiam, Vijay Vasudevan, Alexander Mc- Cauley, Jonathon Shlens, and Dragomir Anguelov. Large scale interactive motion forecasting for autonomous driv- ing: The waymo open motion...

2021
[6]

Co-Reyes, Rishabh Agarwal, Re- becca Roelofs, Yao Lu, Nico Montali, Paul Mougin, Zoey Yang, Brandyn White, Aleksandra Faust, Rowan McAllister, Dragomir Anguelov, and Benjamin Sapp

Cole Gulino, Justin Fu, Wenjie Luo, George Tucker, Eli Bronstein, Yiren Lu, Jean Harb, Xinlei Pan, Yan Wang, Xiangyu Chen, John D. Co-Reyes, Rishabh Agarwal, Re- becca Roelofs, Yao Lu, Nico Montali, Paul Mougin, Zoey Yang, Brandyn White, Aleksandra Faust, Rowan McAllister, Dragomir Anguelov, and Benjamin Sapp. Waymax: An ac- celerated, data-driven simulat...

2023
[7]

Lasil: Learner-aware supervised imitation learning for long-term microscopic traffic simula- tion, 2024

Ke Guo, Zhenwei Miao, Wei Jing, Weiwei Liu, Weizi Li, Dayang Hao, and Jia Pan. Lasil: Learner-aware supervised imitation learning for long-term microscopic traffic simula- tion, 2024. 1

2024
[8]

Hansen, Daniel S

Eric A. Hansen, Daniel S. Bernstein, and Shlomo Zilberstein. Dynamic programming for partially observable stochastic games. InProceedings of the 19th National Conference on Artificial Intelligence (AAAI), pages 709–715. AAAI Press,
[9]

Versatile behavior dif- fusion for generalized traffic agent simulation, 2026

Zhiyu Huang, Zixu Zhang, Ameya Vaidya, Yuxiao Chen, Chen Lv, and Jaime Fern´andez Fisac. Versatile behavior dif- fusion for generalized traffic agent simulation, 2026. 3

2026
[10]

Directional-clamp ppo, 2025

Gilad Karpel, Ruida Zhou, Shoham Sabach, and Mohammad Ghavamzadeh. Directional-clamp ppo, 2025. 6, 7

2025
[11]

Gpudrive: Data- driven, multi-agent driving simulation at 1 million fps, 2025

Saman Kazemkhani, Aarav Pandya, Daphne Cornelisse, Brennan Shacklett, and Eugene Vinitsky. Gpudrive: Data- driven, multi-agent driving simulation at 1 million fps, 2025. 3, 7

2025
[12]

Set transformer: A frame- work for attention-based permutation-invariant neural net- works

Juho Lee, Yoonho Lee, Jungtaek Kim, Adam Kosiorek, Se- ungjin Choi, and Yee Whye Teh. Set transformer: A frame- work for attention-based permutation-invariant neural net- works. InInternational Conference on Machine Learning (ICML), pages 3744–3753. PMLR, 2019. 6

2019
[13]

The exid dataset: A real- world trajectory dataset of highly interactive highway sce- narios in germany

Tobias Moers, Lennart Vater, Robert Krajewski, Julian Bock, Adrian Zlocki, and Lutz Eckstein. The exid dataset: A real- world trajectory dataset of highly interactive highway sce- narios in germany. In2022 IEEE Intelligent Vehicles Sympo- sium (IV), pages 958–964, 2022. 2, 7

2022
[14]

The waymo open sim agents challenge,

Nico Montali, John Lambert, Paul Mougin, Alex Kuefler, Nick Rhinehart, Michelle Li, Cole Gulino, Tristan Em- rich, Zoey Yang, Shimon Whiteson, Brandyn White, and Dragomir Anguelov. The waymo open sim agents challenge,
[15]

Scene transformer: A unified architecture for predicting multiple agent trajectories

Jiquan Ngiam, Benjamin Caine, Vijay Vasudevan, Zheng- dong Zhang, Hao-Tien Lewis Chiang, Jeffrey Ling, Rebecca Roelofs, Alex Bewley, Chenxi Liu, Ashish Venugopal, David Weiss, Ben Sapp, Zhifeng Chen, and Jonathon Shlens. Scene transformer: A unified architecture for predicting multiple agent trajectories. InInternational Conference on Learning Representat...

2022
[16]

Dota 2 with Large Scale Deep Reinforcement Learning

OpenAI, Christopher Berner, Greg Brockman, Brooke Chan, Vicki Cheung, Przemysław Debiak, Christy Dennison, David Farhi, Quirin Fischer, Shariq Hashme, Chris Hesse, Rafal J ´ozefowicz, Scott Gray, Catherine Olsson, Jakub Pa- chocki, Michael Petrov, Henrique Pond ´e de Oliveira Pinto, Jonathan Raiman, Tim Salimans, Jeremy Schlatter, Jonas Schneider, Szymon ...

work page internal anchor Pith review arXiv 1912
[17]

Springer Science & Business Media, New York, NY , second edition,

Rajesh Rajamani.Vehicle Dynamics and Control. Springer Science & Business Media, New York, NY , second edition,
[18]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Rad- ford, and Oleg Klimov. Proximal policy optimization algo- rithms.arXiv preprint arXiv:1707.06347, 2017. 6, 7

work page internal anchor Pith review arXiv 2017
[19]

Mastering the game of Go without human knowledge.Nature, 550(7676): 354–359, 2017

David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lu- cas Baker, Matthew Lai, Adrian Bolton, Yutian Chen, Timo- thy Lillicrap, Fan Hui, Laurent Sifre, George van den Driess- che, Thore Graepel, and Demis Hassabis. Mastering the game of Go without human knowledge.Nature, 550(7676): 354–359, 2017. 3

2017
[20]

A general rein- forcement learning algorithm that masters chess, shogi, and Go through self-play.Science, 362(6419):1140–1144, 2018

David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, Lau- rent Sifre, Dharshan Kumaran, Thore Graepel, Timothy Lill- icrap, Karen Simonyan, and Demis Hassabis. A general rein- forcement learning algorithm that masters chess, shogi, and Go through self-play.Science, 362(6419):1140–1144, 2018. 3

2018
[21]

Con- gested traffic states in empirical observations and micro- scopic simulations.Physical Review E, 62(2):1805–1824,

Martin Treiber, Ansgar Hennecke, and Dirk Helbing. Con- gested traffic states in empirical observations and micro- scopic simulations.Physical Review E, 62(2):1805–1824,
[22]

Gomez, Łukasz Kaiser, and Illia Polosukhin

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. InAdvances in Neural Information Processing Systems (NeurIPS), 2017. 6

2017
[23]

Nocturne: a scalable driving benchmark for bringing multi-agent learning one step closer to the real world, 2023

Eugene Vinitsky, Nathan Lichtl ´e, Xiaomeng Yang, Brandon Amos, and Jakob Foerster. Nocturne: a scalable driving benchmark for bringing multi-agent learning one step closer to the real world, 2023. 3

2023
[24]

Behaviorgpt: Smart agent simulation for autonomous driving with next-patch prediction, 2024

Zikang Zhou, Haibo Hu, Xinhong Chen, Jianping Wang, Nan Guan, Kui Wu, Yung-Hui Li, Yu-Kai Huang, and Chun Jason Xue. Behaviorgpt: Smart agent simulation for autonomous driving with next-patch prediction, 2024. 3

2024