EMAgnet replaces uniform-magnet regularization in PPO self-play with an EMA of last-iterate policy parameters and reports lower exploitability on most tested zero-sum benchmarks, especially those with dominated strategies.
Superhuman AI for Stratego using self-play reinforcement learning and test-time search
5 Pith papers cite this work. Polarity classification is still indexing.
fields
cs.LG 5years
2026 5verdicts
UNVERDICTED 5representative citing papers
Self-play RL with a vision transformer policy, powered by a 10,000x faster JAX simulator, produces an agent that ranks #1 on the Generals.io leaderboard and wins 199-70 against top humans.
GAE suffers from amplified variance in imperfect-info self-play RL; VRPO with Q-boosting and multi-step Expected SARSA(λ) reduces it and improves performance on mid-to-large games.
Self-play RL regularized with 30 minutes of human data produces driving policies that coordinate with humans, training in 15 hours on one GPU with 2500x less data than imitation learning.
DAGS initializes policy-gradient self-play from human-derived intermediate states to reduce exploitability in challenging imperfect-information games, with a multi-task flag fix for resulting bias and new benchmark environments.
citing papers explorer
-
EMAgnet: Parameter-Space EMA Regularization for Policy Gradient Self-Play in Large Games
EMAgnet replaces uniform-magnet regularization in PPO self-play with an EMA of last-iterate policy parameters and reports lower exploitability on most tested zero-sum benchmarks, especially those with dominated strategies.
-
Superhuman AI for Generals.io Using Self-Play Reinforcement Learning
Self-play RL with a vision transformer policy, powered by a 10,000x faster JAX simulator, produces an agent that ranks #1 on the Generals.io leaderboard and wins 199-70 against top humans.
-
GAE Falls Short in Imperfect-Information Self-Play Reinforcement Learning
GAE suffers from amplified variance in imperfect-info self-play RL; VRPO with Q-boosting and multi-step Expected SARSA(λ) reduces it and improves performance on mid-to-large games.
-
Human-like autonomy emerges from self-play and a pinch of human data
Self-play RL regularized with 30 minutes of human data produces driving policies that coordinate with humans, training in 15 hours on one GPU with 2500x less data than imitation learning.
-
Data-Augmented Game Starts for Accelerating Self-Play Exploration in Imperfect Information Games
DAGS initializes policy-gradient self-play from human-derived intermediate states to reduce exploitability in challenging imperfect-information games, with a multi-task flag fix for resulting bias and new benchmark environments.