Superhuman AI for Stratego using self-play reinforcement learning and test-time search

Samuel Sokota, Eugene Vinitsky, Hengyuan Hu, J · 2025 · arXiv 2511.07312

5 Pith papers cite this work. Polarity classification is still indexing.

5 Pith papers citing it

representative citing papers

EMAgnet: Parameter-Space EMA Regularization for Policy Gradient Self-Play in Large Games

cs.LG · 2026-06-22 · unverdicted · novelty 6.0

EMAgnet replaces uniform-magnet regularization in PPO self-play with an EMA of last-iterate policy parameters and reports lower exploitability on most tested zero-sum benchmarks, especially those with dominated strategies.

Superhuman AI for Generals.io Using Self-Play Reinforcement Learning

cs.LG · 2026-06-22 · unverdicted · novelty 6.0

Self-play RL with a vision transformer policy, powered by a 10,000x faster JAX simulator, produces an agent that ranks #1 on the Generals.io leaderboard and wins 199-70 against top humans.

GAE Falls Short in Imperfect-Information Self-Play Reinforcement Learning

cs.LG · 2026-05-19 · unverdicted · novelty 6.0

GAE suffers from amplified variance in imperfect-info self-play RL; VRPO with Q-boosting and multi-step Expected SARSA(λ) reduces it and improves performance on mid-to-large games.

Human-like autonomy emerges from self-play and a pinch of human data

cs.LG · 2026-06-11 · unverdicted · novelty 5.0

Self-play RL regularized with 30 minutes of human data produces driving policies that coordinate with humans, training in 15 hours on one GPU with 2500x less data than imitation learning.

Data-Augmented Game Starts for Accelerating Self-Play Exploration in Imperfect Information Games

cs.LG · 2026-05-14 · unverdicted · novelty 5.0

DAGS initializes policy-gradient self-play from human-derived intermediate states to reduce exploitability in challenging imperfect-information games, with a multi-task flag fix for resulting bias and new benchmark environments.

citing papers explorer

Showing 5 of 5 citing papers after filters.

EMAgnet: Parameter-Space EMA Regularization for Policy Gradient Self-Play in Large Games cs.LG · 2026-06-22 · unverdicted · none · ref 27
EMAgnet replaces uniform-magnet regularization in PPO self-play with an EMA of last-iterate policy parameters and reports lower exploitability on most tested zero-sum benchmarks, especially those with dominated strategies.
Superhuman AI for Generals.io Using Self-Play Reinforcement Learning cs.LG · 2026-06-22 · unverdicted · none · ref 13
Self-play RL with a vision transformer policy, powered by a 10,000x faster JAX simulator, produces an agent that ranks #1 on the Generals.io leaderboard and wins 199-70 against top humans.
GAE Falls Short in Imperfect-Information Self-Play Reinforcement Learning cs.LG · 2026-05-19 · unverdicted · none · ref 6
GAE suffers from amplified variance in imperfect-info self-play RL; VRPO with Q-boosting and multi-step Expected SARSA(λ) reduces it and improves performance on mid-to-large games.
Human-like autonomy emerges from self-play and a pinch of human data cs.LG · 2026-06-11 · unverdicted · none · ref 3
Self-play RL regularized with 30 minutes of human data produces driving policies that coordinate with humans, training in 15 hours on one GPU with 2500x less data than imitation learning.
Data-Augmented Game Starts for Accelerating Self-Play Exploration in Imperfect Information Games cs.LG · 2026-05-14 · unverdicted · none · ref 40
DAGS initializes policy-gradient self-play from human-derived intermediate states to reduce exploitability in challenging imperfect-information games, with a multi-task flag fix for resulting bias and new benchmark environments.

Superhuman AI for Stratego using self-play reinforcement learning and test-time search

fields

years

verdicts

representative citing papers

citing papers explorer