Recognition: no theorem link
Territory Paint Wars: Diagnosing and Mitigating Failure Modes in Competitive Multi-Agent PPO
Pith reviewed 2026-05-13 17:19 UTC · model grok-4.3
The pith
Fixing PPO bugs in a new game reveals competitive overfitting that 20% random opponents cure.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
In Territory Paint Wars, initial PPO self-play training yields only 26.8% win rate against random opponents due to reward-scale imbalance, missing terminal signals, ineffective long-horizon credit assignment, unnormalised observations, and incorrect win detection. After correcting these five issues, agents achieve 73.5% generalization win rate early but then enter competitive overfitting where self-play win rate stays near 50% while generalization drops to 21.6%. Opponent mixing that replaces the co-adaptive opponent with a uniform random policy in 20% of episodes eliminates the overfitting and produces 77.1% (±12.6%) generalization win rate without population-based training.
What carries the argument
Competitive overfitting, in which co-adapting agents sustain stable self-play performance while generalization win rate collapses; mitigated by opponent mixing that substitutes a fixed random policy for 20% of training episodes.
Load-bearing premise
That the five listed implementation failures and the competitive overfitting pathology are representative of broader PPO self-play training and that the 20% mixing rate will transfer to other environments and algorithms.
What would settle it
A replication study in a different competitive environment where 20% random-opponent mixing during PPO self-play fails to keep generalization win rate above 40%.
Figures
read the original abstract
We present Territory Paint Wars, a minimal competitive multi-agent reinforcement learning environment implemented in Unity, and use it to systematically investigate failure modes of Proximal Policy Optimisation (PPO) under self-play. A first agent trained for $84{,}000$ episodes achieves only $26.8\%$ win rate against a uniformly-random opponent in a symmetric zero-sum game. Through controlled ablations we identify five implementation-level failure modes -- reward-scale imbalance, missing terminal signal, ineffective long-horizon credit assignment, unnormalised observations, and incorrect win detection -- each of which contributes critically to this failure in this setting. After correcting these issues, we uncover a distinct emergent pathology: competitive overfitting, where co-adapting agents maintain stable self-play performance while generalisation win rate collapses from $73.5\%$ to $21.6\%$. Critically, this failure is undetectable via standard self-play metrics: both agents co-adapt equally, so the self-play win rate remains near $50\%$ throughout the collapse. We propose a minimal intervention -- opponent mixing, where $20\%$ of training episodes substitute a fixed uniformly-random policy for the co-adaptive opponent -- which mitigates competitive overfitting and restores generalisation to $77.1\%$ ($\pm 12.6\%$, $10$ seeds) without population-based training or additional infrastructure. We open-source Territory Paint Wars to provide a reproducible benchmark for studying competitive MARL failure modes.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Territory Paint Wars, a Unity-based competitive multi-agent environment, and uses it to diagnose PPO self-play failures. It identifies five implementation issues (reward-scale imbalance, missing terminal signals, long-horizon credit assignment, unnormalised observations, incorrect win detection) whose correction raises initial performance; after fixes, it reports an emergent competitive-overfitting regime in which self-play win rate remains near 50% while generalization win rate falls from 73.5% to 21.6%. A lightweight mitigation—replacing the co-adaptive opponent with a fixed random policy in 20% of episodes—restores generalization to 77.1% (±12.6%, 10 seeds) without population-based training. The environment and code are open-sourced.
Significance. If the reported numbers and ablations hold, the work supplies a reproducible benchmark and a minimal, infrastructure-light intervention for a failure mode that standard self-play metrics cannot detect. The concrete win-rate tables, controlled ablations on implementation bugs, and open-sourcing constitute clear strengths for the MARL community.
major comments (2)
- [§4.2] §4.2 (competitive-overfitting results): the generalization collapse from 73.5% to 21.6% is central to the pathology claim, yet the test-opponent protocol (fixed checkpoint, random, or held-out co-adaptive agents) and exact evaluation budget are not stated; without this, it is impossible to distinguish overfitting from a change in effective opponent strength.
- [§5] §5 (opponent-mixing ablation): the 20% random-substitution rate restores performance to 77.1% (±12.6%, 10 seeds), but no sweep over mixing fractions (e.g., 5%, 10%, 30%) or comparison against population-based training is provided; this makes the claim that the intervention is both “minimal” and generally applicable load-bearing yet under-supported.
minor comments (2)
- [Abstract] Abstract and §3: the initial 26.8% win rate against random is reported without seed count or variance, unlike the later 77.1% (±12.6%, 10 seeds); consistency in reporting statistical detail would improve clarity.
- [Table 1] Table 1 (implementation ablations): the contribution of each of the five fixes is shown, but the table does not report the number of independent runs or whether the same random seeds were used across ablations.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and positive assessment of the work's potential contribution to the MARL community. We address each major comment below.
read point-by-point responses
-
Referee: [§4.2] §4.2 (competitive-overfitting results): the generalization collapse from 73.5% to 21.6% is central to the pathology claim, yet the test-opponent protocol (fixed checkpoint, random, or held-out co-adaptive agents) and exact evaluation budget are not stated; without this, it is impossible to distinguish overfitting from a change in effective opponent strength.
Authors: We agree that the evaluation protocol requires explicit clarification to support the overfitting claim. Generalization win rates are measured against a fixed uniform-random policy as well as held-out co-adaptive agents drawn from intermediate checkpoints of the corrected training runs. We will revise §4.2 to fully specify this protocol, including the precise evaluation budget (number of episodes per test opponent) and checkpoint selection criteria, ensuring readers can distinguish true competitive overfitting from shifts in opponent strength. revision: yes
-
Referee: [§5] §5 (opponent-mixing ablation): the 20% random-substitution rate restores performance to 77.1% (±12.6%, 10 seeds), but no sweep over mixing fractions (e.g., 5%, 10%, 30%) or comparison against population-based training is provided; this makes the claim that the intervention is both “minimal” and generally applicable load-bearing yet under-supported.
Authors: We acknowledge that a broader sweep would strengthen the generality claim. The 20% rate was selected after preliminary trials as the smallest mixing fraction that reliably prevented co-adaptation while preserving training stability. We will expand §5 with an explicit rationale for this choice and a qualitative comparison to population-based training, highlighting that opponent mixing requires no population maintenance or extra infrastructure. We do not have results from additional mixing-fraction experiments at this time. revision: partial
- We do not currently have experimental results for a full sweep over alternative mixing fractions or a direct quantitative comparison against population-based training.
Circularity Check
No circularity detected: empirical ablation study with no derivations or self-referential equations.
full rationale
The manuscript is a purely experimental report on a custom Unity environment. It identifies five implementation failures via controlled ablations, measures win-rate collapse after fixes, and tests a 20% random-opponent mixing intervention. No equations, fitted parameters renamed as predictions, self-citations used as uniqueness theorems, or ansatzes appear in the load-bearing claims. All results are direct measurements (win rates, standard deviations over 10 seeds) that remain externally falsifiable and do not reduce to their own inputs by construction.
Axiom & Free-Parameter Ledger
free parameters (1)
- opponent mixing rate =
20%
axioms (1)
- domain assumption Standard PPO update rules and value estimation remain valid after the listed implementation corrections.
Reference graph
Works this paper leans on
-
[1]
Dota 2 with Large Scale Deep Reinforcement Learning
C. Berner, G. Brockman, B. Chan, et al. Dota 2 with large scale deep reinforcement learning. arXiv preprint arXiv:1912.06680,
work page internal anchor Pith review Pith/arXiv arXiv 1912
-
[2]
J. Heinrich and D. Silver. Deep reinforcement learning from self-play in imperfect-information games.arXiv preprint arXiv:1603.01121,
- [3]
-
[4]
Proximal Policy Optimization Algorithms
J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347,
work page internal anchor Pith review Pith/arXiv arXiv
-
[5]
A Environment Implementation Details The Unity scene contains twoAgentController components, GridManager, GameManager, and PythonBridge. PythonBridge runs a background thread that accepts one TCP connec- tion and exchanges JSON messages synchronously with the Python training loop. Message protocol: •Python→Unity (action):{"pink_action": int, "green_action...
work page 2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.