pith. machine review for the scientific record. sign in

arxiv: 2604.04983 · v1 · submitted 2026-04-04 · 💻 cs.LG

Recognition: no theorem link

Territory Paint Wars: Diagnosing and Mitigating Failure Modes in Competitive Multi-Agent PPO

Authors on Pith no claims yet

Pith reviewed 2026-05-13 17:19 UTC · model grok-4.3

classification 💻 cs.LG
keywords multi-agent reinforcement learningPPOself-playcompetitive overfittinggeneralizationopponent mixingTerritory Paint Wars
0
0 comments X

The pith

Fixing PPO bugs in a new game reveals competitive overfitting that 20% random opponents cure.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Territory Paint Wars, a simple symmetric zero-sum game in Unity where agents compete to paint territory. It identifies five implementation errors in PPO self-play training that cause agents to lose badly even to random opponents. Correcting those errors allows good early performance, but continued training produces competitive overfitting: agents maintain near-50% win rates against each other while their ability to beat unseen opponents collapses from 73.5% to 21.6%. Substituting a fixed random policy for the opponent in 20% of episodes during training prevents the collapse and restores 77.1% generalization win rate across ten random seeds.

Core claim

In Territory Paint Wars, initial PPO self-play training yields only 26.8% win rate against random opponents due to reward-scale imbalance, missing terminal signals, ineffective long-horizon credit assignment, unnormalised observations, and incorrect win detection. After correcting these five issues, agents achieve 73.5% generalization win rate early but then enter competitive overfitting where self-play win rate stays near 50% while generalization drops to 21.6%. Opponent mixing that replaces the co-adaptive opponent with a uniform random policy in 20% of episodes eliminates the overfitting and produces 77.1% (±12.6%) generalization win rate without population-based training.

What carries the argument

Competitive overfitting, in which co-adapting agents sustain stable self-play performance while generalization win rate collapses; mitigated by opponent mixing that substitutes a fixed random policy for 20% of training episodes.

Load-bearing premise

That the five listed implementation failures and the competitive overfitting pathology are representative of broader PPO self-play training and that the 20% mixing rate will transfer to other environments and algorithms.

What would settle it

A replication study in a different competitive environment where 20% random-opponent mixing during PPO self-play fails to keep generalization win rate above 40%.

Figures

Figures reproduced from arXiv: 2604.04983 by Diyansha Singh.

Figure 1
Figure 1. Figure 1: The Territory Paint Wars environment on a 10×10 grid. Light pink/green tiles are owned by each agent; dark pink/dark green tiles are locked and cannot be reclaimed; grey tiles are neutral. Agents (spheres) paint the tile they occupy each step. This paper makes failure causes concrete. We build Territory Paint Wars, a deterministic, zero-sum two-player grid game implemented in Unity with a custom Python–Uni… view at source ↗
Figure 2
Figure 2. Figure 2: Win-rate progression across all versions. The red bar (v2 at ep 12,000) exposes [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Competitive overfitting in v2 self-play. Win rate vs. random rises then collapses, [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Learning curves: rolling-100 Pink self-play win rate for all 10 seeds (v3, opponent [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Training diagnostic curves (mean ± std over 10 seeds). Entropy decreases as the policy specialises; explained variance rises toward 1.0, confirming the critic learns accurate TD-λ return estimates. v3 mean (77.1%) and only 6.8 pp below seed 42’s own v3 result (93.9%). The agent learned a strong, generalisable territory-control policy driven entirely by dense step rewards: tile gain (+0.1) and lock bonus (+… view at source ↗
read the original abstract

We present Territory Paint Wars, a minimal competitive multi-agent reinforcement learning environment implemented in Unity, and use it to systematically investigate failure modes of Proximal Policy Optimisation (PPO) under self-play. A first agent trained for $84{,}000$ episodes achieves only $26.8\%$ win rate against a uniformly-random opponent in a symmetric zero-sum game. Through controlled ablations we identify five implementation-level failure modes -- reward-scale imbalance, missing terminal signal, ineffective long-horizon credit assignment, unnormalised observations, and incorrect win detection -- each of which contributes critically to this failure in this setting. After correcting these issues, we uncover a distinct emergent pathology: competitive overfitting, where co-adapting agents maintain stable self-play performance while generalisation win rate collapses from $73.5\%$ to $21.6\%$. Critically, this failure is undetectable via standard self-play metrics: both agents co-adapt equally, so the self-play win rate remains near $50\%$ throughout the collapse. We propose a minimal intervention -- opponent mixing, where $20\%$ of training episodes substitute a fixed uniformly-random policy for the co-adaptive opponent -- which mitigates competitive overfitting and restores generalisation to $77.1\%$ ($\pm 12.6\%$, $10$ seeds) without population-based training or additional infrastructure. We open-source Territory Paint Wars to provide a reproducible benchmark for studying competitive MARL failure modes.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces Territory Paint Wars, a Unity-based competitive multi-agent environment, and uses it to diagnose PPO self-play failures. It identifies five implementation issues (reward-scale imbalance, missing terminal signals, long-horizon credit assignment, unnormalised observations, incorrect win detection) whose correction raises initial performance; after fixes, it reports an emergent competitive-overfitting regime in which self-play win rate remains near 50% while generalization win rate falls from 73.5% to 21.6%. A lightweight mitigation—replacing the co-adaptive opponent with a fixed random policy in 20% of episodes—restores generalization to 77.1% (±12.6%, 10 seeds) without population-based training. The environment and code are open-sourced.

Significance. If the reported numbers and ablations hold, the work supplies a reproducible benchmark and a minimal, infrastructure-light intervention for a failure mode that standard self-play metrics cannot detect. The concrete win-rate tables, controlled ablations on implementation bugs, and open-sourcing constitute clear strengths for the MARL community.

major comments (2)
  1. [§4.2] §4.2 (competitive-overfitting results): the generalization collapse from 73.5% to 21.6% is central to the pathology claim, yet the test-opponent protocol (fixed checkpoint, random, or held-out co-adaptive agents) and exact evaluation budget are not stated; without this, it is impossible to distinguish overfitting from a change in effective opponent strength.
  2. [§5] §5 (opponent-mixing ablation): the 20% random-substitution rate restores performance to 77.1% (±12.6%, 10 seeds), but no sweep over mixing fractions (e.g., 5%, 10%, 30%) or comparison against population-based training is provided; this makes the claim that the intervention is both “minimal” and generally applicable load-bearing yet under-supported.
minor comments (2)
  1. [Abstract] Abstract and §3: the initial 26.8% win rate against random is reported without seed count or variance, unlike the later 77.1% (±12.6%, 10 seeds); consistency in reporting statistical detail would improve clarity.
  2. [Table 1] Table 1 (implementation ablations): the contribution of each of the five fixes is shown, but the table does not report the number of independent runs or whether the same random seeds were used across ablations.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive feedback and positive assessment of the work's potential contribution to the MARL community. We address each major comment below.

read point-by-point responses
  1. Referee: [§4.2] §4.2 (competitive-overfitting results): the generalization collapse from 73.5% to 21.6% is central to the pathology claim, yet the test-opponent protocol (fixed checkpoint, random, or held-out co-adaptive agents) and exact evaluation budget are not stated; without this, it is impossible to distinguish overfitting from a change in effective opponent strength.

    Authors: We agree that the evaluation protocol requires explicit clarification to support the overfitting claim. Generalization win rates are measured against a fixed uniform-random policy as well as held-out co-adaptive agents drawn from intermediate checkpoints of the corrected training runs. We will revise §4.2 to fully specify this protocol, including the precise evaluation budget (number of episodes per test opponent) and checkpoint selection criteria, ensuring readers can distinguish true competitive overfitting from shifts in opponent strength. revision: yes

  2. Referee: [§5] §5 (opponent-mixing ablation): the 20% random-substitution rate restores performance to 77.1% (±12.6%, 10 seeds), but no sweep over mixing fractions (e.g., 5%, 10%, 30%) or comparison against population-based training is provided; this makes the claim that the intervention is both “minimal” and generally applicable load-bearing yet under-supported.

    Authors: We acknowledge that a broader sweep would strengthen the generality claim. The 20% rate was selected after preliminary trials as the smallest mixing fraction that reliably prevented co-adaptation while preserving training stability. We will expand §5 with an explicit rationale for this choice and a qualitative comparison to population-based training, highlighting that opponent mixing requires no population maintenance or extra infrastructure. We do not have results from additional mixing-fraction experiments at this time. revision: partial

standing simulated objections not resolved
  • We do not currently have experimental results for a full sweep over alternative mixing fractions or a direct quantitative comparison against population-based training.

Circularity Check

0 steps flagged

No circularity detected: empirical ablation study with no derivations or self-referential equations.

full rationale

The manuscript is a purely experimental report on a custom Unity environment. It identifies five implementation failures via controlled ablations, measures win-rate collapse after fixes, and tests a 20% random-opponent mixing intervention. No equations, fitted parameters renamed as predictions, self-citations used as uniqueness theorems, or ansatzes appear in the load-bearing claims. All results are direct measurements (win rates, standard deviations over 10 seeds) that remain externally falsifiable and do not reduce to their own inputs by construction.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard RL training assumptions once implementation bugs are removed and on the empirical effectiveness of the mixing intervention in this environment.

free parameters (1)
  • opponent mixing rate = 20%
    Chosen as the minimal intervention that restores generalization; value of 20% is stated without derivation from first principles.
axioms (1)
  • domain assumption Standard PPO update rules and value estimation remain valid after the listed implementation corrections.
    The paper treats PPO as correctly implemented once the five bugs are fixed.

pith-pipeline@v0.9.0 · 5561 in / 1253 out tokens · 50138 ms · 2026-05-13T17:19:56.729291+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

5 extracted references · 5 canonical work pages · 2 internal anchors

  1. [1]

    Dota 2 with Large Scale Deep Reinforcement Learning

    C. Berner, G. Brockman, B. Chan, et al. Dota 2 with large scale deep reinforcement learning. arXiv preprint arXiv:1912.06680,

  2. [2]

    Deep reinforcement learning from self-play in imperfect-information games.arXiv preprint arXiv:1603.01121,

    J. Heinrich and D. Silver. Deep reinforcement learning from self-play in imperfect-information games.arXiv preprint arXiv:1603.01121,

  3. [3]

    Raposo, S

    D. Raposo, S. Ritter, A. Wayne, et al. Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425,

  4. [4]

    Proximal Policy Optimization Algorithms

    J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347,

  5. [5]

    pink_action

    A Environment Implementation Details The Unity scene contains twoAgentController components, GridManager, GameManager, and PythonBridge. PythonBridge runs a background thread that accepts one TCP connec- tion and exchanges JSON messages synchronously with the Python training loop. Message protocol: •Python→Unity (action):{"pink_action": int, "green_action...