A Structural Threshold in Decision Capacity Governs Collapse in Self-Play Reinforcement Learning

Arahan Kujur

arxiv: 2605.16315 · v1 · pith:4WU75TDVnew · submitted 2026-05-04 · 💻 cs.LG · cs.AI

A Structural Threshold in Decision Capacity Governs Collapse in Self-Play Reinforcement Learning

Arahan Kujur This is my paper

Pith reviewed 2026-05-20 23:33 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords self-play reinforcement learningcontingent decisionsasymmetric perturbationsexploitation attractordecision capacitypoker variantsmatrix gamesfunction approximation

0 comments

The pith

Self-play RL collapses to maximal-loss exploitation exactly when reach-weighted contingent decision capacity hits zero.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Self-play reinforcement learning agents can lose their ability to adapt when rules are changed asymmetrically. The work shows that this collapse to a deterministic exploitation strategy at near-maximal loss occurs as soon as every positive-reach contingent decision point is removed. Retaining even one such point blocks the convergence. The result holds across poker variants, matrix games, a dice game, and multiple algorithms; frozen-baseline and fixed-opponent controls indicate the driver is mutual co-adaptation under the constraint rather than the rule change itself. The threshold is timing-invariant and reversible, and it becomes sharper with function approximation.

Core claim

The paper claims that a sharp structural threshold exists at zero reach-weighted contingent action capacity. When this capacity is driven to zero by asymmetric perturbations, self-play agents converge rapidly to a deterministic exploitation attractor at near-maximal loss. Preserving a single positive-reach contingent decision point prevents the collapse. Fixed-opponent and frozen-baseline controls isolate co-adaptation under constraint as the mechanism. The effect is independent of perturbation timing, fully reversible by action restoration, and intensifies under function approximation.

What carries the argument

Reach-weighted contingent action capacity: the effective count of decision points reachable with positive probability that still permit contingent (non-deterministic) actions; this quantity determines whether agents remain able to co-adapt or fall into the exploitation fixed point.

If this is right

The collapse occurs at any point in training, independent of when the perturbation is applied.
Restoring the removed actions immediately reverses the convergence to the exploitation attractor.
Function approximation increases the severity of the collapse once the capacity threshold is crossed.
The same zero-capacity threshold appears consistently in poker variants, matrix games, dice games, and across several reinforcement learning algorithms.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Enforcing a minimum reach-weighted capacity during training could serve as a practical safeguard against sudden loss of adaptability in other constrained multi-agent settings.
The same structural dependence on contingent decision points may appear in single-agent settings that involve partial observability or changing environments.
Monitoring reach-weighted capacity could provide an early warning signal for instability in large-scale self-play systems before full collapse occurs.

Load-bearing premise

Frozen baseline and fixed-opponent controls suffice to isolate co-adaptation under constraint as the cause rather than the perturbation acting directly.

What would settle it

Re-running the perturbations with a completely non-adaptive fixed opponent and checking whether the deterministic exploitation attractor still forms at the same speed and severity.

Figures

Figures reproduced from arXiv: 2605.16315 by Arahan Kujur.

**Figure 2.** Figure 2: The CAC threshold (normalized). The discontinuity at 0 [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 3.** Figure 3: QL-Frozen avoids the DEA, isolating co-adaptation as the mechanism. [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 4.** Figure 4: All four algorithms converge toward the DEA. DQN collapses deepest. [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗

**Figure 5.** Figure 5: DQN neural analysis under zero contingency (Kuhn). Left: policy entropy drops to [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗

**Figure 6.** Figure 6: Collapse and recovery are symmetric. The DEA is maintained only by the constraint. [PITH_FULL_IMAGE:figures/full_fig_p012_6.png] view at source ↗

**Figure 7.** Figure 7: Exploitability over time (Kuhn, full removal). Post-perturbation exploitability spikes as [PITH_FULL_IMAGE:figures/full_fig_p012_7.png] view at source ↗

read the original abstract

We show that a threshold in decision capacity determines whether self-play reinforcement learning agents collapse under asymmetric rule perturbations. Across poker variants, matrix games, a dice game, and multiple learning algorithms, eliminating all positive-reach contingent decisions causes rapid convergence to a deterministic exploitation attractor, a fixed point at near-maximal loss. Preserving even a single positive-reach contingent decision point prevents this collapse. A frozen baseline and fixed-opponent control confirm that the mechanism is co-adaptation under constraint, not the perturbation itself. The phenomenon is timing-invariant, fully reversible upon action restoration, and intensifies under function approximation. These results establish a sharp threshold at zero reach-weighted contingent action capacity, with severity scaling continuously via reach-weighted capacity in the tested domains.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper reports a sharp empirical threshold in reach-weighted contingent action capacity that triggers collapse in self-play RL, with consistent patterns across domains but limited quantitative detail so far.

read the letter

The main point is that this work identifies a threshold at zero reach-weighted contingent action capacity. When all positive-reach contingent decisions are removed, self-play agents converge rapidly to a deterministic bad attractor near maximal loss. Keeping even one such decision point stops the collapse. The effect shows up across poker variants, matrix games, a dice game, and several algorithms, and it is described as timing-invariant and fully reversible when the action is restored. Function approximation appears to make the collapse more severe.

Referee Report

1 major / 1 minor

Summary. The manuscript presents empirical evidence that a threshold in decision capacity, specifically the presence of positive-reach contingent decisions, determines whether self-play RL agents collapse under asymmetric rule perturbations. Eliminating all such decisions leads to rapid convergence to a deterministic exploitation attractor with near-maximal loss across poker variants, matrix games, a dice game, and multiple algorithms. Preserving even one prevents collapse. Controls using frozen baselines and fixed opponents are used to attribute this to co-adaptation under constraint. The phenomenon is timing-invariant, reversible, and intensifies with function approximation, establishing a sharp threshold at zero reach-weighted contingent action capacity.

Significance. If substantiated, this work identifies a fundamental structural threshold governing stability in self-play reinforcement learning, with broad implications for multi-agent systems and game theory applications. The consistency across diverse games and algorithms, along with the reversibility and timing-invariance, provides strong support for the generality of the finding. The identification of a capacity threshold at zero reach-weighted contingent actions offers a falsifiable prediction that could guide future theoretical and empirical work in RL.

major comments (1)

[Abstract, final paragraph] Abstract, final paragraph: The assertion that 'A frozen baseline and fixed-opponent control confirm that the mechanism is co-adaptation under constraint, not the perturbation itself' is load-bearing for the central claim. However, in imperfect-information games the rule perturbation could alter reach probabilities and the effective set of contingent decision points even when the opponent or baseline is fixed, thereby changing the optimization landscape without requiring co-adaptation. This potential confound needs to be addressed with additional analysis or experiments showing that reach measures remain constant under the controls.

minor comments (1)

[Abstract] The abstract provides a high-level overview but lacks quantitative details such as specific loss values, convergence times, error bars, or statistical measures, which would strengthen assessment of the reported effects.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful and constructive review. The single major comment raises a valid point about potential confounds in the control experiments, which we address directly below.

read point-by-point responses

Referee: [Abstract, final paragraph] Abstract, final paragraph: The assertion that 'A frozen baseline and fixed-opponent control confirm that the mechanism is co-adaptation under constraint, not the perturbation itself' is load-bearing for the central claim. However, in imperfect-information games the rule perturbation could alter reach probabilities and the effective set of contingent decision points even when the opponent or baseline is fixed, thereby changing the optimization landscape without requiring co-adaptation. This potential confound needs to be addressed with additional analysis or experiments showing that reach measures remain constant under the controls.

Authors: We agree that explicitly ruling out this confound is necessary for the claim to hold. In the fixed-opponent and frozen-baseline controls the perturbation is applied only to the learning agent's own action set at information sets that already have positive reach under the unperturbed game tree; because the opponent policy is held fixed, the probability of reaching each of the agent's information sets is determined exclusively by the fixed opponent and the game structure, which are identical to the baseline. Consequently the reach-weighted contingent action capacity for the learning agent remains numerically unchanged. To make this transparent we will add a dedicated paragraph and supplementary table in the revised manuscript that reports the computed reach-weighted capacities (and the underlying reach probabilities) for the learning agent under each control condition, confirming constancy with the unperturbed setting. This addition directly addresses the referee's concern without altering the experimental design. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical observations with independent controls

full rationale

The manuscript reports experimental results across poker variants, matrix games, dice games, and multiple RL algorithms. The claimed threshold at zero reach-weighted contingent action capacity is established by direct observation of collapse when all positive-reach contingent decisions are eliminated and preservation when at least one remains. Frozen-baseline and fixed-opponent controls are invoked to isolate co-adaptation as the mechanism; these are external experimental manipulations rather than fitted parameters or self-referential definitions. No equations, derivations, or self-citations are presented that reduce the reported threshold or mechanism to the inputs by construction. The work is therefore self-contained as an empirical demonstration.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Review based on abstract only; no free parameters, invented entities, or explicit axioms are stated. The work relies on the implicit domain assumption that the tested poker variants, matrix games, and dice game are representative of broader self-play RL behavior under perturbation.

axioms (1)

domain assumption The tested games and asymmetric rule perturbations are representative of general self-play RL collapse phenomena.
Abstract states results hold across poker variants, matrix games, a dice game, and multiple algorithms.

pith-pipeline@v0.9.0 · 5647 in / 1262 out tokens · 42462 ms · 2026-05-20T23:33:15.340351+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

24 extracted references · 24 canonical work pages

[1]

Advances in Neural Information Processing Systems , volume=

Regret minimization in games with incomplete information , author=. Advances in Neural Information Processing Systems , volume=

work page
[2]

Science , volume=

Heads-up limit hold'em poker is solved , author=. Science , volume=

work page
[3]

Tesauro, Gerald , journal=

work page
[4]

A general reinforcement learning algorithm that masters chess, shogi, and

Silver, David and Hubert, Thomas and Schrittwieser, Julian and Antonoglou, Ioannis and Lai, Matthew and Guez, Arthur and Lanctot, Marc and Sifre, Laurent and Kumaran, Dharshan and Graepel, Thore and others , journal=. A general reinforcement learning algorithm that masters chess, shogi, and

work page
[5]

arXiv preprint arXiv:1908.09453 , year=

Lanctot, Marc and Lockhart, Edward and Lespiau, Jean-Baptiste and Zambaldi, Vinicius and Upadhyay, Satyaki and P. arXiv preprint arXiv:1908.09453 , year=

work page arXiv 1908
[6]

International Conference on Learning Representations , year=

Adversarial policies: Attacking deep reinforcement learning , author=. International Conference on Learning Representations , year=

work page
[7]

Science , volume=

Moravc. Science , volume=

work page
[8]

Superhuman

Brown, Noam and Sandholm, Tuomas , journal=. Superhuman

work page
[9]

International Conference on Learning Representations , year=

Deep reinforcement learning from self-play in imperfect-information games , author=. International Conference on Learning Representations , year=

work page
[10]

Advances in Neural Information Processing Systems , volume=

A unified game-theoretic approach to multiagent reinforcement learning , author=. Advances in Neural Information Processing Systems , volume=

work page
[11]

Grandmaster level in

Vinyals, Oriol and Babuschkin, Igor and Czarnecki, Wojciech M and Mathieu, Micha. Grandmaster level in. Nature , volume=

work page
[12]

Handbook of Reinforcement Learning and Control , pages=

Multi-agent reinforcement learning: A selective overview of theories and algorithms , author=. Handbook of Reinforcement Learning and Control , pages=. 2021 , publisher=

work page 2021
[13]

International Conference on Autonomous Agents and Multiagent Systems , pages=

Learning with opponent-learning awareness , author=. International Conference on Autonomous Agents and Multiagent Systems , pages=

work page
[14]

International Conference on Machine Learning , pages=

Open-ended learning in symmetric zero-sum games , author=. International Conference on Machine Learning , pages=

work page
[15]

Advances in Neural Information Processing Systems , volume=

Computing robust counter-strategies , author=. Advances in Neural Information Processing Systems , volume=

work page
[16]

Advances in Neural Information Processing Systems , year=

Approximate exploitability: Learning a best response , author=. Advances in Neural Information Processing Systems , year=

work page
[17]

P. From. International Conference on Machine Learning , pages=

work page
[18]

International FLAIRS Conference , year=

A closer look at invalid action masking in policy gradient algorithms , author=. International FLAIRS Conference , year=

work page
[19]

Constrained

Altman, Eitan , year=. Constrained

work page
[20]

Mathematics of Operations Research , volume=

Robust dynamic programming , author=. Mathematics of Operations Research , volume=

work page
[21]

Robust control of

Nilim, Arnab and El Ghaoui, Laurent , journal=. Robust control of

work page
[22]

International Joint Conference on Artificial Intelligence , pages=

Planning and acting in stochastic action sets , author=. International Joint Conference on Artificial Intelligence , pages=

work page
[23]

International Conference on Learning Representations , year=

Stable opponent shaping in differentiable games , author=. International Conference on Learning Representations , year=

work page
[24]

International Conference on Learning Representations , year=

A generalized training approach for multiagent learning , author=. International Conference on Learning Representations , year=

work page

[1] [1]

Advances in Neural Information Processing Systems , volume=

Regret minimization in games with incomplete information , author=. Advances in Neural Information Processing Systems , volume=

work page

[2] [2]

Science , volume=

Heads-up limit hold'em poker is solved , author=. Science , volume=

work page

[3] [3]

Tesauro, Gerald , journal=

work page

[4] [4]

A general reinforcement learning algorithm that masters chess, shogi, and

Silver, David and Hubert, Thomas and Schrittwieser, Julian and Antonoglou, Ioannis and Lai, Matthew and Guez, Arthur and Lanctot, Marc and Sifre, Laurent and Kumaran, Dharshan and Graepel, Thore and others , journal=. A general reinforcement learning algorithm that masters chess, shogi, and

work page

[5] [5]

arXiv preprint arXiv:1908.09453 , year=

Lanctot, Marc and Lockhart, Edward and Lespiau, Jean-Baptiste and Zambaldi, Vinicius and Upadhyay, Satyaki and P. arXiv preprint arXiv:1908.09453 , year=

work page arXiv 1908

[6] [6]

International Conference on Learning Representations , year=

Adversarial policies: Attacking deep reinforcement learning , author=. International Conference on Learning Representations , year=

work page

[7] [7]

Science , volume=

Moravc. Science , volume=

work page

[8] [8]

Superhuman

Brown, Noam and Sandholm, Tuomas , journal=. Superhuman

work page

[9] [9]

International Conference on Learning Representations , year=

Deep reinforcement learning from self-play in imperfect-information games , author=. International Conference on Learning Representations , year=

work page

[10] [10]

Advances in Neural Information Processing Systems , volume=

A unified game-theoretic approach to multiagent reinforcement learning , author=. Advances in Neural Information Processing Systems , volume=

work page

[11] [11]

Grandmaster level in

Vinyals, Oriol and Babuschkin, Igor and Czarnecki, Wojciech M and Mathieu, Micha. Grandmaster level in. Nature , volume=

work page

[12] [12]

Handbook of Reinforcement Learning and Control , pages=

Multi-agent reinforcement learning: A selective overview of theories and algorithms , author=. Handbook of Reinforcement Learning and Control , pages=. 2021 , publisher=

work page 2021

[13] [13]

International Conference on Autonomous Agents and Multiagent Systems , pages=

Learning with opponent-learning awareness , author=. International Conference on Autonomous Agents and Multiagent Systems , pages=

work page

[14] [14]

International Conference on Machine Learning , pages=

Open-ended learning in symmetric zero-sum games , author=. International Conference on Machine Learning , pages=

work page

[15] [15]

Advances in Neural Information Processing Systems , volume=

Computing robust counter-strategies , author=. Advances in Neural Information Processing Systems , volume=

work page

[16] [16]

Advances in Neural Information Processing Systems , year=

Approximate exploitability: Learning a best response , author=. Advances in Neural Information Processing Systems , year=

work page

[17] [17]

P. From. International Conference on Machine Learning , pages=

work page

[18] [18]

International FLAIRS Conference , year=

A closer look at invalid action masking in policy gradient algorithms , author=. International FLAIRS Conference , year=

work page

[19] [19]

Constrained

Altman, Eitan , year=. Constrained

work page

[20] [20]

Mathematics of Operations Research , volume=

Robust dynamic programming , author=. Mathematics of Operations Research , volume=

work page

[21] [21]

Robust control of

Nilim, Arnab and El Ghaoui, Laurent , journal=. Robust control of

work page

[22] [22]

International Joint Conference on Artificial Intelligence , pages=

Planning and acting in stochastic action sets , author=. International Joint Conference on Artificial Intelligence , pages=

work page

[23] [23]

International Conference on Learning Representations , year=

Stable opponent shaping in differentiable games , author=. International Conference on Learning Representations , year=

work page

[24] [24]

International Conference on Learning Representations , year=

A generalized training approach for multiagent learning , author=. International Conference on Learning Representations , year=

work page