pith. sign in

arxiv: 2605.16312 · v1 · pith:BDK5HC7Onew · submitted 2026-05-04 · 💻 cs.LG · cs.AI

When Actions Disappear: Adversarial Action Removal in Self-Play Reinforcement Learning

Pith reviewed 2026-05-20 23:37 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords adversarial action maskingself-play reinforcement learningrobustnessaction removalmulti-agent RLpokerQ-learningPPO
0
0 comments X

The pith

An attacker that removes legal actions from a self-play RL agent's options inflicts more damage than random removals or perturbations, with no recovery from extended training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that in self-play reinforcement learning, selectively removing legal actions from the victim's action set before it acts produces larger performance losses than random masking or learned perturbations. This effect appears across poker games of different sizes and non-poker domains, and it holds for Q-learning, PPO, NFSP, neural NFSP, and DQN victims. A sympathetic reader would care because the results point to action availability itself as a distinct surface of vulnerability that self-play training does not close and that further masked training does not heal.

Core claim

Learned adversarial action masking damages self-play RL victims substantially more than random masking and learned perturbation baselines. The attack works on Q-learning, PPO, NFSP, neural NFSP, and DQN; it transfers across agents; self-play amplifies it; and victims show no recovery even after extended masked training. The adversary targets high-value decision points, as captured by reach-weighted contingent action capacity (CAC_w) and its value-weighted refinement (CAC_v).

What carries the argument

Reach-weighted contingent action capacity (CAC_w) and value-weighted CAC_v, which measure how the adversary focuses removals on high-value decision points by weighting available actions with state reach and value.

If this is right

  • The attack persists across Q-learning, PPO, NFSP, neural NFSP, and DQN victims.
  • The attack transfers from one trained agent to another.
  • Self-play training increases the damage caused by action removal.
  • Extended training while actions remain masked produces no performance recovery.
  • Action availability forms a robustness surface separate from observation or action perturbations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Real-world RL systems may need safeguards at the action interface to block external restrictions on available moves.
  • The same pattern could appear in other multi-agent settings where one participant can limit another's choices.
  • Training policies on randomly varying action sets might build resilience to this form of interference.
  • Whether the vulnerability appears in continuous-action or high-dimensional control tasks remains open for direct test.

Load-bearing premise

The attacker can remove chosen legal actions from the victim's set in real time without the victim noticing the restriction or adapting its policy to neutralize the loss.

What would settle it

Victims recovering their original performance after prolonged training under continued learned masking, or learned masking performing no better than random masking, would show the central claim does not hold.

Figures

Figures reproduced from arXiv: 2605.16312 by Arahan Kujur.

Figure 1
Figure 1. Figure 1: Scaling trend. The adversarial/random damage ratio is plotted against victim information-set count [PITH_FULL_IMAGE:figures/full_fig_p006_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Learning under a fixed mask. NFSP in Leduc-5 does not recover under continued masked training; [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
read the original abstract

We study adversarial action masking in self-play reinforcement learning: an attacker selectively removes legal actions from a victim's action set. Unlike observation or action perturbations, removal eliminates decision options before the agent acts. Across poker games scaling from 6 to 5,531 information states and two non-poker domains, learned masking causes substantially more damage than random masking and learned perturbation baselines. The attack persists across Q-learning, PPO, NFSP, neural NFSP, and DQN victims; transfers across agents; is amplified by self-play; and shows no recovery under extended masked training. Mechanistically, the adversary targets high-value decision points, captured by reach-weighted contingent action capacity (CAC$_w$) and a value-weighted refinement CAC$_v$. These results identify action availability as a distinct robustness surface in self-play RL.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper studies adversarial action masking in self-play reinforcement learning, where an attacker selectively removes legal actions from the victim's action set in real time. It claims that learned masking causes substantially more damage than random masking and learned perturbation baselines across poker games scaling from 6 to 5,531 information states and two non-poker domains. The attack persists across Q-learning, PPO, NFSP, neural NFSP, and DQN victims; transfers across agents; is amplified by self-play; and shows no recovery under extended masked training. Mechanistically, the adversary targets high-value decision points, captured by reach-weighted contingent action capacity (CAC_w) and a value-weighted refinement CAC_v. These results identify action availability as a distinct robustness surface in self-play RL.

Significance. If the results hold, the work identifies action availability as a new robustness surface in self-play RL distinct from observation or action perturbations. The broad empirical evaluation across multiple algorithms (Q-learning, PPO, NFSP, DQN), game scales up to 5,531 states, and domains is a clear strength and provides consistent evidence for the headline claims. Credit is given for the cross-algorithm persistence and transfer results, which go beyond single-environment demonstrations common in the field.

major comments (2)
  1. [Abstract and §4] Abstract and §4 (Experimental Results): The central claim that the attack 'shows no recovery under extended masked training' is load-bearing for the conclusion that action removal is a distinct, non-mitigable robustness surface. The manuscript does not report the number of additional training steps relative to the original self-play horizon, nor whether the victim is permitted to fully re-optimize its Q-values or policy gradients over the reduced legal action set. Without these details, insufficient retraining duration remains a plausible alternative explanation for the observed lack of recovery.
  2. [§5] §5 (Mechanistic Analysis): CAC_w and CAC_v are introduced as mechanistic explanations for why the adversary targets high-value points. However, both quantities are defined in terms of quantities measured after the attack has been optimized (reach and value under the learned mask). This creates a circularity risk: the metrics explain the attack only because they are constructed from its outputs. An independently motivated, pre-attack capacity measure would be needed to support the mechanistic interpretation.
minor comments (2)
  1. [Table 1 and Figure 3] Table 1 and Figure 3: Ensure all damage metrics include standard errors or confidence intervals and the number of independent runs; current presentation makes it difficult to assess whether differences between learned masking and baselines are statistically reliable.
  2. [Notation] Notation: The precise mathematical definitions of CAC_w and CAC_v should appear in the main text (not only the appendix) with explicit summation indices over information states and actions.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive review and for recognizing the breadth of our empirical evaluation across algorithms, scales, and domains. We address each major comment below with clarifications and commit to revisions that strengthen the manuscript without altering its core claims.

read point-by-point responses
  1. Referee: [Abstract and §4] Abstract and §4 (Experimental Results): The central claim that the attack 'shows no recovery under extended masked training' is load-bearing for the conclusion that action removal is a distinct, non-mitigable robustness surface. The manuscript does not report the number of additional training steps relative to the original self-play horizon, nor whether the victim is permitted to fully re-optimize its Q-values or policy gradients over the reduced legal action set. Without these details, insufficient retraining duration remains a plausible alternative explanation for the observed lack of recovery.

    Authors: We agree that explicit details on retraining duration and re-optimization are necessary to support the claim. The experiments extended masked training for a duration comparable to or exceeding the original self-play horizon, during which victim agents were permitted to fully re-optimize their Q-values or policy gradients over the reduced legal action sets. No meaningful recovery occurred. We will revise §4 to report the exact relative number of additional steps and confirm the re-optimization protocol, and we will update the abstract to reference this clarification. revision: yes

  2. Referee: [§5] §5 (Mechanistic Analysis): CAC_w and CAC_v are introduced as mechanistic explanations for why the adversary targets high-value points. However, both quantities are defined in terms of quantities measured after the attack has been optimized (reach and value under the learned mask). This creates a circularity risk: the metrics explain the attack only because they are constructed from its outputs. An independently motivated, pre-attack capacity measure would be needed to support the mechanistic interpretation.

    Authors: We acknowledge the risk of circularity in relying on post-attack reach and value. While CAC_w and CAC_v are motivated by the information structure of decision points, we agree that an independently computed pre-attack measure would provide stronger support. We will introduce and report a pre-attack capacity metric based solely on the victim's original policy distribution and game-tree reach, demonstrating its correlation with the adversary's targeting choices. This will be added to §5 to eliminate the circularity concern. revision: yes

Circularity Check

0 steps flagged

No significant circularity; results are empirical measurements from controlled experiments

full rationale

The paper reports experimental outcomes from training RL agents (Q-learning, PPO, NFSP, DQN) under learned vs. random action masking and perturbation baselines across multiple domains. The central claims concern relative damage, transfer, amplification by self-play, and lack of recovery under extended training; these are direct measurements from policy performance under the attack, not quantities derived by algebraic reduction from the attack definition itself. CAC_w and CAC_v are introduced as post-hoc descriptive metrics that quantify reach and value at decision points targeted by the observed attack behavior. Because the metrics are computed from the same trajectories used to evaluate the attack, they function as explanatory summaries rather than load-bearing premises that force the experimental results. No self-citation chain, fitted-parameter-as-prediction, or self-definitional loop is required to obtain the headline performance deltas. The derivation chain is therefore self-contained against external benchmarks (the victim policies and environment simulators).

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claim rests on empirical comparisons in poker and non-poker domains plus two new capacity metrics introduced to explain targeting behavior.

axioms (1)
  • standard math Reinforcement learning proceeds in Markov decision processes where agents select from a legal action set at each information state.
    Implicit background for any action-masking study in RL.
invented entities (2)
  • CAC_w (reach-weighted contingent action capacity) no independent evidence
    purpose: Quantify high-value decision points that the adversary targets
    New metric defined to capture the mechanistic pattern of the attack.
  • CAC_v (value-weighted refinement of CAC) no independent evidence
    purpose: Refine the capacity measure using action values
    Introduced alongside CAC_w for finer analysis of attack focus.

pith-pipeline@v0.9.0 · 5660 in / 1304 out tokens · 41395 ms · 2026-05-20T23:37:45.278875+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

30 extracted references · 30 canonical work pages · 1 internal anchor

  1. [1]

    International Conference on Learning Representations , year=

    Adversarial policies: Attacking deep reinforcement learning , author=. International Conference on Learning Representations , year=

  2. [2]

    Adversarial Attacks on Neural Network Policies

    Adversarial attacks on neural network policies , author=. arXiv preprint arXiv:1702.02284 , year=

  3. [3]

    International FLAIRS Conference , year=

    A closer look at invalid action masking in policy gradient algorithms , author=. International FLAIRS Conference , year=

  4. [4]

    International Conference on Machine Learning , pages=

    Open-ended learning in symmetric zero-sum games , author=. International Conference on Machine Learning , pages=

  5. [5]

    Lanctot, Marc and Lockhart, Edward and Lespiau, Jean-Baptiste and Zambaldi, Vinicius and others , journal=

  6. [6]

    Advances in Neural Information Processing Systems , volume=

    A unified game-theoretic approach to multiagent reinforcement learning , author=. Advances in Neural Information Processing Systems , volume=

  7. [7]

    A general reinforcement learning algorithm that masters chess, shogi, and

    Silver, David and Hubert, Thomas and Schrittwieser, Julian and others , journal=. A general reinforcement learning algorithm that masters chess, shogi, and

  8. [8]

    International Conference on Machine Learning , year=

    Adaptive reward-poisoning attacks against reinforcement learning , author=. International Conference on Machine Learning , year=

  9. [9]

    International Conference on Autonomous Agents and Multiagent Systems , year=

    Learning with opponent-learning awareness , author=. International Conference on Autonomous Agents and Multiagent Systems , year=

  10. [10]

    Constrained

    Altman, Eitan , year=. Constrained

  11. [11]

    Superhuman

    Brown, Noam and Sandholm, Tuomas , journal=. Superhuman

  12. [12]

    International Conference on Learning Representations , year=

    Deep reinforcement learning from self-play in imperfect-information games , author=. International Conference on Learning Representations , year=

  13. [13]

    International Conference on Machine Learning , pages=

    Robust adversarial reinforcement learning , author=. International Conference on Machine Learning , pages=

  14. [14]

    International Conference on Machine Learning , pages=

    Action robust reinforcement learning and applications in continuous control , author=. International Conference on Machine Learning , pages=

  15. [15]

    Who is the strongest enemy? Towards optimal and efficient evasion attacks in deep

    Sun, Yanchao and Zheng, Ruijie and Liang, Yongyuan and Huang, Furong , journal=. Who is the strongest enemy? Towards optimal and efficient evasion attacks in deep

  16. [16]

    International Conference on Machine Learning , year=

    Robustness to out-of-distribution inputs via task-aware generative uncertainty , author=. International Conference on Machine Learning , year=

  17. [17]

    International Conference on Learning Representations , year=

    Robust reinforcement learning on state observations with learned optimal adversary , author=. International Conference on Learning Representations , year=

  18. [18]

    Mathematics of Operations Research , volume=

    Robust dynamic programming , author=. Mathematics of Operations Research , volume=

  19. [19]

    Robust control of

    Nilim, Arnab and El Ghaoui, Laurent , journal=. Robust control of

  20. [20]

    Advances in Neural Information Processing Systems , volume=

    Regret minimization in games with incomplete information , author=. Advances in Neural Information Processing Systems , volume=

  21. [21]

    International Conference on Machine Learning , pages=

    Deep counterfactual regret minimization , author=. International Conference on Machine Learning , pages=

  22. [22]

    Journal of Artificial Intelligence Research , volume=

    Decision-theoretic planning: Structural assumptions and computational leverage , author=. Journal of Artificial Intelligence Research , volume=

  23. [23]

    IEEE Security and Privacy Workshops , pages=

    On the robustness of cooperative multi-agent reinforcement learning , author=. IEEE Security and Privacy Workshops , pages=

  24. [24]

    arXiv preprint arXiv:2205.09362 , year=

    Sparse adversarial attack in multi-agent reinforcement learning , author=. arXiv preprint arXiv:2205.09362 , year=

  25. [25]

    Niu, Men and Fan, Xinxin and Jing, Quanliang and Luo, Shaoye and Lu, Yunfeng , journal=

  26. [26]

    Samvelyan, Mikayel and Rashid, Tabish and de Witt, Christian Schroeder and Farquhar, Gregory and Nardelli, Nantas and Rudner, Tim G. J. and Hung, Chia-Man and Torr, Philip H. S. and Foerster, Jakob and Whiteson, Shimon , booktitle=. The

  27. [27]

    International Conference on Learning Representations , year=

    Game-theoretic robust reinforcement learning handles temporally-coupled perturbations , author=. International Conference on Learning Representations , year=

  28. [28]

    arXiv preprint arXiv:2508.08800 , year=

    Learning robust multi-agent policies via selective adversarial fault induction , author=. arXiv preprint arXiv:2508.08800 , year=

  29. [29]

    Diagnosis and fault-tolerant control , author=

  30. [30]

    IEEE Congress on Evolutionary Computation , pages=

    Empowerment: A universal agent-centric measure of control , author=. IEEE Congress on Evolutionary Computation , pages=