On the Exploitability of FTRL Dynamics
Pith reviewed 2026-05-10 18:58 UTC · model grok-4.3
The pith
Exploitability is an inherent feature of the FTRL family rather than an artifact of specific regularizers.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We show that exploitability is an inherent feature of the FTRL family. For fixed optimizer we establish a sweeping law of order Omega(N/eta), proving that exploitation scales to the number of the learner's suboptimal actions N and vanishes in their absence. For alternating optimizer a surplus of Omega(eta T / poly(n,m)) can be guaranteed regardless of the equilibrium structure, with high probability, in random games. Our analysis uncovers the sharp geometric dichotomy: non-steep regularizers allow the optimizer to extract maximum surplus via finite-time elimination of suboptimal actions, whereas steep ones introduce a vanishing correction that may delay exploitation.
What carries the argument
The sweeping law bounding total exploitability by Omega(N/eta) together with the steep versus non-steep regularizer dichotomy that controls whether elimination of suboptimal actions occurs in finite time.
If this is right
- A clairvoyant opponent can eliminate all of the learner's suboptimal actions in finite time when a non-steep regularizer is used.
- In alternating play against random games the opponent extracts a surplus linear in eta and T.
- Vulnerability to exploitation can be ranked across regularizers by a susceptibility measure.
- The same leverage continues to hold under bilateral payoff uncertainty for the learner.
Where Pith is reading between the lines
- Constant-step FTRL trajectories cannot achieve vanishing exploitability against a fully informed strategic opponent.
- Algorithm designers may need adaptive step sizes or explicit no-exploitability constraints to protect FTRL in adversarial game settings.
- The geometric dichotomy between steep and non-steep regularizers likely generalizes to other no-regret dynamics used in multi-agent learning.
Load-bearing premise
The optimizer is clairvoyant and knows the learner's strategy in advance while the learner uses constant step size eta.
What would settle it
A concrete small-game simulation where total surplus for a non-steep regularizer fails to grow linearly with the number of suboptimal actions N or where surplus for a steep regularizer vanishes faster than the predicted correction term.
read the original abstract
In this paper we investigate the exploitability of a Follow-the-Regularized-Leader (FTRL) learner with constant step size $\eta$ in $n\times m$ two-player zero-sum games played over $T$ rounds against a clairvoyant optimizer. In contrast with prior analysis, we show that exploitability is an inherent feature of the FTRL family, rather than an artifact of specific instantiations. First, for fixed optimizer, we establish a sweeping law of order $\Omega(N/\eta)$, proving that exploitation scales to the number of the learner's suboptimal actions $N$ and vanishes in their absence. Second, for alternating optimizer, a surplus of $\Omega(\eta T/\mathrm{poly}(n,m))$ can be guaranteed regardless of the equilibrium structure, with high probability, in random games. Our analysis uncovers once more the sharp geometric dichotomy: non-steep regularizers allow the optimizer to extract maximum surplus via finite-time elimination of suboptimal actions, whereas steep ones introduce a vanishing correction that may delay exploitation. Finally, we discuss whether this leverage persists under bilateral payoff uncertainty and we propose susceptibility measure to quantify which regularizers are most vulnerable to strategic manipulation.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that exploitability is an inherent feature of the FTRL family (rather than an artifact of specific instantiations) when using constant step-size η in n×m two-player zero-sum games against a clairvoyant optimizer. It establishes an Ω(N/η) sweeping law for a fixed optimizer (scaling with the number of the learner's suboptimal actions N) and an Ω(ηT/poly(n,m)) surplus bound (with high probability) for an alternating optimizer in random games. The analysis highlights a sharp geometric dichotomy between non-steep regularizers (allowing finite-time elimination of suboptimal actions) and steep ones (introducing a vanishing correction), and proposes a susceptibility measure while discussing persistence under bilateral payoff uncertainty.
Significance. If the central bounds hold, the work provides a unified characterization of exploitability across the FTRL family, moving beyond case-by-case analysis of specific regularizers. The high-probability guarantees in random games and the explicit treatment of the steep/non-steep dichotomy are strengths; the proposed susceptibility measure could help quantify vulnerability if formalized. No machine-checked proofs or reproducible code are referenced.
major comments (2)
- [§3] §3 (fixed optimizer analysis): the Ω(N/η) sweeping law and the claim that exploitation 'vanishes in their absence' are derived under the clairvoyant optimizer that knows the learner's current strategy in advance. This assumption appears load-bearing for the 'inherent feature' conclusion; the manuscript should explicitly test or bound the result when the optimizer lacks advance knowledge of the strategy, as removing clairvoyance could collapse the order or eliminate the surplus.
- [§4] §4 (alternating optimizer in random games): the Ω(ηT/poly(n,m)) surplus bound with high probability is stated to hold 'regardless of the equilibrium structure.' However, the derivation relies on constant η and the specific random-game model; it is unclear whether the poly(n,m) term or the high-probability statement remains valid for general (non-random) games or variable step sizes, which directly affects whether exploitability is truly inherent to FTRL rather than setup-specific.
minor comments (2)
- [Abstract] The abstract introduces N and the steep/non-steep dichotomy without prior definition or reference to the relevant section/equation; this should be clarified on first use for readability.
- [Discussion] The susceptibility measure is proposed in the final discussion but lacks a formal definition, comparison to existing metrics (e.g., regret or exploitability bounds), or explicit formula; adding this would strengthen the contribution.
Simulated Author's Rebuttal
We thank the referee for their constructive comments. We address each major comment below, clarifying the scope of our assumptions while defending the strength of the results within the stated model. Revisions will be made to improve explicitness where helpful.
read point-by-point responses
-
Referee: [§3] §3 (fixed optimizer analysis): the Ω(N/η) sweeping law and the claim that exploitation 'vanishes in their absence' are derived under the clairvoyant optimizer that knows the learner's current strategy in advance. This assumption appears load-bearing for the 'inherent feature' conclusion; the manuscript should explicitly test or bound the result when the optimizer lacks advance knowledge of the strategy, as removing clairvoyance could collapse the order or eliminate the surplus.
Authors: The fixed-optimizer analysis in §3 is conducted under the clairvoyant model, in which the optimizer knows the learner's current strategy in advance. This is the standard worst-case setting for quantifying exploitability, as it represents the strongest possible adversary. Within this model, the Ω(N/η) sweeping law shows that constant-step-size FTRL is exploitable in a manner that scales with the number of suboptimal actions and vanishes precisely when N=0. This establishes that exploitability is inherent to the FTRL family rather than an artifact of particular regularizers, even against a fully informed optimizer. We do not claim the identical bound holds for non-clairvoyant optimizers; without advance knowledge the surplus would likely be smaller or harder to guarantee. The clairvoyant assumption is therefore central to obtaining a clean, strong negative result, but it does not weaken the paper's conclusion inside the model considered. In revision we will add an explicit remark on the role of clairvoyance and note that non-clairvoyant extensions are left for future work. revision: partial
-
Referee: [§4] §4 (alternating optimizer in random games): the Ω(ηT/poly(n,m)) surplus bound with high probability is stated to hold 'regardless of the equilibrium structure.' However, the derivation relies on constant η and the specific random-game model; it is unclear whether the poly(n,m) term or the high-probability statement remains valid for general (non-random) games or variable step sizes, which directly affects whether exploitability is truly inherent to FTRL rather than setup-specific.
Authors: The alternating-optimizer bound in §4 is derived for constant step size η inside the random-game model and is shown to hold with high probability independently of the equilibrium structure of any particular realization. The random-game assumption supplies the concentration needed for the high-probability statement and the poly(n,m) factor; we do not assert that the same quantitative bound extends to arbitrary payoff matrices or to variable step sizes. By working in random games we obtain a generic result that applies to almost all games, thereby supporting the claim that exploitability is an inherent feature of constant-step-size FTRL rather than an equilibrium-specific pathology. We will revise the manuscript to state these modeling choices more explicitly and to discuss the limitations with respect to general games and adaptive step sizes. revision: partial
Circularity Check
No circularity detected in derivation of FTRL exploitability bounds
full rationale
The paper derives lower bounds on exploitability for the FTRL family via direct analysis of the dynamics under constant step-size η against a clairvoyant optimizer. The Ω(N/η) sweeping law for fixed optimizers follows from counting suboptimal actions and the geometric properties of regularizers; the Ω(ηT/poly(n,m)) surplus for alternating optimizers is obtained probabilistically in random games. Neither result reduces by construction to fitted parameters, self-definitions, or load-bearing self-citations; the geometric dichotomy between steep and non-steep regularizers is invoked as an external property rather than smuggled via prior work by the same authors. The derivation is self-contained given the stated assumptions.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Behavior based learning in identifying high frequency trading strategies
arXiv: 2402 . 09721 [cs.GT].url: https : / / arxiv . org / abs / 2402 . 09721 (cit. on pp. 1, 17). [MMSS22] Yishay Mansour, Mehryar Mohri, Jon Schneider, and Balasubramanian Sivan.Strate- gizing against Learners in Bayesian Games. 2022. arXiv: 2205.08562 [cs.LG].url: https://arxiv.org/abs/2205.08562(cit. on p. 17). [RW98] R. Tyrrell Rockafellar and Roger ...
-
[2]
Online convex programming and generalized infinitesimal gradient ascent
arXiv: 2601 . 03853 [cs.GT].url: https : / / arxiv . org / abs / 2601 . 03853 (cit. on p. 18). 15 [Zin03] Martin Zinkevich. “Online convex programming and generalized infinitesimal gradient ascent”. In:Proceedings of the 20th international conference on machine learning (icml-03). 2003, pp. 928–936 (cit. on p. 4). [ZJBP07] Martin Zinkevich, Michael Johans...
work page 2003
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.