On the Exploitability of FTRL Dynamics

Emmanouil-Vasileios Vlatakis-Gkaragkounis; Yiheng Su

arxiv: 2604.05129 · v1 · submitted 2026-04-06 · 💻 cs.GT · cs.LG

On the Exploitability of FTRL Dynamics

Yiheng Su , Emmanouil-Vasileios Vlatakis-Gkaragkounis This is my paper

Pith reviewed 2026-05-10 18:58 UTC · model grok-4.3

classification 💻 cs.GT cs.LG

keywords FTRLexploitabilityzero-sum gamesonline learningregularized leadergame dynamicsclairvoyant optimizersteep regularizers

0 comments

The pith

Exploitability is an inherent feature of the FTRL family rather than an artifact of specific regularizers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that in two-player zero-sum games a Follow-the-Regularized-Leader learner with constant step size remains exploitable by an opponent who knows its strategy in advance. For a fixed optimizer the total surplus extracted scales as Omega of the number of the learner's suboptimal actions divided by the step size. Against an alternating optimizer in random games a surplus of order eta T over a polynomial in the dimensions is guaranteed with high probability regardless of the equilibrium. The analysis separates non-steep regularizers, which permit finite-time removal of bad actions, from steep ones, which add only a vanishing delay term. These results imply that strategic manipulation can systematically extract value from standard FTRL trajectories.

Core claim

We show that exploitability is an inherent feature of the FTRL family. For fixed optimizer we establish a sweeping law of order Omega(N/eta), proving that exploitation scales to the number of the learner's suboptimal actions N and vanishes in their absence. For alternating optimizer a surplus of Omega(eta T / poly(n,m)) can be guaranteed regardless of the equilibrium structure, with high probability, in random games. Our analysis uncovers the sharp geometric dichotomy: non-steep regularizers allow the optimizer to extract maximum surplus via finite-time elimination of suboptimal actions, whereas steep ones introduce a vanishing correction that may delay exploitation.

What carries the argument

The sweeping law bounding total exploitability by Omega(N/eta) together with the steep versus non-steep regularizer dichotomy that controls whether elimination of suboptimal actions occurs in finite time.

If this is right

A clairvoyant opponent can eliminate all of the learner's suboptimal actions in finite time when a non-steep regularizer is used.
In alternating play against random games the opponent extracts a surplus linear in eta and T.
Vulnerability to exploitation can be ranked across regularizers by a susceptibility measure.
The same leverage continues to hold under bilateral payoff uncertainty for the learner.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Constant-step FTRL trajectories cannot achieve vanishing exploitability against a fully informed strategic opponent.
Algorithm designers may need adaptive step sizes or explicit no-exploitability constraints to protect FTRL in adversarial game settings.
The geometric dichotomy between steep and non-steep regularizers likely generalizes to other no-regret dynamics used in multi-agent learning.

Load-bearing premise

The optimizer is clairvoyant and knows the learner's strategy in advance while the learner uses constant step size eta.

What would settle it

A concrete small-game simulation where total surplus for a non-steep regularizer fails to grow linearly with the number of suboptimal actions N or where surplus for a steep regularizer vanishes faster than the predicted correction term.

read the original abstract

In this paper we investigate the exploitability of a Follow-the-Regularized-Leader (FTRL) learner with constant step size $\eta$ in $n\times m$ two-player zero-sum games played over $T$ rounds against a clairvoyant optimizer. In contrast with prior analysis, we show that exploitability is an inherent feature of the FTRL family, rather than an artifact of specific instantiations. First, for fixed optimizer, we establish a sweeping law of order $\Omega(N/\eta)$, proving that exploitation scales to the number of the learner's suboptimal actions $N$ and vanishes in their absence. Second, for alternating optimizer, a surplus of $\Omega(\eta T/\mathrm{poly}(n,m))$ can be guaranteed regardless of the equilibrium structure, with high probability, in random games. Our analysis uncovers once more the sharp geometric dichotomy: non-steep regularizers allow the optimizer to extract maximum surplus via finite-time elimination of suboptimal actions, whereas steep ones introduce a vanishing correction that may delay exploitation. Finally, we discuss whether this leverage persists under bilateral payoff uncertainty and we propose susceptibility measure to quantify which regularizers are most vulnerable to strategic manipulation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

FTRL exploitability holds against clairvoyant opponents with fixed step size, but the inherent claim looks setup-specific rather than fully general.

read the letter

The paper shows that exploitability is a general property of FTRL dynamics against a clairvoyant optimizer, but the results hinge on that information advantage and constant step sizes. It extends earlier work by giving a sweeping law that ties surplus to the count of suboptimal actions for any fixed optimizer. The bound of order Omega(N/eta) is straightforward and vanishes cleanly when no bad actions remain. For alternating optimizers in random games the paper gives a high-probability surplus lower bound of Omega(eta T over poly(n,m)) that holds regardless of the equilibrium. The geometric distinction between steep and non-steep regularizers is also developed, showing how non-steep ones permit quick elimination of suboptimal actions while steep ones add a small delay. The main concern is whether these findings are truly inherent to FTRL or specific to the clairvoyant setup. The stress-test point is valid here: if the optimizer uses advance knowledge of the learner's strategy to achieve the elimination or the surplus, then the exploitability could be weaker or absent when both players have symmetric information. Constant step size is another fixed assumption that might not generalize. The paper does discuss bilateral uncertainty, but the core claims appear tied to the asymmetric information model. This is useful for people studying online algorithms in games who want to see how regularizer choice affects vulnerability. The new bounds and the dichotomy give concrete things to build on. It is worth sending to peer review so the proofs can be checked and the scope of the inherent claim clarified.

Referee Report

2 major / 2 minor

Summary. The paper claims that exploitability is an inherent feature of the FTRL family (rather than an artifact of specific instantiations) when using constant step-size η in n×m two-player zero-sum games against a clairvoyant optimizer. It establishes an Ω(N/η) sweeping law for a fixed optimizer (scaling with the number of the learner's suboptimal actions N) and an Ω(ηT/poly(n,m)) surplus bound (with high probability) for an alternating optimizer in random games. The analysis highlights a sharp geometric dichotomy between non-steep regularizers (allowing finite-time elimination of suboptimal actions) and steep ones (introducing a vanishing correction), and proposes a susceptibility measure while discussing persistence under bilateral payoff uncertainty.

Significance. If the central bounds hold, the work provides a unified characterization of exploitability across the FTRL family, moving beyond case-by-case analysis of specific regularizers. The high-probability guarantees in random games and the explicit treatment of the steep/non-steep dichotomy are strengths; the proposed susceptibility measure could help quantify vulnerability if formalized. No machine-checked proofs or reproducible code are referenced.

major comments (2)

[§3] §3 (fixed optimizer analysis): the Ω(N/η) sweeping law and the claim that exploitation 'vanishes in their absence' are derived under the clairvoyant optimizer that knows the learner's current strategy in advance. This assumption appears load-bearing for the 'inherent feature' conclusion; the manuscript should explicitly test or bound the result when the optimizer lacks advance knowledge of the strategy, as removing clairvoyance could collapse the order or eliminate the surplus.
[§4] §4 (alternating optimizer in random games): the Ω(ηT/poly(n,m)) surplus bound with high probability is stated to hold 'regardless of the equilibrium structure.' However, the derivation relies on constant η and the specific random-game model; it is unclear whether the poly(n,m) term or the high-probability statement remains valid for general (non-random) games or variable step sizes, which directly affects whether exploitability is truly inherent to FTRL rather than setup-specific.

minor comments (2)

[Abstract] The abstract introduces N and the steep/non-steep dichotomy without prior definition or reference to the relevant section/equation; this should be clarified on first use for readability.
[Discussion] The susceptibility measure is proposed in the final discussion but lacks a formal definition, comparison to existing metrics (e.g., regret or exploitability bounds), or explicit formula; adding this would strengthen the contribution.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments. We address each major comment below, clarifying the scope of our assumptions while defending the strength of the results within the stated model. Revisions will be made to improve explicitness where helpful.

read point-by-point responses

Referee: [§3] §3 (fixed optimizer analysis): the Ω(N/η) sweeping law and the claim that exploitation 'vanishes in their absence' are derived under the clairvoyant optimizer that knows the learner's current strategy in advance. This assumption appears load-bearing for the 'inherent feature' conclusion; the manuscript should explicitly test or bound the result when the optimizer lacks advance knowledge of the strategy, as removing clairvoyance could collapse the order or eliminate the surplus.

Authors: The fixed-optimizer analysis in §3 is conducted under the clairvoyant model, in which the optimizer knows the learner's current strategy in advance. This is the standard worst-case setting for quantifying exploitability, as it represents the strongest possible adversary. Within this model, the Ω(N/η) sweeping law shows that constant-step-size FTRL is exploitable in a manner that scales with the number of suboptimal actions and vanishes precisely when N=0. This establishes that exploitability is inherent to the FTRL family rather than an artifact of particular regularizers, even against a fully informed optimizer. We do not claim the identical bound holds for non-clairvoyant optimizers; without advance knowledge the surplus would likely be smaller or harder to guarantee. The clairvoyant assumption is therefore central to obtaining a clean, strong negative result, but it does not weaken the paper's conclusion inside the model considered. In revision we will add an explicit remark on the role of clairvoyance and note that non-clairvoyant extensions are left for future work. revision: partial
Referee: [§4] §4 (alternating optimizer in random games): the Ω(ηT/poly(n,m)) surplus bound with high probability is stated to hold 'regardless of the equilibrium structure.' However, the derivation relies on constant η and the specific random-game model; it is unclear whether the poly(n,m) term or the high-probability statement remains valid for general (non-random) games or variable step sizes, which directly affects whether exploitability is truly inherent to FTRL rather than setup-specific.

Authors: The alternating-optimizer bound in §4 is derived for constant step size η inside the random-game model and is shown to hold with high probability independently of the equilibrium structure of any particular realization. The random-game assumption supplies the concentration needed for the high-probability statement and the poly(n,m) factor; we do not assert that the same quantitative bound extends to arbitrary payoff matrices or to variable step sizes. By working in random games we obtain a generic result that applies to almost all games, thereby supporting the claim that exploitability is an inherent feature of constant-step-size FTRL rather than an equilibrium-specific pathology. We will revise the manuscript to state these modeling choices more explicitly and to discuss the limitations with respect to general games and adaptive step sizes. revision: partial

Circularity Check

0 steps flagged

No circularity detected in derivation of FTRL exploitability bounds

full rationale

The paper derives lower bounds on exploitability for the FTRL family via direct analysis of the dynamics under constant step-size η against a clairvoyant optimizer. The Ω(N/η) sweeping law for fixed optimizers follows from counting suboptimal actions and the geometric properties of regularizers; the Ω(ηT/poly(n,m)) surplus for alternating optimizers is obtained probabilistically in random games. Neither result reduces by construction to fitted parameters, self-definitions, or load-bearing self-citations; the geometric dichotomy between steep and non-steep regularizers is invoked as an external property rather than smuggled via prior work by the same authors. The derivation is self-contained given the stated assumptions.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Insufficient information available from abstract alone to identify specific free parameters, axioms, or invented entities; no equations or detailed assumptions are provided.

pith-pipeline@v0.9.0 · 5516 in / 1062 out tokens · 98201 ms · 2026-05-10T18:58:11.470906+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

2 extracted references · 2 canonical work pages

[1]

Behavior based learning in identifying high frequency trading strategies

arXiv: 2402 . 09721 [cs.GT].url: https : / / arxiv . org / abs / 2402 . 09721 (cit. on pp. 1, 17). [MMSS22] Yishay Mansour, Mehryar Mohri, Jon Schneider, and Balasubramanian Sivan.Strate- gizing against Learners in Bayesian Games. 2022. arXiv: 2205.08562 [cs.LG].url: https://arxiv.org/abs/2205.08562(cit. on p. 17). [RW98] R. Tyrrell Rockafellar and Roger ...

work page doi:10.1007/978-3-642-02431-3(cit 2022
[2]

Online convex programming and generalized infinitesimal gradient ascent

arXiv: 2601 . 03853 [cs.GT].url: https : / / arxiv . org / abs / 2601 . 03853 (cit. on p. 18). 15 [Zin03] Martin Zinkevich. “Online convex programming and generalized infinitesimal gradient ascent”. In:Proceedings of the 20th international conference on machine learning (icml-03). 2003, pp. 928–936 (cit. on p. 4). [ZJBP07] Martin Zinkevich, Michael Johans...

work page 2003

[1] [1]

Behavior based learning in identifying high frequency trading strategies

arXiv: 2402 . 09721 [cs.GT].url: https : / / arxiv . org / abs / 2402 . 09721 (cit. on pp. 1, 17). [MMSS22] Yishay Mansour, Mehryar Mohri, Jon Schneider, and Balasubramanian Sivan.Strate- gizing against Learners in Bayesian Games. 2022. arXiv: 2205.08562 [cs.LG].url: https://arxiv.org/abs/2205.08562(cit. on p. 17). [RW98] R. Tyrrell Rockafellar and Roger ...

work page doi:10.1007/978-3-642-02431-3(cit 2022

[2] [2]

Online convex programming and generalized infinitesimal gradient ascent

arXiv: 2601 . 03853 [cs.GT].url: https : / / arxiv . org / abs / 2601 . 03853 (cit. on p. 18). 15 [Zin03] Martin Zinkevich. “Online convex programming and generalized infinitesimal gradient ascent”. In:Proceedings of the 20th international conference on machine learning (icml-03). 2003, pp. 928–936 (cit. on p. 4). [ZJBP07] Martin Zinkevich, Michael Johans...

work page 2003