arxiv: 2604.15315 · v1 · submitted 2026-02-17 · 💻 cs.GT · math.PR

Recognition: 1 theorem link

· Lean Theorem

Can a Weaker Player Win? Adaptive Play in Repeated Games

Jonatha ANSELMI (GHOST) , Bruno Gaujal (GHOST)

Authors on Pith no claims yet

Pith reviewed 2026-05-15 22:11 UTC · model grok-4.3

classification 💻 cs.GT math.PR

keywords repeated gamesadaptive strategiesweaker playerdynamic programmingfinite horizon controlexpected sign gain

0 comments

The pith

A weaker player can achieve positive expected match gain by switching adaptively between styles despite losing with each fixed choice.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper studies a repeated two-player game over finite horizon N where Player 1 selects between two styles each round while Player 2 uses one fixed style. Player 1 is weaker because each pure style produces negative expected net wins, yet the objective is to find whether an optimal adaptive policy can still make the expected sign of the cumulative win-loss difference strictly positive. Dynamic programming solves the control problem exactly by tracking the running difference as state. Numerical exploration reveals parameter regimes in which the optimal gain turns positive at some horizon, together with structural conditions that force the gain to remain negative for all N and other regimes in which the gain stays nonnegative and becomes positive for every N at least 2. Asymptotic analysis then distinguishes safe versus non-safe defensive cases and shows the limiting gain can range continuously over [0,1] or collapse to -1 or 0 depending on the probabilities.

Core claim

Even when each pure style for the weaker player yields negative expected win-loss difference, there exist transition probabilities such that the optimal adaptive policy produces a strictly positive value of E[sign(X_N)] for some finite N star, and in fair-defensive regimes the optimal gain is nonnegative for every horizon and strictly positive for all N greater than or equal to 2.

What carries the argument

The dynamic-programming value function over the scalar state equal to the current win-loss difference, with actions at each step being the choice of offensive or defensive style.

If this is right

In safe fair-defensive regimes the optimal gain is nonnegative for all N and strictly positive for every N greater than or equal to 2.
As N tends to infinity the limiting gain varies continuously over the interval [0,1] when the defensive style is safe.
In non-safe cases where both styles are strictly losing the limiting gain converges to -1.
When the defensive style is fair but non-safe the limiting gain converges to 0.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The reduction of the state to the scalar net difference shows that only the running score, not the order of past outcomes, determines future optimal choices.
The continuous range of limiting gains in the safe case implies that small changes in win probabilities can produce arbitrarily small positive long-run advantages.
These finite-horizon results suggest that repeated-interaction models with adaptive choice may need to track horizon length explicitly rather than invoke only infinite-horizon averages.

Load-bearing premise

Each of the two pure styles for Player 1 loses in expectation against Player 2's fixed style.

What would settle it

Run the dynamic program on the transition probabilities identified in the paper's positive-gain regimes and obtain a non-positive optimal gain for every N; that single numerical outcome would falsify the existence of regimes with strictly positive gain.

Figures

Figures reproduced from arXiv: 2604.15315 by Bruno Gaujal (GHOST), Jonatha ANSELMI (GHOST).

**Figure 2.** Figure 2: Optimal gain g ⋆ N under the strictly losing parameters (pw, pd, pℓ) = (0.43, 0, 0.57) and (qw, qd, qℓ) = (0.06, 0.84, 0.10). qℓ) < 0 under weakness, so V 2N+1 2N (0) < 0. In all cases, V 2N+1 2N (x) ≤ V 2N 2N (x) ∀x ∈ {−2N, . . . , 2N}, since V 2N 2N (x) = sign(x). Let π ⋆ be optimal for horizon 2N+1. Then g ⋆ 2N+1 = Eπ ⋆ h V 2N+1 2N+1 (X2N+1) [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 1.** Figure 1: Backward computation of g ⋆ N = V N 0 (0) for N = 6. Terminal values are sign(X6). Green (resp. red) nodes are states from which a match win (resp. loss) is already forced, hence their values are +1 (resp. −1). For example, with (pw, pd, pℓ) = (0.49, 0, 0.51) and (qw, qd, qℓ) = (0.02, 0.95, 0.03), we obtain N ⋆ = 32 and g ⋆ 32 ≈ 0.453. It is not difficult to show that N ⋆ can grow to infinity by making (D)… view at source ↗

read the original abstract

Consider a two-player game repeated N times. Player 1 can choose between two styles (for interpretability, offensive and defensive), whereas Player 2 uses a single fixed style. Let X N\,:= \#wins -\#losses for Player 1 after N games, and define the match gain as E[sign(X N )], with sign(0) = 0. We assume Player 1 is weaker in the sense that each pure style is losing in expectation. Our objective is to identify under which parameter regimes Player 1 can nevertheless achieve a positive gain under an optimal adaptive policy. Using dynamic programming, we solve the finite-horizon control problem and numerically identify parameter regimes in which the optimal gain is strictly positive at some horizon N $\star$ . We also derive structural conditions guaranteeing that g $\star$ N is always negative, and regimes (notably with fair (D)) where g $\star$ N is nonnegative for all N and can be strictly positive for every N $\ge$ 2. We then characterize the asymptotic behavior as N $\rightarrow$ $\infty$ for a weak player. In the safe case, where the defensive style induces a sure draw, the limiting gain varies continuously with the parameters and may take any value in [0, 1]. In the non-safe case, the limiting gain converges to -1 when both styles are strictly losing, and to 0 when (D) is fair (and non-safe).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Adaptive play lets a weaker player get positive expected sign in some repeated-game regimes even when fixed styles lose, via clean DP on the win-loss difference.

read the letter

The paper shows that in a repeated two-player game, a weaker player who can switch between two styles can sometimes achieve strictly positive expected sign of the final win-loss count, even though each fixed style has negative drift against the opponent's fixed play. They model the state as the current difference X_t and solve the finite-horizon control problem with dynamic programming to locate concrete parameter regimes where the optimal gain turns positive at some finite N star. They also give structural conditions under which the gain stays negative for all N, or stays nonnegative (and positive for N at least 2) when the defensive style is fair, plus clean asymptotic limits: continuous variation in [0,1] when defensive play is a sure draw, or convergence to -1 or 0 otherwise. The Markov property holds by construction since each round is independent given the chosen style, so the recursion is standard and the derivations follow directly. The numerical identification of regimes is the main new piece; the structural and asymptotic results organize when adaptation helps without needing extra assumptions. The central claim rests on the maintained condition that every pure style loses in expectation, which is explicit and not circular. One minor soft spot is that the numerical search over regimes would be stronger with explicit transition probabilities, sensitivity checks, or code release, but the analytic parts do not depend on it. This is useful for readers working on repeated games, stochastic control in games, or learning agents who want explicit regimes rather than existence results. It deserves peer review because the question is well-posed, the method is appropriate, and the characterizations are concrete enough to be checked or extended.

Referee Report

1 major / 2 minor

Summary. The manuscript considers a repeated two-player game of finite horizon N in which Player 1 adaptively selects between two pure styles (offensive and defensive) while Player 2 employs a fixed style. Player 1 is weaker in that each fixed style yields negative expected increment in the win-loss difference X_t. The match gain is defined as g_N = E[sign(X_N)] with sign(0)=0. The authors formulate the problem as a finite-horizon Markov decision process whose state is the current difference X_t, solve it via dynamic programming, numerically locate parameter regimes in which the optimal gain is strictly positive for some N*, derive structural conditions under which g*_N remains negative for all N, and characterize the N→∞ limits (continuous variation in [0,1] when the defensive style is a sure draw; convergence to -1 or 0 otherwise).

Significance. If the numerical regimes are robust, the work establishes that adaptivity can produce a strictly positive probability of a net win (in the sign metric) even when every stationary policy is losing in expectation, together with explicit structural conditions and asymptotic characterizations. This supplies a concrete, computable framework for studying adaptive play in asymmetric repeated interactions and may inform strategy design in settings where only the sign of the cumulative score matters.

major comments (1)

[Numerical experiments] §4 (Numerical experiments): the identification of regimes in which g*_N > 0 for some finite N* is obtained by numerical DP search over parameter space; the manuscript provides neither the discretization grid, the number of Monte-Carlo replications used for verification, nor any sensitivity check that would rule out post-hoc selection of the reported regimes. Because these regimes constitute the central empirical claim, explicit documentation of the search procedure and robustness diagnostics is required.

minor comments (2)

[Abstract] Abstract: the phrase 'fair (D)' is used without a parenthetical reminder that D denotes the defensive style; a short clarification would aid readers who encounter the abstract first.
[§2] Notation: the transition probabilities p_o, p_d (or equivalent) induced by each style are never written explicitly; inserting them once in §2 would make the Bellman recursion immediately verifiable.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive evaluation and the constructive suggestion regarding the numerical experiments. We will revise the manuscript to incorporate explicit documentation of the search procedure and robustness diagnostics as requested.

read point-by-point responses

Referee: [Numerical experiments] §4 (Numerical experiments): the identification of regimes in which g*_N > 0 for some finite N* is obtained by numerical DP search over parameter space; the manuscript provides neither the discretization grid, the number of Monte-Carlo replications used for verification, nor any sensitivity check that would rule out post-hoc selection of the reported regimes. Because these regimes constitute the central empirical claim, explicit documentation of the search procedure and robustness diagnostics is required.

Authors: We agree that additional documentation is needed for reproducibility and to substantiate the reported regimes. In the revised version we will add a dedicated paragraph (or subsection) in §4 that specifies: (i) the uniform discretization grid employed over the parameter space (step sizes for each probability and payoff parameter), (ii) the number of Monte-Carlo replications (10 000) used to cross-verify the dynamic-programming values, and (iii) a brief sensitivity study that repeats the search at coarser and finer grids as well as with 5 000 and 20 000 replications, confirming that the identified regimes where g*_N > 0 remain stable. These additions will be placed immediately before the presentation of the numerical results. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper formulates a finite-horizon MDP on the integer state X_t (win-loss difference) with actions being the choice of style at each step. The optimal gain g*_N is obtained by direct application of the Bellman recursion starting from the terminal condition at N=0. All structural conditions (always-negative gain, nonnegativity in the fair-D case, asymptotic limits) are derived as consequences of the same recursion under the externally stated drift assumptions; no parameter is fitted to data and then relabeled as a prediction, no self-citation supplies a uniqueness theorem or ansatz, and the Markov property holds by the independence of rounds given the chosen styles. The derivation is therefore self-contained and does not reduce any claimed result to a quantity defined by the same inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The model rests on the domain assumption that each pure style is losing in expectation and on the Markov property that the future depends only on the current win-loss difference; no free parameters or invented entities are introduced in the abstract.

axioms (2)

domain assumption Each pure style for Player 1 is losing in expectation against Player 2's fixed style.
Explicitly stated as the definition of the weaker player.
domain assumption The state can be summarized by the running difference X_n alone.
Implicit in the use of dynamic programming on that state.

pith-pipeline@v0.9.0 · 5569 in / 1364 out tokens · 23707 ms · 2026-05-15T22:11:53.684301+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Bellman recursion V_n(x) = max{ p·V_{n+1}(x+1) + … , q·V_{n+1}(x+1) + … } with V_N(x) = sign(x)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

7 extracted references · 7 canonical work pages

[1]

Arena, S

P. Arena, S. Fazzino, L. Fortuna, and P. Maniscalco. Game the- ory and non-linear dynamics: the parrondo paradox case study. Chaos, Solitons & Fractals, 17(2):545–555, 2003

work page 2003
[2]

L. Dinis. Optimal sequence for Parrondo games. Physical Review E, 77(2):021124, 2008

work page 2008
[3]

W. Feller. An introduction to probability theory and its applica- tions, Vol. 1. John Wiley and Sons, 3rd edition edition, 1968

work page 1968
[4]

Harmer and D

G. Harmer and D. Abbott. Losing strategies can win by par- rondo’s paradox. Nature, 402(864), 1999

work page 1999
[5]

J. W. Lai and K. H. Cheong. Parrondo’s paradox from classical to quantum: A review. Nonlinear Dynamics, 100(1):849–861, 2020

work page 2020
[6]

J. M. R. Parrondo, G. P. Harmer, and D. Abbott. New paradoxi- cal games based on brownian ratchets. Physical Review Letters, 85(24):5226–5229, 2000

work page 2000
[7]

M. J. Wainwright. High-Dimensional Statistics: A Non- Asymptotic Viewpoint. Cambridge Series in Statistical and Prob- abilistic Mathematics. Cambridge University Press, 2019. 8

work page 2019