pith. machine review for the scientific record. sign in

arxiv: 2605.05266 · v1 · submitted 2026-05-06 · 💻 cs.CR · cs.LG

Recognition: unknown

Differential Privacy in the Extensive-Form Bandit Problem

Authors on Pith no claims yet

Pith reviewed 2026-05-08 17:28 UTC · model grok-4.3

classification 💻 cs.CR cs.LG
keywords differential privacyextensive-form gamesbandit problemregret boundslocal privacyonline learningimperfect information
0
0 comments X

The pith

An algorithm satisfies local differential privacy in the extensive-form bandit problem while attaining regret scaling as the square root of trials divided by epsilon.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes an algorithm for the extensive-form bandit problem that protects individual actions with epsilon-local differential privacy. In this problem a learner repeatedly selects moves in a game tree against an adversary, receiving only partial observations at information sets and the resulting payoffs. The algorithm keeps total regret at order the square root of A times log S times T divided by epsilon, where A counts possible actions, S counts reduced strategies, and T counts rounds. A sympathetic reader would care because sequential private decision-making arises in coordinated user-server settings such as games or auctions, and the result shows that privacy need not destroy sublinear regret. The per-round running time is essentially the cost of sending one reduced strategy to the user.

Core claim

We consider the extensive-form bandit problem, where on each trial the learner plays an extensive-form game against an oblivious adversary, observing the information sets it finds itself in as well as the resulting payoff or loss. We give an algorithm for this problem that satisfies epsilon-local differential privacy and attains a regret of tilde O of square root of A ln S T over epsilon, where A is the total number of actions the learner can possibly take, S is the number of the learner's possible reduced strategies, and T is the number of trials. On each trial the time complexity is, up to a logarithmic factor in the maximum number of actions at an information set, equal to the time to the

What carries the argument

An online algorithm that selects reduced strategies with added noise to enforce local differential privacy while controlling regret in imperfect-information sequential decision problems.

Load-bearing premise

The adversary is oblivious and does not adapt its moves in response to the learner's privately perturbed choices.

What would settle it

An experiment in which the adversary observes the noisy outputs of the algorithm and the measured regret exceeds the stated bound by more than a constant factor.

read the original abstract

We consider the extensive-form bandit problem, where on each trial the learner (a user coordinated by a server) plays an extensive-form game against an oblivious adversary, observing the information sets it finds itself in as well as the resulting payoff/loss. We give an algorithm for this problem that satisfies $\epsilon$-local differential privacy and attains a regret of $\tilde{O}(\sqrt{A\ln(S)T}/\epsilon)$, where $A$ is the total number of actions that the learner can possibly take, $S$ is the number of the learner's possible reduced strategies, and $T$ is the number of trials. On each trial, the time complexity of our algorithm is, up to a factor logarithmic in the maximum number of actions at an infoset, equal to the time required for the server to transmit the reduced strategy to the user. We note that local differential privacy is the strongest version of differential privacy and, to the best of our knowledge, this is the first work to study differential privacy of any form in the extensive-form bandit problem.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. The paper studies the extensive-form bandit problem in which a learner coordinated by a server repeatedly plays an extensive-form game against an oblivious adversary, observing information sets and realized payoffs. It presents an algorithm that enforces ε-local differential privacy while attaining regret Õ(√(A ln(S) T)/ε), with A the total number of actions, S the number of reduced strategies, and T the horizon; per-round runtime is essentially the cost of transmitting a reduced strategy (up to a logarithmic factor in the maximum number of actions per information set). The work claims to be the first to consider differential privacy of any form in this setting.

Significance. If the algorithm and its analysis are correct, the result is significant: it supplies the first local-DP guarantee for extensive-form bandits, a setting that captures sequential imperfect-information decisions. The regret scales as 1/ε and remains sublinear in T while depending only logarithmically on the number of reduced strategies, which is a reasonable dependence for the privacy-utility trade-off. The near-optimal communication cost is a practical advantage for distributed implementations.

minor comments (3)
  1. The abstract states the regret bound and time complexity but provides no high-level proof idea or pointer to the section containing the full analysis; adding one sentence on the key technique (e.g., how noise is added to reduced-strategy sampling) would improve readability.
  2. The definition and enumeration of reduced strategies (S) should be illustrated with a small example early in the paper to make the dependence on ln(S) intuitive.
  3. Notation for information sets and the mapping from actions to reduced strategies is used without an explicit reference to a standard definition; a brief reminder or citation would help readers from the bandit literature.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive assessment of our manuscript, the recognition of its significance as the first work to study differential privacy in the extensive-form bandit setting, and the recommendation for minor revision. We appreciate the comments on the regret scaling, communication cost, and practical advantages.

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper introduces an algorithm for the extensive-form bandit problem that achieves ε-local differential privacy while attaining the stated regret bound against an oblivious adversary. The abstract and problem setup present the regret expression Õ(√(A ln(S) T)/ε) as a derived performance guarantee of the algorithm rather than a quantity fitted to data or defined in terms of itself. No self-definitional steps, fitted inputs renamed as predictions, or load-bearing self-citations appear in the provided text. The claim of being the first work on differential privacy in this setting further avoids circular justification chains. The derivation chain is self-contained relative to the stated assumptions and does not reduce to tautology.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Only the abstract is available, so the ledger is necessarily incomplete; the problem statement invokes an oblivious adversary and the standard extensive-form game structure but provides no explicit free parameters or invented entities.

axioms (1)
  • domain assumption The adversary is oblivious and does not adapt based on the learner's private actions.
    Explicitly stated in the problem definition in the abstract.

pith-pipeline@v0.9.0 · 5481 in / 1259 out tokens · 38152 ms · 2026-05-08T17:28:37.745063+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

17 extracted references · 10 canonical work pages

  1. [1]

    Faster rates for private adversarial bandits.ArXiv, abs/2505.21790, 2025

    Hilal Asi, Vinod Raman, and Kunal Talwar. Faster rates for private adversarial bandits.ArXiv, abs/2505.21790, 2025

  2. [2]

    Differentially private no-regret exploration in adversarial markov decision processes

    Shaojie Bai, Lanting Zeng, Chengcheng Zhao, Xiaoming Duan, Mohammad Sadegh Talebi, Peng Cheng, and Jiming Chen. Differentially private no-regret exploration in adversarial markov decision processes. InConference on Uncertainty in Artificial Intelligence, 2024

  3. [3]

    Near-optimal learning of extensive-form games with imperfect information.ArXiv, abs/2202.01752, 2022

    Yunru Bai, Chi Jin, Song Mei, and Tiancheng Yu. Near-optimal learning of extensive-form games with imperfect information.ArXiv, abs/2202.01752, 2022

  4. [4]

    Differentially private policy evaluation

    Borja Balle, Maziar Gomrokchi, and Doina Precup. Differentially private policy evaluation. ArXiv, abs/1603.02010, 2016

  5. [5]

    Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam D. Smith. Calibrating noise to sensitivity in private data analysis.J. Priv. Confidentiality, 7:17–51, 2006

  6. [6]

    Gabriele Farina, Christian Kroer, and Tuomas Sandholm. Better regularization for sequential decision spaces: Fast convergence rates for nash, correlated, and team equilibria.Proceedings of the 22nd ACM Conference on Economics and Computation, 2021

  7. [7]

    Bandit linear optimization for sequential decision making and extensive-form games.ArXiv, abs/2103.04546, 2021

    Gabriele Farina, Robin Schmucker, and Tuomas Sandholm. Bandit linear optimization for sequential decision making and extensive-form games.ArXiv, abs/2103.04546, 2021

  8. [8]

    Adapting to game trees in zero-sum imperfect information games

    Côme Fiegel, Pierre M’enard, Tadashi Kozuno, Rémi Munos, Vianney Perchet, and Michal Valko. Adapting to game trees in zero-sum imperfect information games. InInternational Conference on Machine Learning, 2022

  9. [9]

    Local differentially private regret minimization in reinforcement learning.ArXiv, abs/2010.07778, 2020

    Evrard Garcelon, Vianney Perchet, Ciara Pike-Burke, and Matteo Pirotta. Local differentially private regret minimization in reinforcement learning.ArXiv, abs/2010.07778, 2020

  10. [10]

    Peña, and Tuomas Sandholm

    Samid Hoda, Andrew Gilpin, Javier F. Peña, and Tuomas Sandholm. Smoothing techniques for computing nash equilibria of sequential games.Math. Oper. Res., 35:494–512, 2010

  11. [11]

    Lee, Kobbi Nissim, Sofya Raskhodnikova, and Adam D

    Shiva Prasad Kasiviswanathan, Homin K. Lee, Kobbi Nissim, Sofya Raskhodnikova, and Adam D. Smith. What can we learn privately?2008 49th Annual IEEE Symposium on Foundations of Computer Science, pages 531–540, 2008

  12. [12]

    Model-free learning for two- player zero-sum partially observable markov games with perfect recall.ArXiv, abs/2106.06279, 2021

    Tadashi Kozuno, Pierre M’enard, Rémi Munos, and Michal Valko. Model-free learning for two- player zero-sum partially observable markov games with perfect recall.ArXiv, abs/2106.06279, 2021

  13. [13]

    Ratliff, and Gabriele Farina

    Arnab Maiti, Zhiyuan Fan, Kevin Jamieson, Lillian J. Ratliff, and Gabriele Farina. Efficient near-optimal algorithm for online shortest paths in directed acyclic graphs with bandit feedback against adaptive adversaries.ArXiv, abs/2504.00461, 2025. 10

  14. [14]

    Explore no more: Improved high-probability regret bounds for non-stochastic bandits

    Gergely Neu. Explore no more: Improved high-probability regret bounds for non-stochastic bandits. InNeural Information Processing Systems, 2015

  15. [15]

    Nearest neighbour with bandit feedback

    Stephen Pasteris, Chris Hicks, and Vasilios Mavroudis. Nearest neighbour with bandit feedback. ArXiv, abs/2306.13773, 2023

  16. [16]

    Fast exp3 algorithms.ArXiv, abs/2512.11201, 2025

    Ryoma Sato and Shinji Ito. Fast exp3 algorithms.ArXiv, abs/2512.11201, 2025

  17. [17]

    Aristide C. Y . Tossou and Christos Dimitrakakis. Achieving privacy in the adversarial multi- armed bandit.ArXiv, abs/1701.04222, 2017. A Analysis We now prove Theorem 1. Lemma 1.DP-EFBisϵ-locally differentially private. Proof. Take any t∈[T] . All expectations in this proof are conditioned on the value of σt. Draw a function d′ t :A(σ t)→R such that, for...