arxiv: 2605.05266 · v1 · submitted 2026-05-06 · 💻 cs.CR · cs.LG

Recognition: unknown

Differential Privacy in the Extensive-Form Bandit Problem

Stephen Pasteris , Rahul Savani , Theodore Turocy

Authors on Pith no claims yet

Pith reviewed 2026-05-08 17:28 UTC · model grok-4.3

classification 💻 cs.CR cs.LG

keywords differential privacyextensive-form gamesbandit problemregret boundslocal privacyonline learningimperfect information

0 comments

The pith

An algorithm satisfies local differential privacy in the extensive-form bandit problem while attaining regret scaling as the square root of trials divided by epsilon.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes an algorithm for the extensive-form bandit problem that protects individual actions with epsilon-local differential privacy. In this problem a learner repeatedly selects moves in a game tree against an adversary, receiving only partial observations at information sets and the resulting payoffs. The algorithm keeps total regret at order the square root of A times log S times T divided by epsilon, where A counts possible actions, S counts reduced strategies, and T counts rounds. A sympathetic reader would care because sequential private decision-making arises in coordinated user-server settings such as games or auctions, and the result shows that privacy need not destroy sublinear regret. The per-round running time is essentially the cost of sending one reduced strategy to the user.

Core claim

We consider the extensive-form bandit problem, where on each trial the learner plays an extensive-form game against an oblivious adversary, observing the information sets it finds itself in as well as the resulting payoff or loss. We give an algorithm for this problem that satisfies epsilon-local differential privacy and attains a regret of tilde O of square root of A ln S T over epsilon, where A is the total number of actions the learner can possibly take, S is the number of the learner's possible reduced strategies, and T is the number of trials. On each trial the time complexity is, up to a logarithmic factor in the maximum number of actions at an information set, equal to the time to the

What carries the argument

An online algorithm that selects reduced strategies with added noise to enforce local differential privacy while controlling regret in imperfect-information sequential decision problems.

Load-bearing premise

The adversary is oblivious and does not adapt its moves in response to the learner's privately perturbed choices.

What would settle it

An experiment in which the adversary observes the noisy outputs of the algorithm and the measured regret exceeds the stated bound by more than a constant factor.

read the original abstract

We consider the extensive-form bandit problem, where on each trial the learner (a user coordinated by a server) plays an extensive-form game against an oblivious adversary, observing the information sets it finds itself in as well as the resulting payoff/loss. We give an algorithm for this problem that satisfies $\epsilon$-local differential privacy and attains a regret of $\tilde{O}(\sqrt{A\ln(S)T}/\epsilon)$, where $A$ is the total number of actions that the learner can possibly take, $S$ is the number of the learner's possible reduced strategies, and $T$ is the number of trials. On each trial, the time complexity of our algorithm is, up to a factor logarithmic in the maximum number of actions at an infoset, equal to the time required for the server to transmit the reduced strategy to the user. We note that local differential privacy is the strongest version of differential privacy and, to the best of our knowledge, this is the first work to study differential privacy of any form in the extensive-form bandit problem.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This is the first local DP algorithm for extensive-form bandits, with regret scaling on reduced strategies rather than full trees.

read the letter

This paper gives the first local differential privacy result for the extensive-form bandit problem. The algorithm achieves ε-local DP with regret Õ(√(A ln(S) T)/ε) against an oblivious adversary, and the per-round work stays close to just sending the reduced strategy to the user, up to a log factor in the largest infoset action count. That is the core contribution: adapting privacy noise to the reduced-strategy representation so the bound does not explode with the size of the game tree.

Referee Report

0 major / 3 minor

Summary. The paper studies the extensive-form bandit problem in which a learner coordinated by a server repeatedly plays an extensive-form game against an oblivious adversary, observing information sets and realized payoffs. It presents an algorithm that enforces ε-local differential privacy while attaining regret Õ(√(A ln(S) T)/ε), with A the total number of actions, S the number of reduced strategies, and T the horizon; per-round runtime is essentially the cost of transmitting a reduced strategy (up to a logarithmic factor in the maximum number of actions per information set). The work claims to be the first to consider differential privacy of any form in this setting.

Significance. If the algorithm and its analysis are correct, the result is significant: it supplies the first local-DP guarantee for extensive-form bandits, a setting that captures sequential imperfect-information decisions. The regret scales as 1/ε and remains sublinear in T while depending only logarithmically on the number of reduced strategies, which is a reasonable dependence for the privacy-utility trade-off. The near-optimal communication cost is a practical advantage for distributed implementations.

minor comments (3)

The abstract states the regret bound and time complexity but provides no high-level proof idea or pointer to the section containing the full analysis; adding one sentence on the key technique (e.g., how noise is added to reduced-strategy sampling) would improve readability.
The definition and enumeration of reduced strategies (S) should be illustrated with a small example early in the paper to make the dependence on ln(S) intuitive.
Notation for information sets and the mapping from actions to reduced strategies is used without an explicit reference to a standard definition; a brief reminder or citation would help readers from the bandit literature.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive assessment of our manuscript, the recognition of its significance as the first work to study differential privacy in the extensive-form bandit setting, and the recommendation for minor revision. We appreciate the comments on the regret scaling, communication cost, and practical advantages.

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper introduces an algorithm for the extensive-form bandit problem that achieves ε-local differential privacy while attaining the stated regret bound against an oblivious adversary. The abstract and problem setup present the regret expression Õ(√(A ln(S) T)/ε) as a derived performance guarantee of the algorithm rather than a quantity fitted to data or defined in terms of itself. No self-definitional steps, fitted inputs renamed as predictions, or load-bearing self-citations appear in the provided text. The claim of being the first work on differential privacy in this setting further avoids circular justification chains. The derivation chain is self-contained relative to the stated assumptions and does not reduce to tautology.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Only the abstract is available, so the ledger is necessarily incomplete; the problem statement invokes an oblivious adversary and the standard extensive-form game structure but provides no explicit free parameters or invented entities.

axioms (1)

domain assumption The adversary is oblivious and does not adapt based on the learner's private actions.
Explicitly stated in the problem definition in the abstract.

pith-pipeline@v0.9.0 · 5481 in / 1259 out tokens · 38152 ms · 2026-05-08T17:28:37.745063+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

17 extracted references · 10 canonical work pages

[1]

Faster rates for private adversarial bandits.ArXiv, abs/2505.21790, 2025

Hilal Asi, Vinod Raman, and Kunal Talwar. Faster rates for private adversarial bandits.ArXiv, abs/2505.21790, 2025

work page arXiv 2025
[2]

Differentially private no-regret exploration in adversarial markov decision processes

Shaojie Bai, Lanting Zeng, Chengcheng Zhao, Xiaoming Duan, Mohammad Sadegh Talebi, Peng Cheng, and Jiming Chen. Differentially private no-regret exploration in adversarial markov decision processes. InConference on Uncertainty in Artificial Intelligence, 2024

2024
[3]

Near-optimal learning of extensive-form games with imperfect information.ArXiv, abs/2202.01752, 2022

Yunru Bai, Chi Jin, Song Mei, and Tiancheng Yu. Near-optimal learning of extensive-form games with imperfect information.ArXiv, abs/2202.01752, 2022

work page arXiv 2022
[4]

Differentially private policy evaluation

Borja Balle, Maziar Gomrokchi, and Doina Precup. Differentially private policy evaluation. ArXiv, abs/1603.02010, 2016

work page arXiv 2016
[5]

Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam D. Smith. Calibrating noise to sensitivity in private data analysis.J. Priv. Confidentiality, 7:17–51, 2006

2006
[6]

Gabriele Farina, Christian Kroer, and Tuomas Sandholm. Better regularization for sequential decision spaces: Fast convergence rates for nash, correlated, and team equilibria.Proceedings of the 22nd ACM Conference on Economics and Computation, 2021

2021
[7]

Bandit linear optimization for sequential decision making and extensive-form games.ArXiv, abs/2103.04546, 2021

Gabriele Farina, Robin Schmucker, and Tuomas Sandholm. Bandit linear optimization for sequential decision making and extensive-form games.ArXiv, abs/2103.04546, 2021

work page arXiv 2021
[8]

Adapting to game trees in zero-sum imperfect information games

Côme Fiegel, Pierre M’enard, Tadashi Kozuno, Rémi Munos, Vianney Perchet, and Michal Valko. Adapting to game trees in zero-sum imperfect information games. InInternational Conference on Machine Learning, 2022

2022
[9]

Local differentially private regret minimization in reinforcement learning.ArXiv, abs/2010.07778, 2020

Evrard Garcelon, Vianney Perchet, Ciara Pike-Burke, and Matteo Pirotta. Local differentially private regret minimization in reinforcement learning.ArXiv, abs/2010.07778, 2020

work page arXiv 2010
[10]

Peña, and Tuomas Sandholm

Samid Hoda, Andrew Gilpin, Javier F. Peña, and Tuomas Sandholm. Smoothing techniques for computing nash equilibria of sequential games.Math. Oper. Res., 35:494–512, 2010

2010
[11]

Lee, Kobbi Nissim, Sofya Raskhodnikova, and Adam D

Shiva Prasad Kasiviswanathan, Homin K. Lee, Kobbi Nissim, Sofya Raskhodnikova, and Adam D. Smith. What can we learn privately?2008 49th Annual IEEE Symposium on Foundations of Computer Science, pages 531–540, 2008

2008
[12]

Model-free learning for two- player zero-sum partially observable markov games with perfect recall.ArXiv, abs/2106.06279, 2021

Tadashi Kozuno, Pierre M’enard, Rémi Munos, and Michal Valko. Model-free learning for two- player zero-sum partially observable markov games with perfect recall.ArXiv, abs/2106.06279, 2021

work page arXiv 2021
[13]

Ratliff, and Gabriele Farina

Arnab Maiti, Zhiyuan Fan, Kevin Jamieson, Lillian J. Ratliff, and Gabriele Farina. Efficient near-optimal algorithm for online shortest paths in directed acyclic graphs with bandit feedback against adaptive adversaries.ArXiv, abs/2504.00461, 2025. 10

work page arXiv 2025
[14]

Explore no more: Improved high-probability regret bounds for non-stochastic bandits

Gergely Neu. Explore no more: Improved high-probability regret bounds for non-stochastic bandits. InNeural Information Processing Systems, 2015

2015
[15]

Nearest neighbour with bandit feedback

Stephen Pasteris, Chris Hicks, and Vasilios Mavroudis. Nearest neighbour with bandit feedback. ArXiv, abs/2306.13773, 2023

work page arXiv 2023
[16]

Fast exp3 algorithms.ArXiv, abs/2512.11201, 2025

Ryoma Sato and Shinji Ito. Fast exp3 algorithms.ArXiv, abs/2512.11201, 2025

work page arXiv 2025
[17]

Aristide C. Y . Tossou and Christos Dimitrakakis. Achieving privacy in the adversarial multi- armed bandit.ArXiv, abs/1701.04222, 2017. A Analysis We now prove Theorem 1. Lemma 1.DP-EFBisϵ-locally differentially private. Proof. Take any t∈[T] . All expectations in this proof are conditioned on the value of σt. Draw a function d′ t :A(σ t)→R such that, for...

work page arXiv 2017