Recognition: unknown
Differential Privacy in the Extensive-Form Bandit Problem
Pith reviewed 2026-05-08 17:28 UTC · model grok-4.3
The pith
An algorithm satisfies local differential privacy in the extensive-form bandit problem while attaining regret scaling as the square root of trials divided by epsilon.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We consider the extensive-form bandit problem, where on each trial the learner plays an extensive-form game against an oblivious adversary, observing the information sets it finds itself in as well as the resulting payoff or loss. We give an algorithm for this problem that satisfies epsilon-local differential privacy and attains a regret of tilde O of square root of A ln S T over epsilon, where A is the total number of actions the learner can possibly take, S is the number of the learner's possible reduced strategies, and T is the number of trials. On each trial the time complexity is, up to a logarithmic factor in the maximum number of actions at an information set, equal to the time to the
What carries the argument
An online algorithm that selects reduced strategies with added noise to enforce local differential privacy while controlling regret in imperfect-information sequential decision problems.
Load-bearing premise
The adversary is oblivious and does not adapt its moves in response to the learner's privately perturbed choices.
What would settle it
An experiment in which the adversary observes the noisy outputs of the algorithm and the measured regret exceeds the stated bound by more than a constant factor.
read the original abstract
We consider the extensive-form bandit problem, where on each trial the learner (a user coordinated by a server) plays an extensive-form game against an oblivious adversary, observing the information sets it finds itself in as well as the resulting payoff/loss. We give an algorithm for this problem that satisfies $\epsilon$-local differential privacy and attains a regret of $\tilde{O}(\sqrt{A\ln(S)T}/\epsilon)$, where $A$ is the total number of actions that the learner can possibly take, $S$ is the number of the learner's possible reduced strategies, and $T$ is the number of trials. On each trial, the time complexity of our algorithm is, up to a factor logarithmic in the maximum number of actions at an infoset, equal to the time required for the server to transmit the reduced strategy to the user. We note that local differential privacy is the strongest version of differential privacy and, to the best of our knowledge, this is the first work to study differential privacy of any form in the extensive-form bandit problem.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper studies the extensive-form bandit problem in which a learner coordinated by a server repeatedly plays an extensive-form game against an oblivious adversary, observing information sets and realized payoffs. It presents an algorithm that enforces ε-local differential privacy while attaining regret Õ(√(A ln(S) T)/ε), with A the total number of actions, S the number of reduced strategies, and T the horizon; per-round runtime is essentially the cost of transmitting a reduced strategy (up to a logarithmic factor in the maximum number of actions per information set). The work claims to be the first to consider differential privacy of any form in this setting.
Significance. If the algorithm and its analysis are correct, the result is significant: it supplies the first local-DP guarantee for extensive-form bandits, a setting that captures sequential imperfect-information decisions. The regret scales as 1/ε and remains sublinear in T while depending only logarithmically on the number of reduced strategies, which is a reasonable dependence for the privacy-utility trade-off. The near-optimal communication cost is a practical advantage for distributed implementations.
minor comments (3)
- The abstract states the regret bound and time complexity but provides no high-level proof idea or pointer to the section containing the full analysis; adding one sentence on the key technique (e.g., how noise is added to reduced-strategy sampling) would improve readability.
- The definition and enumeration of reduced strategies (S) should be illustrated with a small example early in the paper to make the dependence on ln(S) intuitive.
- Notation for information sets and the mapping from actions to reduced strategies is used without an explicit reference to a standard definition; a brief reminder or citation would help readers from the bandit literature.
Simulated Author's Rebuttal
We thank the referee for the positive assessment of our manuscript, the recognition of its significance as the first work to study differential privacy in the extensive-form bandit setting, and the recommendation for minor revision. We appreciate the comments on the regret scaling, communication cost, and practical advantages.
Circularity Check
No significant circularity detected
full rationale
The paper introduces an algorithm for the extensive-form bandit problem that achieves ε-local differential privacy while attaining the stated regret bound against an oblivious adversary. The abstract and problem setup present the regret expression Õ(√(A ln(S) T)/ε) as a derived performance guarantee of the algorithm rather than a quantity fitted to data or defined in terms of itself. No self-definitional steps, fitted inputs renamed as predictions, or load-bearing self-citations appear in the provided text. The claim of being the first work on differential privacy in this setting further avoids circular justification chains. The derivation chain is self-contained relative to the stated assumptions and does not reduce to tautology.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The adversary is oblivious and does not adapt based on the learner's private actions.
Reference graph
Works this paper leans on
-
[1]
Faster rates for private adversarial bandits.ArXiv, abs/2505.21790, 2025
Hilal Asi, Vinod Raman, and Kunal Talwar. Faster rates for private adversarial bandits.ArXiv, abs/2505.21790, 2025
-
[2]
Differentially private no-regret exploration in adversarial markov decision processes
Shaojie Bai, Lanting Zeng, Chengcheng Zhao, Xiaoming Duan, Mohammad Sadegh Talebi, Peng Cheng, and Jiming Chen. Differentially private no-regret exploration in adversarial markov decision processes. InConference on Uncertainty in Artificial Intelligence, 2024
2024
-
[3]
Near-optimal learning of extensive-form games with imperfect information.ArXiv, abs/2202.01752, 2022
Yunru Bai, Chi Jin, Song Mei, and Tiancheng Yu. Near-optimal learning of extensive-form games with imperfect information.ArXiv, abs/2202.01752, 2022
-
[4]
Differentially private policy evaluation
Borja Balle, Maziar Gomrokchi, and Doina Precup. Differentially private policy evaluation. ArXiv, abs/1603.02010, 2016
-
[5]
Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam D. Smith. Calibrating noise to sensitivity in private data analysis.J. Priv. Confidentiality, 7:17–51, 2006
2006
-
[6]
Gabriele Farina, Christian Kroer, and Tuomas Sandholm. Better regularization for sequential decision spaces: Fast convergence rates for nash, correlated, and team equilibria.Proceedings of the 22nd ACM Conference on Economics and Computation, 2021
2021
-
[7]
Gabriele Farina, Robin Schmucker, and Tuomas Sandholm. Bandit linear optimization for sequential decision making and extensive-form games.ArXiv, abs/2103.04546, 2021
-
[8]
Adapting to game trees in zero-sum imperfect information games
Côme Fiegel, Pierre M’enard, Tadashi Kozuno, Rémi Munos, Vianney Perchet, and Michal Valko. Adapting to game trees in zero-sum imperfect information games. InInternational Conference on Machine Learning, 2022
2022
-
[9]
Evrard Garcelon, Vianney Perchet, Ciara Pike-Burke, and Matteo Pirotta. Local differentially private regret minimization in reinforcement learning.ArXiv, abs/2010.07778, 2020
-
[10]
Peña, and Tuomas Sandholm
Samid Hoda, Andrew Gilpin, Javier F. Peña, and Tuomas Sandholm. Smoothing techniques for computing nash equilibria of sequential games.Math. Oper. Res., 35:494–512, 2010
2010
-
[11]
Lee, Kobbi Nissim, Sofya Raskhodnikova, and Adam D
Shiva Prasad Kasiviswanathan, Homin K. Lee, Kobbi Nissim, Sofya Raskhodnikova, and Adam D. Smith. What can we learn privately?2008 49th Annual IEEE Symposium on Foundations of Computer Science, pages 531–540, 2008
2008
-
[12]
Tadashi Kozuno, Pierre M’enard, Rémi Munos, and Michal Valko. Model-free learning for two- player zero-sum partially observable markov games with perfect recall.ArXiv, abs/2106.06279, 2021
-
[13]
Arnab Maiti, Zhiyuan Fan, Kevin Jamieson, Lillian J. Ratliff, and Gabriele Farina. Efficient near-optimal algorithm for online shortest paths in directed acyclic graphs with bandit feedback against adaptive adversaries.ArXiv, abs/2504.00461, 2025. 10
-
[14]
Explore no more: Improved high-probability regret bounds for non-stochastic bandits
Gergely Neu. Explore no more: Improved high-probability regret bounds for non-stochastic bandits. InNeural Information Processing Systems, 2015
2015
-
[15]
Nearest neighbour with bandit feedback
Stephen Pasteris, Chris Hicks, and Vasilios Mavroudis. Nearest neighbour with bandit feedback. ArXiv, abs/2306.13773, 2023
-
[16]
Fast exp3 algorithms.ArXiv, abs/2512.11201, 2025
Ryoma Sato and Shinji Ito. Fast exp3 algorithms.ArXiv, abs/2512.11201, 2025
-
[17]
Aristide C. Y . Tossou and Christos Dimitrakakis. Achieving privacy in the adversarial multi- armed bandit.ArXiv, abs/1701.04222, 2017. A Analysis We now prove Theorem 1. Lemma 1.DP-EFBisϵ-locally differentially private. Proof. Take any t∈[T] . All expectations in this proof are conditioned on the value of σt. Draw a function d′ t :A(σ t)→R such that, for...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.