Non-Asymptotic Pure Exploration by Solving Games
Pith reviewed 2026-05-25 16:14 UTC · model grok-4.3
The pith
Pure exploration in exponential-family bandits obtains its first finite-sample guarantees by treating the lower-bound optimization as a game and converging to its saddle point with no-regret learners that need only a best-response oracle.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The optimization problem that lower-bounds the sample complexity of pure exploration in exponential-family bandits admits a game-theoretic formulation whose value is attained at a saddle point. Iterative no-regret learning strategies that only require a best-response oracle converge to this saddle point and thereby deliver the first finite-confidence guarantees that are adapted to the exponential family and that apply to arbitrary pure-exploration queries and bandit structures.
What carries the argument
The game whose payoff is the lower-bound optimization problem; no-regret learners play against each other using only best-response oracles to reach its saddle point.
If this is right
- Any pure-exploration query receives high-probability sample-complexity bounds that match the lower bound up to lower-order terms.
- The same sampling rule works for every exponential-family bandit model and every query without requiring a full optimizer.
- Only a best-response oracle is needed, so the method extends immediately to structures where the full optimization problem is intractable.
- Finite-sample guarantees hold uniformly over all queries rather than only in the asymptotic regime.
Where Pith is reading between the lines
- The game view could be applied to other sequential problems whose lower bounds are convex programs, such as best-policy identification in MDPs.
- Faster convergence rates might be obtained by swapping the generic no-regret learners for methods that exploit the specific geometry of the payoff matrix.
- If the best-response oracle itself must be approximated, the resulting error can be absorbed into the same finite-sample analysis.
Load-bearing premise
The lower-bound optimization problem must possess a saddle point that iterative best-response play can reach.
What would settle it
A simple two-arm Gaussian bandit with a fixed-confidence best-arm identification query in which the algorithm's stopping time exceeds twice the information-theoretic lower bound on more than 5 percent of trials.
Figures
read the original abstract
Pure exploration (aka active testing) is the fundamental task of sequentially gathering information to answer a query about a stochastic environment. Good algorithms make few mistakes and take few samples. Lower bounds (for multi-armed bandit models with arms in an exponential family) reveal that the sample complexity is determined by the solution to an optimisation problem. The existing state of the art algorithms achieve asymptotic optimality by solving a plug-in estimate of that optimisation problem at each step. We interpret the optimisation problem as an unknown game, and propose sampling rules based on iterative strategies to estimate and converge to its saddle point. We apply no-regret learners to obtain the first finite confidence guarantees that are adapted to the exponential family and which apply to any pure exploration query and bandit structure. Moreover, our algorithms only use a best response oracle instead of fully solving the optimisation problem.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims to recast the lower-bound optimization problem for pure exploration in exponential-family multi-armed bandits as an unknown zero-sum game possessing a saddle point. By applying no-regret learners that only require a best-response oracle (rather than solving the full optimization at each step), the authors obtain the first finite-confidence, non-asymptotic guarantees that are adapted to the exponential family and apply to arbitrary pure-exploration queries and bandit structures.
Significance. If the central claims hold, the contribution is significant: it advances pure-exploration theory beyond asymptotic optimality to explicit finite-time bounds while remaining computationally lighter through oracle access. The game-theoretic reduction leverages standard no-regret convergence properties and appears to preserve the exponential-family structure in the analysis, offering a general template that could apply across many query types.
major comments (2)
- [game formulation and minimax argument] The central claim that no-regret dynamics converge to the saddle point of the lower-bound game (thereby yielding finite-confidence sample-complexity bounds) is load-bearing; the manuscript must explicitly verify that the payoff function satisfies the conditions for existence of a saddle point under the exponential-family parameterization and that the best-response oracle is well-defined for the considered structures.
- [finite-confidence analysis] The finite-time analysis must show how the regret bounds of the no-regret learners translate into an explicit (non-asymptotic) upper bound on the number of samples needed to answer the query with a given; any hidden dependence on unknown parameters would undermine the non-asymptotic guarantee.
minor comments (2)
- [abstract and introduction] The abstract asserts applicability to 'any pure exploration query and bandit structure'; the introduction or related-work section should delineate the precise class of queries and structures for which the best-response oracle is assumed to exist.
- [notation] Notation for the game payoff and the best-response oracle should be introduced with a short table or explicit definitions early in the technical sections to improve readability.
Simulated Author's Rebuttal
We thank the referee for the positive assessment and recommendation of minor revision. The comments identify areas where explicitness can be improved. We address each major comment below.
read point-by-point responses
-
Referee: [game formulation and minimax argument] The central claim that no-regret dynamics converge to the saddle point of the lower-bound game (thereby yielding finite-confidence sample-complexity bounds) is load-bearing; the manuscript must explicitly verify that the payoff function satisfies the conditions for existence of a saddle point under the exponential-family parameterization and that the best-response oracle is well-defined for the considered structures.
Authors: We agree that explicit verification strengthens the central argument. The payoff function is the expected log-likelihood ratio, which is bilinear (hence convex-concave) over the product of simplices for exponential-family models; Sion's minimax theorem therefore guarantees a saddle point. The best-response oracle reduces to a one-dimensional convex optimization per arm (solving for the natural parameter matching a prescribed mean), which is well-defined and efficiently computable for any exponential family. We will add a dedicated paragraph in Section 2.2 stating these facts and citing the relevant minimax theorem. revision: yes
-
Referee: [finite-confidence analysis] The finite-time analysis must show how the regret bounds of the no-regret learners translate into an explicit (non-asymptotic) upper bound on the number of samples needed to answer the query with a given; any hidden dependence on unknown parameters would undermine the non-asymptotic guarantee.
Authors: Theorem 4 already derives the translation: the duality gap after T steps is bounded by the sum of the two players' regrets divided by T; the algorithm stops when the estimated value exceeds the threshold by this gap plus a concentration term. Inverting the resulting inequality produces an explicit non-asymptotic bound T = O((V* + R(T))/delta), where V* is the game value (the information-theoretic lower bound) and R(T) is a known function such as O(sqrt(T log K)) for Hedge. No unknown parameters appear outside V* itself, which is intrinsic to the problem. We will add a corollary that isolates this explicit sample-complexity expression. revision: yes
Circularity Check
No significant circularity; derivation self-contained
full rationale
The paper recasts the standard lower-bound optimization problem for pure exploration in exponential families as a zero-sum game with a saddle point, then applies established no-regret learner convergence properties together with a best-response oracle. All load-bearing steps (game formulation, saddle-point existence, oracle-based iteration) are developed within the manuscript using known minimax and regret bounds rather than reducing to self-definition, fitted inputs renamed as predictions, or self-citation chains. The finite-confidence guarantees follow directly from these constructions without circular reduction to the inputs.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We interpret the optimisation problem as an unknown game, and propose sampling rules based on iterative strategies to estimate and converge to its saddle point. We apply no-regret learners...
-
IndisputableMonolith/Foundation/BranchSelection.leanbranch_selection unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
The Pure Exploration Game... value of the game is Dμ := sup_w inf_λ ∑ w_k d(μ_k, λ_k)
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
P. Auer, N. Cesa-Bianchi, and P. Fischer. Finite-time analysis of the multiarmed bandit problem. Machine Learning, 47(2):235–256, 2002
work page 2002
-
[2]
Fast rates for bandit optimization with upper-confidence Frank-Wolfe
Quentin Berthet and Vianney Perchet. Fast rates for bandit optimization with upper-confidence Frank-Wolfe. In Advances in Neural Information Processing Systems (NeurIPS), pages 2222– 2231, 2017
work page 2017
- [3]
-
[4]
Prediction, learning, and games
Nicolo Cesa-Bianchi and Gábor Lugosi. Prediction, learning, and games. Cambridge university press, 2006
work page 2006
-
[5]
Towards instance optimal bounds for best arm identi- fication
Lijie Chen, Jian Li, and Mingda Qiao. Towards instance optimal bounds for best arm identi- fication. In Satyen Kale and Ohad Shamir, editors, Proceedings of the 2017 Conference on Learning Theory, volume 65 of Proceedings of Machine Learning Research, pages 535–592, Amsterdam, Netherlands, July 2017. PMLR
work page 2017
-
[6]
S. Chen, T. Lin, I. King, M. Lyu, and W. Chen. Combinatorial Pure Exploration of Multi-Armed Bandits. In Advances in Neural Information Processing Systems, 2014
work page 2014
-
[7]
Unimodal bandits: Regret lower bounds and optimal algorithms
Richard Combes and Alexandre Proutiere. Unimodal bandits: Regret lower bounds and optimal algorithms. In International Conference on Machine Learning, pages 521–529, 2014
work page 2014
-
[8]
Steven de Rooij, Tim van Erven, Peter D. Grünwald, and Wouter M. Koolen. Follow the leader if you can, Hedge if you must. Journal of Machine Learning Research, 15:1281–1316, April 2014
work page 2014
-
[9]
Rémy Degenne and Wouter M. Koolen. Pure exploration with multiple correct answers. ArXiv, February 2019
work page 2019
-
[10]
PAC bounds for multi-armed bandit and markov decision processes
Eyal Even-Dar, Shie Mannor, and Yishay Mansour. PAC bounds for multi-armed bandit and markov decision processes. In 15th Annual Conference on Learning Theory (COLT), volume 2375 of Lecture Notes in Computer Science, pages 255–270. Springer, 2002
work page 2002
-
[11]
Adaptive game playing using multiplicative weights
Yoav Freund and Robert E Schapire. Adaptive game playing using multiplicative weights. Games and Economic Behavior, 29(1-2):79–103, 1999
work page 1999
-
[12]
Informational confidence bounds for self-normalized averages and applica- tions
Aurélien Garivier. Informational confidence bounds for self-normalized averages and applica- tions. In 2013 IEEE Information Theory Workshop (ITW), pages 1–5. IEEE, 2013
work page 2013
-
[13]
Optimal best arm identification with fixed confidence
Aurélien Garivier and Emilie Kaufmann. Optimal best arm identification with fixed confidence. In Conference on Learning Theory, pages 998–1027, 2016. 9
work page 2016
-
[14]
Thresholding Bandit for Dose-ranging: The Impact of Monotonicity
Aurélien Garivier, Pierre Ménard, and Laurent Rossi. Thresholding bandit for dose-ranging: The impact of monotonicity. arXiv preprint arXiv:1711.04454, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[15]
S. Kalyanakrishnan and P. Stone. Efficient Selection in Multiple Bandit Arms: Theory and Practice. In International Conference on Machine Learning (ICML), 2010
work page 2010
-
[16]
S. Kalyanakrishnan, A. Tewari, P. Auer, and P. Stone. PAC subset selection in stochastic multi-armed bandits. In International Conference on Machine Learning (ICML), 2012
work page 2012
-
[17]
Sumeet Katariya, Branislav Kveton, Csaba Szepesvári, Claire Vernade, and Zheng Wen. Stochas- tic rank-1 bandits. In Aarti Singh and Xiaojin (Jerry) Zhu, editors, Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, AISTATS 2017, 20-22 April 2017, Fort Lauderdale, FL, USA, volume 54 of Proceedings of Machine Learning R...
work page 2017
-
[18]
E. Kaufmann, O. Cappé, and A. Garivier. On the Complexity of Best Arm Identification in Multi-Armed Bandit Models. Journal of Machine Learning Research, 17(1):1–42, 2016
work page 2016
-
[19]
Emilie Kaufmann and Wouter M. Koolen. Mixture martingales revisited with applications to sequential tests and confidence intervals. Preprint, October 2018
work page 2018
-
[20]
Emilie Kaufmann, Wouter M. Koolen, and Aurélien Garivier. Sequential test for the lowest mean: From Thompson to Murphy sampling. In Advances in Neural Information Processing Systems (NeurIPS) 31, pages 6333–6343, December 2018
work page 2018
-
[21]
Tor Lattimore and Csaba Szepesvári. Bandit Algorithms. Cambridge University Press, 2019
work page 2019
-
[22]
An optimal algorithm for the thresholding bandit problem
Andrea Locatelli, Maurilio Gutzeit, and Alexandra Carpentier. An optimal algorithm for the thresholding bandit problem. In Maria-Florina Balcan and Kilian Q. Weinberger, editors, Proceedings of the 33nd International Conference on Machine Learning, ICML 2016, New York City, NY, USA, June 19-24, 2016, volume 48 of JMLR Workshop and Conference Proceedings, ...
work page 2016
-
[23]
Lipschitz bandits: Regret lower bound and optimal algorithms
Stefan Magureanu, Richard Combes, and Alexandre Proutiere. Lipschitz bandits: Regret lower bound and optimal algorithms. In Conference on Learning Theory, pages 975–999, 2014
work page 2014
-
[24]
Gradient Ascent for Active Exploration in Bandit Problems
Pierre Ménard. Gradient ascent for active exploration in bandit problems. arXiv preprint arXiv:1905.08165, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1905
-
[25]
Optimization, learning, and games with predictable sequences
Alexander Rakhlin and Karthik Sridharan. Optimization, learning, and games with predictable sequences. In Advances in Neural Information Processing Systems (NeurIPS), pages 3066–3074, 2013
work page 2013
-
[26]
The simulator: Understanding adaptive sampling in the moderate-confidence regime
Max Simchowitz, Kevin Jamieson, and Benjamin Recht. The simulator: Understanding adaptive sampling in the moderate-confidence regime. In Conference on Learning Theory, pages 1794– 1834, 2017
work page 2017
-
[27]
Efficient sampling method for Monte Carlo tree search problem
Kazuki Teraoka, Kohei Hatano, and Eiji Takimoto. Efficient sampling method for Monte Carlo tree search problem. IEICE Transactions, 97-D(3):392–398, 2014
work page 2014
- [28]
-
[29]
Identify the Nash equilibrium in static games with random payoffs
Yichi Zhou, Jialian Li, and Jun Zhu. Identify the Nash equilibrium in static games with random payoffs. In Doina Precup and Yee Whye Teh, editors, Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 4160–4169, International Convention Centre, Sydney, Australia, August 2017. PML...
work page 2017
-
[30]
U k t (1) = max { f(t−1) Nk t−1 , maxξ∈[αk t ,βk t ] Eλ∼qt d(ξ, λk) }
-
[31]
U k t (2) = max { f(t−1) Nk t−1 , maxξ∈[ak t ,bk t ] Eλ∼qt d(ξ, λk) }
-
[32]
U k t (1) (λ) = max { f(t−1) Nk t−1 , maxξ∈[αk t ,βk t ] d(ξ, λk) }
-
[33]
U k t (2) (λ) = max { f(t−1) Nk t−1 , maxξ∈[ak t ,bk t ] d(ξ, λk) } . The UCBs indexed by (2) are larger but potentially easier to compute that the ones indexed by (1), since ak t and bk t are easier to compute than αk t and βk t . The next lemma simplifies the computation of the UCBs. Lemma 10. In all the UCBs introduced, the maximum over the interval is ...
work page 2000
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.