pith. sign in

arxiv: 1906.10431 · v1 · pith:XGPCXG2Nnew · submitted 2019-06-25 · 📊 stat.ML · cs.LG

Non-Asymptotic Pure Exploration by Solving Games

Pith reviewed 2026-05-25 16:14 UTC · model grok-4.3

classification 📊 stat.ML cs.LG
keywords pure explorationmulti-armed banditsno-regret learningexponential familysaddle pointfinite-sample guaranteesactive testing
0
0 comments X

The pith

Pure exploration in exponential-family bandits obtains its first finite-sample guarantees by treating the lower-bound optimization as a game and converging to its saddle point with no-regret learners that need only a best-response oracle.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper reframes the optimization problem that sets the information-theoretic sample complexity of pure exploration as an unknown two-player game. Running no-regret strategies on this game produces iterates that converge to the saddle point, yielding stopping rules whose total samples stay within a constant factor of the lower bound with high probability. Because the method only queries a best-response oracle rather than solving the full convex program at every round, it applies to any pure-exploration query and any bandit structure whose arms belong to an exponential family. A reader would care because the approach closes the long-standing gap between asymptotic optimality and concrete, non-asymptotic performance guarantees.

Core claim

The optimization problem that lower-bounds the sample complexity of pure exploration in exponential-family bandits admits a game-theoretic formulation whose value is attained at a saddle point. Iterative no-regret learning strategies that only require a best-response oracle converge to this saddle point and thereby deliver the first finite-confidence guarantees that are adapted to the exponential family and that apply to arbitrary pure-exploration queries and bandit structures.

What carries the argument

The game whose payoff is the lower-bound optimization problem; no-regret learners play against each other using only best-response oracles to reach its saddle point.

If this is right

  • Any pure-exploration query receives high-probability sample-complexity bounds that match the lower bound up to lower-order terms.
  • The same sampling rule works for every exponential-family bandit model and every query without requiring a full optimizer.
  • Only a best-response oracle is needed, so the method extends immediately to structures where the full optimization problem is intractable.
  • Finite-sample guarantees hold uniformly over all queries rather than only in the asymptotic regime.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The game view could be applied to other sequential problems whose lower bounds are convex programs, such as best-policy identification in MDPs.
  • Faster convergence rates might be obtained by swapping the generic no-regret learners for methods that exploit the specific geometry of the payoff matrix.
  • If the best-response oracle itself must be approximated, the resulting error can be absorbed into the same finite-sample analysis.

Load-bearing premise

The lower-bound optimization problem must possess a saddle point that iterative best-response play can reach.

What would settle it

A simple two-arm Gaussian bandit with a fixed-confidence best-arm identification query in which the algorithm's stopping time exceeds twice the information-theoretic lower bound on more than 5 percent of trials.

Figures

Figures reproduced from arXiv: 1906.10431 by Pierre M\'enard, R\'emy Degenne, Wouter M. Koolen.

Figure 1
Figure 1. Figure 1: Selected experiments. In both cases δ = 0.1. Plots based on 3000 and 5000 runs. This algorithm is the most computationally expensive but has the best sample complexity upper bound, has a simpler proof and works well in experiments where computing the max-max-min oracle is feasible, like the Best Arm and Minimum Threshold problems (see section 4). 4 Experiments The goal of our experiments is to empirically … view at source ↗
Figure 2
Figure 2. Figure 2: Best Arm experiments from [13]. In both cases [PITH_FULL_IMAGE:figures/full_fig_p027_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Best Arm experiment from [24]. Gaussian bandit µ = (1., 0.85, 0.8, 0.7), w∗ = (0.41, 0.38, 0.15, 0.06). Plots show 3000 runs. 27 [PITH_FULL_IMAGE:figures/full_fig_p027_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Minimum Threshold experiments from [20] with threshold [PITH_FULL_IMAGE:figures/full_fig_p028_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Minimum Threshold experiment (new): Gaussian bandit [PITH_FULL_IMAGE:figures/full_fig_p028_5.png] view at source ↗
read the original abstract

Pure exploration (aka active testing) is the fundamental task of sequentially gathering information to answer a query about a stochastic environment. Good algorithms make few mistakes and take few samples. Lower bounds (for multi-armed bandit models with arms in an exponential family) reveal that the sample complexity is determined by the solution to an optimisation problem. The existing state of the art algorithms achieve asymptotic optimality by solving a plug-in estimate of that optimisation problem at each step. We interpret the optimisation problem as an unknown game, and propose sampling rules based on iterative strategies to estimate and converge to its saddle point. We apply no-regret learners to obtain the first finite confidence guarantees that are adapted to the exponential family and which apply to any pure exploration query and bandit structure. Moreover, our algorithms only use a best response oracle instead of fully solving the optimisation problem.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims to recast the lower-bound optimization problem for pure exploration in exponential-family multi-armed bandits as an unknown zero-sum game possessing a saddle point. By applying no-regret learners that only require a best-response oracle (rather than solving the full optimization at each step), the authors obtain the first finite-confidence, non-asymptotic guarantees that are adapted to the exponential family and apply to arbitrary pure-exploration queries and bandit structures.

Significance. If the central claims hold, the contribution is significant: it advances pure-exploration theory beyond asymptotic optimality to explicit finite-time bounds while remaining computationally lighter through oracle access. The game-theoretic reduction leverages standard no-regret convergence properties and appears to preserve the exponential-family structure in the analysis, offering a general template that could apply across many query types.

major comments (2)
  1. [game formulation and minimax argument] The central claim that no-regret dynamics converge to the saddle point of the lower-bound game (thereby yielding finite-confidence sample-complexity bounds) is load-bearing; the manuscript must explicitly verify that the payoff function satisfies the conditions for existence of a saddle point under the exponential-family parameterization and that the best-response oracle is well-defined for the considered structures.
  2. [finite-confidence analysis] The finite-time analysis must show how the regret bounds of the no-regret learners translate into an explicit (non-asymptotic) upper bound on the number of samples needed to answer the query with a given; any hidden dependence on unknown parameters would undermine the non-asymptotic guarantee.
minor comments (2)
  1. [abstract and introduction] The abstract asserts applicability to 'any pure exploration query and bandit structure'; the introduction or related-work section should delineate the precise class of queries and structures for which the best-response oracle is assumed to exist.
  2. [notation] Notation for the game payoff and the best-response oracle should be introduced with a short table or explicit definitions early in the technical sections to improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive assessment and recommendation of minor revision. The comments identify areas where explicitness can be improved. We address each major comment below.

read point-by-point responses
  1. Referee: [game formulation and minimax argument] The central claim that no-regret dynamics converge to the saddle point of the lower-bound game (thereby yielding finite-confidence sample-complexity bounds) is load-bearing; the manuscript must explicitly verify that the payoff function satisfies the conditions for existence of a saddle point under the exponential-family parameterization and that the best-response oracle is well-defined for the considered structures.

    Authors: We agree that explicit verification strengthens the central argument. The payoff function is the expected log-likelihood ratio, which is bilinear (hence convex-concave) over the product of simplices for exponential-family models; Sion's minimax theorem therefore guarantees a saddle point. The best-response oracle reduces to a one-dimensional convex optimization per arm (solving for the natural parameter matching a prescribed mean), which is well-defined and efficiently computable for any exponential family. We will add a dedicated paragraph in Section 2.2 stating these facts and citing the relevant minimax theorem. revision: yes

  2. Referee: [finite-confidence analysis] The finite-time analysis must show how the regret bounds of the no-regret learners translate into an explicit (non-asymptotic) upper bound on the number of samples needed to answer the query with a given; any hidden dependence on unknown parameters would undermine the non-asymptotic guarantee.

    Authors: Theorem 4 already derives the translation: the duality gap after T steps is bounded by the sum of the two players' regrets divided by T; the algorithm stops when the estimated value exceeds the threshold by this gap plus a concentration term. Inverting the resulting inequality produces an explicit non-asymptotic bound T = O((V* + R(T))/delta), where V* is the game value (the information-theoretic lower bound) and R(T) is a known function such as O(sqrt(T log K)) for Hedge. No unknown parameters appear outside V* itself, which is intrinsic to the problem. We will add a corollary that isolates this explicit sample-complexity expression. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation self-contained

full rationale

The paper recasts the standard lower-bound optimization problem for pure exploration in exponential families as a zero-sum game with a saddle point, then applies established no-regret learner convergence properties together with a best-response oracle. All load-bearing steps (game formulation, saddle-point existence, oracle-based iteration) are developed within the manuscript using known minimax and regret bounds rather than reducing to self-definition, fitted inputs renamed as predictions, or self-citation chains. The finite-confidence guarantees follow directly from these constructions without circular reduction to the inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides insufficient detail to enumerate specific free parameters, axioms, or invented entities; the approach appears to rest on standard game-theoretic and no-regret learning assumptions from prior literature.

pith-pipeline@v0.9.0 · 5672 in / 1105 out tokens · 31317 ms · 2026-05-25T16:14:37.790460+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

33 extracted references · 33 canonical work pages · 2 internal anchors

  1. [1]

    P. Auer, N. Cesa-Bianchi, and P. Fischer. Finite-time analysis of the multiarmed bandit problem. Machine Learning, 47(2):235–256, 2002

  2. [2]

    Fast rates for bandit optimization with upper-confidence Frank-Wolfe

    Quentin Berthet and Vianney Perchet. Fast rates for bandit optimization with upper-confidence Frank-Wolfe. In Advances in Neural Information Processing Systems (NeurIPS), pages 2222– 2231, 2017

  3. [3]

    Bubeck, R

    S. Bubeck, R. Munos, and G. Stoltz. Pure Exploration in Finitely Armed and Continuous Armed Bandits. Theoretical Computer Science 412, 1832-1852, 412:1832–1852, 2011

  4. [4]

    Prediction, learning, and games

    Nicolo Cesa-Bianchi and Gábor Lugosi. Prediction, learning, and games. Cambridge university press, 2006

  5. [5]

    Towards instance optimal bounds for best arm identi- fication

    Lijie Chen, Jian Li, and Mingda Qiao. Towards instance optimal bounds for best arm identi- fication. In Satyen Kale and Ohad Shamir, editors, Proceedings of the 2017 Conference on Learning Theory, volume 65 of Proceedings of Machine Learning Research, pages 535–592, Amsterdam, Netherlands, July 2017. PMLR

  6. [6]

    S. Chen, T. Lin, I. King, M. Lyu, and W. Chen. Combinatorial Pure Exploration of Multi-Armed Bandits. In Advances in Neural Information Processing Systems, 2014

  7. [7]

    Unimodal bandits: Regret lower bounds and optimal algorithms

    Richard Combes and Alexandre Proutiere. Unimodal bandits: Regret lower bounds and optimal algorithms. In International Conference on Machine Learning, pages 521–529, 2014

  8. [8]

    Grünwald, and Wouter M

    Steven de Rooij, Tim van Erven, Peter D. Grünwald, and Wouter M. Koolen. Follow the leader if you can, Hedge if you must. Journal of Machine Learning Research, 15:1281–1316, April 2014

  9. [9]

    Rémy Degenne and Wouter M. Koolen. Pure exploration with multiple correct answers. ArXiv, February 2019

  10. [10]

    PAC bounds for multi-armed bandit and markov decision processes

    Eyal Even-Dar, Shie Mannor, and Yishay Mansour. PAC bounds for multi-armed bandit and markov decision processes. In 15th Annual Conference on Learning Theory (COLT), volume 2375 of Lecture Notes in Computer Science, pages 255–270. Springer, 2002

  11. [11]

    Adaptive game playing using multiplicative weights

    Yoav Freund and Robert E Schapire. Adaptive game playing using multiplicative weights. Games and Economic Behavior, 29(1-2):79–103, 1999

  12. [12]

    Informational confidence bounds for self-normalized averages and applica- tions

    Aurélien Garivier. Informational confidence bounds for self-normalized averages and applica- tions. In 2013 IEEE Information Theory Workshop (ITW), pages 1–5. IEEE, 2013

  13. [13]

    Optimal best arm identification with fixed confidence

    Aurélien Garivier and Emilie Kaufmann. Optimal best arm identification with fixed confidence. In Conference on Learning Theory, pages 998–1027, 2016. 9

  14. [14]

    Thresholding Bandit for Dose-ranging: The Impact of Monotonicity

    Aurélien Garivier, Pierre Ménard, and Laurent Rossi. Thresholding bandit for dose-ranging: The impact of monotonicity. arXiv preprint arXiv:1711.04454, 2017

  15. [15]

    Kalyanakrishnan and P

    S. Kalyanakrishnan and P. Stone. Efficient Selection in Multiple Bandit Arms: Theory and Practice. In International Conference on Machine Learning (ICML), 2010

  16. [16]

    Kalyanakrishnan, A

    S. Kalyanakrishnan, A. Tewari, P. Auer, and P. Stone. PAC subset selection in stochastic multi-armed bandits. In International Conference on Machine Learning (ICML), 2012

  17. [17]

    Stochas- tic rank-1 bandits

    Sumeet Katariya, Branislav Kveton, Csaba Szepesvári, Claire Vernade, and Zheng Wen. Stochas- tic rank-1 bandits. In Aarti Singh and Xiaojin (Jerry) Zhu, editors, Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, AISTATS 2017, 20-22 April 2017, Fort Lauderdale, FL, USA, volume 54 of Proceedings of Machine Learning R...

  18. [18]

    Kaufmann, O

    E. Kaufmann, O. Cappé, and A. Garivier. On the Complexity of Best Arm Identification in Multi-Armed Bandit Models. Journal of Machine Learning Research, 17(1):1–42, 2016

  19. [19]

    Emilie Kaufmann and Wouter M. Koolen. Mixture martingales revisited with applications to sequential tests and confidence intervals. Preprint, October 2018

  20. [20]

    Koolen, and Aurélien Garivier

    Emilie Kaufmann, Wouter M. Koolen, and Aurélien Garivier. Sequential test for the lowest mean: From Thompson to Murphy sampling. In Advances in Neural Information Processing Systems (NeurIPS) 31, pages 6333–6343, December 2018

  21. [21]

    Bandit Algorithms

    Tor Lattimore and Csaba Szepesvári. Bandit Algorithms. Cambridge University Press, 2019

  22. [22]

    An optimal algorithm for the thresholding bandit problem

    Andrea Locatelli, Maurilio Gutzeit, and Alexandra Carpentier. An optimal algorithm for the thresholding bandit problem. In Maria-Florina Balcan and Kilian Q. Weinberger, editors, Proceedings of the 33nd International Conference on Machine Learning, ICML 2016, New York City, NY, USA, June 19-24, 2016, volume 48 of JMLR Workshop and Conference Proceedings, ...

  23. [23]

    Lipschitz bandits: Regret lower bound and optimal algorithms

    Stefan Magureanu, Richard Combes, and Alexandre Proutiere. Lipschitz bandits: Regret lower bound and optimal algorithms. In Conference on Learning Theory, pages 975–999, 2014

  24. [24]

    Gradient Ascent for Active Exploration in Bandit Problems

    Pierre Ménard. Gradient ascent for active exploration in bandit problems. arXiv preprint arXiv:1905.08165, 2019

  25. [25]

    Optimization, learning, and games with predictable sequences

    Alexander Rakhlin and Karthik Sridharan. Optimization, learning, and games with predictable sequences. In Advances in Neural Information Processing Systems (NeurIPS), pages 3066–3074, 2013

  26. [26]

    The simulator: Understanding adaptive sampling in the moderate-confidence regime

    Max Simchowitz, Kevin Jamieson, and Benjamin Recht. The simulator: Understanding adaptive sampling in the moderate-confidence regime. In Conference on Learning Theory, pages 1794– 1834, 2017

  27. [27]

    Efficient sampling method for Monte Carlo tree search problem

    Kazuki Teraoka, Kohei Hatano, and Eiji Takimoto. Efficient sampling method for Monte Carlo tree search problem. IEICE Transactions, 97-D(3):392–398, 2014

  28. [28]

    Abernethy

    Jun-Kun Wang and Jacob D. Abernethy. Acceleration through optimistic no-regret dynamics. In Advances in Neural Information Processing Systems (NeurIPS), pages 3828–3838, 2018

  29. [29]

    Identify the Nash equilibrium in static games with random payoffs

    Yichi Zhou, Jialian Li, and Jun Zhu. Identify the Nash equilibrium in static games with random payoffs. In Doina Precup and Yee Whye Teh, editors, Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 4160–4169, International Convention Centre, Sydney, Australia, August 2017. PML...

  30. [30]

    U k t (1) = max { f(t−1) Nk t−1 , maxξ∈[αk t ,βk t ] Eλ∼qt d(ξ, λk) }

  31. [31]

    U k t (2) = max { f(t−1) Nk t−1 , maxξ∈[ak t ,bk t ] Eλ∼qt d(ξ, λk) }

  32. [32]

    U k t (1) (λ) = max { f(t−1) Nk t−1 , maxξ∈[αk t ,βk t ] d(ξ, λk) }

  33. [33]

    optimistic

    U k t (2) (λ) = max { f(t−1) Nk t−1 , maxξ∈[ak t ,bk t ] d(ξ, λk) } . The UCBs indexed by (2) are larger but potentially easier to compute that the ones indexed by (1), since ak t and bk t are easier to compute than αk t and βk t . The next lemma simplifies the computation of the UCBs. Lemma 10. In all the UCBs introduced, the maximum over the interval is ...