pith. machine review for the scientific record. sign in

arxiv: 2604.14860 · v1 · submitted 2026-04-16 · 📊 stat.ML · cs.LG

Recognition: unknown

Best of both worlds: Stochastic & adversarial best-arm identification

Authors on Pith no claims yet

Pith reviewed 2026-05-10 10:12 UTC · model grok-4.3

classification 📊 stat.ML cs.LG
keywords best-arm identificationbandit algorithmsstochastic banditsadversarial banditsparameter-free algorithmslower boundsrobustness
0
0 comments X

The pith

A parameter-free algorithm matches the constrained stochastic lower bound up to log factors while remaining robust to adversarial rewards.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper asks whether one can design a single learner that is optimal for both stochastic and adversarial best-arm identification without knowing which setting it faces. It proves this is impossible in general: adversarial robustness restricts optimal stochastic performance to only a subset of instances. A lower bound characterizes the best error rate achievable in that subset under the robustness requirement. The authors then give a simple parameter-free algorithm whose error probability matches this lower bound up to logarithmic factors on the allowable stochastic instances and stays robust when rewards are adversarial.

Core claim

Designing a learner that is optimal in both stochastic and adversarial best-arm identification is impossible in general. Robustness to adversarial rewards limits the learner to optimal rates on only a subset of stochastic problems, and a lower bound gives the optimal rate on that subset. A simple parameter-free algorithm achieves error probability matching this lower bound up to log factors in the stochastic case and remains robust against adversarial rewards.

What carries the argument

The simple parameter-free algorithm that achieves the constrained-optimal stochastic rate without knowledge of the reward model while guaranteeing adversarial robustness.

If this is right

  • Any adversarial-robust algorithm is provably suboptimal on some stochastic instances.
  • The lower bound exactly pins down the best stochastic performance possible under the robustness constraint.
  • The algorithm needs no parameters and works without knowing whether rewards are stochastic or adversarial.
  • The log-factor gap is the remaining slack between the algorithm and the lower bound.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar parameter-free constructions may apply to other bandit problems where the environment type is unknown in advance.
  • In applications such as online advertising, the algorithm could be deployed safely even if some reward sources are manipulated.
  • Tighter analysis or refined algorithms might remove the logarithmic factors while preserving the same guarantees.

Load-bearing premise

That the learner must remain robust to fully adversarial rewards while still achieving near-optimal rates on a non-trivial subset of stochastic instances.

What would settle it

An explicit stochastic instance in the allowable subset together with an algorithm that is adversarial-robust yet attains strictly better error probability than the derived lower bound (or an adversarial instance where the proposed algorithm's error exceeds the known adversarial optimum).

Figures

Figures reproduced from arXiv: 2604.14860 by Alan Malek, Michal Valko, Peter L. Bartlett, Victor Gabillon, Yasin Abbasi-Yadkori.

Figure 1
Figure 1. Figure 1: General protocol of the adversarial setting. The adversary is oblivious. [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The ProbabilityONE (P1) algorithm 9 [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Probabilities of error of SR, P1, and the static uniform allocation. 16 [PITH_FULL_IMAGE:figures/full_fig_p016_3.png] view at source ↗
read the original abstract

We study bandit best-arm identification with arbitrary and potentially adversarial rewards. A simple random uniform learner obtains the optimal rate of error in the adversarial scenario. However, this type of strategy is suboptimal when the rewards are sampled stochastically. Therefore, we ask: Can we design a learner that performs optimally in both the stochastic and adversarial problems while not being aware of the nature of the rewards? First, we show that designing such a learner is impossible in general. In particular, to be robust to adversarial rewards, we can only guarantee optimal rates of error on a subset of the stochastic problems. We give a lower bound that characterizes the optimal rate in stochastic problems if the strategy is constrained to be robust to adversarial rewards. Finally, we design a simple parameter-free algorithm and show that its probability of error matches (up to log factors) the lower bound in stochastic problems, and it is also robust to adversarial ones.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript studies best-arm identification in bandits with rewards that may be stochastic or adversarial. It proves an impossibility result: no algorithm can achieve optimal stochastic rates on all instances while remaining robust to arbitrary adversarial rewards. A lower bound is derived that characterizes the optimal error rate achievable on a specific subset of stochastic problems under the adversarial-robustness constraint. The authors then present a simple parameter-free algorithm whose error probability matches this lower bound up to logarithmic factors on the identified stochastic subset and remains robust in the adversarial setting.

Significance. If the lower-bound derivation and matching analysis hold, the work provides a precise characterization of the achievable 'best-of-both-worlds' regime, identifying a non-trivial subset of stochastic instances where near-optimal rates remain possible under adversarial robustness. The parameter-free algorithm and explicit matching (up to logs) of the lower bound constitute concrete strengths that could guide future robust bandit designs. The result clarifies fundamental trade-offs without requiring the learner to know the reward model in advance.

major comments (2)
  1. [Lower bound theorem and surrounding analysis] The lower-bound derivation characterizing the stochastic subset under adversarial robustness (main impossibility and lower-bound result): the choice of hard-instance reward distributions and the precise reduction enforcing adversarial robustness are load-bearing; any looseness here directly determines whether the identified subset is non-trivial or overly narrow, and the information-theoretic argument requires explicit verification of tightness.
  2. [Algorithm and upper-bound analysis] Algorithm analysis (the parameter-free algorithm and its error-probability bound): the claim that the probability of error matches the lower bound up to log factors must specify the precise logarithmic overhead and confirm that the algorithm uses no hidden parameters or knowledge of the stochastic subset; the robustness proof for the adversarial case should be cross-referenced to the same construction.
minor comments (1)
  1. [Abstract] The abstract states that the algorithm 'matches (up to log factors) the lower bound' but does not indicate the exact log dependence or the precise subset characterization; adding one clarifying sentence would improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for your careful reading and constructive feedback on our manuscript. We appreciate your recognition of the significance of characterizing the achievable regime under adversarial robustness. We address each major comment below.

read point-by-point responses
  1. Referee: [Lower bound theorem and surrounding analysis] The lower-bound derivation characterizing the stochastic subset under adversarial robustness (main impossibility and lower-bound result): the choice of hard-instance reward distributions and the precise reduction enforcing adversarial robustness are load-bearing; any looseness here directly determines whether the identified subset is non-trivial or overly narrow, and the information-theoretic argument requires explicit verification of tightness.

    Authors: The hard instances consist of Bernoulli arms whose means differ by a gap of order 1/sqrt(T) on a carefully chosen subset of instances where the best arm remains identifiable even under adversarial perturbations. The reduction proceeds by showing that any algorithm robust to arbitrary reward sequences must incur at least the same information cost as on these instances, because an adversary can always mimic the stochastic realization until the algorithm commits. This yields a non-trivial subset: all instances in which the sub-optimality gaps satisfy a mild separation condition that is independent of the algorithm. Tightness follows directly from the matching upper bound (up to logs) derived for the same subset in the subsequent section. To make the information-theoretic steps fully explicit, we will expand the proof of the lower bound with an additional appendix containing the full KL-divergence calculations and the precise reduction argument. revision: yes

  2. Referee: [Algorithm and upper-bound analysis] Algorithm analysis (the parameter-free algorithm and its error-probability bound): the claim that the probability of error matches the lower bound up to log factors must specify the precise logarithmic overhead and confirm that the algorithm uses no hidden parameters or knowledge of the stochastic subset; the robustness proof for the adversarial case should be cross-referenced to the same construction.

    Authors: The algorithm is a single parameter-free procedure that maintains a uniform sampling schedule augmented by a doubling-based elimination rule; it receives only K and delta and never uses any knowledge of the stochastic subset or instance parameters. Its error probability on the identified subset is at most C times the lower-bound expression multiplied by an explicit O(log(K/delta)) factor that arises from the union bound over elimination rounds and the doubling schedule. The adversarial robustness proof is obtained by observing that the same algorithm, when faced with arbitrary rewards, never eliminates arms faster than uniform sampling and therefore inherits the optimal adversarial rate; we will insert an explicit cross-reference from the stochastic analysis (Section 4) to the adversarial argument (Section 3) to clarify that both results apply to the identical construction. revision: partial

Circularity Check

0 steps flagged

No significant circularity; lower bound and algorithm are independently derived

full rationale

The paper first establishes an impossibility result showing that no single algorithm can be optimal on all stochastic instances while remaining fully adversarial-robust. It then derives a lower bound that characterizes the largest subset of stochastic problems on which near-optimal rates remain achievable under the robustness constraint. Finally, it presents a simple parameter-free algorithm whose error probability is shown to match this lower bound (up to log factors) on the identified subset while retaining adversarial robustness. None of these steps reduces to a self-definition, a fitted parameter renamed as a prediction, or a load-bearing self-citation whose content is merely restated. The lower-bound construction is information-theoretic and external to the algorithm design; the algorithm is explicitly parameter-free and its guarantees are proved directly rather than by construction from the bound itself. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on standard multi-armed bandit assumptions (finite arms, bounded rewards, fixed horizon or fixed-confidence setting) and the definition of adversarial robustness. No free parameters are introduced because the algorithm is parameter-free. No new entities are postulated.

axioms (1)
  • domain assumption Standard multi-armed bandit model with bounded rewards and best-arm identification objective
    Invoked throughout the abstract when discussing error rates in stochastic and adversarial settings.

pith-pipeline@v0.9.0 · 5466 in / 1186 out tokens · 65738 ms · 2026-05-10T10:12:54.585759+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

28 extracted references · 9 canonical work pages

  1. [1]

    Selection of learning experts https://ieeexplore.ieee.org/document/7965962/

    Robin Allesiardo and Rapha \"e l F \'e raud. Selection of learning experts https://ieeexplore.ieee.org/document/7965962/. In International Joint Conference on Neural Networks, 2017

  2. [2]

    The non-stationary stochastic multi-armed bandit problem https://doi.org/10.1007/s41060-017-0050-5

    Robin Allesiardo, Rapha \"e l F \'e raud, and Odalric-Ambrym Maillard. The non-stationary stochastic multi-armed bandit problem https://doi.org/10.1007/s41060-017-0050-5. International Journal of Data Science and Analytics, 2017

  3. [3]

    Best-arm identification for contaminated bandits https://arxiv.org/pdf/1802.09514.pdf

    Jason Altschuler, Victor-Emmanuel Brunel, and Alan Malek. Best-arm identification for contaminated bandits https://arxiv.org/pdf/1802.09514.pdf. arXiv preprint arXiv:1802.09514 , 2018

  4. [4]

    Best-arm identification in multi-armed bandits http://www.learningtheory.org/colt2010/papers/59Audibert.pdf

    Jean-Yves Audibert, S \'e bastien Bubeck, and R \'e mi Munos. Best-arm identification in multi-armed bandits http://www.learningtheory.org/colt2010/papers/59Audibert.pdf. In Conference on Learning Theory (COLT), 2010

  5. [5]

    https://arxiv.org/pdf/1605.08722.pdf An empirical evaluation of Thompson sampling

    Peter Auer and Chao-Kai Chiang. An algorithm with nearly optimal pseudo-regret for both stochastic and adversarial bandits https://arxiv.org/pdf/1605.08722.pdf. In Conference on Learning Theory (COLT) and arXiv preprint arXiv:1605.08722 , 2016

  6. [6]

    Schapire

    Peter Auer, Nicol \`o Cesa-Bianchi, Yoav Freund, and Robert E. Schapire. The nonstochastic multi-armed bandit problem http://rob.schapire.net/papers/AuerCeFrSc01.pdf. SIAM Journal on Computing, 32 0 (1), 2002

  7. [7]

    Multiple identifications in multi-armed bandits http://proceedings.mlr.press/v28/bubeck13.pdf

    S \'e bastian Bubeck, Tengyao Wang, and Nitin Viswanathan. Multiple identifications in multi-armed bandits http://proceedings.mlr.press/v28/bubeck13.pdf. In International Conference on Machine Learning (ICML), 2013

  8. [8]

    Regret analysis of stochastic and nonstochastic multi-armed bandit problems.arXiv preprint arXiv:1204.5721, 2012

    S \'e bastien Bubeck and Nicol \`o Cesa-Bianchi. Regret analysis of stochastic and nonstochastic multi-armed bandit problems https://arxiv.org/pdf/1204.5721.pdf. Foundations and Trends in Machine Learning, 5 0 (1), 2012

  9. [9]

    The best of both worlds: stochastic and adversarial bandits http://proceedings.mlr.press/v23/bubeck12b/bubeck12b.pdf

    S \'e bastien Bubeck and Aleksandrs Slivkins. The best of both worlds: stochastic and adversarial bandits http://proceedings.mlr.press/v23/bubeck12b/bubeck12b.pdf. In Conference on Learning Theory (COLT), 2012

  10. [10]

    Pure exploration in multi-armed bandit problems https://doi.org/10.1007/978-3-642-04414-4_7

    S \'e bastien Bubeck, R \'e mi Munos, and Gilles Stoltz. Pure exploration in multi-armed bandit problems https://doi.org/10.1007/978-3-642-04414-4_7. In Algorithmic Learning Theory (ALT), 2009

  11. [11]

    Tight (lower) bounds for the fixed budget best-arm identification bandit problem http://proceedings.mlr.press/v49/carpentier16.pdf

    Alexandra Carpentier and Andrea Locatelli. Tight (lower) bounds for the fixed budget best-arm identification bandit problem http://proceedings.mlr.press/v49/carpentier16.pdf. In Conference on Learning Theory (COLT), 2016

  12. [12]

    Extreme bandits https://papers.nips.cc/paper/5379-extreme-bandits.pdf

    Alexandra Carpentier and Michal Valko. Extreme bandits https://papers.nips.cc/paper/5379-extreme-bandits.pdf. In Neural Information Processing Systems (NeurIPS), 2014

  13. [13]

    Action elimination and stopping conditions for the multi-armed Bandit and reinforcement-learning problems http://jmlr.csail.mit.edu/papers/volume7/evendar06a/evendar06a.pdf

    Eyal Even-Dar, Shie Mannor, and Yishay Mansour. Action elimination and stopping conditions for the multi-armed Bandit and reinforcement-learning problems http://jmlr.csail.mit.edu/papers/volume7/evendar06a/evendar06a.pdf. Journal of Machine Learning Research, 7: 0 1079--1105, 2006

  14. [14]

    Annals of Probability , volume = 2, pages =

    David A. Freedman. On tail probabilities for martingales https://projecteuclid.org/euclid.aop/1176996452. The Annals of Probability, pages 100--118, 1975

  15. [15]

    Optimal best-arm identification with fixed confidence http://proceedings.mlr.press/v49/garivier16a.pdf

    Aur \'e lien Garivier and Emilie Kaufmann. Optimal best-arm identification with fixed confidence http://proceedings.mlr.press/v49/garivier16a.pdf. In Conference on Learning Theory (COLT), 2016

  16. [16]

    Non-stochastic best-arm identification and hyperparameter optimization http://proceedings.mlr.press/v51/jamieson16.pdf

    Kevin Jamieson and Ameet Talwalkar. Non-stochastic best-arm identification and hyperparameter optimization http://proceedings.mlr.press/v51/jamieson16.pdf. In International Conference on Artificial Intelligence and Statistics (AISTATS), 2016

  17. [17]

    Rob Kaas and Jan M. Buhrman. Mean, median and mode in binomial distributions https://doi.org/10.1111/j.1467-9574.1980.tb00681.x. Statistica Neerlandica, 34 0 (1): 0 13--18, 1980

  18. [18]

    PAC subset selection in stochastic multi-armed bandits https://www.cse.iitb.ac.in/ shivaram/papers/ktas_icml_2012.pdf

    Shivaram Kalyanakrishnan, Ambuj Tewari, Peter Auer, and Peter Stone. PAC subset selection in stochastic multi-armed bandits https://www.cse.iitb.ac.in/ shivaram/papers/ktas_icml_2012.pdf. In International Conference on Machine Learning (ICML), 2012

  19. [19]

    Almost optimal exploration in multi-armed bandits http://proceedings.mlr.press/v28/karnin13.pdf

    Zohar Karnin, Tomer Koren, and Oren Somekh. Almost optimal exploration in multi-armed bandits http://proceedings.mlr.press/v28/karnin13.pdf. In International Conference on Machine Learning (ICML), 2013

  20. [20]

    Information complexity in bandit subset selection http://proceedings.mlr.press/v30/Kaufmann13.pdf

    Emilie Kaufmann and Shivaram Kalyanakrishnan. Information complexity in bandit subset selection http://proceedings.mlr.press/v30/Kaufmann13.pdf. In Conference on Learning Theory (COLT), 2013

  21. [21]

    Hyperband: A Novel Bandit-Based Approach to Hyperparameter Optimization

    Lisha Li, Kevin Jamieson, Giulia DeSalvo, Afshin Rostamizadeh, and Ameet Talwalkar. https://arxiv.org/pdf/1603.06560.pdf Hyperband : A novel bandit-based approach to hyperparameter optimization . arXiv preprint arXiv:1603.06560 , 2016

  22. [22]

    An optimal algorithm for the thresholding bandit problem http://proceedings.mlr.press/v48/locatelli16-supp.pdf

    Andrea Locatelli, Maurilio Gutzeit, and Alexandra Carpentier. An optimal algorithm for the thresholding bandit problem http://proceedings.mlr.press/v48/locatelli16-supp.pdf. In International Conference on Machine Learning (ICML), 2016

  23. [23]

    Tsitsiklis

    Shie Mannor and John N. Tsitsiklis. The sample complexity of exploration in the multi-armed bandit problem http://www.jmlr.org/papers/volume5/mannor04b/mannor04b.pdf. Journal of Machine Learning Research, 5 0 (Jun), 2004

  24. [24]

    Oded Maron and Andrew Moore. Hoeffding Races: Accelerating model-selection search for classification and function approximation https://papers.nips.cc/paper/841-hoeffding-races-accelerating-model-selection-search-for-classification-and-function-approximation.pdf. In Neural Information Processing Systems (NeurIPS), 1993

  25. [25]

    http://icml2008.cs.helsinki.fi/papers/523.pdf Empirical B ernstein stopping

    Volodymyr Mnih, Csaba Szepesv\' a ri, and Jean-Yves Audibert. http://icml2008.cs.helsinki.fi/papers/523.pdf Empirical B ernstein stopping . In International Conference on Machine Learning (ICML), 2008

  26. [26]

    Applications and explanations of Zipf's law http://aclweb.org/anthology/W98-1218

    David Powers. Applications and explanations of Zipf's law http://aclweb.org/anthology/W98-1218. In New methods in language processing and computational natural language learning. Association for Computational Linguistics, 1998

  27. [27]

    http://proceedings.mlr.press/v65/seldin17a/seldin17a.pdf An improved parametrization and analysis of the EXP3++ algorithm for stochastic and adversarial bandits

    Yevgeny Seldin and G \' a bor Lugosi. http://proceedings.mlr.press/v65/seldin17a/seldin17a.pdf An improved parametrization and analysis of the EXP3++ algorithm for stochastic and adversarial bandits . In Conference on Learning Theory (COLT), 2017

  28. [28]

    One practical algorithm for both stochastic and adversarial bandits http://proceedings.mlr.press/v32/seldinb14.pdf

    Yevgeny Seldin and Aleksandrs Slivkins. One practical algorithm for both stochastic and adversarial bandits http://proceedings.mlr.press/v32/seldinb14.pdf. In International Conference on Machine Learning (ICML), 2014