arxiv: 2604.14860 · v1 · submitted 2026-04-16 · 📊 stat.ML · cs.LG

Recognition: unknown

Best of both worlds: Stochastic & adversarial best-arm identification

Yasin Abbasi-Yadkori , Peter L. Bartlett , Victor Gabillon , Alan Malek , Michal Valko

Authors on Pith no claims yet

Pith reviewed 2026-05-10 10:12 UTC · model grok-4.3

classification 📊 stat.ML cs.LG

keywords best-arm identificationbandit algorithmsstochastic banditsadversarial banditsparameter-free algorithmslower boundsrobustness

0 comments

The pith

A parameter-free algorithm matches the constrained stochastic lower bound up to log factors while remaining robust to adversarial rewards.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper asks whether one can design a single learner that is optimal for both stochastic and adversarial best-arm identification without knowing which setting it faces. It proves this is impossible in general: adversarial robustness restricts optimal stochastic performance to only a subset of instances. A lower bound characterizes the best error rate achievable in that subset under the robustness requirement. The authors then give a simple parameter-free algorithm whose error probability matches this lower bound up to logarithmic factors on the allowable stochastic instances and stays robust when rewards are adversarial.

Core claim

Designing a learner that is optimal in both stochastic and adversarial best-arm identification is impossible in general. Robustness to adversarial rewards limits the learner to optimal rates on only a subset of stochastic problems, and a lower bound gives the optimal rate on that subset. A simple parameter-free algorithm achieves error probability matching this lower bound up to log factors in the stochastic case and remains robust against adversarial rewards.

What carries the argument

The simple parameter-free algorithm that achieves the constrained-optimal stochastic rate without knowledge of the reward model while guaranteeing adversarial robustness.

If this is right

Any adversarial-robust algorithm is provably suboptimal on some stochastic instances.
The lower bound exactly pins down the best stochastic performance possible under the robustness constraint.
The algorithm needs no parameters and works without knowing whether rewards are stochastic or adversarial.
The log-factor gap is the remaining slack between the algorithm and the lower bound.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar parameter-free constructions may apply to other bandit problems where the environment type is unknown in advance.
In applications such as online advertising, the algorithm could be deployed safely even if some reward sources are manipulated.
Tighter analysis or refined algorithms might remove the logarithmic factors while preserving the same guarantees.

Load-bearing premise

That the learner must remain robust to fully adversarial rewards while still achieving near-optimal rates on a non-trivial subset of stochastic instances.

What would settle it

An explicit stochastic instance in the allowable subset together with an algorithm that is adversarial-robust yet attains strictly better error probability than the derived lower bound (or an adversarial instance where the proposed algorithm's error exceeds the known adversarial optimum).

Figures

Figures reproduced from arXiv: 2604.14860 by Alan Malek, Michal Valko, Peter L. Bartlett, Victor Gabillon, Yasin Abbasi-Yadkori.

**Figure 2.** Figure 2: The ProbabilityONE (P1) algorithm 9 [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗

**Figure 3.** Figure 3: Probabilities of error of SR, P1, and the static uniform allocation. 16 [PITH_FULL_IMAGE:figures/full_fig_p016_3.png] view at source ↗

read the original abstract

We study bandit best-arm identification with arbitrary and potentially adversarial rewards. A simple random uniform learner obtains the optimal rate of error in the adversarial scenario. However, this type of strategy is suboptimal when the rewards are sampled stochastically. Therefore, we ask: Can we design a learner that performs optimally in both the stochastic and adversarial problems while not being aware of the nature of the rewards? First, we show that designing such a learner is impossible in general. In particular, to be robust to adversarial rewards, we can only guarantee optimal rates of error on a subset of the stochastic problems. We give a lower bound that characterizes the optimal rate in stochastic problems if the strategy is constrained to be robust to adversarial rewards. Finally, we design a simple parameter-free algorithm and show that its probability of error matches (up to log factors) the lower bound in stochastic problems, and it is also robust to adversarial ones.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

No algorithm can hit optimal stochastic rates on all instances while staying adversarial-robust, but this paper pins down the allowable subset, gives a matching lower bound, and supplies a simple parameter-free algorithm that achieves it up to logs.

read the letter

The key takeaway is that you cannot design a best-arm identification algorithm that achieves the optimal error rate on every stochastic instance and remains robust to fully adversarial rewards. The paper proves this impossibility and then identifies the subset of stochastic problems where near-optimal performance is still possible under the robustness requirement. They provide a matching lower bound and a simple parameter-free algorithm that attains rates within logarithmic factors of this bound in the stochastic case while handling adversarial rewards. The work does a good job of making the algorithm straightforward and free of any parameters that would need tuning based on the setting. This addresses a real practical gap where the reward mechanism is unknown in advance. The lower bound seems carefully constructed to capture the tension between the two models. The main soft spot is the characterization of the stochastic subset itself. It is defined through the lower bound, so its breadth depends on how the hard instances are chosen in the proof. If that subset turns out to be limited to problems with particular gap structures or reward distributions, the result applies less broadly than the title suggests. There is also no empirical evaluation in the abstract, leaving open questions about finite-sample behavior and comparisons to existing methods. The full proofs would need checking to confirm the log factors are not hiding larger constants. This paper is for bandit researchers interested in robust algorithms under model uncertainty. A reader focused on theoretical trade-offs in best-arm identification will get the most out of it. It deserves serious peer review because the impossibility result and the constructive algorithm are non-trivial contributions that clarify the limits of simultaneous optimality.

Referee Report

2 major / 1 minor

Summary. The manuscript studies best-arm identification in bandits with rewards that may be stochastic or adversarial. It proves an impossibility result: no algorithm can achieve optimal stochastic rates on all instances while remaining robust to arbitrary adversarial rewards. A lower bound is derived that characterizes the optimal error rate achievable on a specific subset of stochastic problems under the adversarial-robustness constraint. The authors then present a simple parameter-free algorithm whose error probability matches this lower bound up to logarithmic factors on the identified stochastic subset and remains robust in the adversarial setting.

Significance. If the lower-bound derivation and matching analysis hold, the work provides a precise characterization of the achievable 'best-of-both-worlds' regime, identifying a non-trivial subset of stochastic instances where near-optimal rates remain possible under adversarial robustness. The parameter-free algorithm and explicit matching (up to logs) of the lower bound constitute concrete strengths that could guide future robust bandit designs. The result clarifies fundamental trade-offs without requiring the learner to know the reward model in advance.

major comments (2)

[Lower bound theorem and surrounding analysis] The lower-bound derivation characterizing the stochastic subset under adversarial robustness (main impossibility and lower-bound result): the choice of hard-instance reward distributions and the precise reduction enforcing adversarial robustness are load-bearing; any looseness here directly determines whether the identified subset is non-trivial or overly narrow, and the information-theoretic argument requires explicit verification of tightness.
[Algorithm and upper-bound analysis] Algorithm analysis (the parameter-free algorithm and its error-probability bound): the claim that the probability of error matches the lower bound up to log factors must specify the precise logarithmic overhead and confirm that the algorithm uses no hidden parameters or knowledge of the stochastic subset; the robustness proof for the adversarial case should be cross-referenced to the same construction.

minor comments (1)

[Abstract] The abstract states that the algorithm 'matches (up to log factors) the lower bound' but does not indicate the exact log dependence or the precise subset characterization; adding one clarifying sentence would improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for your careful reading and constructive feedback on our manuscript. We appreciate your recognition of the significance of characterizing the achievable regime under adversarial robustness. We address each major comment below.

read point-by-point responses

Referee: [Lower bound theorem and surrounding analysis] The lower-bound derivation characterizing the stochastic subset under adversarial robustness (main impossibility and lower-bound result): the choice of hard-instance reward distributions and the precise reduction enforcing adversarial robustness are load-bearing; any looseness here directly determines whether the identified subset is non-trivial or overly narrow, and the information-theoretic argument requires explicit verification of tightness.

Authors: The hard instances consist of Bernoulli arms whose means differ by a gap of order 1/sqrt(T) on a carefully chosen subset of instances where the best arm remains identifiable even under adversarial perturbations. The reduction proceeds by showing that any algorithm robust to arbitrary reward sequences must incur at least the same information cost as on these instances, because an adversary can always mimic the stochastic realization until the algorithm commits. This yields a non-trivial subset: all instances in which the sub-optimality gaps satisfy a mild separation condition that is independent of the algorithm. Tightness follows directly from the matching upper bound (up to logs) derived for the same subset in the subsequent section. To make the information-theoretic steps fully explicit, we will expand the proof of the lower bound with an additional appendix containing the full KL-divergence calculations and the precise reduction argument. revision: yes
Referee: [Algorithm and upper-bound analysis] Algorithm analysis (the parameter-free algorithm and its error-probability bound): the claim that the probability of error matches the lower bound up to log factors must specify the precise logarithmic overhead and confirm that the algorithm uses no hidden parameters or knowledge of the stochastic subset; the robustness proof for the adversarial case should be cross-referenced to the same construction.

Authors: The algorithm is a single parameter-free procedure that maintains a uniform sampling schedule augmented by a doubling-based elimination rule; it receives only K and delta and never uses any knowledge of the stochastic subset or instance parameters. Its error probability on the identified subset is at most C times the lower-bound expression multiplied by an explicit O(log(K/delta)) factor that arises from the union bound over elimination rounds and the doubling schedule. The adversarial robustness proof is obtained by observing that the same algorithm, when faced with arbitrary rewards, never eliminates arms faster than uniform sampling and therefore inherits the optimal adversarial rate; we will insert an explicit cross-reference from the stochastic analysis (Section 4) to the adversarial argument (Section 3) to clarify that both results apply to the identical construction. revision: partial

Circularity Check

0 steps flagged

No significant circularity; lower bound and algorithm are independently derived

full rationale

The paper first establishes an impossibility result showing that no single algorithm can be optimal on all stochastic instances while remaining fully adversarial-robust. It then derives a lower bound that characterizes the largest subset of stochastic problems on which near-optimal rates remain achievable under the robustness constraint. Finally, it presents a simple parameter-free algorithm whose error probability is shown to match this lower bound (up to log factors) on the identified subset while retaining adversarial robustness. None of these steps reduces to a self-definition, a fitted parameter renamed as a prediction, or a load-bearing self-citation whose content is merely restated. The lower-bound construction is information-theoretic and external to the algorithm design; the algorithm is explicitly parameter-free and its guarantees are proved directly rather than by construction from the bound itself. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on standard multi-armed bandit assumptions (finite arms, bounded rewards, fixed horizon or fixed-confidence setting) and the definition of adversarial robustness. No free parameters are introduced because the algorithm is parameter-free. No new entities are postulated.

axioms (1)

domain assumption Standard multi-armed bandit model with bounded rewards and best-arm identification objective
Invoked throughout the abstract when discussing error rates in stochastic and adversarial settings.

pith-pipeline@v0.9.0 · 5466 in / 1186 out tokens · 65738 ms · 2026-05-10T10:12:54.585759+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

28 extracted references · 9 canonical work pages

[1]

Selection of learning experts https://ieeexplore.ieee.org/document/7965962/

Robin Allesiardo and Rapha \"e l F \'e raud. Selection of learning experts https://ieeexplore.ieee.org/document/7965962/. In International Joint Conference on Neural Networks, 2017

work page arXiv 2017
[2]

The non-stationary stochastic multi-armed bandit problem https://doi.org/10.1007/s41060-017-0050-5

Robin Allesiardo, Rapha \"e l F \'e raud, and Odalric-Ambrym Maillard. The non-stationary stochastic multi-armed bandit problem https://doi.org/10.1007/s41060-017-0050-5. International Journal of Data Science and Analytics, 2017

work page doi:10.1007/s41060-017-0050-5 2017
[3]

Best-arm identification for contaminated bandits https://arxiv.org/pdf/1802.09514.pdf

Jason Altschuler, Victor-Emmanuel Brunel, and Alan Malek. Best-arm identification for contaminated bandits https://arxiv.org/pdf/1802.09514.pdf. arXiv preprint arXiv:1802.09514 , 2018

work page arXiv 2018
[4]

Best-arm identification in multi-armed bandits http://www.learningtheory.org/colt2010/papers/59Audibert.pdf

Jean-Yves Audibert, S \'e bastien Bubeck, and R \'e mi Munos. Best-arm identification in multi-armed bandits http://www.learningtheory.org/colt2010/papers/59Audibert.pdf. In Conference on Learning Theory (COLT), 2010

2010
[5]

https://arxiv.org/pdf/1605.08722.pdf An empirical evaluation of Thompson sampling

Peter Auer and Chao-Kai Chiang. An algorithm with nearly optimal pseudo-regret for both stochastic and adversarial bandits https://arxiv.org/pdf/1605.08722.pdf. In Conference on Learning Theory (COLT) and arXiv preprint arXiv:1605.08722 , 2016

work page arXiv 2016
[6]

Schapire

Peter Auer, Nicol \`o Cesa-Bianchi, Yoav Freund, and Robert E. Schapire. The nonstochastic multi-armed bandit problem http://rob.schapire.net/papers/AuerCeFrSc01.pdf. SIAM Journal on Computing, 32 0 (1), 2002

2002
[7]

Multiple identifications in multi-armed bandits http://proceedings.mlr.press/v28/bubeck13.pdf

S \'e bastian Bubeck, Tengyao Wang, and Nitin Viswanathan. Multiple identifications in multi-armed bandits http://proceedings.mlr.press/v28/bubeck13.pdf. In International Conference on Machine Learning (ICML), 2013

2013
[8]

Regret analysis of stochastic and nonstochastic multi-armed bandit problems.arXiv preprint arXiv:1204.5721, 2012

S \'e bastien Bubeck and Nicol \`o Cesa-Bianchi. Regret analysis of stochastic and nonstochastic multi-armed bandit problems https://arxiv.org/pdf/1204.5721.pdf. Foundations and Trends in Machine Learning, 5 0 (1), 2012

work page arXiv 2012
[9]

The best of both worlds: stochastic and adversarial bandits http://proceedings.mlr.press/v23/bubeck12b/bubeck12b.pdf

S \'e bastien Bubeck and Aleksandrs Slivkins. The best of both worlds: stochastic and adversarial bandits http://proceedings.mlr.press/v23/bubeck12b/bubeck12b.pdf. In Conference on Learning Theory (COLT), 2012

2012
[10]

Pure exploration in multi-armed bandit problems https://doi.org/10.1007/978-3-642-04414-4_7

S \'e bastien Bubeck, R \'e mi Munos, and Gilles Stoltz. Pure exploration in multi-armed bandit problems https://doi.org/10.1007/978-3-642-04414-4_7. In Algorithmic Learning Theory (ALT), 2009

work page doi:10.1007/978-3-642-04414-4_7 2009
[11]

Tight (lower) bounds for the fixed budget best-arm identification bandit problem http://proceedings.mlr.press/v49/carpentier16.pdf

Alexandra Carpentier and Andrea Locatelli. Tight (lower) bounds for the fixed budget best-arm identification bandit problem http://proceedings.mlr.press/v49/carpentier16.pdf. In Conference on Learning Theory (COLT), 2016

2016
[12]

Extreme bandits https://papers.nips.cc/paper/5379-extreme-bandits.pdf

Alexandra Carpentier and Michal Valko. Extreme bandits https://papers.nips.cc/paper/5379-extreme-bandits.pdf. In Neural Information Processing Systems (NeurIPS), 2014

2014
[13]

Action elimination and stopping conditions for the multi-armed Bandit and reinforcement-learning problems http://jmlr.csail.mit.edu/papers/volume7/evendar06a/evendar06a.pdf

Eyal Even-Dar, Shie Mannor, and Yishay Mansour. Action elimination and stopping conditions for the multi-armed Bandit and reinforcement-learning problems http://jmlr.csail.mit.edu/papers/volume7/evendar06a/evendar06a.pdf. Journal of Machine Learning Research, 7: 0 1079--1105, 2006

2006
[14]

Annals of Probability , volume = 2, pages =

David A. Freedman. On tail probabilities for martingales https://projecteuclid.org/euclid.aop/1176996452. The Annals of Probability, pages 100--118, 1975

work page arXiv 1975
[15]

Optimal best-arm identification with fixed confidence http://proceedings.mlr.press/v49/garivier16a.pdf

Aur \'e lien Garivier and Emilie Kaufmann. Optimal best-arm identification with fixed confidence http://proceedings.mlr.press/v49/garivier16a.pdf. In Conference on Learning Theory (COLT), 2016

2016
[16]

Non-stochastic best-arm identification and hyperparameter optimization http://proceedings.mlr.press/v51/jamieson16.pdf

Kevin Jamieson and Ameet Talwalkar. Non-stochastic best-arm identification and hyperparameter optimization http://proceedings.mlr.press/v51/jamieson16.pdf. In International Conference on Artificial Intelligence and Statistics (AISTATS), 2016

2016
[17]

Rob Kaas and Jan M. Buhrman. Mean, median and mode in binomial distributions https://doi.org/10.1111/j.1467-9574.1980.tb00681.x. Statistica Neerlandica, 34 0 (1): 0 13--18, 1980

work page doi:10.1111/j.1467-9574.1980.tb00681.x 1980
[18]

PAC subset selection in stochastic multi-armed bandits https://www.cse.iitb.ac.in/ shivaram/papers/ktas_icml_2012.pdf

Shivaram Kalyanakrishnan, Ambuj Tewari, Peter Auer, and Peter Stone. PAC subset selection in stochastic multi-armed bandits https://www.cse.iitb.ac.in/ shivaram/papers/ktas_icml_2012.pdf. In International Conference on Machine Learning (ICML), 2012

2012
[19]

Almost optimal exploration in multi-armed bandits http://proceedings.mlr.press/v28/karnin13.pdf

Zohar Karnin, Tomer Koren, and Oren Somekh. Almost optimal exploration in multi-armed bandits http://proceedings.mlr.press/v28/karnin13.pdf. In International Conference on Machine Learning (ICML), 2013

2013
[20]

Information complexity in bandit subset selection http://proceedings.mlr.press/v30/Kaufmann13.pdf

Emilie Kaufmann and Shivaram Kalyanakrishnan. Information complexity in bandit subset selection http://proceedings.mlr.press/v30/Kaufmann13.pdf. In Conference on Learning Theory (COLT), 2013

2013
[21]

Hyperband: A Novel Bandit-Based Approach to Hyperparameter Optimization

Lisha Li, Kevin Jamieson, Giulia DeSalvo, Afshin Rostamizadeh, and Ameet Talwalkar. https://arxiv.org/pdf/1603.06560.pdf Hyperband : A novel bandit-based approach to hyperparameter optimization . arXiv preprint arXiv:1603.06560 , 2016

work page Pith review arXiv 2016
[22]

An optimal algorithm for the thresholding bandit problem http://proceedings.mlr.press/v48/locatelli16-supp.pdf

Andrea Locatelli, Maurilio Gutzeit, and Alexandra Carpentier. An optimal algorithm for the thresholding bandit problem http://proceedings.mlr.press/v48/locatelli16-supp.pdf. In International Conference on Machine Learning (ICML), 2016

2016
[23]

Tsitsiklis

Shie Mannor and John N. Tsitsiklis. The sample complexity of exploration in the multi-armed bandit problem http://www.jmlr.org/papers/volume5/mannor04b/mannor04b.pdf. Journal of Machine Learning Research, 5 0 (Jun), 2004

2004
[24]

Oded Maron and Andrew Moore. Hoeffding Races: Accelerating model-selection search for classification and function approximation https://papers.nips.cc/paper/841-hoeffding-races-accelerating-model-selection-search-for-classification-and-function-approximation.pdf. In Neural Information Processing Systems (NeurIPS), 1993

1993
[25]

http://icml2008.cs.helsinki.fi/papers/523.pdf Empirical B ernstein stopping

Volodymyr Mnih, Csaba Szepesv\' a ri, and Jean-Yves Audibert. http://icml2008.cs.helsinki.fi/papers/523.pdf Empirical B ernstein stopping . In International Conference on Machine Learning (ICML), 2008

2008
[26]

Applications and explanations of Zipf's law http://aclweb.org/anthology/W98-1218

David Powers. Applications and explanations of Zipf's law http://aclweb.org/anthology/W98-1218. In New methods in language processing and computational natural language learning. Association for Computational Linguistics, 1998

1998
[27]

http://proceedings.mlr.press/v65/seldin17a/seldin17a.pdf An improved parametrization and analysis of the EXP3++ algorithm for stochastic and adversarial bandits

Yevgeny Seldin and G \' a bor Lugosi. http://proceedings.mlr.press/v65/seldin17a/seldin17a.pdf An improved parametrization and analysis of the EXP3++ algorithm for stochastic and adversarial bandits . In Conference on Learning Theory (COLT), 2017

2017
[28]

One practical algorithm for both stochastic and adversarial bandits http://proceedings.mlr.press/v32/seldinb14.pdf

Yevgeny Seldin and Aleksandrs Slivkins. One practical algorithm for both stochastic and adversarial bandits http://proceedings.mlr.press/v32/seldinb14.pdf. In International Conference on Machine Learning (ICML), 2014

2014