Leveraging Similarities in Multi-Armed Bandits
Pith reviewed 2026-06-26 08:39 UTC · model grok-4.3
The pith
Tree-structured action similarities allow bandit algorithms to replace the full action count K by an effective count K_eff in regret bounds, but only when feedback is richer than one-point.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
For loss sequences that are tree-compatible, meaning losses of actions whose lowest common ancestor is at a given tree level differ by at most a factor tied to that level, algorithms exist that achieve regret scaling with an effective number of actions K_eff induced by the tree rather than the total number of leaves K, provided the feedback model supplies at least two loss evaluations per round; the same algorithms are impossible to obtain under ordinary one-point feedback.
What carries the argument
A rooted tree on the action set whose internal nodes define similarity levels, together with the derived effective action count K_eff that counts how many distinct loss values are distinguishable under the tree compatibility constraint.
If this is right
- Regret bounds replace the raw action count K by the smaller effective count K_eff across semi-bandit, multi-point, and two-point feedback models.
- The algorithms simultaneously achieve the best known rates for both stochastic and adversarial environments.
- Square-root regret becomes attainable for Lipschitz bandits on the unit ball when dimension is at most two and two-point feedback is supplied.
- The same framework covers hierarchical action sets and tagged or latent-trait similarities encoded as trees.
Where Pith is reading between the lines
- Two-point feedback may be sufficient to obtain near-optimal rates in other structured continuous problems whose similarity can be approximated by a tree.
- The impossibility for one-point feedback suggests that any practical deployment of similarity-aware bandits will require at least occasional direct loss comparisons between nearby actions.
- K_eff could serve as a design criterion for constructing or pruning action trees in applications where the number of leaves is large but the effective diversity is small.
Load-bearing premise
The loss sequence must be tree-compatible so that losses of similar actions stay close.
What would settle it
An explicit loss sequence that is tree-compatible yet forces any algorithm using only one-point feedback to suffer regret linear in the full number of actions K rather than in K_eff.
read the original abstract
In many online learning and bandit problems, the actions we consider possess inherent similarities--for instance because they share latent traits, tags, or hierarchical structure. We study online learning with a similarity-structured action set, encoded by a rooted tree whose leaves are the actions and whose levels quantify how closely two actions are related. The loss sequence is assumed tree-compatible: losses of similar actions are constrained to be close. We establish an impossibility result showing that usual one-point bandit feedback cannot, in general, leverage range or tree-induced similarity, even under very strong similarity constraints. We then provide a unified set of algorithms which adapt to a wide range of richer feedback models, from semi-bandit feedback down to multi-point bandit protocols, including the minimal two-point feedback setting. We show these algorithms exhibit best-of-both-worlds guarantees and provably exploit action similarities by replacing the number of actions $K$ by a similarity-aware effective number of actions $K_{\mathrm{eff}}$ in the regret bounds. As an application, we show that under two-point feedback, it is possible to achieve $\sqrt{T}$ regret in Lipschitz bandits when $d \leq 2$.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper studies multi-armed bandits where actions form a similarity structure encoded by a rooted tree (leaves are actions, levels quantify relatedness). Losses are assumed tree-compatible (similar actions have close losses). It proves an impossibility result that standard one-point bandit feedback cannot leverage such similarities even under strong constraints. It then gives a unified family of algorithms for richer feedback models (semi-bandit down to minimal two-point feedback) that achieve best-of-both-worlds regret bounds by replacing the action count K with a similarity-aware effective count K_eff. As an application, two-point feedback yields √T regret for Lipschitz bandits when d ≤ 2.
Significance. If the proofs and definitions of K_eff hold, the work supplies a clean way to quantify and exploit tree-induced similarities across feedback regimes, with the two-point Lipschitz application providing a concrete improvement over standard one-point bounds in low dimensions. The impossibility result for one-point feedback and the best-of-both-worlds guarantees are useful complementary contributions.
minor comments (2)
- The abstract introduces K_eff as the quantity that replaces K in the regret bounds but does not indicate how K_eff is computed from the tree or the loss-compatibility assumption; a brief parenthetical definition or reference to the main theorem would improve readability.
- The tree-compatibility assumption on the loss sequence is central to both the impossibility and the positive results; a short illustrative example of a tree-compatible vs. incompatible loss sequence would help readers assess the modeling restriction.
Simulated Author's Rebuttal
We thank the referee for their careful reading and for highlighting the potential significance of the impossibility result, the unified algorithms, and the Lipschitz application. We address the points raised below. Since the report lists no enumerated major comments after 'MAJOR COMMENTS:', we respond to the overall assessment and note that we stand ready to supply additional proof details or clarifications if the editor or referee requests them.
Circularity Check
No significant circularity identified
full rationale
The derivation relies on explicit modeling assumptions (rooted tree encoding action similarities, tree-compatible loss sequences) that are stated upfront and used to define K_eff as a similarity-aware effective action count; regret bounds are then derived by substituting this quantity into standard analyses for richer feedback models. The impossibility result for one-point feedback is shown separately as a complementary negative result, and the Lipschitz-bandit application for d ≤ 2 follows directly from instantiating the general two-point-feedback algorithm without reducing any bound to a fitted parameter or self-referential definition. No load-bearing step collapses to a self-citation chain, ansatz smuggled via prior work, or renaming of a known empirical pattern; the central claims remain independent of the target results.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Action set encoded by a rooted tree whose leaves are actions and levels quantify closeness.
- domain assumption Loss sequence is tree-compatible: losses of similar actions are constrained to be close.
Reference graph
Works this paper leans on
-
[1]
Improved algorithms for linear stochastic bandits
Yasin Abbasi-Yadkori, D \'a vid P \'a l, and Csaba Szepesv \'a ri. Improved algorithms for linear stochastic bandits. Advances in neural information processing systems, 24, 2011
2011
-
[2]
Optimal algorithms for online convex optimization with multi-point bandit feedback
Alekh Agarwal, Ofer Dekel, and Lin Xiao. Optimal algorithms for online convex optimization with multi-point bandit feedback. In Colt, pages 28--40, 2010
2010
-
[3]
Thompson sampling for contextual bandits with linear payoffs
Shipra Agrawal and Navin Goyal. Thompson sampling for contextual bandits with linear payoffs. In International conference on machine learning, pages 127--135. PMLR, 2013
2013
-
[4]
Mixed-effect thompson sampling
Imad Aouali, Branislav Kveton, and Sumeet Katariya. Mixed-effect thompson sampling. In Proceedings of The 26th International Conference on Artificial Intelligence and Statistics, volume 206 of Proceedings of Machine Learning Research, pages 2087--2115. PMLR, 2023
2087
-
[5]
Finite-time analysis of the multiarmed bandit problem
Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem. Machine learning, 47 0 (2): 0 235--256, 2002
2002
-
[6]
Convex optimization theory, volume 1
Dimitri Bertsekas. Convex optimization theory, volume 1. Athena Scientific, 2009
2009
-
[7]
Sébastien Bubeck, Nicoló Cesa-Bianchi, and Sham M. Kakade. Towards minimax policies for online linear optimization with bandit feedback. In Proceedings of the 25th Annual Conference on Learning Theory, volume 23 of Proceedings of Machine Learning Research, pages 41.1--41.14. PMLR, 2012
2012
-
[8]
Algorithmic chaining and the role of partial feedback in online nonparametric learning
Nicol \`o Cesa-Bianchi, Pierre Gaillard, Claudio Gentile, and S \'e bastien Gerchinovitz. Algorithmic chaining and the role of partial feedback in online nonparametric learning. In Conference on Learning Theory, pages 465--481. PMLR, 2017
2017
-
[9]
Combinatorial bandits revisited
Richard Combes, Mohammad Sadegh Talebi Mazraeh Shahi, Alexandre Proutiere, et al. Combinatorial bandits revisited. Advances in neural information processing systems, 28, 2015
2015
-
[10]
Minimal exploration in structured stochastic bandits
Richard Combes, Stefan Magureanu, and Alexandre Proutiere. Minimal exploration in structured stochastic bandits. In Advances in Neural Information Processing Systems (NeurIPS), 2017
2017
-
[11]
A blackbox approach to best of both worlds in bandits and beyond
Chris Dann, Chen - Yu Wei, and Julian Zimmert. A blackbox approach to best of both worlds in bandits and beyond. In Proceedings of Thirty Sixth Conference on Learning Theory, volume 195 of Proceedings of Machine Learning Research, pages 5503--5570. PMLR, 12--15 Jul 2023
2023
-
[12]
Bandits with side observations: Bounded vs
R \'e my Degenne, Evrard Garcelon, and Vianney Perchet. Bandits with side observations: Bounded vs. logarithmic regret. In Conference on Uncertainty in Artificial Intelligence, 2018
2018
-
[13]
Optimal rates for zero-order convex optimization: The power of two function evaluations
John C Duchi, Michael I Jordan, Martin J Wainwright, and Andre Wibisono. Optimal rates for zero-order convex optimization: The power of two function evaluations. IEEE Transactions on Information Theory, 61 0 (5): 0 2788--2806, 2015
2015
-
[14]
Online convex optimization in the bandit setting: gradient descent without a gradient
Abraham D Flaxman, Adam Tauman Kalai, and H Brendan McMahan. Online convex optimization in the bandit setting: gradient descent without a gradient. arXiv preprint cs/0408007, 2004
Pith/arXiv arXiv 2004
-
[15]
Refined lower bounds for adversarial bandits
S \'e bastien Gerchinovitz and Tor Lattimore. Refined lower bounds for adversarial bandits. Advances in Neural Information Processing Systems, 29, 2016
2016
-
[16]
Zeroth-order non-convex learning via hierarchical dual averaging
Am \'e lie H \'e liou, Matthieu Martin, Panayotis Mertikopoulos, and Thibaud Rahier. Zeroth-order non-convex learning via hierarchical dual averaging. In International Conference on Machine Learning, pages 4192--4202. PMLR, 2021
2021
-
[17]
Deep hierarchy in bandits
Joey Hong, Branislav Kveton, Sumeet Katariya, Manzil Zaheer, and Mohammad Ghavamzadeh. Deep hierarchy in bandits. In Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 8833--8851. PMLR, 2022 a
2022
-
[18]
Hierarchical bayesian bandits
Joey Hong, Branislav Kveton, Manzil Zaheer, and Mohammad Ghavamzadeh. Hierarchical bayesian bandits. In Gustau Camps-Valls, Francisco J. R. Ruiz, and Isabel Valera, editors, Proceedings of The 25th International Conference on Artificial Intelligence and Statistics, volume 151 of Proceedings of Machine Learning Research, pages 7724--7741. PMLR, 2022 b
2022
-
[19]
Tight first-and second-order regret bounds for adversarial linear bandits
Shinji Ito, Shuichi Hirahara, Tasuku Soma, and Yuichi Yoshida. Tight first-and second-order regret bounds for adversarial linear bandits. Advances in Neural Information Processing Systems, 33: 0 2028--2038, 2020
2028
-
[20]
Nearly optimal best-of-both-worlds algorithms for online learning with feedback graphs
Shinji Ito, Taira Tsuchiya, and Junya Honda. Nearly optimal best-of-both-worlds algorithms for online learning with feedback graphs. In Advances in Neural Information Processing Systems, volume 35, pages 28631--28643. Curran Associates, Inc., 2022
2022
-
[21]
David S. Johnson. Approximation algorithms for combinatorial problems. Journal of Computer and System Sciences, 9 0 (3): 0 256--278, 1974. ISSN 0022-0000
1974
-
[22]
Robust lipschitz bandits to adversarial corruptions
Yue Kang, Cho-Jui Hsieh, and Thomas Chun Man Lee. Robust lipschitz bandits to adversarial corruptions. Advances in Neural Information Processing Systems, 36: 0 10897--10908, 2023
2023
-
[23]
Nearly tight bounds for the continuum-armed bandit problem
Robert Kleinberg. Nearly tight bounds for the continuum-armed bandit problem. In Advances in Neural Information Processing Systems, volume 17. MIT Press, 2004
2004
-
[24]
Multi-armed bandits in metric spaces
Robert Kleinberg, Aleksandrs Slivkins, and Eli Upfal. Multi-armed bandits in metric spaces. In Proceedings of the fortieth annual ACM symposium on Theory of computing, pages 681--690, 2008
2008
-
[25]
Bandits and experts in metric spaces
Robert Kleinberg, Aleksandrs Slivkins, and Eli Upfal. Bandits and experts in metric spaces. Journal of the ACM (JACM), 66 0 (4): 0 1--77, 2019
2019
-
[26]
Bandit algorithms
Tor Lattimore and Csaba Szepesv \'a ri. Bandit algorithms. Cambridge University Press, 2020
2020
-
[27]
Adaptive regret for bandits made possible: Two queries suffice
Zhou Lu, Qiuyi Zhang, Xinyi Chen, Fred Zhang, David Woodruff, and Elad Hazan. Adaptive regret for bandits made possible: Two queries suffice. arXiv preprint arXiv:2401.09278, 2024
arXiv 2024
-
[28]
Nested bandits
Matthieu Martin, Panayotis Mertikopoulos, Thibaud Rahier, and Houssam Zenati. Nested bandits. In International Conference on Machine Learning, pages 15093--15121. PMLR, 2022
2022
-
[29]
Prior-dependent allocations for bayesian fixed-budget best-arm identification in structured bandits
Nicolas Nguyen, Imad Aouali, Andr \'a s Gy \"o rgy, and Claire Vernade. Prior-dependent allocations for bayesian fixed-budget best-arm identification in structured bandits. In Proceedings of The 28th International Conference on Artificial Intelligence and Statistics, volume 258 of Proceedings of Machine Learning Research, pages 379--387. PMLR, 2025
2025
-
[30]
A modern introduction to online learning
Francesco Orabona. A modern introduction to online learning. arXiv preprint arXiv:1912.13213, 2025
Pith/arXiv arXiv 1912
-
[31]
Adaptive discretization for adversarial lipschitz bandits
Chara Podimata and Alex Slivkins. Adaptive discretization for adversarial lipschitz bandits. In Conference on Learning Theory, pages 3788--3805. PMLR, 2021
2021
-
[32]
Tyrrell Rockafellar
R. Tyrrell Rockafellar. Convex analysis. Princeton Mathematical Series. Princeton University Press, 1970
1970
-
[33]
Exploiting easy data in online optimization
Amir Sani, Gergely Neu, and Alessandro Lazaric. Exploiting easy data in online optimization. Advances in Neural Information Processing Systems, 27, 2014
2014
-
[34]
An optimal algorithm for bandit and zero-order convex optimization with two-point feedback
Ohad Shamir. An optimal algorithm for bandit and zero-order convex optimization with two-point feedback. Journal of Machine Learning Research, 18 0 (52): 0 1--11, 2017
2017
-
[35]
A tight analysis of the greedy algorithm for set cover
Petr Slav\' k. A tight analysis of the greedy algorithm for set cover. Journal of Algorithms, 25 0 (2): 0 237--254, 1997. ISSN 0196-6774
1997
-
[36]
Multi-armed bandits on implicit metric spaces
Aleksandrs Slivkins. Multi-armed bandits on implicit metric spaces. Advances in Neural Information Processing Systems, 24, 2011
2011
-
[37]
Adaptation to easy data in prediction with limited advice
Tobias Sommer Thune and Yevgeny Seldin. Adaptation to easy data in prediction with limited advice. Advances in Neural Information Processing Systems, 31, 2018
2018
-
[38]
Efficient learning in large-scale combinatorial semi-bandits
Zheng Wen, Branislav Kveton, and Azin Ashkan. Efficient learning in large-scale combinatorial semi-bandits. In International Conference on Machine Learning, pages 1113--1122. PMLR, 2015
2015
-
[39]
Tsallis-inf: An optimal algorithm for stochastic and adversarial bandits
Julian Zimmert and Yevgeny Seldin. Tsallis-inf: An optimal algorithm for stochastic and adversarial bandits. Journal of Machine Learning Research, 22 0 (28): 0 1--49, 2021
2021
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.