pith. sign in

arxiv: 2503.00565 · v3 · pith:LVZFTV6Dnew · submitted 2025-03-01 · 📊 stat.ML · cs.LG· math.ST· stat.ME· stat.TH

Batched Single-Index Global Multi-Armed Bandits with Covariates

Pith reviewed 2026-05-23 01:24 UTC · model grok-4.3

classification 📊 stat.ML cs.LGmath.STstat.MEstat.TH
keywords batched banditssingle-index modelcovariatesregret boundssuccessive eliminationdynamic binningsemi-parametric bandits
0
0 comments X

The pith

Single-index regression lets batched bandits with covariates achieve optimal one-dimensional regret rates when a pilot direction is given accurately.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a semi-parametric framework for batched multi-armed bandits with covariates that models arm rewards through a shared single-index regression. It presents the BIDS algorithm, which performs successive arm elimination in batches while dynamically binning observations along the index direction. When the index direction is supplied by a sufficiently accurate pilot and the number of arms is fixed, the resulting regret matches the minimax-optimal nonparametric rate for one-dimensional covariates. A sympathetic reader would care because this removes the usual exponential slowdown from high-dimensional covariates in batch-feedback settings such as personalized medicine or recommendation systems.

Core claim

The BIDS algorithm, which pairs batched successive arm elimination with dynamic binning guided by the single-index direction, attains minimax-optimal rates (equivalent to d=1) for nonparametric batched bandits whenever a pilot direction of sufficient accuracy is available and the number of arms K is fixed.

What carries the argument

The single-index regression model relating rewards to covariates via an unknown link function of a linear index, together with the dynamic binning mechanism inside the BIDS batched elimination procedure.

If this is right

  • Regret bounds are derived both when a pilot direction is supplied and when the direction must be estimated from data.
  • The method circumvents the curse of dimensionality for fixed K by reducing the effective dimension to one.
  • Extensive simulations and real-data experiments show lower regret than the nonparametric batched bandit baseline.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The framework suggests that single-index models can serve as a practical compromise between linear and fully nonparametric contextual bandits under batch feedback.
  • If the direction can be estimated adaptively at negligible extra cost, the same one-dimensional rates may hold without an external pilot.
  • The approach may extend to other batch sequential decision problems where a low-dimensional structure is known to be present but not fully linear.

Load-bearing premise

The true relationship between covariates and rewards is correctly described by a single-index model shared across arms.

What would settle it

A numerical experiment in which an accurate pilot direction is supplied yet the observed regret rate stays as slow as the full nonparametric d-dimensional rate would falsify the optimality claim.

Figures

Figures reproduced from arXiv: 2503.00565 by Hyebin Song, Sakshi Arya.

Figure 1
Figure 1. Figure 1: Illustration of ABSE in 2-dimensional setting. The algorithm partitions the context space ( [0, 1]2 ) at Levels 1, 2, and 3, running local arm elimination in each bin. Bins with confidently identified optimal arms (light-blue colored bins for Level 1 and blue-colored bins for Level 2) are not refined further, while bins without optimal arms are split into 2 2 = 4 equal-sized sub-bins. with reasonable accur… view at source ↗
Figure 2
Figure 2. Figure 2: A linear model example with Yt = β1Xt,1 + β2Xt,2 + ϵt, where Xt = (Xt,1, Xt,2) ∈ R 2 and ϵt i.i.d ∼ N(0, σ2 = 1) for t = 1, . . . , 25. (a) 3-D representation of the simulated data. (b) Projection of covariates X ∈ R 2 (circles with holes) onto the single-index direction (red dotted line), with projected points shown as black circles connected by gray lines. (c) Rotated view of (b) to align the SIR directi… view at source ↗
Figure 3
Figure 3. Figure 3: Mean reward functions for the two simulation settings We let Y (k) t = f (k) (Xt)+ϵt , where ϵt i.i.d. ∼ N(0, σ2 ) for t = 1, . . . , T, with σ 2 > 0, represent￾ing the noise variance. In the first case, where we test the performance of the BIDS algorithm 17 [PITH_FULL_IMAGE:figures/full_fig_p017_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Average regret ((Rt) T t=1) with pilot directions β with varying accuracy, measured by sin θ = sin ∠β, β0 for the two simulation settings. Different colors of the solid lines represent different levels of per￾turbation, where sin ∠β, β0 = 0 corresponds to no perturbation, and sin ∠β, β0 = 1 corresponds to orthogonal vectors. As the degree of perturbation increases, performance deteriorates but still beats … view at source ↗
Figure 5
Figure 5. Figure 5: Average regret ((Rt) T t=1) with varying model noise σ for the two simulation settings. As the noise level increases, while the performance of the BIDS algorithm (solid) remains better than the nonparametric analogue (dashed), but deviates further from the BIDS oracle (dashed-dotted). ting 2, they have more of an overlap in various regions. Therefore, even with higher model error in setting 1, it is easier… view at source ↗
Figure 6
Figure 6. Figure 6: Comparison of expected regret of the proposed semiparametric BIDS algorithm and the non￾parametric batched bandit algorithm (BaSEDB) on a) rice classification, b) occupancy detection, and c) EEG datasets, with β0 estimated in the initial phase with tinit ≈ T 2/3 for their respective data lengths T. Vertical solid and dashed lines denote the batch markings for the BIDS and BaSEDB algorithm, respectively. Ob… view at source ↗
read the original abstract

The multi-armed bandits (MAB) framework is a widely used approach for sequential decision-making, where a decision-maker selects an arm in each round with the goal of maximizing long-term rewards. In many practical applications, such as personalized medicine and recommendation systems, contextual information is available at the time of decision-making, rewards from different arms are related rather than independent, and feedback is provided in batches. We propose a novel semi-parametric framework for batched bandits with covariates that incorporates a shared parameter across arms. We leverage the single-index regression (SIR) model to capture relationships between arm rewards while balancing interpretability and flexibility. Our algorithm, Batched single-Index Dynamic binning and Successive arm elimination (BIDS), employs a batched successive arm elimination strategy with a dynamic binning mechanism guided by the single-index direction. We consider two settings: one where a pilot direction is available and another where the direction is estimated from data, deriving theoretical regret bounds for both cases. When a pilot direction is available with sufficient accuracy and the number of arms $K$ is fixed, our approach achieves minimax-optimal rates (with $d = 1$) for nonparametric batched bandits, circumventing the curse of dimensionality. Extensive experiments on simulated and real-world datasets demonstrate the effectiveness of our algorithm compared to the nonparametric batched bandit method introduced by \cite{jiang2025batched}.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes the BIDS algorithm for batched global multi-armed bandits with covariates under a single-index regression model. It considers two settings (pilot direction supplied with sufficient accuracy; direction estimated from data), derives theoretical regret bounds for both, and claims that with fixed K and an accurate pilot the method attains minimax-optimal nonparametric rates for the effective dimension d=1, thereby circumventing the curse of dimensionality. Experiments on simulated and real data compare BIDS to the nonparametric batched bandit baseline of Jiang et al. (2025).

Significance. If the stated regret bounds hold under the single-index model and the pilot-accuracy condition, the work supplies a concrete semi-parametric route to dimension-free rates in batched contextual bandits. This is potentially useful for applications such as personalized medicine where high-dimensional covariates are present but a low-dimensional index structure may be plausible. The explicit separation of the pilot and estimation cases, together with the dynamic binning mechanism aligned to the index, is a clear technical contribution when the modeling assumptions are met.

major comments (2)
  1. [Theoretical analysis] Theoretical analysis (regret bounds section): the manuscript states regret bounds for both the pilot-direction and estimated-direction cases but does not supply the full derivation or the precise assumptions on the link function, the batch-size schedule, and the required accuracy of the pilot direction. Without these, it is impossible to confirm that the claimed minimax-optimal d=1 rate is attained rather than an additional logarithmic or polynomial factor appearing.
  2. [Pilot-direction setting] § on pilot direction: the optimality claim is explicitly conditional on the pilot direction being supplied with sufficient accuracy and on K being fixed. The manuscript does not quantify the degradation in the regret bound when the pilot error exceeds the stated threshold or when K grows with the horizon, both of which are load-bearing for the central “circumventing the curse of dimensionality” statement.
minor comments (2)
  1. [Notation] Notation for the single-index direction and the binning grid is introduced without a consolidated table of symbols; readers must hunt through the text to recover definitions.
  2. [Experiments] The experiments section reports performance on real-world data but does not state the dimension d of the covariates or the effective sample size per batch, making it hard to judge whether the observed gains are consistent with the d=1 regime.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and constructive comments. We address the two major comments point by point below.

read point-by-point responses
  1. Referee: [Theoretical analysis] Theoretical analysis (regret bounds section): the manuscript states regret bounds for both the pilot-direction and estimated-direction cases but does not supply the full derivation or the precise assumptions on the link function, the batch-size schedule, and the required accuracy of the pilot direction. Without these, it is impossible to confirm that the claimed minimax-optimal d=1 rate is attained rather than an additional logarithmic or polynomial factor appearing.

    Authors: The complete proofs appear in Appendices B and C. The link function is assumed twice continuously differentiable with bounded second derivative; batch sizes follow the schedule b_m = 2^m with m up to log T; and the pilot direction must satisfy ||pilot - true|| = O(T^{-1/4}) to eliminate extra factors and recover the d=1 minimax rate. We will insert a concise statement of these conditions together with a reference to the appendix at the beginning of the regret-bounds section in the revision. revision: yes

  2. Referee: [Pilot-direction setting] § on pilot direction: the optimality claim is explicitly conditional on the pilot direction being supplied with sufficient accuracy and on K being fixed. The manuscript does not quantify the degradation in the regret bound when the pilot error exceeds the stated threshold or when K grows with the horizon, both of which are load-bearing for the central “circumventing the curse of dimensionality” statement.

    Authors: The abstract and introduction already state that the d=1 rate holds only under the stated pilot-accuracy and fixed-K conditions. When the pilot error exceeds the threshold the effective dimension rises and the rate reverts to the nonparametric d-dimensional bound; when K grows, an extra poly(K) factor appears. A full quantitative extension to these regimes lies outside the present scope and would essentially reproduce the Jiang et al. (2025) analysis. We will add one paragraph in the discussion section clarifying these scope limitations. revision: partial

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper conditions its minimax rate claim explicitly on the single-index model holding and on a pilot direction supplied with sufficient accuracy (K fixed). Under those conditions the reduction to effective dimension 1 follows from standard 1-D nonparametric rates once binning aligns with the known index; no equation reduces a claimed prediction to a fitted quantity defined by the same data, and no load-bearing step relies on self-citation for uniqueness, ansatz, or model justification. The estimation-of-direction case is treated separately with its own bounds. The derivation is therefore self-contained against external nonparametric benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the single-index model being correctly specified and on the pilot direction being sufficiently accurate; these are domain assumptions rather than free parameters or new entities.

axioms (2)
  • domain assumption Single-index regression model holds for the reward functions across arms
    Abstract states the framework 'leverages the single-index regression (SIR) model to capture relationships between arm rewards'
  • domain assumption Pilot direction is available with sufficient accuracy when claimed
    Regret optimality is conditioned on 'a pilot direction is available with sufficient accuracy'

pith-pipeline@v0.9.0 · 5790 in / 1334 out tokens · 23750 ms · 2026-05-23T01:24:24.047702+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

67 extracted references · 67 canonical work pages

  1. [1]

    Abbasi-Yadkori, D

    Y. Abbasi-Yadkori, D. P ´al, and C. Szepesv ´ari, Improved algorithms for linear stochastic bandits , in Advances in Neural Information Processing Systems, J. Shawe-Taylor, R. Zemel, P. Bartlett, F. Pereira, and K. Weinberger, eds., vol. 24, Curran Associates, Inc., 2011

  2. [2]

    Agrawal and N

    S. Agrawal and N. Goyal, Thompson sampling for contextual bandits with linear payoffs, in Proceedings of the 30th International Conference on Machine Learning, S. Dasgupta and D. McAllester, eds., vol. 28 of Proceedings of Machine Learning Research, Atlanta, Georgia, USA, 17–19 Jun 2013, PMLR, pp. 127–135

  3. [3]

    Arya and B

    S. Arya and B. K. Sriperumbudur , Kernel ϵ-greedy for contextual bandits , arXiv preprint arXiv:2306.17329, (2023)

  4. [4]

    P. M. Asquith and H. Ihshaish , Classification of eye-state using eeg recordings: speed-up gains using signal epochs and mutual information measure , in Proceedings of the 23rd International Database Applications & Engineering Symposium, 2019, pp. 1–6

  5. [5]

    O. Atan, C. Tekin, and M. Van der Schaar , Global multi-armed bandits with H¨ older continuity, in Artificial Intelligence and Statistics, PMLR, 2015, pp. 28–36

  6. [6]

    O. Atan, C. Tekin, and M. van der Schaar , Global bandits, IEEE Transactions on Neural Networks and Learning Systems, 29 (2018), pp. 5798–5811

  7. [7]

    Babichev and F

    D. Babichev and F. Bach, Slice inverse regression with score functions, Electronic Journal of Statistics, 12 (2018), pp. 1507 – 1543

  8. [8]

    Bastani and M

    H. Bastani and M. Bayati, Online decision making with high-dimensional covariates , Operations Re- search, 68 (2020), pp. 276–294

  9. [9]

    Bietti, A

    A. Bietti, A. Agarwal, and J. Langford, A contextual bandit bake-off, Journal of Machine Learning Research, 22 (2021), pp. 1–49

  10. [10]

    T. T. Cai and H. Pu , Stochastic continuum-armed bandits with additive models: Minimax regrets and adaptive algorithm, The Annals of Statistics, 50 (2022), pp. 2179–2204

  11. [11]

    Z. Cai, R. Li, and L. Zhu , Online sufficient dimension reduction through sliced inverse regression , Journal of Machine Learning Research, 21 (2020), pp. 1–25

  12. [12]

    Candanedo , Occupancy Detection

    L. Candanedo , Occupancy Detection . UCI Machine Learning Repository, 2016. DOI: https://doi.org/10.24432/C5X01N

  13. [13]

    S. R. Chowdhury and A. Gopalan , On kernelized multi-armed bandits , in International Conference on Machine Learning, PMLR, 2017, pp. 844–853

  14. [14]

    W. Chu, L. Li, L. Reyzin, and R. Schapire , Contextual bandits with linear payoff functions , in Pro- ceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, G. Gor- don, D. Dunson, and M. Dudik, eds., vol. 15 of Proceedings of Machine Learning Research, Fort Lauderdale, FL, USA, 11–13 Apr 2011, PMLR, pp. 208–214

  15. [15]

    Cinar and M

    I. Cinar and M. Koklu, Rice (Cammeo and Osmancik). UCI Machine Learning Repository, 2019. DOI: https://doi.org/10.24432/C5MW4Z

  16. [16]

    Cinarer, N

    G. Cinarer, N. Erbas ¸, and A. ¨Ocal, Rice classification and quality detection success with artificial intelligence technologies, Brazilian Archives of Biology and Technology, (2024)

  17. [17]

    R. Dai, H. Song, R. F. Barber, and G. Raskutti , Convergence guarantee for the sparse monotone single index model , Electronic Journal of Statistics, 16 (2022), pp. 4449–4496

  18. [18]

    Esfandiari, A

    H. Esfandiari, A. Karbasi, A. Mehrabian, and V. Mirrokni , Regret bounds for batched bandits , Proceedings of the AAAI Conference on Artificial Intelligence, 35 (2021), pp. 7340–7348

  19. [19]

    Y. Feng, Z. Huang, and T. Wang , Lipschitz bandits with batched feedback , in Advances in Neural Information Processing Systems, A. H. Oh, A. Agarwal, D. Belgrave, and K. Cho, eds., 2022

  20. [20]

    Filippi, O

    S. Filippi, O. Cappe, A. Garivier, and C. Szepesv ´ari, Parametric Bandits: The generalized linear case, in Advances in Neural Information Processing Systems, J. Lafferty, C. Williams, J. Shawe-Taylor, R. Zemel, and A. Culotta, eds., vol. 23, Curran Associates, Inc., 2010

  21. [21]

    Ghosh, S

    A. Ghosh, S. R. Chowdhury, and A. Gopalan , Misspecified linear bandits, in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 31, 2017

  22. [22]

    Goldenshluger and A

    A. Goldenshluger and A. Zeevi , A linear response bandit problem , Stochastic Systems, 3 (2013), pp. 230–261

  23. [23]

    Greenewald, A

    K. Greenewald, A. Tewari, S. Murphy, and P. Klasnja , Action centered contextual bandits , in SM37 SAKSHI ARYA AND HYEBIN SONG Advances in Neural Information Processing Systems, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, eds., vol. 30, Curran Associates, Inc., 2017

  24. [24]

    Q. Gu, A. Karbasi, K. Khosravi, V. Mirrokni, and D. Zhou , Batched neural bandits, ACM / IMS J. Data Sci., 1 (2024)

  25. [25]

    Gupta, S

    S. Gupta, S. Chaudhari, G. Joshi, and O. Ya ˘gan, Multi-armed bandits with correlated arms , IEEE Transactions on Information Theory, 67 (2021), pp. 6711–6732

  26. [26]

    Y. Gur, A. Momeni, and S. Wager , Smoothness-adaptive contextual bandits, Operations Research, 70 (2022), pp. 3198–3216

  27. [27]

    Y. Han, Z. Zhou, Z. Zhou, J. Blanchet, P. W. Glynn, and Y. Ye , Sequential batch learning in finite-action linear contextual bandits, arXiv preprint arXiv:2004.06321, (2020)

  28. [28]

    Hardle, P

    W. Hardle, P. Hall, and H. Ichimura , Optimal smoothing in single-index models , The Annals of Statistics, 21 (1993), pp. 157–178

  29. [29]

    Y. Hu, N. Kallus, and X. Mao , Smooth contextual bandits: Bridging the parametric and non- differentiable regret regimes, in Conference on Learning Theory, PMLR, 2020, pp. 2007–2010

  30. [30]

    Ichimura, Semiparametric least squares (SLS) and weighted SLS estimation of single-index models , Journal of econometrics, 58 (1993), pp

    H. Ichimura, Semiparametric least squares (SLS) and weighted SLS estimation of single-index models , Journal of econometrics, 58 (1993), pp. 71–120

  31. [31]

    Jiang, Non-asymptotic uniform rates of consistency for k-nn regression , in Proceedings of the AAAI Conference on Artificial Intelligence, vol

    H. Jiang, Non-asymptotic uniform rates of consistency for k-nn regression , in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, 2019, pp. 3999–4006

  32. [32]

    Jiang and C

    R. Jiang and C. Ma , Batched nonparametric contextual bandits , arXiv preprint arXiv:2402.17732, (2024)

  33. [33]

    T. Jin, J. Tang, P. Xu, K. Huang, X. Xiao, and Q. Gu, Almost optimal anytime algorithm for batched multi-armed bandits, in International Conference on Machine Learning, PMLR, 2021, pp. 5065–5073

  34. [34]

    Kalkanli and A

    C. Kalkanli and A. Ozgur, Batched Thompson sampling, in Advances in Neural Information Processing Systems, M. Ranzato, A. Beygelzimer, Y. Dauphin, P. Liang, and J. W. Vaughan, eds., vol. 34, Curran Associates, Inc., 2021, pp. 29984–29994

  35. [35]

    Kandasamy, J

    K. Kandasamy, J. Schneider, and B. Poczos, High dimensional bayesian optimisation and bandits via additive models, in Proceedings of the 32nd International Conference on Machine Learning, F. Bach and D. Blei, eds., vol. 37 of Proceedings of Machine Learning Research, Lille, France, 07–09 Jul 2015, PMLR, pp. 295–304, https://proceedings.mlr.press/v37/kanda...

  36. [36]

    G. H. Khan and M. A. Rahman , Room occupancy detection from temperature, light, humidity, and carbon dioxide measurements using deep learning , in 2021 International Conference on Computer, Communication, Chemical, Materials and Electronic Engineering (IC4ME2), 2021, pp. 1–4

  37. [37]

    Kim and M

    G.-S. Kim and M. C. Paik , Contextual multi-armed bandit algorithm for semiparametric reward model , in Proceedings of the 36th International Conference on Machine Learning, K. Chaudhuri and R. Salakhutdinov, eds., vol. 97 of Proceedings of Machine Learning Research, PMLR, 09–15 Jun 2019, pp. 3389–3397

  38. [38]

    Krishnamurthy, Z

    A. Krishnamurthy, Z. S. Wu, and V. Syrgkanis, Semiparametric contextual bandits, in International Conference on Machine Learning, PMLR, 2018, pp. 2776–2785

  39. [39]

    A. K. Kuchibhotla and R. K. Patra , Efficient estimation in single index models through smoothing splines, Bernoulli, 26 (2020), pp. 1587–1618

  40. [40]

    Kuszmaul and Q

    W. Kuszmaul and Q. Qi , The multiplicative version of azuma’s inequality, with an application to contention analysis, arXiv preprint arXiv:2102.05077, (2021)

  41. [41]

    T. L. Lai , Adaptive treatment allocation and the multi-armed bandit problem , The Annals of Statistics, (1987), pp. 1091–1114

  42. [42]

    T. L. Lai and H. Robbins, Asymptotically efficient adaptive allocation rules , Advances in applied math- ematics, 6 (1985), pp. 4–22

  43. [43]

    K. Li, Y. Yang, and N. N. Narisetty, Regret lower bound and optimal algorithm for high-dimensional contextual linear bandit, Electronic Journal of Statistics, 15 (2021), pp. 5652–5695

  44. [44]

    Li, Sliced inverse regression for dimension reduction, Journal of the American Statistical Associa- tion, 86 (1991), pp

    K.-C. Li, Sliced inverse regression for dimension reduction, Journal of the American Statistical Associa- tion, 86 (1991), pp. 316–327

  45. [45]

    Li and N

    K.-C. Li and N. Duan , Regression analysis under link violation , The Annals of Statistics, (1989), pp. 1009–1052

  46. [46]

    W. Li, A. Barik, and J. Honorio , A simple unified framework for high dimensional bandit problems , in International Conference on Machine Learning, PMLR, 2022, pp. 12619–12655. SM38 SEMI-PARAMETRIC BATCHED GLOBAL MULTI-ARMED BANDITS WITH COVARIATES

  47. [47]

    W. Li, N. Chen, and L. J. Hong , Dimension reduction in contextual online learning via nonparametric variable selection, Journal of Machine Learning Research, 24 (2023), pp. 1–84

  48. [48]

    W. K. Newey and T. M. Stoker, Efficiency of weighted average derivative estimators and index models, Econometrica: Journal of the Econometric Society, (1993), pp. 1199–1223

  49. [49]

    Perchet and P

    V. Perchet and P. Rigollet, The multi-armed bandit problem with covariates, The Annals of Statistics, (2013)

  50. [50]

    Perchet, P

    V. Perchet, P. Rigollet, S. Chassang, and E. Snowberg , Batched bandit problems, The Annals of Statistics, 44 (2016), pp. 660 – 681

  51. [51]

    Qian, C.-K

    W. Qian, C.-K. Ing, and J. Liu , Adaptive algorithm for multi-armed bandit problem with high- dimensional covariates, Journal of the American Statistical Association, 119 (2024), pp. 970–982

  52. [52]

    Qian and Y

    W. Qian and Y. Yang , Kernel estimation and model combination in a bandit problem with covariates , Journal of Machine Learning Research, 17 (2016)

  53. [53]

    Z. Ren, Z. Zhou, and J. R. Kalagnanam , Batched learning in generalized linear contextual bandits with general decision sets , IEEE Control Systems Letters, 6 (2022), pp. 37–42

  54. [54]

    Rigollet and A

    P. Rigollet and A. Zeevi , Nonparametric bandits with covariates , Conference on Learning Theory (COLT), (2010), p. 54

  55. [55]

    Roesler , EEG Eye State

    O. Roesler , EEG Eye State . UCI Machine Learning Repository, 2013. DOI: https://doi.org/10.24432/C57G7J

  56. [56]

    R ¨osler and D

    O. R ¨osler and D. Suendermann , A first step towards eye state prediction using eeg , Proc. of the AIHLS, 1 (2013), pp. 1–4

  57. [57]

    C. Shen, R. Zhou, C. Tekin, and M. van der Schaar , Generalized global bandit and its application in cellular coverage optimization , IEEE Journal of Selected Topics in Signal Processing, 12 (2018), pp. 218–232

  58. [58]

    C. Shi, C. Shen, and J. Yang , Federated multi-armed bandits with personalization , in International conference on artificial intelligence and statistics, PMLR, 2021, pp. 2917–2925

  59. [59]

    Tsybakov, Introduction to Nonparametric Estimation , Springer Series in Statistics, Springer New York, 2008

    A. Tsybakov, Introduction to Nonparametric Estimation , Springer Series in Statistics, Springer New York, 2008

  60. [60]

    Valko, N

    M. Valko, N. Korda, R. Munos, I. Flaounas, and N. Cristianini, Finite-time analysis of kernelised contextual bandits, in Proceedings of the Twenty-Ninth Conference on Uncertainty in Artificial Intel- ligence, 2013, pp. 654–663

  61. [61]

    Van Parys and N

    B. Van Parys and N. Golrezaei , Optimal learning for structured bandits , Management Science, 70 (2024), pp. 3951–3998

  62. [62]

    Wanigasekara and C

    N. Wanigasekara and C. Yu, Nonparametric contextual bandits in metric spaces with unknown metric , in Advances in Neural Information Processing Systems, H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alch´ e-Buc, E. Fox, and R. Garnett, eds., vol. 32, Curran Associates, Inc., 2019

  63. [63]

    W. Xia, T. Q. Quek, K. Guo, W. Wen, H. H. Yang, and H. Zhu , Multi-armed bandit-based client scheduling for federated learning , IEEE Transactions on Wireless Communications, 19 (2020), pp. 7108–7123

  64. [64]

    Yang and D

    Y. Yang and D. Zhu , Randomized allocation with nonparametric estimation for a multi-armed bandit problem with covariates, The Annals of Statistics, 30 (2002), pp. 100–121

  65. [65]

    Y. Yu, T. Wang, and R. J. Samworth, A useful variant of the Davis–Kahan theorem for statisticians , Biometrika, 102 (2015), pp. 315–323

  66. [66]

    D. Zhou, L. Li, and Q. Gu , Neural contextual bandits with UCB-based exploration , in International Conference on Machine Learning, PMLR, 2020, pp. 11492–11502

  67. [67]

    Y. Zhu, D. Zhou, R. Jiang, Q. Gu, R. Willett, and R. Nowak , Pure exploration in kernel and neural bandits, Advances in neural information processing systems, 34 (2021), pp. 11618–11630. SM39