Batched Single-Index Global Multi-Armed Bandits with Covariates

Hyebin Song; Sakshi Arya

arxiv: 2503.00565 · v3 · pith:LVZFTV6Dnew · submitted 2025-03-01 · 📊 stat.ML · cs.LG· math.ST· stat.ME· stat.TH

Batched Single-Index Global Multi-Armed Bandits with Covariates

Sakshi Arya , Hyebin Song This is my paper

Pith reviewed 2026-05-23 01:24 UTC · model grok-4.3

classification 📊 stat.ML cs.LGmath.STstat.MEstat.TH

keywords batched banditssingle-index modelcovariatesregret boundssuccessive eliminationdynamic binningsemi-parametric bandits

0 comments

The pith

Single-index regression lets batched bandits with covariates achieve optimal one-dimensional regret rates when a pilot direction is given accurately.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a semi-parametric framework for batched multi-armed bandits with covariates that models arm rewards through a shared single-index regression. It presents the BIDS algorithm, which performs successive arm elimination in batches while dynamically binning observations along the index direction. When the index direction is supplied by a sufficiently accurate pilot and the number of arms is fixed, the resulting regret matches the minimax-optimal nonparametric rate for one-dimensional covariates. A sympathetic reader would care because this removes the usual exponential slowdown from high-dimensional covariates in batch-feedback settings such as personalized medicine or recommendation systems.

Core claim

The BIDS algorithm, which pairs batched successive arm elimination with dynamic binning guided by the single-index direction, attains minimax-optimal rates (equivalent to d=1) for nonparametric batched bandits whenever a pilot direction of sufficient accuracy is available and the number of arms K is fixed.

What carries the argument

The single-index regression model relating rewards to covariates via an unknown link function of a linear index, together with the dynamic binning mechanism inside the BIDS batched elimination procedure.

If this is right

Regret bounds are derived both when a pilot direction is supplied and when the direction must be estimated from data.
The method circumvents the curse of dimensionality for fixed K by reducing the effective dimension to one.
Extensive simulations and real-data experiments show lower regret than the nonparametric batched bandit baseline.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The framework suggests that single-index models can serve as a practical compromise between linear and fully nonparametric contextual bandits under batch feedback.
If the direction can be estimated adaptively at negligible extra cost, the same one-dimensional rates may hold without an external pilot.
The approach may extend to other batch sequential decision problems where a low-dimensional structure is known to be present but not fully linear.

Load-bearing premise

The true relationship between covariates and rewards is correctly described by a single-index model shared across arms.

What would settle it

A numerical experiment in which an accurate pilot direction is supplied yet the observed regret rate stays as slow as the full nonparametric d-dimensional rate would falsify the optimality claim.

Figures

Figures reproduced from arXiv: 2503.00565 by Hyebin Song, Sakshi Arya.

**Figure 1.** Figure 1: Illustration of ABSE in 2-dimensional setting. The algorithm partitions the context space ( [0, 1]2 ) at Levels 1, 2, and 3, running local arm elimination in each bin. Bins with confidently identified optimal arms (light-blue colored bins for Level 1 and blue-colored bins for Level 2) are not refined further, while bins without optimal arms are split into 2 2 = 4 equal-sized sub-bins. with reasonable accur… view at source ↗

**Figure 2.** Figure 2: A linear model example with Yt = β1Xt,1 + β2Xt,2 + ϵt, where Xt = (Xt,1, Xt,2) ∈ R 2 and ϵt i.i.d ∼ N(0, σ2 = 1) for t = 1, . . . , 25. (a) 3-D representation of the simulated data. (b) Projection of covariates X ∈ R 2 (circles with holes) onto the single-index direction (red dotted line), with projected points shown as black circles connected by gray lines. (c) Rotated view of (b) to align the SIR directi… view at source ↗

**Figure 3.** Figure 3: Mean reward functions for the two simulation settings We let Y (k) t = f (k) (Xt)+ϵt , where ϵt i.i.d. ∼ N(0, σ2 ) for t = 1, . . . , T, with σ 2 > 0, representing the noise variance. In the first case, where we test the performance of the BIDS algorithm 17 [PITH_FULL_IMAGE:figures/full_fig_p017_3.png] view at source ↗

**Figure 4.** Figure 4: Average regret ((Rt) T t=1) with pilot directions β with varying accuracy, measured by sin θ = sin ∠β, β0 for the two simulation settings. Different colors of the solid lines represent different levels of perturbation, where sin ∠β, β0 = 0 corresponds to no perturbation, and sin ∠β, β0 = 1 corresponds to orthogonal vectors. As the degree of perturbation increases, performance deteriorates but still beats … view at source ↗

**Figure 5.** Figure 5: Average regret ((Rt) T t=1) with varying model noise σ for the two simulation settings. As the noise level increases, while the performance of the BIDS algorithm (solid) remains better than the nonparametric analogue (dashed), but deviates further from the BIDS oracle (dashed-dotted). ting 2, they have more of an overlap in various regions. Therefore, even with higher model error in setting 1, it is easier… view at source ↗

**Figure 6.** Figure 6: Comparison of expected regret of the proposed semiparametric BIDS algorithm and the nonparametric batched bandit algorithm (BaSEDB) on a) rice classification, b) occupancy detection, and c) EEG datasets, with β0 estimated in the initial phase with tinit ≈ T 2/3 for their respective data lengths T. Vertical solid and dashed lines denote the batch markings for the BIDS and BaSEDB algorithm, respectively. Ob… view at source ↗

read the original abstract

The multi-armed bandits (MAB) framework is a widely used approach for sequential decision-making, where a decision-maker selects an arm in each round with the goal of maximizing long-term rewards. In many practical applications, such as personalized medicine and recommendation systems, contextual information is available at the time of decision-making, rewards from different arms are related rather than independent, and feedback is provided in batches. We propose a novel semi-parametric framework for batched bandits with covariates that incorporates a shared parameter across arms. We leverage the single-index regression (SIR) model to capture relationships between arm rewards while balancing interpretability and flexibility. Our algorithm, Batched single-Index Dynamic binning and Successive arm elimination (BIDS), employs a batched successive arm elimination strategy with a dynamic binning mechanism guided by the single-index direction. We consider two settings: one where a pilot direction is available and another where the direction is estimated from data, deriving theoretical regret bounds for both cases. When a pilot direction is available with sufficient accuracy and the number of arms $K$ is fixed, our approach achieves minimax-optimal rates (with $d = 1$) for nonparametric batched bandits, circumventing the curse of dimensionality. Extensive experiments on simulated and real-world datasets demonstrate the effectiveness of our algorithm compared to the nonparametric batched bandit method introduced by \cite{jiang2025batched}.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

BIDS uses single-index structure plus pilot direction to reach 1D nonparametric rates in batched contextual bandits when K is fixed.

read the letter

The main takeaway is that this paper gives a concrete way to sidestep the dimension curse in batched bandits with covariates by imposing a single-index model and assuming a reasonably accurate pilot direction is available. When those hold and the number of arms stays fixed, the claimed regret matches the minimax rate for one-dimensional nonparametric problems. They also treat the case where the direction must be estimated from data and supply regret bounds for both settings. The algorithm BIDS pairs dynamic binning along the index with batched successive elimination, and the experiments show it beating the fully nonparametric baseline from Jiang et al. on simulated and real data. That combination of single-index regression with batched elimination looks new relative to the cited prior work. The practical motivation is clear for high-dimensional applications where full nonparametric methods scale poorly. The central claim is explicitly conditional on the model being correct and the pilot being good enough, which keeps the argument from overreaching. The main soft spot is that the abstract and stress-test note leave the exact assumptions on the link function, batch-size scaling, and how estimation error interacts with the regret rate unverified; without the full derivations it is difficult to judge whether those conditions are mild or quite strong. The experiments are useful for illustration but do not substitute for checking the proof details. This work is aimed at researchers who already think about structured or semi-parametric bandits and want regret bounds that improve on pure nonparametric rates under plausible assumptions. A reader focused on theoretical analysis with some empirical checks would find it worth their time. It is coherent on its own terms and shows honest engagement with the literature, so it deserves a serious referee even if revisions are needed on the proof presentation.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes the BIDS algorithm for batched global multi-armed bandits with covariates under a single-index regression model. It considers two settings (pilot direction supplied with sufficient accuracy; direction estimated from data), derives theoretical regret bounds for both, and claims that with fixed K and an accurate pilot the method attains minimax-optimal nonparametric rates for the effective dimension d=1, thereby circumventing the curse of dimensionality. Experiments on simulated and real data compare BIDS to the nonparametric batched bandit baseline of Jiang et al. (2025).

Significance. If the stated regret bounds hold under the single-index model and the pilot-accuracy condition, the work supplies a concrete semi-parametric route to dimension-free rates in batched contextual bandits. This is potentially useful for applications such as personalized medicine where high-dimensional covariates are present but a low-dimensional index structure may be plausible. The explicit separation of the pilot and estimation cases, together with the dynamic binning mechanism aligned to the index, is a clear technical contribution when the modeling assumptions are met.

major comments (2)

[Theoretical analysis] Theoretical analysis (regret bounds section): the manuscript states regret bounds for both the pilot-direction and estimated-direction cases but does not supply the full derivation or the precise assumptions on the link function, the batch-size schedule, and the required accuracy of the pilot direction. Without these, it is impossible to confirm that the claimed minimax-optimal d=1 rate is attained rather than an additional logarithmic or polynomial factor appearing.
[Pilot-direction setting] § on pilot direction: the optimality claim is explicitly conditional on the pilot direction being supplied with sufficient accuracy and on K being fixed. The manuscript does not quantify the degradation in the regret bound when the pilot error exceeds the stated threshold or when K grows with the horizon, both of which are load-bearing for the central “circumventing the curse of dimensionality” statement.

minor comments (2)

[Notation] Notation for the single-index direction and the binning grid is introduced without a consolidated table of symbols; readers must hunt through the text to recover definitions.
[Experiments] The experiments section reports performance on real-world data but does not state the dimension d of the covariates or the effective sample size per batch, making it hard to judge whether the observed gains are consistent with the d=1 regime.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and constructive comments. We address the two major comments point by point below.

read point-by-point responses

Referee: [Theoretical analysis] Theoretical analysis (regret bounds section): the manuscript states regret bounds for both the pilot-direction and estimated-direction cases but does not supply the full derivation or the precise assumptions on the link function, the batch-size schedule, and the required accuracy of the pilot direction. Without these, it is impossible to confirm that the claimed minimax-optimal d=1 rate is attained rather than an additional logarithmic or polynomial factor appearing.

Authors: The complete proofs appear in Appendices B and C. The link function is assumed twice continuously differentiable with bounded second derivative; batch sizes follow the schedule b_m = 2^m with m up to log T; and the pilot direction must satisfy ||pilot - true|| = O(T^{-1/4}) to eliminate extra factors and recover the d=1 minimax rate. We will insert a concise statement of these conditions together with a reference to the appendix at the beginning of the regret-bounds section in the revision. revision: yes
Referee: [Pilot-direction setting] § on pilot direction: the optimality claim is explicitly conditional on the pilot direction being supplied with sufficient accuracy and on K being fixed. The manuscript does not quantify the degradation in the regret bound when the pilot error exceeds the stated threshold or when K grows with the horizon, both of which are load-bearing for the central “circumventing the curse of dimensionality” statement.

Authors: The abstract and introduction already state that the d=1 rate holds only under the stated pilot-accuracy and fixed-K conditions. When the pilot error exceeds the threshold the effective dimension rises and the rate reverts to the nonparametric d-dimensional bound; when K grows, an extra poly(K) factor appears. A full quantitative extension to these regimes lies outside the present scope and would essentially reproduce the Jiang et al. (2025) analysis. We will add one paragraph in the discussion section clarifying these scope limitations. revision: partial

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper conditions its minimax rate claim explicitly on the single-index model holding and on a pilot direction supplied with sufficient accuracy (K fixed). Under those conditions the reduction to effective dimension 1 follows from standard 1-D nonparametric rates once binning aligns with the known index; no equation reduces a claimed prediction to a fitted quantity defined by the same data, and no load-bearing step relies on self-citation for uniqueness, ansatz, or model justification. The estimation-of-direction case is treated separately with its own bounds. The derivation is therefore self-contained against external nonparametric benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the single-index model being correctly specified and on the pilot direction being sufficiently accurate; these are domain assumptions rather than free parameters or new entities.

axioms (2)

domain assumption Single-index regression model holds for the reward functions across arms
Abstract states the framework 'leverages the single-index regression (SIR) model to capture relationships between arm rewards'
domain assumption Pilot direction is available with sufficient accuracy when claimed
Regret optimality is conditioned on 'a pilot direction is available with sufficient accuracy'

pith-pipeline@v0.9.0 · 5790 in / 1334 out tokens · 23750 ms · 2026-05-23T01:24:24.047702+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

67 extracted references · 67 canonical work pages

[1]

Abbasi-Yadkori, D

Y. Abbasi-Yadkori, D. P ´al, and C. Szepesv ´ari, Improved algorithms for linear stochastic bandits , in Advances in Neural Information Processing Systems, J. Shawe-Taylor, R. Zemel, P. Bartlett, F. Pereira, and K. Weinberger, eds., vol. 24, Curran Associates, Inc., 2011

work page 2011
[2]

Agrawal and N

S. Agrawal and N. Goyal, Thompson sampling for contextual bandits with linear payoffs, in Proceedings of the 30th International Conference on Machine Learning, S. Dasgupta and D. McAllester, eds., vol. 28 of Proceedings of Machine Learning Research, Atlanta, Georgia, USA, 17–19 Jun 2013, PMLR, pp. 127–135

work page 2013
[3]

Arya and B

S. Arya and B. K. Sriperumbudur , Kernel ϵ-greedy for contextual bandits , arXiv preprint arXiv:2306.17329, (2023)

work page arXiv 2023
[4]

P. M. Asquith and H. Ihshaish , Classification of eye-state using eeg recordings: speed-up gains using signal epochs and mutual information measure , in Proceedings of the 23rd International Database Applications & Engineering Symposium, 2019, pp. 1–6

work page 2019
[5]

O. Atan, C. Tekin, and M. Van der Schaar , Global multi-armed bandits with H¨ older continuity, in Artificial Intelligence and Statistics, PMLR, 2015, pp. 28–36

work page 2015
[6]

O. Atan, C. Tekin, and M. van der Schaar , Global bandits, IEEE Transactions on Neural Networks and Learning Systems, 29 (2018), pp. 5798–5811

work page 2018
[7]

Babichev and F

D. Babichev and F. Bach, Slice inverse regression with score functions, Electronic Journal of Statistics, 12 (2018), pp. 1507 – 1543

work page 2018
[8]

Bastani and M

H. Bastani and M. Bayati, Online decision making with high-dimensional covariates , Operations Re- search, 68 (2020), pp. 276–294

work page 2020
[9]

Bietti, A

A. Bietti, A. Agarwal, and J. Langford, A contextual bandit bake-off, Journal of Machine Learning Research, 22 (2021), pp. 1–49

work page 2021
[10]

T. T. Cai and H. Pu , Stochastic continuum-armed bandits with additive models: Minimax regrets and adaptive algorithm, The Annals of Statistics, 50 (2022), pp. 2179–2204

work page 2022
[11]

Z. Cai, R. Li, and L. Zhu , Online sufficient dimension reduction through sliced inverse regression , Journal of Machine Learning Research, 21 (2020), pp. 1–25

work page 2020
[12]

Candanedo , Occupancy Detection

L. Candanedo , Occupancy Detection . UCI Machine Learning Repository, 2016. DOI: https://doi.org/10.24432/C5X01N

work page doi:10.24432/c5x01n 2016
[13]

S. R. Chowdhury and A. Gopalan , On kernelized multi-armed bandits , in International Conference on Machine Learning, PMLR, 2017, pp. 844–853

work page 2017
[14]

W. Chu, L. Li, L. Reyzin, and R. Schapire , Contextual bandits with linear payoff functions , in Pro- ceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, G. Gor- don, D. Dunson, and M. Dudik, eds., vol. 15 of Proceedings of Machine Learning Research, Fort Lauderdale, FL, USA, 11–13 Apr 2011, PMLR, pp. 208–214

work page 2011
[15]

Cinar and M

I. Cinar and M. Koklu, Rice (Cammeo and Osmancik). UCI Machine Learning Repository, 2019. DOI: https://doi.org/10.24432/C5MW4Z

work page doi:10.24432/c5mw4z 2019
[16]

Cinarer, N

G. Cinarer, N. Erbas ¸, and A. ¨Ocal, Rice classification and quality detection success with artificial intelligence technologies, Brazilian Archives of Biology and Technology, (2024)

work page 2024
[17]

R. Dai, H. Song, R. F. Barber, and G. Raskutti , Convergence guarantee for the sparse monotone single index model , Electronic Journal of Statistics, 16 (2022), pp. 4449–4496

work page 2022
[18]

Esfandiari, A

H. Esfandiari, A. Karbasi, A. Mehrabian, and V. Mirrokni , Regret bounds for batched bandits , Proceedings of the AAAI Conference on Artificial Intelligence, 35 (2021), pp. 7340–7348

work page 2021
[19]

Y. Feng, Z. Huang, and T. Wang , Lipschitz bandits with batched feedback , in Advances in Neural Information Processing Systems, A. H. Oh, A. Agarwal, D. Belgrave, and K. Cho, eds., 2022

work page 2022
[20]

Filippi, O

S. Filippi, O. Cappe, A. Garivier, and C. Szepesv ´ari, Parametric Bandits: The generalized linear case, in Advances in Neural Information Processing Systems, J. Lafferty, C. Williams, J. Shawe-Taylor, R. Zemel, and A. Culotta, eds., vol. 23, Curran Associates, Inc., 2010

work page 2010
[21]

Ghosh, S

A. Ghosh, S. R. Chowdhury, and A. Gopalan , Misspecified linear bandits, in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 31, 2017

work page 2017
[22]

Goldenshluger and A

A. Goldenshluger and A. Zeevi , A linear response bandit problem , Stochastic Systems, 3 (2013), pp. 230–261

work page 2013
[23]

Greenewald, A

K. Greenewald, A. Tewari, S. Murphy, and P. Klasnja , Action centered contextual bandits , in SM37 SAKSHI ARYA AND HYEBIN SONG Advances in Neural Information Processing Systems, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, eds., vol. 30, Curran Associates, Inc., 2017

work page 2017
[24]

Q. Gu, A. Karbasi, K. Khosravi, V. Mirrokni, and D. Zhou , Batched neural bandits, ACM / IMS J. Data Sci., 1 (2024)

work page 2024
[25]

Gupta, S

S. Gupta, S. Chaudhari, G. Joshi, and O. Ya ˘gan, Multi-armed bandits with correlated arms , IEEE Transactions on Information Theory, 67 (2021), pp. 6711–6732

work page 2021
[26]

Y. Gur, A. Momeni, and S. Wager , Smoothness-adaptive contextual bandits, Operations Research, 70 (2022), pp. 3198–3216

work page 2022
[27]

Y. Han, Z. Zhou, Z. Zhou, J. Blanchet, P. W. Glynn, and Y. Ye , Sequential batch learning in finite-action linear contextual bandits, arXiv preprint arXiv:2004.06321, (2020)

work page arXiv 2004
[28]

Hardle, P

W. Hardle, P. Hall, and H. Ichimura , Optimal smoothing in single-index models , The Annals of Statistics, 21 (1993), pp. 157–178

work page 1993
[29]

Y. Hu, N. Kallus, and X. Mao , Smooth contextual bandits: Bridging the parametric and non- differentiable regret regimes, in Conference on Learning Theory, PMLR, 2020, pp. 2007–2010

work page 2020
[30]

Ichimura, Semiparametric least squares (SLS) and weighted SLS estimation of single-index models , Journal of econometrics, 58 (1993), pp

H. Ichimura, Semiparametric least squares (SLS) and weighted SLS estimation of single-index models , Journal of econometrics, 58 (1993), pp. 71–120

work page 1993
[31]

Jiang, Non-asymptotic uniform rates of consistency for k-nn regression , in Proceedings of the AAAI Conference on Artificial Intelligence, vol

H. Jiang, Non-asymptotic uniform rates of consistency for k-nn regression , in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, 2019, pp. 3999–4006

work page 2019
[32]

Jiang and C

R. Jiang and C. Ma , Batched nonparametric contextual bandits , arXiv preprint arXiv:2402.17732, (2024)

work page arXiv 2024
[33]

T. Jin, J. Tang, P. Xu, K. Huang, X. Xiao, and Q. Gu, Almost optimal anytime algorithm for batched multi-armed bandits, in International Conference on Machine Learning, PMLR, 2021, pp. 5065–5073

work page 2021
[34]

Kalkanli and A

C. Kalkanli and A. Ozgur, Batched Thompson sampling, in Advances in Neural Information Processing Systems, M. Ranzato, A. Beygelzimer, Y. Dauphin, P. Liang, and J. W. Vaughan, eds., vol. 34, Curran Associates, Inc., 2021, pp. 29984–29994

work page 2021
[35]

Kandasamy, J

K. Kandasamy, J. Schneider, and B. Poczos, High dimensional bayesian optimisation and bandits via additive models, in Proceedings of the 32nd International Conference on Machine Learning, F. Bach and D. Blei, eds., vol. 37 of Proceedings of Machine Learning Research, Lille, France, 07–09 Jul 2015, PMLR, pp. 295–304, https://proceedings.mlr.press/v37/kanda...

work page 2015
[36]

G. H. Khan and M. A. Rahman , Room occupancy detection from temperature, light, humidity, and carbon dioxide measurements using deep learning , in 2021 International Conference on Computer, Communication, Chemical, Materials and Electronic Engineering (IC4ME2), 2021, pp. 1–4

work page 2021
[37]

Kim and M

G.-S. Kim and M. C. Paik , Contextual multi-armed bandit algorithm for semiparametric reward model , in Proceedings of the 36th International Conference on Machine Learning, K. Chaudhuri and R. Salakhutdinov, eds., vol. 97 of Proceedings of Machine Learning Research, PMLR, 09–15 Jun 2019, pp. 3389–3397

work page 2019
[38]

Krishnamurthy, Z

A. Krishnamurthy, Z. S. Wu, and V. Syrgkanis, Semiparametric contextual bandits, in International Conference on Machine Learning, PMLR, 2018, pp. 2776–2785

work page 2018
[39]

A. K. Kuchibhotla and R. K. Patra , Efficient estimation in single index models through smoothing splines, Bernoulli, 26 (2020), pp. 1587–1618

work page 2020
[40]

Kuszmaul and Q

W. Kuszmaul and Q. Qi , The multiplicative version of azuma’s inequality, with an application to contention analysis, arXiv preprint arXiv:2102.05077, (2021)

work page arXiv 2021
[41]

T. L. Lai , Adaptive treatment allocation and the multi-armed bandit problem , The Annals of Statistics, (1987), pp. 1091–1114

work page 1987
[42]

T. L. Lai and H. Robbins, Asymptotically efficient adaptive allocation rules , Advances in applied math- ematics, 6 (1985), pp. 4–22

work page 1985
[43]

K. Li, Y. Yang, and N. N. Narisetty, Regret lower bound and optimal algorithm for high-dimensional contextual linear bandit, Electronic Journal of Statistics, 15 (2021), pp. 5652–5695

work page 2021
[44]

Li, Sliced inverse regression for dimension reduction, Journal of the American Statistical Associa- tion, 86 (1991), pp

K.-C. Li, Sliced inverse regression for dimension reduction, Journal of the American Statistical Associa- tion, 86 (1991), pp. 316–327

work page 1991
[45]

Li and N

K.-C. Li and N. Duan , Regression analysis under link violation , The Annals of Statistics, (1989), pp. 1009–1052

work page 1989
[46]

W. Li, A. Barik, and J. Honorio , A simple unified framework for high dimensional bandit problems , in International Conference on Machine Learning, PMLR, 2022, pp. 12619–12655. SM38 SEMI-PARAMETRIC BATCHED GLOBAL MULTI-ARMED BANDITS WITH COVARIATES

work page 2022
[47]

W. Li, N. Chen, and L. J. Hong , Dimension reduction in contextual online learning via nonparametric variable selection, Journal of Machine Learning Research, 24 (2023), pp. 1–84

work page 2023
[48]

W. K. Newey and T. M. Stoker, Efficiency of weighted average derivative estimators and index models, Econometrica: Journal of the Econometric Society, (1993), pp. 1199–1223

work page 1993
[49]

Perchet and P

V. Perchet and P. Rigollet, The multi-armed bandit problem with covariates, The Annals of Statistics, (2013)

work page 2013
[50]

Perchet, P

V. Perchet, P. Rigollet, S. Chassang, and E. Snowberg , Batched bandit problems, The Annals of Statistics, 44 (2016), pp. 660 – 681

work page 2016
[51]

Qian, C.-K

W. Qian, C.-K. Ing, and J. Liu , Adaptive algorithm for multi-armed bandit problem with high- dimensional covariates, Journal of the American Statistical Association, 119 (2024), pp. 970–982

work page 2024
[52]

Qian and Y

W. Qian and Y. Yang , Kernel estimation and model combination in a bandit problem with covariates , Journal of Machine Learning Research, 17 (2016)

work page 2016
[53]

Z. Ren, Z. Zhou, and J. R. Kalagnanam , Batched learning in generalized linear contextual bandits with general decision sets , IEEE Control Systems Letters, 6 (2022), pp. 37–42

work page 2022
[54]

Rigollet and A

P. Rigollet and A. Zeevi , Nonparametric bandits with covariates , Conference on Learning Theory (COLT), (2010), p. 54

work page 2010
[55]

Roesler , EEG Eye State

O. Roesler , EEG Eye State . UCI Machine Learning Repository, 2013. DOI: https://doi.org/10.24432/C57G7J

work page doi:10.24432/c57g7j 2013
[56]

R ¨osler and D

O. R ¨osler and D. Suendermann , A first step towards eye state prediction using eeg , Proc. of the AIHLS, 1 (2013), pp. 1–4

work page 2013
[57]

C. Shen, R. Zhou, C. Tekin, and M. van der Schaar , Generalized global bandit and its application in cellular coverage optimization , IEEE Journal of Selected Topics in Signal Processing, 12 (2018), pp. 218–232

work page 2018
[58]

C. Shi, C. Shen, and J. Yang , Federated multi-armed bandits with personalization , in International conference on artificial intelligence and statistics, PMLR, 2021, pp. 2917–2925

work page 2021
[59]

Tsybakov, Introduction to Nonparametric Estimation , Springer Series in Statistics, Springer New York, 2008

A. Tsybakov, Introduction to Nonparametric Estimation , Springer Series in Statistics, Springer New York, 2008

work page 2008
[60]

Valko, N

M. Valko, N. Korda, R. Munos, I. Flaounas, and N. Cristianini, Finite-time analysis of kernelised contextual bandits, in Proceedings of the Twenty-Ninth Conference on Uncertainty in Artificial Intel- ligence, 2013, pp. 654–663

work page 2013
[61]

Van Parys and N

B. Van Parys and N. Golrezaei , Optimal learning for structured bandits , Management Science, 70 (2024), pp. 3951–3998

work page 2024
[62]

Wanigasekara and C

N. Wanigasekara and C. Yu, Nonparametric contextual bandits in metric spaces with unknown metric , in Advances in Neural Information Processing Systems, H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alch´ e-Buc, E. Fox, and R. Garnett, eds., vol. 32, Curran Associates, Inc., 2019

work page 2019
[63]

W. Xia, T. Q. Quek, K. Guo, W. Wen, H. H. Yang, and H. Zhu , Multi-armed bandit-based client scheduling for federated learning , IEEE Transactions on Wireless Communications, 19 (2020), pp. 7108–7123

work page 2020
[64]

Yang and D

Y. Yang and D. Zhu , Randomized allocation with nonparametric estimation for a multi-armed bandit problem with covariates, The Annals of Statistics, 30 (2002), pp. 100–121

work page 2002
[65]

Y. Yu, T. Wang, and R. J. Samworth, A useful variant of the Davis–Kahan theorem for statisticians , Biometrika, 102 (2015), pp. 315–323

work page 2015
[66]

D. Zhou, L. Li, and Q. Gu , Neural contextual bandits with UCB-based exploration , in International Conference on Machine Learning, PMLR, 2020, pp. 11492–11502

work page 2020
[67]

Y. Zhu, D. Zhou, R. Jiang, Q. Gu, R. Willett, and R. Nowak , Pure exploration in kernel and neural bandits, Advances in neural information processing systems, 34 (2021), pp. 11618–11630. SM39

work page 2021

[1] [1]

Abbasi-Yadkori, D

Y. Abbasi-Yadkori, D. P ´al, and C. Szepesv ´ari, Improved algorithms for linear stochastic bandits , in Advances in Neural Information Processing Systems, J. Shawe-Taylor, R. Zemel, P. Bartlett, F. Pereira, and K. Weinberger, eds., vol. 24, Curran Associates, Inc., 2011

work page 2011

[2] [2]

Agrawal and N

S. Agrawal and N. Goyal, Thompson sampling for contextual bandits with linear payoffs, in Proceedings of the 30th International Conference on Machine Learning, S. Dasgupta and D. McAllester, eds., vol. 28 of Proceedings of Machine Learning Research, Atlanta, Georgia, USA, 17–19 Jun 2013, PMLR, pp. 127–135

work page 2013

[3] [3]

Arya and B

S. Arya and B. K. Sriperumbudur , Kernel ϵ-greedy for contextual bandits , arXiv preprint arXiv:2306.17329, (2023)

work page arXiv 2023

[4] [4]

P. M. Asquith and H. Ihshaish , Classification of eye-state using eeg recordings: speed-up gains using signal epochs and mutual information measure , in Proceedings of the 23rd International Database Applications & Engineering Symposium, 2019, pp. 1–6

work page 2019

[5] [5]

O. Atan, C. Tekin, and M. Van der Schaar , Global multi-armed bandits with H¨ older continuity, in Artificial Intelligence and Statistics, PMLR, 2015, pp. 28–36

work page 2015

[6] [6]

O. Atan, C. Tekin, and M. van der Schaar , Global bandits, IEEE Transactions on Neural Networks and Learning Systems, 29 (2018), pp. 5798–5811

work page 2018

[7] [7]

Babichev and F

D. Babichev and F. Bach, Slice inverse regression with score functions, Electronic Journal of Statistics, 12 (2018), pp. 1507 – 1543

work page 2018

[8] [8]

Bastani and M

H. Bastani and M. Bayati, Online decision making with high-dimensional covariates , Operations Re- search, 68 (2020), pp. 276–294

work page 2020

[9] [9]

Bietti, A

A. Bietti, A. Agarwal, and J. Langford, A contextual bandit bake-off, Journal of Machine Learning Research, 22 (2021), pp. 1–49

work page 2021

[10] [10]

T. T. Cai and H. Pu , Stochastic continuum-armed bandits with additive models: Minimax regrets and adaptive algorithm, The Annals of Statistics, 50 (2022), pp. 2179–2204

work page 2022

[11] [11]

Z. Cai, R. Li, and L. Zhu , Online sufficient dimension reduction through sliced inverse regression , Journal of Machine Learning Research, 21 (2020), pp. 1–25

work page 2020

[12] [12]

Candanedo , Occupancy Detection

L. Candanedo , Occupancy Detection . UCI Machine Learning Repository, 2016. DOI: https://doi.org/10.24432/C5X01N

work page doi:10.24432/c5x01n 2016

[13] [13]

S. R. Chowdhury and A. Gopalan , On kernelized multi-armed bandits , in International Conference on Machine Learning, PMLR, 2017, pp. 844–853

work page 2017

[14] [14]

W. Chu, L. Li, L. Reyzin, and R. Schapire , Contextual bandits with linear payoff functions , in Pro- ceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, G. Gor- don, D. Dunson, and M. Dudik, eds., vol. 15 of Proceedings of Machine Learning Research, Fort Lauderdale, FL, USA, 11–13 Apr 2011, PMLR, pp. 208–214

work page 2011

[15] [15]

Cinar and M

I. Cinar and M. Koklu, Rice (Cammeo and Osmancik). UCI Machine Learning Repository, 2019. DOI: https://doi.org/10.24432/C5MW4Z

work page doi:10.24432/c5mw4z 2019

[16] [16]

Cinarer, N

G. Cinarer, N. Erbas ¸, and A. ¨Ocal, Rice classification and quality detection success with artificial intelligence technologies, Brazilian Archives of Biology and Technology, (2024)

work page 2024

[17] [17]

R. Dai, H. Song, R. F. Barber, and G. Raskutti , Convergence guarantee for the sparse monotone single index model , Electronic Journal of Statistics, 16 (2022), pp. 4449–4496

work page 2022

[18] [18]

Esfandiari, A

H. Esfandiari, A. Karbasi, A. Mehrabian, and V. Mirrokni , Regret bounds for batched bandits , Proceedings of the AAAI Conference on Artificial Intelligence, 35 (2021), pp. 7340–7348

work page 2021

[19] [19]

Y. Feng, Z. Huang, and T. Wang , Lipschitz bandits with batched feedback , in Advances in Neural Information Processing Systems, A. H. Oh, A. Agarwal, D. Belgrave, and K. Cho, eds., 2022

work page 2022

[20] [20]

Filippi, O

S. Filippi, O. Cappe, A. Garivier, and C. Szepesv ´ari, Parametric Bandits: The generalized linear case, in Advances in Neural Information Processing Systems, J. Lafferty, C. Williams, J. Shawe-Taylor, R. Zemel, and A. Culotta, eds., vol. 23, Curran Associates, Inc., 2010

work page 2010

[21] [21]

Ghosh, S

A. Ghosh, S. R. Chowdhury, and A. Gopalan , Misspecified linear bandits, in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 31, 2017

work page 2017

[22] [22]

Goldenshluger and A

A. Goldenshluger and A. Zeevi , A linear response bandit problem , Stochastic Systems, 3 (2013), pp. 230–261

work page 2013

[23] [23]

Greenewald, A

K. Greenewald, A. Tewari, S. Murphy, and P. Klasnja , Action centered contextual bandits , in SM37 SAKSHI ARYA AND HYEBIN SONG Advances in Neural Information Processing Systems, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, eds., vol. 30, Curran Associates, Inc., 2017

work page 2017

[24] [24]

Q. Gu, A. Karbasi, K. Khosravi, V. Mirrokni, and D. Zhou , Batched neural bandits, ACM / IMS J. Data Sci., 1 (2024)

work page 2024

[25] [25]

Gupta, S

S. Gupta, S. Chaudhari, G. Joshi, and O. Ya ˘gan, Multi-armed bandits with correlated arms , IEEE Transactions on Information Theory, 67 (2021), pp. 6711–6732

work page 2021

[26] [26]

Y. Gur, A. Momeni, and S. Wager , Smoothness-adaptive contextual bandits, Operations Research, 70 (2022), pp. 3198–3216

work page 2022

[27] [27]

Y. Han, Z. Zhou, Z. Zhou, J. Blanchet, P. W. Glynn, and Y. Ye , Sequential batch learning in finite-action linear contextual bandits, arXiv preprint arXiv:2004.06321, (2020)

work page arXiv 2004

[28] [28]

Hardle, P

W. Hardle, P. Hall, and H. Ichimura , Optimal smoothing in single-index models , The Annals of Statistics, 21 (1993), pp. 157–178

work page 1993

[29] [29]

Y. Hu, N. Kallus, and X. Mao , Smooth contextual bandits: Bridging the parametric and non- differentiable regret regimes, in Conference on Learning Theory, PMLR, 2020, pp. 2007–2010

work page 2020

[30] [30]

Ichimura, Semiparametric least squares (SLS) and weighted SLS estimation of single-index models , Journal of econometrics, 58 (1993), pp

H. Ichimura, Semiparametric least squares (SLS) and weighted SLS estimation of single-index models , Journal of econometrics, 58 (1993), pp. 71–120

work page 1993

[31] [31]

Jiang, Non-asymptotic uniform rates of consistency for k-nn regression , in Proceedings of the AAAI Conference on Artificial Intelligence, vol

H. Jiang, Non-asymptotic uniform rates of consistency for k-nn regression , in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, 2019, pp. 3999–4006

work page 2019

[32] [32]

Jiang and C

R. Jiang and C. Ma , Batched nonparametric contextual bandits , arXiv preprint arXiv:2402.17732, (2024)

work page arXiv 2024

[33] [33]

T. Jin, J. Tang, P. Xu, K. Huang, X. Xiao, and Q. Gu, Almost optimal anytime algorithm for batched multi-armed bandits, in International Conference on Machine Learning, PMLR, 2021, pp. 5065–5073

work page 2021

[34] [34]

Kalkanli and A

C. Kalkanli and A. Ozgur, Batched Thompson sampling, in Advances in Neural Information Processing Systems, M. Ranzato, A. Beygelzimer, Y. Dauphin, P. Liang, and J. W. Vaughan, eds., vol. 34, Curran Associates, Inc., 2021, pp. 29984–29994

work page 2021

[35] [35]

Kandasamy, J

K. Kandasamy, J. Schneider, and B. Poczos, High dimensional bayesian optimisation and bandits via additive models, in Proceedings of the 32nd International Conference on Machine Learning, F. Bach and D. Blei, eds., vol. 37 of Proceedings of Machine Learning Research, Lille, France, 07–09 Jul 2015, PMLR, pp. 295–304, https://proceedings.mlr.press/v37/kanda...

work page 2015

[36] [36]

G. H. Khan and M. A. Rahman , Room occupancy detection from temperature, light, humidity, and carbon dioxide measurements using deep learning , in 2021 International Conference on Computer, Communication, Chemical, Materials and Electronic Engineering (IC4ME2), 2021, pp. 1–4

work page 2021

[37] [37]

Kim and M

G.-S. Kim and M. C. Paik , Contextual multi-armed bandit algorithm for semiparametric reward model , in Proceedings of the 36th International Conference on Machine Learning, K. Chaudhuri and R. Salakhutdinov, eds., vol. 97 of Proceedings of Machine Learning Research, PMLR, 09–15 Jun 2019, pp. 3389–3397

work page 2019

[38] [38]

Krishnamurthy, Z

A. Krishnamurthy, Z. S. Wu, and V. Syrgkanis, Semiparametric contextual bandits, in International Conference on Machine Learning, PMLR, 2018, pp. 2776–2785

work page 2018

[39] [39]

A. K. Kuchibhotla and R. K. Patra , Efficient estimation in single index models through smoothing splines, Bernoulli, 26 (2020), pp. 1587–1618

work page 2020

[40] [40]

Kuszmaul and Q

W. Kuszmaul and Q. Qi , The multiplicative version of azuma’s inequality, with an application to contention analysis, arXiv preprint arXiv:2102.05077, (2021)

work page arXiv 2021

[41] [41]

T. L. Lai , Adaptive treatment allocation and the multi-armed bandit problem , The Annals of Statistics, (1987), pp. 1091–1114

work page 1987

[42] [42]

T. L. Lai and H. Robbins, Asymptotically efficient adaptive allocation rules , Advances in applied math- ematics, 6 (1985), pp. 4–22

work page 1985

[43] [43]

K. Li, Y. Yang, and N. N. Narisetty, Regret lower bound and optimal algorithm for high-dimensional contextual linear bandit, Electronic Journal of Statistics, 15 (2021), pp. 5652–5695

work page 2021

[44] [44]

Li, Sliced inverse regression for dimension reduction, Journal of the American Statistical Associa- tion, 86 (1991), pp

K.-C. Li, Sliced inverse regression for dimension reduction, Journal of the American Statistical Associa- tion, 86 (1991), pp. 316–327

work page 1991

[45] [45]

Li and N

K.-C. Li and N. Duan , Regression analysis under link violation , The Annals of Statistics, (1989), pp. 1009–1052

work page 1989

[46] [46]

W. Li, A. Barik, and J. Honorio , A simple unified framework for high dimensional bandit problems , in International Conference on Machine Learning, PMLR, 2022, pp. 12619–12655. SM38 SEMI-PARAMETRIC BATCHED GLOBAL MULTI-ARMED BANDITS WITH COVARIATES

work page 2022

[47] [47]

W. Li, N. Chen, and L. J. Hong , Dimension reduction in contextual online learning via nonparametric variable selection, Journal of Machine Learning Research, 24 (2023), pp. 1–84

work page 2023

[48] [48]

W. K. Newey and T. M. Stoker, Efficiency of weighted average derivative estimators and index models, Econometrica: Journal of the Econometric Society, (1993), pp. 1199–1223

work page 1993

[49] [49]

Perchet and P

V. Perchet and P. Rigollet, The multi-armed bandit problem with covariates, The Annals of Statistics, (2013)

work page 2013

[50] [50]

Perchet, P

V. Perchet, P. Rigollet, S. Chassang, and E. Snowberg , Batched bandit problems, The Annals of Statistics, 44 (2016), pp. 660 – 681

work page 2016

[51] [51]

Qian, C.-K

W. Qian, C.-K. Ing, and J. Liu , Adaptive algorithm for multi-armed bandit problem with high- dimensional covariates, Journal of the American Statistical Association, 119 (2024), pp. 970–982

work page 2024

[52] [52]

Qian and Y

W. Qian and Y. Yang , Kernel estimation and model combination in a bandit problem with covariates , Journal of Machine Learning Research, 17 (2016)

work page 2016

[53] [53]

Z. Ren, Z. Zhou, and J. R. Kalagnanam , Batched learning in generalized linear contextual bandits with general decision sets , IEEE Control Systems Letters, 6 (2022), pp. 37–42

work page 2022

[54] [54]

Rigollet and A

P. Rigollet and A. Zeevi , Nonparametric bandits with covariates , Conference on Learning Theory (COLT), (2010), p. 54

work page 2010

[55] [55]

Roesler , EEG Eye State

O. Roesler , EEG Eye State . UCI Machine Learning Repository, 2013. DOI: https://doi.org/10.24432/C57G7J

work page doi:10.24432/c57g7j 2013

[56] [56]

R ¨osler and D

O. R ¨osler and D. Suendermann , A first step towards eye state prediction using eeg , Proc. of the AIHLS, 1 (2013), pp. 1–4

work page 2013

[57] [57]

C. Shen, R. Zhou, C. Tekin, and M. van der Schaar , Generalized global bandit and its application in cellular coverage optimization , IEEE Journal of Selected Topics in Signal Processing, 12 (2018), pp. 218–232

work page 2018

[58] [58]

C. Shi, C. Shen, and J. Yang , Federated multi-armed bandits with personalization , in International conference on artificial intelligence and statistics, PMLR, 2021, pp. 2917–2925

work page 2021

[59] [59]

Tsybakov, Introduction to Nonparametric Estimation , Springer Series in Statistics, Springer New York, 2008

A. Tsybakov, Introduction to Nonparametric Estimation , Springer Series in Statistics, Springer New York, 2008

work page 2008

[60] [60]

Valko, N

M. Valko, N. Korda, R. Munos, I. Flaounas, and N. Cristianini, Finite-time analysis of kernelised contextual bandits, in Proceedings of the Twenty-Ninth Conference on Uncertainty in Artificial Intel- ligence, 2013, pp. 654–663

work page 2013

[61] [61]

Van Parys and N

B. Van Parys and N. Golrezaei , Optimal learning for structured bandits , Management Science, 70 (2024), pp. 3951–3998

work page 2024

[62] [62]

Wanigasekara and C

N. Wanigasekara and C. Yu, Nonparametric contextual bandits in metric spaces with unknown metric , in Advances in Neural Information Processing Systems, H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alch´ e-Buc, E. Fox, and R. Garnett, eds., vol. 32, Curran Associates, Inc., 2019

work page 2019

[63] [63]

W. Xia, T. Q. Quek, K. Guo, W. Wen, H. H. Yang, and H. Zhu , Multi-armed bandit-based client scheduling for federated learning , IEEE Transactions on Wireless Communications, 19 (2020), pp. 7108–7123

work page 2020

[64] [64]

Yang and D

Y. Yang and D. Zhu , Randomized allocation with nonparametric estimation for a multi-armed bandit problem with covariates, The Annals of Statistics, 30 (2002), pp. 100–121

work page 2002

[65] [65]

Y. Yu, T. Wang, and R. J. Samworth, A useful variant of the Davis–Kahan theorem for statisticians , Biometrika, 102 (2015), pp. 315–323

work page 2015

[66] [66]

D. Zhou, L. Li, and Q. Gu , Neural contextual bandits with UCB-based exploration , in International Conference on Machine Learning, PMLR, 2020, pp. 11492–11502

work page 2020

[67] [67]

Y. Zhu, D. Zhou, R. Jiang, Q. Gu, R. Willett, and R. Nowak , Pure exploration in kernel and neural bandits, Advances in neural information processing systems, 34 (2021), pp. 11618–11630. SM39

work page 2021