Batched Single-Index Global Multi-Armed Bandits with Covariates
Pith reviewed 2026-05-23 01:24 UTC · model grok-4.3
The pith
Single-index regression lets batched bandits with covariates achieve optimal one-dimensional regret rates when a pilot direction is given accurately.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The BIDS algorithm, which pairs batched successive arm elimination with dynamic binning guided by the single-index direction, attains minimax-optimal rates (equivalent to d=1) for nonparametric batched bandits whenever a pilot direction of sufficient accuracy is available and the number of arms K is fixed.
What carries the argument
The single-index regression model relating rewards to covariates via an unknown link function of a linear index, together with the dynamic binning mechanism inside the BIDS batched elimination procedure.
If this is right
- Regret bounds are derived both when a pilot direction is supplied and when the direction must be estimated from data.
- The method circumvents the curse of dimensionality for fixed K by reducing the effective dimension to one.
- Extensive simulations and real-data experiments show lower regret than the nonparametric batched bandit baseline.
Where Pith is reading between the lines
- The framework suggests that single-index models can serve as a practical compromise between linear and fully nonparametric contextual bandits under batch feedback.
- If the direction can be estimated adaptively at negligible extra cost, the same one-dimensional rates may hold without an external pilot.
- The approach may extend to other batch sequential decision problems where a low-dimensional structure is known to be present but not fully linear.
Load-bearing premise
The true relationship between covariates and rewards is correctly described by a single-index model shared across arms.
What would settle it
A numerical experiment in which an accurate pilot direction is supplied yet the observed regret rate stays as slow as the full nonparametric d-dimensional rate would falsify the optimality claim.
Figures
read the original abstract
The multi-armed bandits (MAB) framework is a widely used approach for sequential decision-making, where a decision-maker selects an arm in each round with the goal of maximizing long-term rewards. In many practical applications, such as personalized medicine and recommendation systems, contextual information is available at the time of decision-making, rewards from different arms are related rather than independent, and feedback is provided in batches. We propose a novel semi-parametric framework for batched bandits with covariates that incorporates a shared parameter across arms. We leverage the single-index regression (SIR) model to capture relationships between arm rewards while balancing interpretability and flexibility. Our algorithm, Batched single-Index Dynamic binning and Successive arm elimination (BIDS), employs a batched successive arm elimination strategy with a dynamic binning mechanism guided by the single-index direction. We consider two settings: one where a pilot direction is available and another where the direction is estimated from data, deriving theoretical regret bounds for both cases. When a pilot direction is available with sufficient accuracy and the number of arms $K$ is fixed, our approach achieves minimax-optimal rates (with $d = 1$) for nonparametric batched bandits, circumventing the curse of dimensionality. Extensive experiments on simulated and real-world datasets demonstrate the effectiveness of our algorithm compared to the nonparametric batched bandit method introduced by \cite{jiang2025batched}.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes the BIDS algorithm for batched global multi-armed bandits with covariates under a single-index regression model. It considers two settings (pilot direction supplied with sufficient accuracy; direction estimated from data), derives theoretical regret bounds for both, and claims that with fixed K and an accurate pilot the method attains minimax-optimal nonparametric rates for the effective dimension d=1, thereby circumventing the curse of dimensionality. Experiments on simulated and real data compare BIDS to the nonparametric batched bandit baseline of Jiang et al. (2025).
Significance. If the stated regret bounds hold under the single-index model and the pilot-accuracy condition, the work supplies a concrete semi-parametric route to dimension-free rates in batched contextual bandits. This is potentially useful for applications such as personalized medicine where high-dimensional covariates are present but a low-dimensional index structure may be plausible. The explicit separation of the pilot and estimation cases, together with the dynamic binning mechanism aligned to the index, is a clear technical contribution when the modeling assumptions are met.
major comments (2)
- [Theoretical analysis] Theoretical analysis (regret bounds section): the manuscript states regret bounds for both the pilot-direction and estimated-direction cases but does not supply the full derivation or the precise assumptions on the link function, the batch-size schedule, and the required accuracy of the pilot direction. Without these, it is impossible to confirm that the claimed minimax-optimal d=1 rate is attained rather than an additional logarithmic or polynomial factor appearing.
- [Pilot-direction setting] § on pilot direction: the optimality claim is explicitly conditional on the pilot direction being supplied with sufficient accuracy and on K being fixed. The manuscript does not quantify the degradation in the regret bound when the pilot error exceeds the stated threshold or when K grows with the horizon, both of which are load-bearing for the central “circumventing the curse of dimensionality” statement.
minor comments (2)
- [Notation] Notation for the single-index direction and the binning grid is introduced without a consolidated table of symbols; readers must hunt through the text to recover definitions.
- [Experiments] The experiments section reports performance on real-world data but does not state the dimension d of the covariates or the effective sample size per batch, making it hard to judge whether the observed gains are consistent with the d=1 regime.
Simulated Author's Rebuttal
We thank the referee for the careful reading and constructive comments. We address the two major comments point by point below.
read point-by-point responses
-
Referee: [Theoretical analysis] Theoretical analysis (regret bounds section): the manuscript states regret bounds for both the pilot-direction and estimated-direction cases but does not supply the full derivation or the precise assumptions on the link function, the batch-size schedule, and the required accuracy of the pilot direction. Without these, it is impossible to confirm that the claimed minimax-optimal d=1 rate is attained rather than an additional logarithmic or polynomial factor appearing.
Authors: The complete proofs appear in Appendices B and C. The link function is assumed twice continuously differentiable with bounded second derivative; batch sizes follow the schedule b_m = 2^m with m up to log T; and the pilot direction must satisfy ||pilot - true|| = O(T^{-1/4}) to eliminate extra factors and recover the d=1 minimax rate. We will insert a concise statement of these conditions together with a reference to the appendix at the beginning of the regret-bounds section in the revision. revision: yes
-
Referee: [Pilot-direction setting] § on pilot direction: the optimality claim is explicitly conditional on the pilot direction being supplied with sufficient accuracy and on K being fixed. The manuscript does not quantify the degradation in the regret bound when the pilot error exceeds the stated threshold or when K grows with the horizon, both of which are load-bearing for the central “circumventing the curse of dimensionality” statement.
Authors: The abstract and introduction already state that the d=1 rate holds only under the stated pilot-accuracy and fixed-K conditions. When the pilot error exceeds the threshold the effective dimension rises and the rate reverts to the nonparametric d-dimensional bound; when K grows, an extra poly(K) factor appears. A full quantitative extension to these regimes lies outside the present scope and would essentially reproduce the Jiang et al. (2025) analysis. We will add one paragraph in the discussion section clarifying these scope limitations. revision: partial
Circularity Check
No significant circularity in derivation chain
full rationale
The paper conditions its minimax rate claim explicitly on the single-index model holding and on a pilot direction supplied with sufficient accuracy (K fixed). Under those conditions the reduction to effective dimension 1 follows from standard 1-D nonparametric rates once binning aligns with the known index; no equation reduces a claimed prediction to a fitted quantity defined by the same data, and no load-bearing step relies on self-citation for uniqueness, ansatz, or model justification. The estimation-of-direction case is treated separately with its own bounds. The derivation is therefore self-contained against external nonparametric benchmarks.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Single-index regression model holds for the reward functions across arms
- domain assumption Pilot direction is available with sufficient accuracy when claimed
Reference graph
Works this paper leans on
-
[1]
Y. Abbasi-Yadkori, D. P ´al, and C. Szepesv ´ari, Improved algorithms for linear stochastic bandits , in Advances in Neural Information Processing Systems, J. Shawe-Taylor, R. Zemel, P. Bartlett, F. Pereira, and K. Weinberger, eds., vol. 24, Curran Associates, Inc., 2011
work page 2011
-
[2]
S. Agrawal and N. Goyal, Thompson sampling for contextual bandits with linear payoffs, in Proceedings of the 30th International Conference on Machine Learning, S. Dasgupta and D. McAllester, eds., vol. 28 of Proceedings of Machine Learning Research, Atlanta, Georgia, USA, 17–19 Jun 2013, PMLR, pp. 127–135
work page 2013
-
[3]
S. Arya and B. K. Sriperumbudur , Kernel ϵ-greedy for contextual bandits , arXiv preprint arXiv:2306.17329, (2023)
-
[4]
P. M. Asquith and H. Ihshaish , Classification of eye-state using eeg recordings: speed-up gains using signal epochs and mutual information measure , in Proceedings of the 23rd International Database Applications & Engineering Symposium, 2019, pp. 1–6
work page 2019
-
[5]
O. Atan, C. Tekin, and M. Van der Schaar , Global multi-armed bandits with H¨ older continuity, in Artificial Intelligence and Statistics, PMLR, 2015, pp. 28–36
work page 2015
-
[6]
O. Atan, C. Tekin, and M. van der Schaar , Global bandits, IEEE Transactions on Neural Networks and Learning Systems, 29 (2018), pp. 5798–5811
work page 2018
-
[7]
D. Babichev and F. Bach, Slice inverse regression with score functions, Electronic Journal of Statistics, 12 (2018), pp. 1507 – 1543
work page 2018
-
[8]
H. Bastani and M. Bayati, Online decision making with high-dimensional covariates , Operations Re- search, 68 (2020), pp. 276–294
work page 2020
- [9]
-
[10]
T. T. Cai and H. Pu , Stochastic continuum-armed bandits with additive models: Minimax regrets and adaptive algorithm, The Annals of Statistics, 50 (2022), pp. 2179–2204
work page 2022
-
[11]
Z. Cai, R. Li, and L. Zhu , Online sufficient dimension reduction through sliced inverse regression , Journal of Machine Learning Research, 21 (2020), pp. 1–25
work page 2020
-
[12]
Candanedo , Occupancy Detection
L. Candanedo , Occupancy Detection . UCI Machine Learning Repository, 2016. DOI: https://doi.org/10.24432/C5X01N
-
[13]
S. R. Chowdhury and A. Gopalan , On kernelized multi-armed bandits , in International Conference on Machine Learning, PMLR, 2017, pp. 844–853
work page 2017
-
[14]
W. Chu, L. Li, L. Reyzin, and R. Schapire , Contextual bandits with linear payoff functions , in Pro- ceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, G. Gor- don, D. Dunson, and M. Dudik, eds., vol. 15 of Proceedings of Machine Learning Research, Fort Lauderdale, FL, USA, 11–13 Apr 2011, PMLR, pp. 208–214
work page 2011
-
[15]
I. Cinar and M. Koklu, Rice (Cammeo and Osmancik). UCI Machine Learning Repository, 2019. DOI: https://doi.org/10.24432/C5MW4Z
-
[16]
G. Cinarer, N. Erbas ¸, and A. ¨Ocal, Rice classification and quality detection success with artificial intelligence technologies, Brazilian Archives of Biology and Technology, (2024)
work page 2024
-
[17]
R. Dai, H. Song, R. F. Barber, and G. Raskutti , Convergence guarantee for the sparse monotone single index model , Electronic Journal of Statistics, 16 (2022), pp. 4449–4496
work page 2022
-
[18]
H. Esfandiari, A. Karbasi, A. Mehrabian, and V. Mirrokni , Regret bounds for batched bandits , Proceedings of the AAAI Conference on Artificial Intelligence, 35 (2021), pp. 7340–7348
work page 2021
-
[19]
Y. Feng, Z. Huang, and T. Wang , Lipschitz bandits with batched feedback , in Advances in Neural Information Processing Systems, A. H. Oh, A. Agarwal, D. Belgrave, and K. Cho, eds., 2022
work page 2022
-
[20]
S. Filippi, O. Cappe, A. Garivier, and C. Szepesv ´ari, Parametric Bandits: The generalized linear case, in Advances in Neural Information Processing Systems, J. Lafferty, C. Williams, J. Shawe-Taylor, R. Zemel, and A. Culotta, eds., vol. 23, Curran Associates, Inc., 2010
work page 2010
- [21]
-
[22]
A. Goldenshluger and A. Zeevi , A linear response bandit problem , Stochastic Systems, 3 (2013), pp. 230–261
work page 2013
-
[23]
K. Greenewald, A. Tewari, S. Murphy, and P. Klasnja , Action centered contextual bandits , in SM37 SAKSHI ARYA AND HYEBIN SONG Advances in Neural Information Processing Systems, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, eds., vol. 30, Curran Associates, Inc., 2017
work page 2017
-
[24]
Q. Gu, A. Karbasi, K. Khosravi, V. Mirrokni, and D. Zhou , Batched neural bandits, ACM / IMS J. Data Sci., 1 (2024)
work page 2024
- [25]
-
[26]
Y. Gur, A. Momeni, and S. Wager , Smoothness-adaptive contextual bandits, Operations Research, 70 (2022), pp. 3198–3216
work page 2022
- [27]
- [28]
-
[29]
Y. Hu, N. Kallus, and X. Mao , Smooth contextual bandits: Bridging the parametric and non- differentiable regret regimes, in Conference on Learning Theory, PMLR, 2020, pp. 2007–2010
work page 2020
-
[30]
H. Ichimura, Semiparametric least squares (SLS) and weighted SLS estimation of single-index models , Journal of econometrics, 58 (1993), pp. 71–120
work page 1993
-
[31]
H. Jiang, Non-asymptotic uniform rates of consistency for k-nn regression , in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, 2019, pp. 3999–4006
work page 2019
-
[32]
R. Jiang and C. Ma , Batched nonparametric contextual bandits , arXiv preprint arXiv:2402.17732, (2024)
-
[33]
T. Jin, J. Tang, P. Xu, K. Huang, X. Xiao, and Q. Gu, Almost optimal anytime algorithm for batched multi-armed bandits, in International Conference on Machine Learning, PMLR, 2021, pp. 5065–5073
work page 2021
-
[34]
C. Kalkanli and A. Ozgur, Batched Thompson sampling, in Advances in Neural Information Processing Systems, M. Ranzato, A. Beygelzimer, Y. Dauphin, P. Liang, and J. W. Vaughan, eds., vol. 34, Curran Associates, Inc., 2021, pp. 29984–29994
work page 2021
-
[35]
K. Kandasamy, J. Schneider, and B. Poczos, High dimensional bayesian optimisation and bandits via additive models, in Proceedings of the 32nd International Conference on Machine Learning, F. Bach and D. Blei, eds., vol. 37 of Proceedings of Machine Learning Research, Lille, France, 07–09 Jul 2015, PMLR, pp. 295–304, https://proceedings.mlr.press/v37/kanda...
work page 2015
-
[36]
G. H. Khan and M. A. Rahman , Room occupancy detection from temperature, light, humidity, and carbon dioxide measurements using deep learning , in 2021 International Conference on Computer, Communication, Chemical, Materials and Electronic Engineering (IC4ME2), 2021, pp. 1–4
work page 2021
-
[37]
G.-S. Kim and M. C. Paik , Contextual multi-armed bandit algorithm for semiparametric reward model , in Proceedings of the 36th International Conference on Machine Learning, K. Chaudhuri and R. Salakhutdinov, eds., vol. 97 of Proceedings of Machine Learning Research, PMLR, 09–15 Jun 2019, pp. 3389–3397
work page 2019
-
[38]
A. Krishnamurthy, Z. S. Wu, and V. Syrgkanis, Semiparametric contextual bandits, in International Conference on Machine Learning, PMLR, 2018, pp. 2776–2785
work page 2018
-
[39]
A. K. Kuchibhotla and R. K. Patra , Efficient estimation in single index models through smoothing splines, Bernoulli, 26 (2020), pp. 1587–1618
work page 2020
-
[40]
W. Kuszmaul and Q. Qi , The multiplicative version of azuma’s inequality, with an application to contention analysis, arXiv preprint arXiv:2102.05077, (2021)
-
[41]
T. L. Lai , Adaptive treatment allocation and the multi-armed bandit problem , The Annals of Statistics, (1987), pp. 1091–1114
work page 1987
-
[42]
T. L. Lai and H. Robbins, Asymptotically efficient adaptive allocation rules , Advances in applied math- ematics, 6 (1985), pp. 4–22
work page 1985
-
[43]
K. Li, Y. Yang, and N. N. Narisetty, Regret lower bound and optimal algorithm for high-dimensional contextual linear bandit, Electronic Journal of Statistics, 15 (2021), pp. 5652–5695
work page 2021
-
[44]
K.-C. Li, Sliced inverse regression for dimension reduction, Journal of the American Statistical Associa- tion, 86 (1991), pp. 316–327
work page 1991
- [45]
-
[46]
W. Li, A. Barik, and J. Honorio , A simple unified framework for high dimensional bandit problems , in International Conference on Machine Learning, PMLR, 2022, pp. 12619–12655. SM38 SEMI-PARAMETRIC BATCHED GLOBAL MULTI-ARMED BANDITS WITH COVARIATES
work page 2022
-
[47]
W. Li, N. Chen, and L. J. Hong , Dimension reduction in contextual online learning via nonparametric variable selection, Journal of Machine Learning Research, 24 (2023), pp. 1–84
work page 2023
-
[48]
W. K. Newey and T. M. Stoker, Efficiency of weighted average derivative estimators and index models, Econometrica: Journal of the Econometric Society, (1993), pp. 1199–1223
work page 1993
-
[49]
V. Perchet and P. Rigollet, The multi-armed bandit problem with covariates, The Annals of Statistics, (2013)
work page 2013
-
[50]
V. Perchet, P. Rigollet, S. Chassang, and E. Snowberg , Batched bandit problems, The Annals of Statistics, 44 (2016), pp. 660 – 681
work page 2016
-
[51]
W. Qian, C.-K. Ing, and J. Liu , Adaptive algorithm for multi-armed bandit problem with high- dimensional covariates, Journal of the American Statistical Association, 119 (2024), pp. 970–982
work page 2024
-
[52]
W. Qian and Y. Yang , Kernel estimation and model combination in a bandit problem with covariates , Journal of Machine Learning Research, 17 (2016)
work page 2016
-
[53]
Z. Ren, Z. Zhou, and J. R. Kalagnanam , Batched learning in generalized linear contextual bandits with general decision sets , IEEE Control Systems Letters, 6 (2022), pp. 37–42
work page 2022
-
[54]
P. Rigollet and A. Zeevi , Nonparametric bandits with covariates , Conference on Learning Theory (COLT), (2010), p. 54
work page 2010
-
[55]
O. Roesler , EEG Eye State . UCI Machine Learning Repository, 2013. DOI: https://doi.org/10.24432/C57G7J
-
[56]
O. R ¨osler and D. Suendermann , A first step towards eye state prediction using eeg , Proc. of the AIHLS, 1 (2013), pp. 1–4
work page 2013
-
[57]
C. Shen, R. Zhou, C. Tekin, and M. van der Schaar , Generalized global bandit and its application in cellular coverage optimization , IEEE Journal of Selected Topics in Signal Processing, 12 (2018), pp. 218–232
work page 2018
-
[58]
C. Shi, C. Shen, and J. Yang , Federated multi-armed bandits with personalization , in International conference on artificial intelligence and statistics, PMLR, 2021, pp. 2917–2925
work page 2021
-
[59]
A. Tsybakov, Introduction to Nonparametric Estimation , Springer Series in Statistics, Springer New York, 2008
work page 2008
- [60]
-
[61]
B. Van Parys and N. Golrezaei , Optimal learning for structured bandits , Management Science, 70 (2024), pp. 3951–3998
work page 2024
-
[62]
N. Wanigasekara and C. Yu, Nonparametric contextual bandits in metric spaces with unknown metric , in Advances in Neural Information Processing Systems, H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alch´ e-Buc, E. Fox, and R. Garnett, eds., vol. 32, Curran Associates, Inc., 2019
work page 2019
-
[63]
W. Xia, T. Q. Quek, K. Guo, W. Wen, H. H. Yang, and H. Zhu , Multi-armed bandit-based client scheduling for federated learning , IEEE Transactions on Wireless Communications, 19 (2020), pp. 7108–7123
work page 2020
-
[64]
Y. Yang and D. Zhu , Randomized allocation with nonparametric estimation for a multi-armed bandit problem with covariates, The Annals of Statistics, 30 (2002), pp. 100–121
work page 2002
-
[65]
Y. Yu, T. Wang, and R. J. Samworth, A useful variant of the Davis–Kahan theorem for statisticians , Biometrika, 102 (2015), pp. 315–323
work page 2015
-
[66]
D. Zhou, L. Li, and Q. Gu , Neural contextual bandits with UCB-based exploration , in International Conference on Machine Learning, PMLR, 2020, pp. 11492–11502
work page 2020
-
[67]
Y. Zhu, D. Zhou, R. Jiang, Q. Gu, R. Willett, and R. Nowak , Pure exploration in kernel and neural bandits, Advances in neural information processing systems, 34 (2021), pp. 11618–11630. SM39
work page 2021
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.