Multi-Armed Bandits With Machine Learning-Generated Surrogate Rewards
Pith reviewed 2026-05-19 08:45 UTC · model grok-4.3
The pith
When true and predicted rewards are jointly Gaussian, MLA-UCB improves cumulative regret in multi-armed bandits even if surrogate means misalign with true means.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that MLA-UCB provably improves the cumulative regret of multi-armed bandits when predicted and true rewards are jointly Gaussian, even in cases where the mean surrogate reward completely misaligns with the true mean rewards, and it achieves asymptotic optimality among a broad class of policies without prior knowledge of the covariance matrix.
What carries the argument
The MLA-UCB algorithm that constructs confidence bounds leveraging machine learning-generated surrogate rewards from side information and historical data.
If this is right
- Regret is reduced compared to classical UCB even with misaligned surrogates.
- The algorithm achieves asymptotic optimality in a broad class of policies.
- It applies to any reward prediction model and form of auxiliary data.
- It extends to batched reward settings with computable bounds for non-Gaussian rewards.
Where Pith is reading between the lines
- This approach could be adapted to contextual bandits or reinforcement learning where auxiliary data is abundant.
- Relaxing the joint Gaussian assumption using other concentration inequalities might broaden applicability to heavy-tailed rewards.
- Testing with larger offline datasets could reveal how surrogate quality scales with correlation strength.
Load-bearing premise
The predicted and true rewards are jointly Gaussian.
What would settle it
A simulation or theoretical counterexample where rewards are not jointly Gaussian, the surrogates are misaligned, and MLA-UCB shows higher regret than standard UCB.
Figures
read the original abstract
Multi-armed bandit (MAB) is a widely adopted framework for sequential decision-making under uncertainty. Traditional bandit algorithms rely solely on online data, which tends to be scarce as it must be gathered during the online phase when the arms are actively pulled. However, in many practical settings, rich auxiliary data, such as covariates of past users, is available prior to deploying any arms. We introduce a new setting for MAB where pre-trained machine learning (ML) models are applied to convert side information and historical data into \emph{surrogate rewards}. A prominent challenge of this setting is that the surrogate rewards may exhibit substantial bias, as true reward data is typically unavailable in the offline phase, forcing ML predictions to heavily rely on extrapolation. To address the issue, we propose the Machine Learning-Assisted Upper Confidence Bound (MLA-UCB) algorithm, which can be applied to any reward prediction model and any form of auxiliary data. When the predicted and true rewards are jointly Gaussian, it provably improves the cumulative regret, even in cases where the mean surrogate reward completely misaligns with the true mean rewards, and achieves the asymptotic optimality among a broad class of policies. Notably, our method requires no prior knowledge of the covariance matrix between true and surrogate rewards. We further extend the method to a batched reward MAB problem, where each arm pull yields a batch of observations and rewards may be non-Gaussian, and we derive computable confidence bounds and regret guarantees that improve upon classical UCB algorithms. Finally, extensive simulations with both Gaussian and ML-generated surrogates, together with real-world studies on language model selection and video recommendation, demonstrate consistent and often substantial regret reductions with moderate offline surrogate sample sizes and correlations.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces MLA-UCB, a variant of UCB for multi-armed bandits that incorporates surrogate rewards produced by pre-trained ML models applied to auxiliary data. Under the assumption that predicted and true rewards are jointly Gaussian, the paper claims that MLA-UCB yields strictly lower cumulative regret than classical UCB even when surrogate means are completely misaligned with true means, requires no knowledge of the covariance matrix, and attains asymptotic optimality within a broad class of policies. The work extends the approach to batched settings with non-Gaussian rewards, supplies computable confidence bounds, and reports regret reductions in simulations and real-world experiments on language-model selection and video recommendation.
Significance. If the central theoretical claims hold, the result would be significant for sequential decision-making: it shows how offline ML predictions can be leveraged to improve online regret without requiring alignment of means or knowledge of the correlation parameter. The extension to batched non-Gaussian rewards and the empirical validation on practical tasks further strengthen applicability. The manuscript supplies explicit regret bounds and extensive simulations, which are positive features.
major comments (2)
- [Abstract and §3 (theoretical analysis)] Abstract (theoretical results paragraph) and the main regret analysis: the claim that MLA-UCB produces strictly better cumulative regret than classical UCB under joint Gaussianity, even with complete mean misalignment and without any knowledge of the covariance matrix, is load-bearing. Joint Gaussianity implies the true reward equals a linear function of the surrogate plus independent noise; any tightening of the confidence interval must therefore exploit the correlation coefficient. If the algorithm and its analysis are truly covariance-free (as stated), a conservative marginal-variance bound recovers exactly the classical UCB width and eliminates the asserted regret improvement. Please supply the explicit construction of the covariance-free confidence bounds (including the precise form of the width and how extrapolation bias is controlled) and demonstrate that they remain strictly sub-
- [§4] §4 (batched extension): the derivation of computable confidence bounds and regret guarantees for non-Gaussian batched rewards is presented as an improvement over classical UCB, yet the handling of the extrapolation bias induced by ML-generated surrogates is not shown to preserve the improvement when the batch size is moderate and the surrogate model is misspecified. A concrete counter-example or additional assumption would clarify whether the improvement survives outside the Gaussian case.
minor comments (2)
- [§2] Notation for the surrogate reward process and the definition of 'complete mean misalignment' should be introduced earlier and used consistently throughout the theoretical sections.
- [§5] Figure captions for the simulation results could more explicitly state the correlation values and offline sample sizes used, to allow direct comparison with the theoretical regime.
Simulated Author's Rebuttal
We thank the referee for the careful and constructive report. The comments highlight important points about the clarity of our theoretical claims and the scope of the batched extension. We address each major comment below and indicate the revisions we will make.
read point-by-point responses
-
Referee: [Abstract and §3 (theoretical analysis)] Abstract (theoretical results paragraph) and the main regret analysis: the claim that MLA-UCB produces strictly better cumulative regret than classical UCB under joint Gaussianity, even with complete mean misalignment and without any knowledge of the covariance matrix, is load-bearing. Joint Gaussianity implies the true reward equals a linear function of the surrogate plus independent noise; any tightening of the confidence interval must therefore exploit the correlation coefficient. If the algorithm and its analysis are truly covariance-free (as stated), a conservative marginal-variance bound recovers exactly the classical UCB width and eliminates the asserted regret improvement. Please supply the explicit construction of the covariance-free confidence bounds (including the precise form of the width and how extrapolation bias is controlled) and
Authors: We agree that the mechanism by which improvement occurs under joint Gaussianity deserves a more explicit derivation. The covariance-free property means that neither the algorithm nor the stated regret bound requires the user to know or input the covariance value; the analysis instead uses the joint Gaussian structure to bound the conditional variance and the extrapolation bias term without needing its numerical value. The explicit form of the confidence width appears in Equation (8) and the proof of Theorem 3.1, where the width is constructed from the marginal variance of the surrogate plus a controlled bias term that vanishes asymptotically. Because the surrogate is observed at every round, the effective information gain remains strictly larger than the marginal case even under a conservative bound on the correlation. We will add a new subsection in the revision that walks through the construction step by step, explicitly compares the resulting width to the classical UCB width, and shows that the regret improvement is preserved under mean misalignment. revision: yes
-
Referee: [§4] §4 (batched extension): the derivation of computable confidence bounds and regret guarantees for non-Gaussian batched rewards is presented as an improvement over classical UCB, yet the handling of the extrapolation bias induced by ML-generated surrogates is not shown to preserve the improvement when the batch size is moderate and the surrogate model is misspecified. A concrete counter-example or additional assumption would clarify whether the improvement survives outside the Gaussian case.
Authors: The referee correctly notes that the non-Gaussian batched analysis relies on a bounded-bias assumption to control extrapolation error. Under this assumption the regret bound remains strictly better than classical UCB for any fixed batch size, because the surrogate still supplies additional observations per batch. When the surrogate is misspecified, the improvement is preserved as long as the bias remains bounded by a constant that does not grow with the horizon; our simulations in Section 5 already illustrate this regime. To make the boundary conditions transparent, we will insert a remark that states the precise bias condition and supplies a simple counter-example (a deterministic misspecification that makes the bias grow linearly with batch size) under which the improvement can disappear. This will be accompanied by an additional simulation panel showing the transition point. revision: partial
Circularity Check
No significant circularity in derivation chain
full rationale
The paper derives MLA-UCB regret bounds and asymptotic optimality from the joint Gaussianity assumption on predicted and true rewards, explicitly without requiring covariance knowledge, and claims improvement even under mean misalignment. No load-bearing step reduces by construction to a fitted parameter, self-citation, or input renaming; the central guarantee is presented as following from the distributional assumption and algorithm construction rather than tautologically from data fits or prior author results. The analysis is self-contained against the stated assumptions and does not exhibit self-definitional, fitted-input, or ansatz-smuggling patterns.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Predicted and true rewards are jointly Gaussian
Forward citations
Cited by 1 Pith paper
-
Valid Best-Model Identification for LLM Evaluation via Low-Rank Factorization
Doubly robust estimators that incorporate low-rank predictions enable valid finite-sample confidence intervals for best-model identification under adaptive sampling and without-replacement example selection in LLM evaluation.
Reference graph
Works this paper leans on
-
[1]
Analysis of thompson sampling for the multi-armed bandit problem
Shipra Agrawal and Navin Goyal. Analysis of thompson sampling for the multi-armed bandit problem. In Proceedings of the 25th Annual Conference on Learning Theory , volume 23, pages 39.1–39.26. PMLR, 2012
work page 2012
-
[2]
Anastasios N Angelopoulos, Stephen Bates, Clara Fannjiang, Michael I Jordan, and Tijana Zrnic. Prediction-powered inference. Science, 382(6671):669–674, 2023
work page 2023
-
[3]
PPI++: Efficient Prediction-Powered Inference
Anastasios N Angelopoulos, John C Duchi, and Tijana Zrnic. PPI++: Efficient prediction-powered inference. arXiv preprint arXiv:2311.01453 , 2023. 13 0 200 400 600 800 1000 Time Step T 0 10 20 30 40Cumulative Regret MLA-UCB algorithm with ML predictions, = 0 UCB Linear Model SVR Neural Network Decision Tree 0 200 400 600 800 1000 Time Step T 0 10 20 30 40C...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[4]
Susan Athey, Raj Chetty, Guido W Imbens, and Hyunseung Kang. The surrogate index: Combining short-term proxies to estimate long-term treatment effects more rapidly and precisely. Technical report, National Bureau of Economic Research, Inc, 2019
work page 2019
-
[5]
Finite-time analysis of the mul- tiarmed bandit problem
Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the mul- tiarmed bandit problem. Machine Learning, 47:235–256, 2002
work page 2002
-
[6]
On multi-armed bandit designs for dose-finding trials
Maryam Aziz, Emilie Kaufmann, and Marie-Karelle Riviere. On multi-armed bandit designs for dose-finding trials. Journal of Machine Learning Research , 22(14):1–38, 2021
work page 2021
-
[7]
Semi-supervised linear regression
David Azriel, Lawrence D Brown, Michael Sklar, Richard Berk, Andreas Buja, and Linda Zhao. Semi-supervised linear regression. Journal of the American Statistical Association, 117(540):2238–2251, 2022
work page 2022
-
[8]
Artificial replay: a meta-algorithm for harnessing historical data in bandits
Siddhartha Banerjee, Sean R Sinclair, Milind Tambe, Lily Xu, and Christina Lee Yu. Artificial replay: a meta-algorithm for harnessing historical data in bandits. arXiv preprint arXiv:2210.00025, 2022
-
[9]
Optimal adaptive policies for se- quential allocation problems
Apostolos N Burnetas and Michael N Katehakis. Optimal adaptive policies for se- quential allocation problems. Advances in Applied Mathematics, 17(2):122–142, 1996
work page 1996
-
[10]
Measurement error models with auxiliary data
Xiaohong Chen, Han Hong, and Elie Tamer. Measurement error models with auxiliary data. The Review of Economic Studies , 72(2):343–366, 2005
work page 2005
-
[11]
Semiparametric efficiency in gmm models with auxiliary data
Xiaohong Chen, Han Hong, and Alessandro Tarozzi. Semiparametric efficiency in gmm models with auxiliary data. The Annals of Statistics , 36(2):808, 2008. 14
work page 2008
-
[12]
A unified approach to regression analysis under double- sampling designs
Yi-Hau Chen and Hung Chen. A unified approach to regression analysis under double- sampling designs. Journal of the Royal Statistical Society Series B: Statistical Method- ology, 62(3):449–460, 2000
work page 2000
-
[13]
Leveraging (biased) information: Multi-armed bandits with offline data
Wang Chi Cheung and Lixing Lyu. Leveraging (biased) information: Multi-armed bandits with offline data. arXiv preprint arXiv:2405.02594 , 2024
-
[14]
How the new york times is experimenting with recommendation algo- rithms
Anna Coenen. How the new york times is experimenting with recommendation algo- rithms. NYT Open, 2019. [Last accessed Feb 3, 2025]
work page 2019
-
[15]
Normal bandits of unknown means and variances
Wesley Cowan, Junya Honda, and Michael N Katehakis. Normal bandits of unknown means and variances. Journal of Machine Learning Research , 18(154):1–28, 2018
work page 2018
-
[16]
Prediction de-correlated inference: A safe approach for post-prediction inference
Feng Gan, Wanfeng Liang, and Changliang Zou. Prediction de-correlated inference: A safe approach for post-prediction inference. Australian & New Zealand Journal of Statistics, 66(4):417–440, 2024
work page 2024
-
[17]
Some monotonicity theorems for χ2, f and t distributions with applications
Bhaskar K Ghosh. Some monotonicity theorems for χ2, f and t distributions with applications. Journal of the Royal Statistical Society: Series B (Methodological) , 35(3):480–492, 1973
work page 1973
-
[18]
An introduction to the augmented inverse propensity weighted estimator
Adam N Glynn and Kevin M Quinn. An introduction to the augmented inverse propensity weighted estimator. Political analysis, 18(1):36–56, 2010
work page 2010
-
[19]
Another look at inference after prediction
Jessica Gronsbell, Jianhui Gao, Yaqi Shi, Zachary R McCaw, and David Cheng. Another look at inference after prediction. arXiv preprint arXiv:2411.19908 , 2024
-
[20]
Wenlong Ji, Lihua Lei, and Tijana Zrnic. Predictions as surrogates: Revisiting sur- rogate outcomes in the age of ai. arXiv preprint arXiv:2501.09731 , 2025
-
[21]
Asymptotically efficient adaptive allocation rules
Tze Leung Lai and Herbert Robbins. Asymptotically efficient adaptive allocation rules. Advances in applied mathematics , 6(1):4–22, 1985
work page 1985
-
[22]
Tor Lattimore and Csaba Szepesvari. Bandit Algorithms. Cambridge University Press, 1st edition, 2020
work page 2020
-
[23]
A perspective on the use of control variables to increase the efficiency of monte carlo simulations
Stephen S Lavenberg and Peter D Welch. A perspective on the use of control variables to increase the efficiency of monte carlo simulations. Management Science, 27(3):322– 335, 1981
work page 1981
-
[24]
A contextual-bandit approach to personalized news article recommendation
Lihong Li, Wei Chu, John Langford, and Robert E Schapire. A contextual-bandit approach to personalized news article recommendation. Proceedings of the 19th In- ternational Conference on World Wide Web , 2010
work page 2010
-
[25]
Barry L Nelson. Control variate remedies. Operations Research, 38(6):974–992, 1990
work page 1990
-
[26]
Inference using surrogate outcome data and a validation sample
Margaret Sullivan Pepe. Inference using surrogate outcome data and a validation sample. Biometrika, 79(2):355–365, 1992
work page 1992
-
[27]
Estimation of regression coefficients when some regressors are not always observed
James M Robins, Andrea Rotnitzky, and Lue Ping Zhao. Estimation of regression coefficients when some regressors are not always observed. Journal of the American statistical Association, 89(427):846–866, 1994
work page 1994
-
[28]
Russo, Benjamin Van Roy, Abbas Kazerouni, Ian Osband, and Zheng Wen
Daniel J. Russo, Benjamin Van Roy, Abbas Kazerouni, Ian Osband, and Zheng Wen. A tutorial on thompson sampling. Foundations and Trends in Machine Learning , 11(1):1–96, 2018. 15
work page 2018
-
[29]
Adjusting for nonig- norable drop-out using semiparametric nonresponse models
Daniel O Scharfstein, Andrea Rotnitzky, and James M Robins. Adjusting for nonig- norable drop-out using semiparametric nonresponse models. Journal of the American Statistical Association, 94(448):1096–1120, 1999
work page 1999
-
[30]
Multi-armed bandit problems with history
Pannagadatta Shivaswamy and Thorsten Joachims. Multi-armed bandit problems with history. In Artificial Intelligence and Statistics , pages 1046–1054. PMLR, 2012
work page 2012
-
[31]
Long-term value of exploration: Mea- surements, findings and algorithms
Yi Su, Xiangyu Wang, Elaine Ya Le, Liang Liu, Yuening Li, Haokai Lu, Benjamin Lipshitz, Sriraj Badam, Lukasz Heldt, Shuchao Bi, Ed H Chi, Cristos Goodrow, Su-Lin Wu, Lexi Baugher, and Minmin Chen. Long-term value of exploration: Mea- surements, findings and algorithms. Proceedings of the 17th ACM International Con- ference on Web Search and Data Mining , 2024
work page 2024
-
[32]
Stochastic multi-armed bandits with con- trol variates
Arun Verma and Manjesh Kumar Hanawal. Stochastic multi-armed bandits with con- trol variates. Advances in Neural Information Processing Systems , 34:27592–27603, 2021
work page 2021
-
[33]
Multi-armed bandit models for the optimal design of clinical trials: benefits and challenges
Sof´ ıa S Villar, Jack Bowden, and James Wason. Multi-armed bandit models for the optimal design of clinical trials: benefits and challenges. Statistical science: a review journal of the Institute of Mathematical Statistics , 30(2):199, 2015
work page 2015
-
[34]
Semi-supervised inference: General theory and estimation of means
Anru Zhang, Lawrence D Brown, and T Tony Cai. Semi-supervised inference: General theory and estimation of means. The Annals of Statistics , 47(5):2538–2566, 2019
work page 2019
-
[35]
Warm-starting Contextual Bandits: Robustly Combining Supervised and Bandit Feedback
Chicheng Zhang, Alekh Agarwal, Hal Daum´ e III, John Langford, and Sahand N Negahban. Warm-starting contextual bandits: Robustly combining supervised and bandit feedback. arXiv preprint arXiv:1901.00301 , 2019
work page internal anchor Pith review Pith/arXiv arXiv 1901
-
[36]
Tijana Zrnic and Emmanuel J Cand` es. Active statistical inference. arXiv preprint arXiv:2403.03208, 2024. A Proof This section provides proof of the theoretical results presented in this paper. First, we prove the properties of the machine learning-assisted mean estimator presented in Propo- sition 1 - 3. Proof of Proposition 1. Recall that the definitio...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[37]
We will start with the following generalized version of regret bound. Theorem 2. Under the Gaussian data generation model (2), for any ϵ ∈ (0, 1) and T ≥ 4K, if the sample size of offline predictions satisfies Nk ≥ 1 δk 2 logT log 1 + ∆2 k 24σ2 k (1−ϵ)2 1+ϵ + 3 , ∀k ∈ [K] (25) then the expected regret of Algorithm 1 can be bounded by: E[RegT] ≤ X ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.