pith. sign in

arxiv: 2506.16658 · v2 · submitted 2025-06-20 · 🧮 math.ST · cs.LG· stat.ML· stat.TH

Multi-Armed Bandits With Machine Learning-Generated Surrogate Rewards

Pith reviewed 2026-05-19 08:45 UTC · model grok-4.3

classification 🧮 math.ST cs.LGstat.MLstat.TH
keywords multi-armed banditssurrogate rewardsmachine learningupper confidence boundregret analysisGaussian distributionasymptotic optimalityauxiliary data
0
0 comments X

The pith

When true and predicted rewards are jointly Gaussian, MLA-UCB improves cumulative regret in multi-armed bandits even if surrogate means misalign with true means.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Traditional bandit algorithms suffer from scarce online data, but auxiliary information can be turned into surrogate rewards via pre-trained machine learning models. The paper proposes MLA-UCB, which integrates these surrogates into the decision process for any prediction model. Under joint Gaussianity of predicted and true rewards, it reduces cumulative regret without requiring knowledge of their covariance and remains effective even when surrogate means are completely misaligned. The method also handles batched observations and derives guarantees for non-Gaussian cases, with empirical support from simulations and real-world tasks like language model selection.

Core claim

The central claim is that MLA-UCB provably improves the cumulative regret of multi-armed bandits when predicted and true rewards are jointly Gaussian, even in cases where the mean surrogate reward completely misaligns with the true mean rewards, and it achieves asymptotic optimality among a broad class of policies without prior knowledge of the covariance matrix.

What carries the argument

The MLA-UCB algorithm that constructs confidence bounds leveraging machine learning-generated surrogate rewards from side information and historical data.

If this is right

  • Regret is reduced compared to classical UCB even with misaligned surrogates.
  • The algorithm achieves asymptotic optimality in a broad class of policies.
  • It applies to any reward prediction model and form of auxiliary data.
  • It extends to batched reward settings with computable bounds for non-Gaussian rewards.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This approach could be adapted to contextual bandits or reinforcement learning where auxiliary data is abundant.
  • Relaxing the joint Gaussian assumption using other concentration inequalities might broaden applicability to heavy-tailed rewards.
  • Testing with larger offline datasets could reveal how surrogate quality scales with correlation strength.

Load-bearing premise

The predicted and true rewards are jointly Gaussian.

What would settle it

A simulation or theoretical counterexample where rewards are not jointly Gaussian, the surrogates are misaligned, and MLA-UCB shows higher regret than standard UCB.

Figures

Figures reproduced from arXiv: 2506.16658 by Lihua Lei, Ruihao Zhu, Wenlong Ji, Yihan Pan.

Figure 1
Figure 1. Figure 1: Comparison of the quantile bound q d(s 2 d−1 − 1) and the true quantile qd  1 2s √ log s  in Proposition 4. We set d = ⌊log s⌋ and vary the values of s to reflect the order of regret bound (1). Theorem 1. Under the Gaussian reward model (2), for any T ≥ 4K, if offline sample size satisfies Nk ≥ 1 δk   2 log T log  1 + ∆2 k 24σ 2 k  + 3   , ∀k ∈ [K], (9) then the expected regret of Algorithm 1 can b… view at source ↗
Figure 2
Figure 2. Figure 2: Cumulative regret of the MLA-UCB algorithm (Algorithm 1) and the classical UCB algorithm) under a Gaussian model. For Figure 2a to 2c, we report the final cumulative regret at T = 1000 steps. For Figure 2d, we choose ρ 2 = 0.5, ∆ = 0.5, N = 100. where x = (x1, x2) ⊤ is the feature vector, wk = (wk,1, wk,2) is the arm-specific weight parameter, ϵ is the random noise. The features are assumed to be only visi… view at source ↗
Figure 3
Figure 3. Figure 3: The conditional expectation and distribution of the true rewards. The true reward is generated using wk,1 = wk,2 = 1 and σ = 0.1. The reward function is highly nonlinear and the reward distribution is different from the Gaussian distribution. From [PITH_FULL_IMAGE:figures/full_fig_p013_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Cumulative regret of the MLA-UCB algorithm and the classical UCB algorithm under different settings. The true correlations of predictions of each set￾ting are shown in [PITH_FULL_IMAGE:figures/full_fig_p014_4.png] view at source ↗
read the original abstract

Multi-armed bandit (MAB) is a widely adopted framework for sequential decision-making under uncertainty. Traditional bandit algorithms rely solely on online data, which tends to be scarce as it must be gathered during the online phase when the arms are actively pulled. However, in many practical settings, rich auxiliary data, such as covariates of past users, is available prior to deploying any arms. We introduce a new setting for MAB where pre-trained machine learning (ML) models are applied to convert side information and historical data into \emph{surrogate rewards}. A prominent challenge of this setting is that the surrogate rewards may exhibit substantial bias, as true reward data is typically unavailable in the offline phase, forcing ML predictions to heavily rely on extrapolation. To address the issue, we propose the Machine Learning-Assisted Upper Confidence Bound (MLA-UCB) algorithm, which can be applied to any reward prediction model and any form of auxiliary data. When the predicted and true rewards are jointly Gaussian, it provably improves the cumulative regret, even in cases where the mean surrogate reward completely misaligns with the true mean rewards, and achieves the asymptotic optimality among a broad class of policies. Notably, our method requires no prior knowledge of the covariance matrix between true and surrogate rewards. We further extend the method to a batched reward MAB problem, where each arm pull yields a batch of observations and rewards may be non-Gaussian, and we derive computable confidence bounds and regret guarantees that improve upon classical UCB algorithms. Finally, extensive simulations with both Gaussian and ML-generated surrogates, together with real-world studies on language model selection and video recommendation, demonstrate consistent and often substantial regret reductions with moderate offline surrogate sample sizes and correlations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces MLA-UCB, a variant of UCB for multi-armed bandits that incorporates surrogate rewards produced by pre-trained ML models applied to auxiliary data. Under the assumption that predicted and true rewards are jointly Gaussian, the paper claims that MLA-UCB yields strictly lower cumulative regret than classical UCB even when surrogate means are completely misaligned with true means, requires no knowledge of the covariance matrix, and attains asymptotic optimality within a broad class of policies. The work extends the approach to batched settings with non-Gaussian rewards, supplies computable confidence bounds, and reports regret reductions in simulations and real-world experiments on language-model selection and video recommendation.

Significance. If the central theoretical claims hold, the result would be significant for sequential decision-making: it shows how offline ML predictions can be leveraged to improve online regret without requiring alignment of means or knowledge of the correlation parameter. The extension to batched non-Gaussian rewards and the empirical validation on practical tasks further strengthen applicability. The manuscript supplies explicit regret bounds and extensive simulations, which are positive features.

major comments (2)
  1. [Abstract and §3 (theoretical analysis)] Abstract (theoretical results paragraph) and the main regret analysis: the claim that MLA-UCB produces strictly better cumulative regret than classical UCB under joint Gaussianity, even with complete mean misalignment and without any knowledge of the covariance matrix, is load-bearing. Joint Gaussianity implies the true reward equals a linear function of the surrogate plus independent noise; any tightening of the confidence interval must therefore exploit the correlation coefficient. If the algorithm and its analysis are truly covariance-free (as stated), a conservative marginal-variance bound recovers exactly the classical UCB width and eliminates the asserted regret improvement. Please supply the explicit construction of the covariance-free confidence bounds (including the precise form of the width and how extrapolation bias is controlled) and demonstrate that they remain strictly sub-
  2. [§4] §4 (batched extension): the derivation of computable confidence bounds and regret guarantees for non-Gaussian batched rewards is presented as an improvement over classical UCB, yet the handling of the extrapolation bias induced by ML-generated surrogates is not shown to preserve the improvement when the batch size is moderate and the surrogate model is misspecified. A concrete counter-example or additional assumption would clarify whether the improvement survives outside the Gaussian case.
minor comments (2)
  1. [§2] Notation for the surrogate reward process and the definition of 'complete mean misalignment' should be introduced earlier and used consistently throughout the theoretical sections.
  2. [§5] Figure captions for the simulation results could more explicitly state the correlation values and offline sample sizes used, to allow direct comparison with the theoretical regime.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful and constructive report. The comments highlight important points about the clarity of our theoretical claims and the scope of the batched extension. We address each major comment below and indicate the revisions we will make.

read point-by-point responses
  1. Referee: [Abstract and §3 (theoretical analysis)] Abstract (theoretical results paragraph) and the main regret analysis: the claim that MLA-UCB produces strictly better cumulative regret than classical UCB under joint Gaussianity, even with complete mean misalignment and without any knowledge of the covariance matrix, is load-bearing. Joint Gaussianity implies the true reward equals a linear function of the surrogate plus independent noise; any tightening of the confidence interval must therefore exploit the correlation coefficient. If the algorithm and its analysis are truly covariance-free (as stated), a conservative marginal-variance bound recovers exactly the classical UCB width and eliminates the asserted regret improvement. Please supply the explicit construction of the covariance-free confidence bounds (including the precise form of the width and how extrapolation bias is controlled) and

    Authors: We agree that the mechanism by which improvement occurs under joint Gaussianity deserves a more explicit derivation. The covariance-free property means that neither the algorithm nor the stated regret bound requires the user to know or input the covariance value; the analysis instead uses the joint Gaussian structure to bound the conditional variance and the extrapolation bias term without needing its numerical value. The explicit form of the confidence width appears in Equation (8) and the proof of Theorem 3.1, where the width is constructed from the marginal variance of the surrogate plus a controlled bias term that vanishes asymptotically. Because the surrogate is observed at every round, the effective information gain remains strictly larger than the marginal case even under a conservative bound on the correlation. We will add a new subsection in the revision that walks through the construction step by step, explicitly compares the resulting width to the classical UCB width, and shows that the regret improvement is preserved under mean misalignment. revision: yes

  2. Referee: [§4] §4 (batched extension): the derivation of computable confidence bounds and regret guarantees for non-Gaussian batched rewards is presented as an improvement over classical UCB, yet the handling of the extrapolation bias induced by ML-generated surrogates is not shown to preserve the improvement when the batch size is moderate and the surrogate model is misspecified. A concrete counter-example or additional assumption would clarify whether the improvement survives outside the Gaussian case.

    Authors: The referee correctly notes that the non-Gaussian batched analysis relies on a bounded-bias assumption to control extrapolation error. Under this assumption the regret bound remains strictly better than classical UCB for any fixed batch size, because the surrogate still supplies additional observations per batch. When the surrogate is misspecified, the improvement is preserved as long as the bias remains bounded by a constant that does not grow with the horizon; our simulations in Section 5 already illustrate this regime. To make the boundary conditions transparent, we will insert a remark that states the precise bias condition and supplies a simple counter-example (a deterministic misspecification that makes the bias grow linearly with batch size) under which the improvement can disappear. This will be accompanied by an additional simulation panel showing the transition point. revision: partial

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper derives MLA-UCB regret bounds and asymptotic optimality from the joint Gaussianity assumption on predicted and true rewards, explicitly without requiring covariance knowledge, and claims improvement even under mean misalignment. No load-bearing step reduces by construction to a fitted parameter, self-citation, or input renaming; the central guarantee is presented as following from the distributional assumption and algorithm construction rather than tautologically from data fits or prior author results. The analysis is self-contained against the stated assumptions and does not exhibit self-definitional, fitted-input, or ansatz-smuggling patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the joint Gaussian assumption for true and surrogate rewards and on the existence of a pre-trained ML model that produces surrogates from auxiliary data; no free parameters or new invented entities are introduced in the abstract.

axioms (1)
  • domain assumption Predicted and true rewards are jointly Gaussian
    Invoked to obtain closed-form confidence bounds and regret guarantees without requiring the covariance matrix (abstract, theoretical results paragraph).

pith-pipeline@v0.9.0 · 5858 in / 1424 out tokens · 32360 ms · 2026-05-19T08:45:33.726044+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Valid Best-Model Identification for LLM Evaluation via Low-Rank Factorization

    cs.LG 2026-05 unverdicted novelty 6.0

    Doubly robust estimators that incorporate low-rank predictions enable valid finite-sample confidence intervals for best-model identification under adaptive sampling and without-replacement example selection in LLM evaluation.

Reference graph

Works this paper leans on

37 extracted references · 37 canonical work pages · cited by 1 Pith paper · 3 internal anchors

  1. [1]

    Analysis of thompson sampling for the multi-armed bandit problem

    Shipra Agrawal and Navin Goyal. Analysis of thompson sampling for the multi-armed bandit problem. In Proceedings of the 25th Annual Conference on Learning Theory , volume 23, pages 39.1–39.26. PMLR, 2012

  2. [2]

    Prediction-powered inference

    Anastasios N Angelopoulos, Stephen Bates, Clara Fannjiang, Michael I Jordan, and Tijana Zrnic. Prediction-powered inference. Science, 382(6671):669–674, 2023

  3. [3]

    PPI++: Efficient Prediction-Powered Inference

    Anastasios N Angelopoulos, John C Duchi, and Tijana Zrnic. PPI++: Efficient prediction-powered inference. arXiv preprint arXiv:2311.01453 , 2023. 13 0 200 400 600 800 1000 Time Step T 0 10 20 30 40Cumulative Regret MLA-UCB algorithm with ML predictions, = 0 UCB Linear Model SVR Neural Network Decision Tree 0 200 400 600 800 1000 Time Step T 0 10 20 30 40C...

  4. [4]

    The surrogate index: Combining short-term proxies to estimate long-term treatment effects more rapidly and precisely

    Susan Athey, Raj Chetty, Guido W Imbens, and Hyunseung Kang. The surrogate index: Combining short-term proxies to estimate long-term treatment effects more rapidly and precisely. Technical report, National Bureau of Economic Research, Inc, 2019

  5. [5]

    Finite-time analysis of the mul- tiarmed bandit problem

    Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the mul- tiarmed bandit problem. Machine Learning, 47:235–256, 2002

  6. [6]

    On multi-armed bandit designs for dose-finding trials

    Maryam Aziz, Emilie Kaufmann, and Marie-Karelle Riviere. On multi-armed bandit designs for dose-finding trials. Journal of Machine Learning Research , 22(14):1–38, 2021

  7. [7]

    Semi-supervised linear regression

    David Azriel, Lawrence D Brown, Michael Sklar, Richard Berk, Andreas Buja, and Linda Zhao. Semi-supervised linear regression. Journal of the American Statistical Association, 117(540):2238–2251, 2022

  8. [8]

    Artificial replay: a meta-algorithm for harnessing historical data in bandits

    Siddhartha Banerjee, Sean R Sinclair, Milind Tambe, Lily Xu, and Christina Lee Yu. Artificial replay: a meta-algorithm for harnessing historical data in bandits. arXiv preprint arXiv:2210.00025, 2022

  9. [9]

    Optimal adaptive policies for se- quential allocation problems

    Apostolos N Burnetas and Michael N Katehakis. Optimal adaptive policies for se- quential allocation problems. Advances in Applied Mathematics, 17(2):122–142, 1996

  10. [10]

    Measurement error models with auxiliary data

    Xiaohong Chen, Han Hong, and Elie Tamer. Measurement error models with auxiliary data. The Review of Economic Studies , 72(2):343–366, 2005

  11. [11]

    Semiparametric efficiency in gmm models with auxiliary data

    Xiaohong Chen, Han Hong, and Alessandro Tarozzi. Semiparametric efficiency in gmm models with auxiliary data. The Annals of Statistics , 36(2):808, 2008. 14

  12. [12]

    A unified approach to regression analysis under double- sampling designs

    Yi-Hau Chen and Hung Chen. A unified approach to regression analysis under double- sampling designs. Journal of the Royal Statistical Society Series B: Statistical Method- ology, 62(3):449–460, 2000

  13. [13]

    Leveraging (biased) information: Multi-armed bandits with offline data

    Wang Chi Cheung and Lixing Lyu. Leveraging (biased) information: Multi-armed bandits with offline data. arXiv preprint arXiv:2405.02594 , 2024

  14. [14]

    How the new york times is experimenting with recommendation algo- rithms

    Anna Coenen. How the new york times is experimenting with recommendation algo- rithms. NYT Open, 2019. [Last accessed Feb 3, 2025]

  15. [15]

    Normal bandits of unknown means and variances

    Wesley Cowan, Junya Honda, and Michael N Katehakis. Normal bandits of unknown means and variances. Journal of Machine Learning Research , 18(154):1–28, 2018

  16. [16]

    Prediction de-correlated inference: A safe approach for post-prediction inference

    Feng Gan, Wanfeng Liang, and Changliang Zou. Prediction de-correlated inference: A safe approach for post-prediction inference. Australian & New Zealand Journal of Statistics, 66(4):417–440, 2024

  17. [17]

    Some monotonicity theorems for χ2, f and t distributions with applications

    Bhaskar K Ghosh. Some monotonicity theorems for χ2, f and t distributions with applications. Journal of the Royal Statistical Society: Series B (Methodological) , 35(3):480–492, 1973

  18. [18]

    An introduction to the augmented inverse propensity weighted estimator

    Adam N Glynn and Kevin M Quinn. An introduction to the augmented inverse propensity weighted estimator. Political analysis, 18(1):36–56, 2010

  19. [19]

    Another look at inference after prediction

    Jessica Gronsbell, Jianhui Gao, Yaqi Shi, Zachary R McCaw, and David Cheng. Another look at inference after prediction. arXiv preprint arXiv:2411.19908 , 2024

  20. [20]

    Predictions as surrogates: Revisiting surrogate outcomes in the age of ai.arXiv preprint arXiv:2501.09731,

    Wenlong Ji, Lihua Lei, and Tijana Zrnic. Predictions as surrogates: Revisiting sur- rogate outcomes in the age of ai. arXiv preprint arXiv:2501.09731 , 2025

  21. [21]

    Asymptotically efficient adaptive allocation rules

    Tze Leung Lai and Herbert Robbins. Asymptotically efficient adaptive allocation rules. Advances in applied mathematics , 6(1):4–22, 1985

  22. [22]

    Bandit Algorithms

    Tor Lattimore and Csaba Szepesvari. Bandit Algorithms. Cambridge University Press, 1st edition, 2020

  23. [23]

    A perspective on the use of control variables to increase the efficiency of monte carlo simulations

    Stephen S Lavenberg and Peter D Welch. A perspective on the use of control variables to increase the efficiency of monte carlo simulations. Management Science, 27(3):322– 335, 1981

  24. [24]

    A contextual-bandit approach to personalized news article recommendation

    Lihong Li, Wei Chu, John Langford, and Robert E Schapire. A contextual-bandit approach to personalized news article recommendation. Proceedings of the 19th In- ternational Conference on World Wide Web , 2010

  25. [25]

    Control variate remedies

    Barry L Nelson. Control variate remedies. Operations Research, 38(6):974–992, 1990

  26. [26]

    Inference using surrogate outcome data and a validation sample

    Margaret Sullivan Pepe. Inference using surrogate outcome data and a validation sample. Biometrika, 79(2):355–365, 1992

  27. [27]

    Estimation of regression coefficients when some regressors are not always observed

    James M Robins, Andrea Rotnitzky, and Lue Ping Zhao. Estimation of regression coefficients when some regressors are not always observed. Journal of the American statistical Association, 89(427):846–866, 1994

  28. [28]

    Russo, Benjamin Van Roy, Abbas Kazerouni, Ian Osband, and Zheng Wen

    Daniel J. Russo, Benjamin Van Roy, Abbas Kazerouni, Ian Osband, and Zheng Wen. A tutorial on thompson sampling. Foundations and Trends in Machine Learning , 11(1):1–96, 2018. 15

  29. [29]

    Adjusting for nonig- norable drop-out using semiparametric nonresponse models

    Daniel O Scharfstein, Andrea Rotnitzky, and James M Robins. Adjusting for nonig- norable drop-out using semiparametric nonresponse models. Journal of the American Statistical Association, 94(448):1096–1120, 1999

  30. [30]

    Multi-armed bandit problems with history

    Pannagadatta Shivaswamy and Thorsten Joachims. Multi-armed bandit problems with history. In Artificial Intelligence and Statistics , pages 1046–1054. PMLR, 2012

  31. [31]

    Long-term value of exploration: Mea- surements, findings and algorithms

    Yi Su, Xiangyu Wang, Elaine Ya Le, Liang Liu, Yuening Li, Haokai Lu, Benjamin Lipshitz, Sriraj Badam, Lukasz Heldt, Shuchao Bi, Ed H Chi, Cristos Goodrow, Su-Lin Wu, Lexi Baugher, and Minmin Chen. Long-term value of exploration: Mea- surements, findings and algorithms. Proceedings of the 17th ACM International Con- ference on Web Search and Data Mining , 2024

  32. [32]

    Stochastic multi-armed bandits with con- trol variates

    Arun Verma and Manjesh Kumar Hanawal. Stochastic multi-armed bandits with con- trol variates. Advances in Neural Information Processing Systems , 34:27592–27603, 2021

  33. [33]

    Multi-armed bandit models for the optimal design of clinical trials: benefits and challenges

    Sof´ ıa S Villar, Jack Bowden, and James Wason. Multi-armed bandit models for the optimal design of clinical trials: benefits and challenges. Statistical science: a review journal of the Institute of Mathematical Statistics , 30(2):199, 2015

  34. [34]

    Semi-supervised inference: General theory and estimation of means

    Anru Zhang, Lawrence D Brown, and T Tony Cai. Semi-supervised inference: General theory and estimation of means. The Annals of Statistics , 47(5):2538–2566, 2019

  35. [35]

    Warm-starting Contextual Bandits: Robustly Combining Supervised and Bandit Feedback

    Chicheng Zhang, Alekh Agarwal, Hal Daum´ e III, John Langford, and Sahand N Negahban. Warm-starting contextual bandits: Robustly combining supervised and bandit feedback. arXiv preprint arXiv:1901.00301 , 2019

  36. [36]

    Active Statistical Inference

    Tijana Zrnic and Emmanuel J Cand` es. Active statistical inference. arXiv preprint arXiv:2403.03208, 2024. A Proof This section provides proof of the theoretical results presented in this paper. First, we prove the properties of the machine learning-assisted mean estimator presented in Propo- sition 1 - 3. Proof of Proposition 1. Recall that the definitio...

  37. [37]

    Theorem 2

    We will start with the following generalized version of regret bound. Theorem 2. Under the Gaussian data generation model (2), for any ϵ ∈ (0, 1) and T ≥ 4K, if the sample size of offline predictions satisfies Nk ≥ 1 δk   2 logT log 1 + ∆2 k 24σ2 k (1−ϵ)2 1+ϵ + 3   , ∀k ∈ [K] (25) then the expected regret of Algorithm 1 can be bounded by: E[RegT] ≤ X ...