Multi-Armed Bandits With Machine Learning-Generated Surrogate Rewards

Lihua Lei; Ruihao Zhu; Wenlong Ji; Yihan Pan

arxiv: 2506.16658 · v2 · submitted 2025-06-20 · 🧮 math.ST · cs.LG· stat.ML· stat.TH

Multi-Armed Bandits With Machine Learning-Generated Surrogate Rewards

Wenlong Ji , Yihan Pan , Ruihao Zhu , Lihua Lei This is my paper

Pith reviewed 2026-05-19 08:45 UTC · model grok-4.3

classification 🧮 math.ST cs.LGstat.MLstat.TH

keywords multi-armed banditssurrogate rewardsmachine learningupper confidence boundregret analysisGaussian distributionasymptotic optimalityauxiliary data

0 comments

The pith

When true and predicted rewards are jointly Gaussian, MLA-UCB improves cumulative regret in multi-armed bandits even if surrogate means misalign with true means.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Traditional bandit algorithms suffer from scarce online data, but auxiliary information can be turned into surrogate rewards via pre-trained machine learning models. The paper proposes MLA-UCB, which integrates these surrogates into the decision process for any prediction model. Under joint Gaussianity of predicted and true rewards, it reduces cumulative regret without requiring knowledge of their covariance and remains effective even when surrogate means are completely misaligned. The method also handles batched observations and derives guarantees for non-Gaussian cases, with empirical support from simulations and real-world tasks like language model selection.

Core claim

The central claim is that MLA-UCB provably improves the cumulative regret of multi-armed bandits when predicted and true rewards are jointly Gaussian, even in cases where the mean surrogate reward completely misaligns with the true mean rewards, and it achieves asymptotic optimality among a broad class of policies without prior knowledge of the covariance matrix.

What carries the argument

The MLA-UCB algorithm that constructs confidence bounds leveraging machine learning-generated surrogate rewards from side information and historical data.

If this is right

Regret is reduced compared to classical UCB even with misaligned surrogates.
The algorithm achieves asymptotic optimality in a broad class of policies.
It applies to any reward prediction model and form of auxiliary data.
It extends to batched reward settings with computable bounds for non-Gaussian rewards.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This approach could be adapted to contextual bandits or reinforcement learning where auxiliary data is abundant.
Relaxing the joint Gaussian assumption using other concentration inequalities might broaden applicability to heavy-tailed rewards.
Testing with larger offline datasets could reveal how surrogate quality scales with correlation strength.

Load-bearing premise

The predicted and true rewards are jointly Gaussian.

What would settle it

A simulation or theoretical counterexample where rewards are not jointly Gaussian, the surrogates are misaligned, and MLA-UCB shows higher regret than standard UCB.

Figures

Figures reproduced from arXiv: 2506.16658 by Lihua Lei, Ruihao Zhu, Wenlong Ji, Yihan Pan.

**Figure 1.** Figure 1: Comparison of the quantile bound q d(s 2 d−1 − 1) and the true quantile qd 1 2s √ log s in Proposition 4. We set d = ⌊log s⌋ and vary the values of s to reflect the order of regret bound (1). Theorem 1. Under the Gaussian reward model (2), for any T ≥ 4K, if offline sample size satisfies Nk ≥ 1 δk   2 log T log 1 + ∆2 k 24σ 2 k + 3   , ∀k ∈ [K], (9) then the expected regret of Algorithm 1 can b… view at source ↗

**Figure 2.** Figure 2: Cumulative regret of the MLA-UCB algorithm (Algorithm 1) and the classical UCB algorithm) under a Gaussian model. For Figure 2a to 2c, we report the final cumulative regret at T = 1000 steps. For Figure 2d, we choose ρ 2 = 0.5, ∆ = 0.5, N = 100. where x = (x1, x2) ⊤ is the feature vector, wk = (wk,1, wk,2) is the arm-specific weight parameter, ϵ is the random noise. The features are assumed to be only visi… view at source ↗

**Figure 3.** Figure 3: The conditional expectation and distribution of the true rewards. The true reward is generated using wk,1 = wk,2 = 1 and σ = 0.1. The reward function is highly nonlinear and the reward distribution is different from the Gaussian distribution. From [PITH_FULL_IMAGE:figures/full_fig_p013_3.png] view at source ↗

**Figure 4.** Figure 4: Cumulative regret of the MLA-UCB algorithm and the classical UCB algorithm under different settings. The true correlations of predictions of each setting are shown in [PITH_FULL_IMAGE:figures/full_fig_p014_4.png] view at source ↗

read the original abstract

Multi-armed bandit (MAB) is a widely adopted framework for sequential decision-making under uncertainty. Traditional bandit algorithms rely solely on online data, which tends to be scarce as it must be gathered during the online phase when the arms are actively pulled. However, in many practical settings, rich auxiliary data, such as covariates of past users, is available prior to deploying any arms. We introduce a new setting for MAB where pre-trained machine learning (ML) models are applied to convert side information and historical data into \emph{surrogate rewards}. A prominent challenge of this setting is that the surrogate rewards may exhibit substantial bias, as true reward data is typically unavailable in the offline phase, forcing ML predictions to heavily rely on extrapolation. To address the issue, we propose the Machine Learning-Assisted Upper Confidence Bound (MLA-UCB) algorithm, which can be applied to any reward prediction model and any form of auxiliary data. When the predicted and true rewards are jointly Gaussian, it provably improves the cumulative regret, even in cases where the mean surrogate reward completely misaligns with the true mean rewards, and achieves the asymptotic optimality among a broad class of policies. Notably, our method requires no prior knowledge of the covariance matrix between true and surrogate rewards. We further extend the method to a batched reward MAB problem, where each arm pull yields a batch of observations and rewards may be non-Gaussian, and we derive computable confidence bounds and regret guarantees that improve upon classical UCB algorithms. Finally, extensive simulations with both Gaussian and ML-generated surrogates, together with real-world studies on language model selection and video recommendation, demonstrate consistent and often substantial regret reductions with moderate offline surrogate sample sizes and correlations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MLA-UCB claims regret gains from ML surrogates without covariance knowledge under joint Gaussianity, but the bound construction needs close checking to confirm it beats classical UCB when means misalign.

read the letter

The main point is that this paper sets up a bandit problem where you feed in surrogate rewards produced by any pre-trained ML model on auxiliary data, and MLA-UCB delivers lower cumulative regret than plain UCB when the true and surrogate rewards are jointly Gaussian, even if the surrogate means are completely misaligned with the truth. It does this without requiring knowledge of the covariance matrix and extends the approach to batched pulls with non-Gaussian rewards plus computable bounds. The asymptotic optimality result among a broad policy class is the cleanest theoretical piece. The simulations and the two real-world examples on language-model selection and video recommendation show consistent regret drops with moderate offline sample sizes, which is useful to see. The work is new in making the surrogate-reward source a first-class input rather than just another feature, and the algorithm stays model-agnostic. That combination is absent from the cited prior work. The joint-Gaussianity assumption is the clearest limitation for the strongest claims, and the stress-test note correctly flags that any improvement must come through the correlation term. If the analysis truly avoids estimating or using that term, it is not obvious how the confidence intervals end up strictly tighter than the marginal-variance version that recovers classical UCB. The proofs will need to be read carefully on that point. This is aimed at people working on practical bandit deployments who already have side data or historical covariates they can run through an ML model. A reader in online learning or recommendation systems would get concrete value from the algorithm and the experiments. The paper is grounded enough and addresses a real gap to deserve a serious referee.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces MLA-UCB, a variant of UCB for multi-armed bandits that incorporates surrogate rewards produced by pre-trained ML models applied to auxiliary data. Under the assumption that predicted and true rewards are jointly Gaussian, the paper claims that MLA-UCB yields strictly lower cumulative regret than classical UCB even when surrogate means are completely misaligned with true means, requires no knowledge of the covariance matrix, and attains asymptotic optimality within a broad class of policies. The work extends the approach to batched settings with non-Gaussian rewards, supplies computable confidence bounds, and reports regret reductions in simulations and real-world experiments on language-model selection and video recommendation.

Significance. If the central theoretical claims hold, the result would be significant for sequential decision-making: it shows how offline ML predictions can be leveraged to improve online regret without requiring alignment of means or knowledge of the correlation parameter. The extension to batched non-Gaussian rewards and the empirical validation on practical tasks further strengthen applicability. The manuscript supplies explicit regret bounds and extensive simulations, which are positive features.

major comments (2)

[Abstract and §3 (theoretical analysis)] Abstract (theoretical results paragraph) and the main regret analysis: the claim that MLA-UCB produces strictly better cumulative regret than classical UCB under joint Gaussianity, even with complete mean misalignment and without any knowledge of the covariance matrix, is load-bearing. Joint Gaussianity implies the true reward equals a linear function of the surrogate plus independent noise; any tightening of the confidence interval must therefore exploit the correlation coefficient. If the algorithm and its analysis are truly covariance-free (as stated), a conservative marginal-variance bound recovers exactly the classical UCB width and eliminates the asserted regret improvement. Please supply the explicit construction of the covariance-free confidence bounds (including the precise form of the width and how extrapolation bias is controlled) and demonstrate that they remain strictly sub-
[§4] §4 (batched extension): the derivation of computable confidence bounds and regret guarantees for non-Gaussian batched rewards is presented as an improvement over classical UCB, yet the handling of the extrapolation bias induced by ML-generated surrogates is not shown to preserve the improvement when the batch size is moderate and the surrogate model is misspecified. A concrete counter-example or additional assumption would clarify whether the improvement survives outside the Gaussian case.

minor comments (2)

[§2] Notation for the surrogate reward process and the definition of 'complete mean misalignment' should be introduced earlier and used consistently throughout the theoretical sections.
[§5] Figure captions for the simulation results could more explicitly state the correlation values and offline sample sizes used, to allow direct comparison with the theoretical regime.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful and constructive report. The comments highlight important points about the clarity of our theoretical claims and the scope of the batched extension. We address each major comment below and indicate the revisions we will make.

read point-by-point responses

Referee: [Abstract and §3 (theoretical analysis)] Abstract (theoretical results paragraph) and the main regret analysis: the claim that MLA-UCB produces strictly better cumulative regret than classical UCB under joint Gaussianity, even with complete mean misalignment and without any knowledge of the covariance matrix, is load-bearing. Joint Gaussianity implies the true reward equals a linear function of the surrogate plus independent noise; any tightening of the confidence interval must therefore exploit the correlation coefficient. If the algorithm and its analysis are truly covariance-free (as stated), a conservative marginal-variance bound recovers exactly the classical UCB width and eliminates the asserted regret improvement. Please supply the explicit construction of the covariance-free confidence bounds (including the precise form of the width and how extrapolation bias is controlled) and

Authors: We agree that the mechanism by which improvement occurs under joint Gaussianity deserves a more explicit derivation. The covariance-free property means that neither the algorithm nor the stated regret bound requires the user to know or input the covariance value; the analysis instead uses the joint Gaussian structure to bound the conditional variance and the extrapolation bias term without needing its numerical value. The explicit form of the confidence width appears in Equation (8) and the proof of Theorem 3.1, where the width is constructed from the marginal variance of the surrogate plus a controlled bias term that vanishes asymptotically. Because the surrogate is observed at every round, the effective information gain remains strictly larger than the marginal case even under a conservative bound on the correlation. We will add a new subsection in the revision that walks through the construction step by step, explicitly compares the resulting width to the classical UCB width, and shows that the regret improvement is preserved under mean misalignment. revision: yes
Referee: [§4] §4 (batched extension): the derivation of computable confidence bounds and regret guarantees for non-Gaussian batched rewards is presented as an improvement over classical UCB, yet the handling of the extrapolation bias induced by ML-generated surrogates is not shown to preserve the improvement when the batch size is moderate and the surrogate model is misspecified. A concrete counter-example or additional assumption would clarify whether the improvement survives outside the Gaussian case.

Authors: The referee correctly notes that the non-Gaussian batched analysis relies on a bounded-bias assumption to control extrapolation error. Under this assumption the regret bound remains strictly better than classical UCB for any fixed batch size, because the surrogate still supplies additional observations per batch. When the surrogate is misspecified, the improvement is preserved as long as the bias remains bounded by a constant that does not grow with the horizon; our simulations in Section 5 already illustrate this regime. To make the boundary conditions transparent, we will insert a remark that states the precise bias condition and supplies a simple counter-example (a deterministic misspecification that makes the bias grow linearly with batch size) under which the improvement can disappear. This will be accompanied by an additional simulation panel showing the transition point. revision: partial

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper derives MLA-UCB regret bounds and asymptotic optimality from the joint Gaussianity assumption on predicted and true rewards, explicitly without requiring covariance knowledge, and claims improvement even under mean misalignment. No load-bearing step reduces by construction to a fitted parameter, self-citation, or input renaming; the central guarantee is presented as following from the distributional assumption and algorithm construction rather than tautologically from data fits or prior author results. The analysis is self-contained against the stated assumptions and does not exhibit self-definitional, fitted-input, or ansatz-smuggling patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the joint Gaussian assumption for true and surrogate rewards and on the existence of a pre-trained ML model that produces surrogates from auxiliary data; no free parameters or new invented entities are introduced in the abstract.

axioms (1)

domain assumption Predicted and true rewards are jointly Gaussian
Invoked to obtain closed-form confidence bounds and regret guarantees without requiring the covariance matrix (abstract, theoretical results paragraph).

pith-pipeline@v0.9.0 · 5858 in / 1424 out tokens · 32360 ms · 2026-05-19T08:45:33.726044+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Valid Best-Model Identification for LLM Evaluation via Low-Rank Factorization
cs.LG 2026-05 unverdicted novelty 6.0

Doubly robust estimators that incorporate low-rank predictions enable valid finite-sample confidence intervals for best-model identification under adaptive sampling and without-replacement example selection in LLM evaluation.

Reference graph

Works this paper leans on

37 extracted references · 37 canonical work pages · cited by 1 Pith paper · 3 internal anchors

[1]

Analysis of thompson sampling for the multi-armed bandit problem

Shipra Agrawal and Navin Goyal. Analysis of thompson sampling for the multi-armed bandit problem. In Proceedings of the 25th Annual Conference on Learning Theory , volume 23, pages 39.1–39.26. PMLR, 2012

work page 2012
[2]

Prediction-powered inference

Anastasios N Angelopoulos, Stephen Bates, Clara Fannjiang, Michael I Jordan, and Tijana Zrnic. Prediction-powered inference. Science, 382(6671):669–674, 2023

work page 2023
[3]

PPI++: Efficient Prediction-Powered Inference

Anastasios N Angelopoulos, John C Duchi, and Tijana Zrnic. PPI++: Efficient prediction-powered inference. arXiv preprint arXiv:2311.01453 , 2023. 13 0 200 400 600 800 1000 Time Step T 0 10 20 30 40Cumulative Regret MLA-UCB algorithm with ML predictions, = 0 UCB Linear Model SVR Neural Network Decision Tree 0 200 400 600 800 1000 Time Step T 0 10 20 30 40C...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[4]

The surrogate index: Combining short-term proxies to estimate long-term treatment effects more rapidly and precisely

Susan Athey, Raj Chetty, Guido W Imbens, and Hyunseung Kang. The surrogate index: Combining short-term proxies to estimate long-term treatment effects more rapidly and precisely. Technical report, National Bureau of Economic Research, Inc, 2019

work page 2019
[5]

Finite-time analysis of the mul- tiarmed bandit problem

Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the mul- tiarmed bandit problem. Machine Learning, 47:235–256, 2002

work page 2002
[6]

On multi-armed bandit designs for dose-finding trials

Maryam Aziz, Emilie Kaufmann, and Marie-Karelle Riviere. On multi-armed bandit designs for dose-finding trials. Journal of Machine Learning Research , 22(14):1–38, 2021

work page 2021
[7]

Semi-supervised linear regression

David Azriel, Lawrence D Brown, Michael Sklar, Richard Berk, Andreas Buja, and Linda Zhao. Semi-supervised linear regression. Journal of the American Statistical Association, 117(540):2238–2251, 2022

work page 2022
[8]

Artificial replay: a meta-algorithm for harnessing historical data in bandits

Siddhartha Banerjee, Sean R Sinclair, Milind Tambe, Lily Xu, and Christina Lee Yu. Artificial replay: a meta-algorithm for harnessing historical data in bandits. arXiv preprint arXiv:2210.00025, 2022

work page arXiv 2022
[9]

Optimal adaptive policies for se- quential allocation problems

Apostolos N Burnetas and Michael N Katehakis. Optimal adaptive policies for se- quential allocation problems. Advances in Applied Mathematics, 17(2):122–142, 1996

work page 1996
[10]

Measurement error models with auxiliary data

Xiaohong Chen, Han Hong, and Elie Tamer. Measurement error models with auxiliary data. The Review of Economic Studies , 72(2):343–366, 2005

work page 2005
[11]

Semiparametric efficiency in gmm models with auxiliary data

Xiaohong Chen, Han Hong, and Alessandro Tarozzi. Semiparametric efficiency in gmm models with auxiliary data. The Annals of Statistics , 36(2):808, 2008. 14

work page 2008
[12]

A unified approach to regression analysis under double- sampling designs

Yi-Hau Chen and Hung Chen. A unified approach to regression analysis under double- sampling designs. Journal of the Royal Statistical Society Series B: Statistical Method- ology, 62(3):449–460, 2000

work page 2000
[13]

Leveraging (biased) information: Multi-armed bandits with offline data

Wang Chi Cheung and Lixing Lyu. Leveraging (biased) information: Multi-armed bandits with offline data. arXiv preprint arXiv:2405.02594 , 2024

work page arXiv 2024
[14]

How the new york times is experimenting with recommendation algo- rithms

Anna Coenen. How the new york times is experimenting with recommendation algo- rithms. NYT Open, 2019. [Last accessed Feb 3, 2025]

work page 2019
[15]

Normal bandits of unknown means and variances

Wesley Cowan, Junya Honda, and Michael N Katehakis. Normal bandits of unknown means and variances. Journal of Machine Learning Research , 18(154):1–28, 2018

work page 2018
[16]

Prediction de-correlated inference: A safe approach for post-prediction inference

Feng Gan, Wanfeng Liang, and Changliang Zou. Prediction de-correlated inference: A safe approach for post-prediction inference. Australian & New Zealand Journal of Statistics, 66(4):417–440, 2024

work page 2024
[17]

Some monotonicity theorems for χ2, f and t distributions with applications

Bhaskar K Ghosh. Some monotonicity theorems for χ2, f and t distributions with applications. Journal of the Royal Statistical Society: Series B (Methodological) , 35(3):480–492, 1973

work page 1973
[18]

An introduction to the augmented inverse propensity weighted estimator

Adam N Glynn and Kevin M Quinn. An introduction to the augmented inverse propensity weighted estimator. Political analysis, 18(1):36–56, 2010

work page 2010
[19]

Another look at inference after prediction

Jessica Gronsbell, Jianhui Gao, Yaqi Shi, Zachary R McCaw, and David Cheng. Another look at inference after prediction. arXiv preprint arXiv:2411.19908 , 2024

work page arXiv 2024
[20]

Predictions as surrogates: Revisiting surrogate outcomes in the age of ai.arXiv preprint arXiv:2501.09731,

Wenlong Ji, Lihua Lei, and Tijana Zrnic. Predictions as surrogates: Revisiting sur- rogate outcomes in the age of ai. arXiv preprint arXiv:2501.09731 , 2025

work page arXiv 2025
[21]

Asymptotically efficient adaptive allocation rules

Tze Leung Lai and Herbert Robbins. Asymptotically efficient adaptive allocation rules. Advances in applied mathematics , 6(1):4–22, 1985

work page 1985
[22]

Bandit Algorithms

Tor Lattimore and Csaba Szepesvari. Bandit Algorithms. Cambridge University Press, 1st edition, 2020

work page 2020
[23]

A perspective on the use of control variables to increase the efficiency of monte carlo simulations

Stephen S Lavenberg and Peter D Welch. A perspective on the use of control variables to increase the efficiency of monte carlo simulations. Management Science, 27(3):322– 335, 1981

work page 1981
[24]

A contextual-bandit approach to personalized news article recommendation

Lihong Li, Wei Chu, John Langford, and Robert E Schapire. A contextual-bandit approach to personalized news article recommendation. Proceedings of the 19th In- ternational Conference on World Wide Web , 2010

work page 2010
[25]

Control variate remedies

Barry L Nelson. Control variate remedies. Operations Research, 38(6):974–992, 1990

work page 1990
[26]

Inference using surrogate outcome data and a validation sample

Margaret Sullivan Pepe. Inference using surrogate outcome data and a validation sample. Biometrika, 79(2):355–365, 1992

work page 1992
[27]

Estimation of regression coefficients when some regressors are not always observed

James M Robins, Andrea Rotnitzky, and Lue Ping Zhao. Estimation of regression coefficients when some regressors are not always observed. Journal of the American statistical Association, 89(427):846–866, 1994

work page 1994
[28]

Russo, Benjamin Van Roy, Abbas Kazerouni, Ian Osband, and Zheng Wen

Daniel J. Russo, Benjamin Van Roy, Abbas Kazerouni, Ian Osband, and Zheng Wen. A tutorial on thompson sampling. Foundations and Trends in Machine Learning , 11(1):1–96, 2018. 15

work page 2018
[29]

Adjusting for nonig- norable drop-out using semiparametric nonresponse models

Daniel O Scharfstein, Andrea Rotnitzky, and James M Robins. Adjusting for nonig- norable drop-out using semiparametric nonresponse models. Journal of the American Statistical Association, 94(448):1096–1120, 1999

work page 1999
[30]

Multi-armed bandit problems with history

Pannagadatta Shivaswamy and Thorsten Joachims. Multi-armed bandit problems with history. In Artificial Intelligence and Statistics , pages 1046–1054. PMLR, 2012

work page 2012
[31]

Long-term value of exploration: Mea- surements, findings and algorithms

Yi Su, Xiangyu Wang, Elaine Ya Le, Liang Liu, Yuening Li, Haokai Lu, Benjamin Lipshitz, Sriraj Badam, Lukasz Heldt, Shuchao Bi, Ed H Chi, Cristos Goodrow, Su-Lin Wu, Lexi Baugher, and Minmin Chen. Long-term value of exploration: Mea- surements, findings and algorithms. Proceedings of the 17th ACM International Con- ference on Web Search and Data Mining , 2024

work page 2024
[32]

Stochastic multi-armed bandits with con- trol variates

Arun Verma and Manjesh Kumar Hanawal. Stochastic multi-armed bandits with con- trol variates. Advances in Neural Information Processing Systems , 34:27592–27603, 2021

work page 2021
[33]

Multi-armed bandit models for the optimal design of clinical trials: benefits and challenges

Sof´ ıa S Villar, Jack Bowden, and James Wason. Multi-armed bandit models for the optimal design of clinical trials: benefits and challenges. Statistical science: a review journal of the Institute of Mathematical Statistics , 30(2):199, 2015

work page 2015
[34]

Semi-supervised inference: General theory and estimation of means

Anru Zhang, Lawrence D Brown, and T Tony Cai. Semi-supervised inference: General theory and estimation of means. The Annals of Statistics , 47(5):2538–2566, 2019

work page 2019
[35]

Warm-starting Contextual Bandits: Robustly Combining Supervised and Bandit Feedback

Chicheng Zhang, Alekh Agarwal, Hal Daum´ e III, John Langford, and Sahand N Negahban. Warm-starting contextual bandits: Robustly combining supervised and bandit feedback. arXiv preprint arXiv:1901.00301 , 2019

work page internal anchor Pith review Pith/arXiv arXiv 1901
[36]

Active Statistical Inference

Tijana Zrnic and Emmanuel J Cand` es. Active statistical inference. arXiv preprint arXiv:2403.03208, 2024. A Proof This section provides proof of the theoretical results presented in this paper. First, we prove the properties of the machine learning-assisted mean estimator presented in Propo- sition 1 - 3. Proof of Proposition 1. Recall that the definitio...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[37]

Theorem 2

We will start with the following generalized version of regret bound. Theorem 2. Under the Gaussian data generation model (2), for any ϵ ∈ (0, 1) and T ≥ 4K, if the sample size of offline predictions satisfies Nk ≥ 1 δk   2 logT log 1 + ∆2 k 24σ2 k (1−ϵ)2 1+ϵ + 3   , ∀k ∈ [K] (25) then the expected regret of Algorithm 1 can be bounded by: E[RegT] ≤ X ...

work page

[1] [1]

Analysis of thompson sampling for the multi-armed bandit problem

Shipra Agrawal and Navin Goyal. Analysis of thompson sampling for the multi-armed bandit problem. In Proceedings of the 25th Annual Conference on Learning Theory , volume 23, pages 39.1–39.26. PMLR, 2012

work page 2012

[2] [2]

Prediction-powered inference

Anastasios N Angelopoulos, Stephen Bates, Clara Fannjiang, Michael I Jordan, and Tijana Zrnic. Prediction-powered inference. Science, 382(6671):669–674, 2023

work page 2023

[3] [3]

PPI++: Efficient Prediction-Powered Inference

Anastasios N Angelopoulos, John C Duchi, and Tijana Zrnic. PPI++: Efficient prediction-powered inference. arXiv preprint arXiv:2311.01453 , 2023. 13 0 200 400 600 800 1000 Time Step T 0 10 20 30 40Cumulative Regret MLA-UCB algorithm with ML predictions, = 0 UCB Linear Model SVR Neural Network Decision Tree 0 200 400 600 800 1000 Time Step T 0 10 20 30 40C...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[4] [4]

The surrogate index: Combining short-term proxies to estimate long-term treatment effects more rapidly and precisely

Susan Athey, Raj Chetty, Guido W Imbens, and Hyunseung Kang. The surrogate index: Combining short-term proxies to estimate long-term treatment effects more rapidly and precisely. Technical report, National Bureau of Economic Research, Inc, 2019

work page 2019

[5] [5]

Finite-time analysis of the mul- tiarmed bandit problem

Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the mul- tiarmed bandit problem. Machine Learning, 47:235–256, 2002

work page 2002

[6] [6]

On multi-armed bandit designs for dose-finding trials

Maryam Aziz, Emilie Kaufmann, and Marie-Karelle Riviere. On multi-armed bandit designs for dose-finding trials. Journal of Machine Learning Research , 22(14):1–38, 2021

work page 2021

[7] [7]

Semi-supervised linear regression

David Azriel, Lawrence D Brown, Michael Sklar, Richard Berk, Andreas Buja, and Linda Zhao. Semi-supervised linear regression. Journal of the American Statistical Association, 117(540):2238–2251, 2022

work page 2022

[8] [8]

Artificial replay: a meta-algorithm for harnessing historical data in bandits

Siddhartha Banerjee, Sean R Sinclair, Milind Tambe, Lily Xu, and Christina Lee Yu. Artificial replay: a meta-algorithm for harnessing historical data in bandits. arXiv preprint arXiv:2210.00025, 2022

work page arXiv 2022

[9] [9]

Optimal adaptive policies for se- quential allocation problems

Apostolos N Burnetas and Michael N Katehakis. Optimal adaptive policies for se- quential allocation problems. Advances in Applied Mathematics, 17(2):122–142, 1996

work page 1996

[10] [10]

Measurement error models with auxiliary data

Xiaohong Chen, Han Hong, and Elie Tamer. Measurement error models with auxiliary data. The Review of Economic Studies , 72(2):343–366, 2005

work page 2005

[11] [11]

Semiparametric efficiency in gmm models with auxiliary data

Xiaohong Chen, Han Hong, and Alessandro Tarozzi. Semiparametric efficiency in gmm models with auxiliary data. The Annals of Statistics , 36(2):808, 2008. 14

work page 2008

[12] [12]

A unified approach to regression analysis under double- sampling designs

Yi-Hau Chen and Hung Chen. A unified approach to regression analysis under double- sampling designs. Journal of the Royal Statistical Society Series B: Statistical Method- ology, 62(3):449–460, 2000

work page 2000

[13] [13]

Leveraging (biased) information: Multi-armed bandits with offline data

Wang Chi Cheung and Lixing Lyu. Leveraging (biased) information: Multi-armed bandits with offline data. arXiv preprint arXiv:2405.02594 , 2024

work page arXiv 2024

[14] [14]

How the new york times is experimenting with recommendation algo- rithms

Anna Coenen. How the new york times is experimenting with recommendation algo- rithms. NYT Open, 2019. [Last accessed Feb 3, 2025]

work page 2019

[15] [15]

Normal bandits of unknown means and variances

Wesley Cowan, Junya Honda, and Michael N Katehakis. Normal bandits of unknown means and variances. Journal of Machine Learning Research , 18(154):1–28, 2018

work page 2018

[16] [16]

Prediction de-correlated inference: A safe approach for post-prediction inference

Feng Gan, Wanfeng Liang, and Changliang Zou. Prediction de-correlated inference: A safe approach for post-prediction inference. Australian & New Zealand Journal of Statistics, 66(4):417–440, 2024

work page 2024

[17] [17]

Some monotonicity theorems for χ2, f and t distributions with applications

Bhaskar K Ghosh. Some monotonicity theorems for χ2, f and t distributions with applications. Journal of the Royal Statistical Society: Series B (Methodological) , 35(3):480–492, 1973

work page 1973

[18] [18]

An introduction to the augmented inverse propensity weighted estimator

Adam N Glynn and Kevin M Quinn. An introduction to the augmented inverse propensity weighted estimator. Political analysis, 18(1):36–56, 2010

work page 2010

[19] [19]

Another look at inference after prediction

Jessica Gronsbell, Jianhui Gao, Yaqi Shi, Zachary R McCaw, and David Cheng. Another look at inference after prediction. arXiv preprint arXiv:2411.19908 , 2024

work page arXiv 2024

[20] [20]

Predictions as surrogates: Revisiting surrogate outcomes in the age of ai.arXiv preprint arXiv:2501.09731,

Wenlong Ji, Lihua Lei, and Tijana Zrnic. Predictions as surrogates: Revisiting sur- rogate outcomes in the age of ai. arXiv preprint arXiv:2501.09731 , 2025

work page arXiv 2025

[21] [21]

Asymptotically efficient adaptive allocation rules

Tze Leung Lai and Herbert Robbins. Asymptotically efficient adaptive allocation rules. Advances in applied mathematics , 6(1):4–22, 1985

work page 1985

[22] [22]

Bandit Algorithms

Tor Lattimore and Csaba Szepesvari. Bandit Algorithms. Cambridge University Press, 1st edition, 2020

work page 2020

[23] [23]

A perspective on the use of control variables to increase the efficiency of monte carlo simulations

Stephen S Lavenberg and Peter D Welch. A perspective on the use of control variables to increase the efficiency of monte carlo simulations. Management Science, 27(3):322– 335, 1981

work page 1981

[24] [24]

A contextual-bandit approach to personalized news article recommendation

Lihong Li, Wei Chu, John Langford, and Robert E Schapire. A contextual-bandit approach to personalized news article recommendation. Proceedings of the 19th In- ternational Conference on World Wide Web , 2010

work page 2010

[25] [25]

Control variate remedies

Barry L Nelson. Control variate remedies. Operations Research, 38(6):974–992, 1990

work page 1990

[26] [26]

Inference using surrogate outcome data and a validation sample

Margaret Sullivan Pepe. Inference using surrogate outcome data and a validation sample. Biometrika, 79(2):355–365, 1992

work page 1992

[27] [27]

Estimation of regression coefficients when some regressors are not always observed

James M Robins, Andrea Rotnitzky, and Lue Ping Zhao. Estimation of regression coefficients when some regressors are not always observed. Journal of the American statistical Association, 89(427):846–866, 1994

work page 1994

[28] [28]

Russo, Benjamin Van Roy, Abbas Kazerouni, Ian Osband, and Zheng Wen

Daniel J. Russo, Benjamin Van Roy, Abbas Kazerouni, Ian Osband, and Zheng Wen. A tutorial on thompson sampling. Foundations and Trends in Machine Learning , 11(1):1–96, 2018. 15

work page 2018

[29] [29]

Adjusting for nonig- norable drop-out using semiparametric nonresponse models

Daniel O Scharfstein, Andrea Rotnitzky, and James M Robins. Adjusting for nonig- norable drop-out using semiparametric nonresponse models. Journal of the American Statistical Association, 94(448):1096–1120, 1999

work page 1999

[30] [30]

Multi-armed bandit problems with history

Pannagadatta Shivaswamy and Thorsten Joachims. Multi-armed bandit problems with history. In Artificial Intelligence and Statistics , pages 1046–1054. PMLR, 2012

work page 2012

[31] [31]

Long-term value of exploration: Mea- surements, findings and algorithms

Yi Su, Xiangyu Wang, Elaine Ya Le, Liang Liu, Yuening Li, Haokai Lu, Benjamin Lipshitz, Sriraj Badam, Lukasz Heldt, Shuchao Bi, Ed H Chi, Cristos Goodrow, Su-Lin Wu, Lexi Baugher, and Minmin Chen. Long-term value of exploration: Mea- surements, findings and algorithms. Proceedings of the 17th ACM International Con- ference on Web Search and Data Mining , 2024

work page 2024

[32] [32]

Stochastic multi-armed bandits with con- trol variates

Arun Verma and Manjesh Kumar Hanawal. Stochastic multi-armed bandits with con- trol variates. Advances in Neural Information Processing Systems , 34:27592–27603, 2021

work page 2021

[33] [33]

Multi-armed bandit models for the optimal design of clinical trials: benefits and challenges

Sof´ ıa S Villar, Jack Bowden, and James Wason. Multi-armed bandit models for the optimal design of clinical trials: benefits and challenges. Statistical science: a review journal of the Institute of Mathematical Statistics , 30(2):199, 2015

work page 2015

[34] [34]

Semi-supervised inference: General theory and estimation of means

Anru Zhang, Lawrence D Brown, and T Tony Cai. Semi-supervised inference: General theory and estimation of means. The Annals of Statistics , 47(5):2538–2566, 2019

work page 2019

[35] [35]

Warm-starting Contextual Bandits: Robustly Combining Supervised and Bandit Feedback

Chicheng Zhang, Alekh Agarwal, Hal Daum´ e III, John Langford, and Sahand N Negahban. Warm-starting contextual bandits: Robustly combining supervised and bandit feedback. arXiv preprint arXiv:1901.00301 , 2019

work page internal anchor Pith review Pith/arXiv arXiv 1901

[36] [36]

Active Statistical Inference

Tijana Zrnic and Emmanuel J Cand` es. Active statistical inference. arXiv preprint arXiv:2403.03208, 2024. A Proof This section provides proof of the theoretical results presented in this paper. First, we prove the properties of the machine learning-assisted mean estimator presented in Propo- sition 1 - 3. Proof of Proposition 1. Recall that the definitio...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[37] [37]

Theorem 2

We will start with the following generalized version of regret bound. Theorem 2. Under the Gaussian data generation model (2), for any ϵ ∈ (0, 1) and T ≥ 4K, if the sample size of offline predictions satisfies Nk ≥ 1 δk   2 logT log 1 + ∆2 k 24σ2 k (1−ϵ)2 1+ϵ + 3   , ∀k ∈ [K] (25) then the expected regret of Algorithm 1 can be bounded by: E[RegT] ≤ X ...

work page