Introduces a parametric reservation-index policy with GMM estimation and UCB exploration for contextual LLM cascading under output-mediated feedback, claiming dimension-dependent square-root regret.
Multi-Armed Bandits With Machine Learning-Generated Surrogate Rewards
2 Pith papers cite this work. Polarity classification is still indexing.
abstract
Multi-armed bandit (MAB) is a widely adopted framework for sequential decision-making under uncertainty. Traditional bandit algorithms rely solely on online data, which tends to be scarce as it must be gathered during the online phase when the arms are actively pulled. However, in many practical settings, rich auxiliary data, such as covariates of past users, is available prior to deploying any arms. We introduce a new setting for MAB where pre-trained machine learning (ML) models are applied to convert side information and historical data into \emph{surrogate rewards}. A prominent challenge of this setting is that the surrogate rewards may exhibit substantial bias, as true reward data is typically unavailable in the offline phase, forcing ML predictions to heavily rely on extrapolation. To address the issue, we propose the Machine Learning-Assisted Upper Confidence Bound (MLA-UCB) algorithm, which can be applied to any reward prediction model and any form of auxiliary data. When the predicted and true rewards are jointly Gaussian, it provably improves the cumulative regret, even in cases where the mean surrogate reward completely misaligns with the true mean rewards, and achieves the asymptotic optimality among a broad class of policies. Notably, our method requires no prior knowledge of the covariance matrix between true and surrogate rewards. We further extend the method to a batched reward MAB problem, where each arm pull yields a batch of observations and rewards may be non-Gaussian, and we derive computable confidence bounds and regret guarantees that improve upon classical UCB algorithms. Finally, extensive simulations with both Gaussian and ML-generated surrogates, together with real-world studies on language model selection and video recommendation, demonstrate consistent and often substantial regret reductions with moderate offline surrogate sample sizes and correlations.
citation-role summary
citation-polarity summary
years
2026 2verdicts
UNVERDICTED 2roles
method 1polarities
background 1representative citing papers
Doubly robust estimators that incorporate low-rank predictions enable valid finite-sample confidence intervals for best-model identification under adaptive sampling and without-replacement example selection in LLM evaluation.
citing papers explorer
-
Online Pandora's Box for Contextual LLM Cascading
Introduces a parametric reservation-index policy with GMM estimation and UCB exploration for contextual LLM cascading under output-mediated feedback, claiming dimension-dependent square-root regret.
-
Valid Best-Model Identification for LLM Evaluation via Low-Rank Factorization
Doubly robust estimators that incorporate low-rank predictions enable valid finite-sample confidence intervals for best-model identification under adaptive sampling and without-replacement example selection in LLM evaluation.