A generic conversion turns offline local search algorithms into online stochastic combinatorial bandit algorithms with O(log^3 T) approximate regret.
Cambridge University Press
6 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
roles
background 2polarities
background 2representative citing papers
Establishes last-iterate convergence rates for on-policy Q-learning under minimal irreducibility assumptions, with sample complexity O(1/ξ²) matching off-policy up to exploration factors.
Delight-gated exploration spends actions only when expected improvement times surprisal exceeds a gate price, recovers Pandora's reservation rule, and shows weaker regret growth than Thompson sampling or epsilon-greedy across bandits and MDPs with transferable hyperparameters.
Develops COF algorithm for MAB-CS that intelligently checks cheap arm feasibility by pooling samples, with generalized instance-dependent lower bounds and matching upper bounds on cumulative cost and quality regret.
A randomized (1+ε)-approximation algorithm for ordered-norm load balancing uses O((n+d)(ε^{-2} + log log d) log(n+d)) linear-oracle calls via follow-the-regularized-leader prices and martingale progress analysis.
RIE-Greedy uses stochasticity from cross-validation regularization to induce Thompson Sampling-like exploration, claimed equivalent in the two-armed case and empirically competitive in large-scale settings.
citing papers explorer
-
Offline Local Search for Online Stochastic Bandits
A generic conversion turns offline local search algorithms into online stochastic combinatorial bandit algorithms with O(log^3 T) approximate regret.
-
A Minimal-Assumption Analysis of Q-Learning with Time-Varying Policies
Establishes last-iterate convergence rates for on-policy Q-learning under minimal irreducibility assumptions, with sample complexity O(1/ξ²) matching off-policy up to exploration factors.
-
Delightful Exploration
Delight-gated exploration spends actions only when expected improvement times surprisal exceeds a gate price, recovers Pandora's reservation rule, and shows weaker regret growth than Thompson sampling or epsilon-greedy across bandits and MDPs with transferable hyperparameters.
-
Cost-Ordered Feasibility for Multi-Armed Bandits with Cost Subsidy
Develops COF algorithm for MAB-CS that intelligently checks cheap arm feasibility by pooling samples, with generalized instance-dependent lower bounds and matching upper bounds on cumulative cost and quality regret.
-
An Efficient Algorithm for Minimizing Ordered Norms in Fractional Load Balancing
A randomized (1+ε)-approximation algorithm for ordered-norm load balancing uses O((n+d)(ε^{-2} + log log d) log(n+d)) linear-oracle calls via follow-the-regularized-leader prices and martingale progress analysis.
-
RIE-Greedy: Regularization-Induced Exploration for Contextual Bandits
RIE-Greedy uses stochasticity from cross-validation regularization to induce Thompson Sampling-like exploration, claimed equivalent in the two-armed case and empirically competitive in large-scale settings.