Asymptotic Theory and Sequential Testing for Adaptive Bandits

Dandan Jiang; Li Yang; Xiaodong Yan

read the original abstract

Multi-armed bandit (MAB) processes constitute a foundational subclass of reinforcement learning problems and represent a central topic in statistical decision theory. Yet, conducting valid sequential testing under adaptive allocation remains challenging due to the lack of asymptotic theory under non-i.i.d. reward sequences and sublinear sample sizes for some arms. To address this open challenge, we propose an Urn Bandit (UNB) process to integrate the reinforcement mechanism of urn probabilistic models with MAB principles, ensuring almost sure concentration of allocation proportions on optimal arms. We establish a joint functional central limit theorem (FCLT) for consistent estimators of expected rewards under non-i.i.d. reward sequences with non-sub-Gaussian tails and pairwise cross-arm dependence. To overcome the limitations of existing methods that focus mainly on cumulative regret and therefore provide only algorithmic performance guarantees without supporting valid sequential testing, we develop an asymptotic theory for sequential test statistics under the proposed UNB process. The resulting framework enables a broad class of sequential inference procedures, such as A/B testing and policy evaluation. Simulation studies and real data analysis demonstrate that UNB maintains testing performance comparable to that of the equal randomization (ER) design while achieving improved reward accumulation relative to ER.

Asymptotic Theory and Sequential Testing for Adaptive Bandits

discussion (0)