Online Learning of Whittle Indices for Restless Bandits with Non-Stationary Transition Kernels

Md Kamran Chowdhury Shisher , Vishrant Tripathi , Mung Chiang , Christopher G. Brinton

Authors on Pith no claims yet

classification 💻 cs.LG stat.ML

keywords kernelswhittlenon-stationaryonlinetransitionwhileaddressalgorithm

read the original abstract

The restless multi-armed bandit (RMAB) framework is a popular approach to solving resource allocation problems in networked systems. In this paper, we study optimal resource allocation in RMABs facing unknown and non-stationary dynamics. Solving RMABs optimally is known to be PSPACE-hard even with full knowledge of model parameters. While Whittle index policies offer asymptotic optimality with low computational cost, they require access to stationary transition kernels, an unrealistic assumption in many modern networking applications. To address this challenge, we propose a Sliding-Window Online Whittle (SW-Whittle) policy that remains computationally efficient while adapting to time-varying kernels. Through theoretical analysis, we show that our algorithm achieves sub-linear dynamic regret with respect to the number of episodes. We further address the important case where the variation budget is unknown in advance by combining a Bandit-over-Bandit framework with our sliding-window design. In our scheme, window lengths are tuned online as a function of the estimated variation, while Whittle indices are computed via an upper-confidence-bound of the estimated transition kernels and a bilinear optimization routine. Numerical experiments demonstrate that our algorithm consistently outperforms baselines, achieving the lowest cumulative regret across a range of non-stationary environments.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

A Modularized Framework for Piecewise-Stationary Restless Bandits
cs.IT 2026-04 unverdicted novelty 7.0

A modular PS-RMAB framework integrates base algorithms with change detection and diminishing exploration to achieve tilde-O(sqrt(L M K T)) excess regret against a segment-oracle benchmark, with simulations showing nea...