Online Learning of Whittle Indices for Restless Bandits with Non-Stationary Transition Kernels

· 2025 · cs.LG · arXiv 2506.18186

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

open full Pith review browse 2 citing papers arXiv PDF

abstract

The restless multi-armed bandit (RMAB) framework is a popular approach to solving resource allocation problems in networked systems. In this paper, we study optimal resource allocation in RMABs facing unknown and non-stationary dynamics. Solving RMABs optimally is known to be PSPACE-hard even with full knowledge of model parameters. While Whittle index policies offer asymptotic optimality with low computational cost, they require access to stationary transition kernels, an unrealistic assumption in many modern networking applications. To address this challenge, we propose a Sliding-Window Online Whittle (SW-Whittle) policy that remains computationally efficient while adapting to time-varying kernels. Through theoretical analysis, we show that our algorithm achieves sub-linear dynamic regret with respect to the number of episodes. We further address the important case where the variation budget is unknown in advance by combining a Bandit-over-Bandit framework with our sliding-window design. In our scheme, window lengths are tuned online as a function of the estimated variation, while Whittle indices are computed via an upper-confidence-bound of the estimated transition kernels and a bilinear optimization routine. Numerical experiments demonstrate that our algorithm consistently outperforms baselines, achieving the lowest cumulative regret across a range of non-stationary environments.

representative citing papers

A Modularized Framework for Piecewise-Stationary Restless Bandits

cs.IT · 2026-04-11 · unverdicted · novelty 7.0

A modular PS-RMAB framework integrates base algorithms with change detection and diminishing exploration to achieve tilde-O(sqrt(L M K T)) excess regret against a segment-oracle benchmark, with simulations showing near-oracle performance.

MARBLE: Multi-Armed Restless Bandits in Latent Markovian Environment

cs.LG · 2025-11-12 · unverdicted · novelty 7.0

Under the Markov-Averaged Indexability condition, synchronous Q-learning with Whittle indices converges almost surely to the optimal Q-function and indices in restless bandits whose arms are driven by an unobserved latent Markov environment.

citing papers explorer

Showing 2 of 2 citing papers.

A Modularized Framework for Piecewise-Stationary Restless Bandits cs.IT · 2026-04-11 · unverdicted · none · ref 2 · internal anchor
A modular PS-RMAB framework integrates base algorithms with change detection and diminishing exploration to achieve tilde-O(sqrt(L M K T)) excess regret against a segment-oracle benchmark, with simulations showing near-oracle performance.
MARBLE: Multi-Armed Restless Bandits in Latent Markovian Environment cs.LG · 2025-11-12 · unverdicted · none · ref 26 · internal anchor
Under the Markov-Averaged Indexability condition, synchronous Q-learning with Whittle indices converges almost surely to the optimal Q-function and indices in restless bandits whose arms are driven by an unobserved latent Markov environment.

Online Learning of Whittle Indices for Restless Bandits with Non-Stationary Transition Kernels

fields

years

verdicts

representative citing papers

citing papers explorer