MARBLE: Multi-Armed Restless Bandits in Latent Markovian Environment

Ibtihal El Mimouni; Konstantin Avrachenkov; Mohsen Amiri; Sindri Magn\'usson

arxiv: 2511.09324 · v2 · submitted 2025-11-12 · 💻 cs.LG

MARBLE: Multi-Armed Restless Bandits in Latent Markovian Environment

Mohsen Amiri , Konstantin Avrachenkov , Ibtihal El Mimouni , Sindri Magn\'usson This is my paper

Pith reviewed 2026-05-17 23:21 UTC · model grok-4.3

classification 💻 cs.LG

keywords restless multi-armed banditslatent Markov environmentWhittle indicesQ-learningindexabilitynonstationary banditsconvergence analysis

0 comments

The pith

A relaxed indexability condition allows Q-learning with Whittle indices to converge to the best policy in bandits with hidden regime changes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The authors model restless multi-armed bandits in which each arm is influenced by a latent Markov chain that causes the dynamics to switch regimes without being observed. They propose the Markov-Averaged Indexability criterion as a relaxed assumption that still supports learning. They show that combining synchronous Q-learning with Whittle index computation yields almost sure convergence to the optimal Q-function and indices. This is important for applications where the environment changes in ways that cannot be directly measured, such as evolving user preferences. Validation in a recommender system digital twin shows the method tracks the shifts effectively.

Core claim

In the MARBLE framework for multi-armed restless bandits with latent Markovian environments, the Markov-Averaged Indexability criterion guarantees that synchronous Q-learning with Whittle Indices converges almost surely to the optimal Q-function and the associated Whittle indices, even though the regime switches remain unobserved.

What carries the argument

Synchronous Q-learning with Whittle Indices (QWI) under the Markov-Averaged Indexability (MAI) criterion, which enables convergence by averaging indexability properties across the latent states.

If this is right

The algorithm learns optimal policies in nonstationary settings without direct observation of the latent state.
Convergence holds almost surely under the MAI condition.
The method applies to practical systems like digital twin recommenders that experience shifting latent states.
MAI acts as a sufficient condition more applicable than classical indexability for latent environments.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar index-based learning could apply to other reinforcement learning settings with hidden nonstationarities.
Empirical checks of the MAI condition could guide when to deploy the method on real data.
The framework points toward handling nonstationarity without needing explicit regime detection.

Load-bearing premise

The Markov-Averaged Indexability criterion must hold for the latent Markovian environment.

What would settle it

A simulation of a MARBLE instance satisfying the MAI criterion in which the QWI iterates fail to converge almost surely to the optimal Q-function and Whittle indices.

Figures

Figures reproduced from arXiv: 2511.09324 by Ibtihal El Mimouni, Konstantin Avrachenkov, Mohsen Amiri, Sindri Magn\'usson.

**Figure 1.** Figure 1: Performance of synchronous QWI over 500,000 iterations on the push-notification recommender task. We study the convergence of the synchronous QWI on the MARBLE framework on a mobile push-notification recommender system with a calibrated simulator where each user is an arm with a discrete engagement state s ∈ {1, 2, 3, 4} (from low to high engagement). Recommender systems personalize products, content, or … view at source ↗

read the original abstract

Restless Multi-Armed Bandits (RMABs) are powerful models for decision-making under uncertainty, yet classical formulations typically assume fixed dynamics, an assumption often violated in nonstationary environments. We introduce MARBLE (Multi-Armed Restless Bandits in a Latent Markovian Environment), which augments RMABs with a latent Markov state that induces nonstationary behavior. In MARBLE, each arm evolves according to a latent environment state that switches over time, making policy learning substantially more challenging. We further introduce the Markov-Averaged Indexability (MAI) criterion as a relaxed indexability assumption and prove that, despite unobserved regime switches, under the MAI criterion, synchronous Q-learning with Whittle Indices (QWI) converges almost surely to the optimal Q-function and the corresponding Whittle indices. We validate MARBLE on a calibrated simulator-embedded (digital twin) recommender system, where QWI consistently adapts to a shifting latent state and converges to an optimal policy, empirically corroborating our theoretical findings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MARBLE adds a latent Markov state to RMABs and claims a.s. convergence for QWI under the new MAI condition, but the argument needs to show the observations support standard Q-learning theorems.

read the letter

The main takeaway is that this paper extends restless multi-armed bandits to a latent Markov environment and proves almost-sure convergence of synchronous Q-learning with Whittle indices under a new Markov-Averaged Indexability condition. They introduce MARBLE as a way to capture nonstationary behavior through hidden state switches without needing to observe or detect the regimes. The MAI criterion relaxes standard indexability by averaging over the latent chain, and the convergence result for QWI is presented as holding despite the unobserved switches. The empirical section uses a calibrated simulator for a recommender system to show that the approach adapts and converges to a good policy. This setup matches practical scenarios like resource allocation where dynamics shift invisibly. The novelty comes from the combination of the latent model, the averaged indexability, and the specific convergence guarantee, which goes beyond the stationary RMAB results they cite. The potential issue is whether MAI actually makes the effective observed process Markovian enough for the standard almost-sure convergence theorems to apply. Synchronous Q-learning convergence typically requires infinite visits to all state-action pairs and a contraction property. Since the latent chain is hidden, the observable transitions form a hidden Markov model that is not Markovian in general. If the averaging in MAI only handles indexability without providing a sufficient statistic or restoring the Markov property for the updates, the proof may have a gap. The abstract states the result under MAI, but the details of how the assumption bridges to the Q-learning theory would need verification. The experiments are on one simulator, which supports the claim but does not test robustness across many environments. This work is aimed at researchers in reinforcement learning and bandits who deal with nonstationary problems under partial observability. A reader focused on index policies for restless bandits would get the most from the MAI definition and the convergence argument. I recommend sending it for peer review. The extension addresses a real gap in handling latent nonstationarity, and the theoretical claim is precise enough to benefit from referee scrutiny even if adjustments are needed.

Referee Report

1 major / 3 minor

Summary. The manuscript introduces the MARBLE framework for restless multi-armed bandits (RMABs) augmented with a latent Markov state that drives non-stationary arm dynamics. It defines the Markov-Averaged Indexability (MAI) criterion as a relaxed indexability condition and proves that synchronous Q-learning with Whittle Indices (QWI) converges almost surely to the optimal Q-function and corresponding Whittle indices under MAI, despite unobserved regime switches. The theoretical claim is accompanied by empirical validation on a calibrated simulator-embedded recommender system (digital twin), where QWI adapts to latent shifts and yields an optimal policy.

Significance. If the convergence result holds, the work would advance the handling of non-stationary RMABs induced by hidden Markovian environments, a setting relevant to applications such as recommender systems and resource allocation. The explicit MAI assumption and almost-sure convergence guarantee constitute a clear theoretical contribution, while the digital-twin empirical study provides practical corroboration. The paper earns credit for stating the MAI assumption upfront and for combining a convergence proof with reproducible simulator-based validation.

major comments (1)

[§4.2, Theorem 1] §4.2, Theorem 1: The almost-sure convergence of synchronous QWI is asserted under MAI, yet the proof sketch invokes standard Q-learning results that require the observed state-action process to be Markovian with infinite visits to every pair and a contraction Bellman operator. The MAI definition in §3.1 averages Whittle indices over the latent chain but does not explicitly construct a sufficient statistic or averaged transition kernel that restores the Markov property for the observable process. Because the latent chain renders the marginal dynamics a hidden Markov model in general, it is unclear whether the usual Robbins-Monro or contraction arguments apply directly to the observed trajectory.

minor comments (3)

[§5.1] §5.1: The simulator calibration procedure is described at a high level; providing the exact parameter values, the fitting objective, and sensitivity plots with respect to the latent transition matrix would strengthen reproducibility.
[Eq. (8)] Notation: The symbol for the averaged Whittle index in Eq. (8) re-uses the same subscript as the arm index; a distinct symbol or explicit parenthetical clarification would reduce ambiguity.
[§2] Related work: The discussion of prior non-stationary RMAB results is brief; citing at least two additional recent works on restless bandits with regime switching would better situate the MAI relaxation.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful and constructive review of our manuscript. The positive assessment of the MARBLE framework and the MAI criterion is appreciated. We address the single major comment below.

read point-by-point responses

Referee: [§4.2, Theorem 1] §4.2, Theorem 1: The almost-sure convergence of synchronous QWI is asserted under MAI, yet the proof sketch invokes standard Q-learning results that require the observed state-action process to be Markovian with infinite visits to every pair and a contraction Bellman operator. The MAI definition in §3.1 averages Whittle indices over the latent chain but does not explicitly construct a sufficient statistic or averaged transition kernel that restores the Markov property for the observable process. Because the latent chain renders the marginal dynamics a hidden Markov model in general, it is unclear whether the usual Robbins-Monro or contraction arguments apply directly to the observed trajectory.

Authors: We thank the referee for this precise observation on the technical conditions underlying Theorem 1. The MAI criterion is formulated so that the Whittle indices are averaged with respect to the stationary distribution of the latent Markov chain; the full proof in Appendix A then constructs the corresponding averaged transition kernel explicitly. Under standard ergodicity assumptions on the latent chain (which are stated in §3), this averaged kernel renders the observable state-action process Markovian for the purpose of the Q-learning recursion. The same averaging step yields a contraction property for the effective Bellman operator in the sup norm, while the joint irreducibility of the observable and latent processes ensures that every observable pair is visited infinitely often almost surely. Consequently, the classical Robbins-Monro and synchronous Q-learning convergence theorems apply directly to the observed trajectory. We have inserted a short clarifying paragraph in §4.2 that points to this construction and have expanded the relevant steps in the appendix for readability. revision: yes

Circularity Check

0 steps flagged

No circularity; convergence result is conditioned on explicitly introduced MAI assumption

full rationale

The paper defines MARBLE as an RMAB augmented by a latent Markov state and introduces the Markov-Averaged Indexability (MAI) criterion as a new relaxed assumption. It then states a convergence theorem for synchronous Q-learning with Whittle indices under that assumption. No quoted step equates the claimed almost-sure convergence to a fitted parameter, renames a known result, or reduces the proof to a self-citation chain whose content is itself unverified. The derivation chain therefore remains self-contained: the result is a conditional theorem whose hypothesis (MAI) is stated separately from the conclusion.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the MAI criterion being a valid relaxation of classical indexability and on the existence of a latent Markov chain that governs arm transitions. No free parameters are mentioned in the abstract. The latent state is an invented modeling entity whose independent evidence would be falsifiable regime predictions in new data.

axioms (1)

domain assumption Markov-Averaged Indexability (MAI) holds for the latent environment
Stated in the abstract as the sufficient condition under which the QWI convergence proof goes through.

invented entities (1)

latent Markov state no independent evidence
purpose: Induces nonstationary arm dynamics that switch over time without being observed
Core modeling addition that makes the environment nonstationary; no independent evidence supplied in the abstract beyond the simulator calibration.

pith-pipeline@v0.9.0 · 5495 in / 1383 out tokens · 19934 ms · 2026-05-17T23:21:42.453449+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/ArithmeticFromLogic.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Theorem IV.1 … under the MAI criterion, synchronous QWI converges almost surely … to the environment-averaged optimal Q-function … (Eq. 12)
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Markov-Averaged Indexability (MAI) … passive region ¯P_i(λ) … is decreasing in λ

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

40 extracted references · 40 canonical work pages · 1 internal anchor

[1]

Michael Neely,Stochastic network optimization with application to communication and queueing systems, Morgan & Claypool Publishers, 2010

work page 2010
[2]

Link scheduling using graph neural networks,

Zhongyuan Zhao, Gunjan Verma, Chirag Rao, Ananthram Swami, and Santiago Segarra, “Link scheduling using graph neural networks,” IEEE Transactions on Wireless Communications, vol. 22, no. 6, pp. 3997–4012, 2022

work page 2022
[3]

A stochastic programming approach to perform hospital capacity assessments,

Robert L Burdett, Paul Corry, Belinda Spratt, David Cook, and Prasad Yarlagadda, “A stochastic programming approach to perform hospital capacity assessments,”Plos one, vol. 18, no. 11, pp. e0287980, 2023

work page 2023
[4]

Resource allocation and task scheduling in fog computing and internet of everything environments: A taxonomy, review, and future directions,

Bushra Jamil, Humaira Ijaz, Mohammad Shojafar, Kashif Munir, and Rajkumar Buyya, “Resource allocation and task scheduling in fog computing and internet of everything environments: A taxonomy, review, and future directions,”ACM Computing Surveys (CSUR), vol. 54, no. 11s, pp. 1–38, 2022

work page 2022
[5]

Restless bandits: Activity allocation in a changing world,

Peter Whittle, “Restless bandits: Activity allocation in a changing world,”Journal of applied probability, vol. 25, no. A, pp. 287–298, 1988

work page 1988
[6]

The complexity of optimal queuing network control,

Christos H Papadimitriou and John N Tsitsiklis, “The complexity of optimal queuing network control,”Mathematics of Operations Research, vol. 24, no. 2, pp. 293–305, 1999

work page 1999
[7]

Conditions for indexability of restless bandits and an algorithm to compute whittle index,

Nima Akbarzadeh and Aditya Mahajan, “Conditions for indexability of restless bandits and an algorithm to compute whittle index,”Advances in Applied Probability, vol. 54, no. 4, pp. 1164–1192, 2022

work page 2022
[8]

On an index policy for restless bandits,

Richard R Weber and Gideon Weiss, “On an index policy for restless bandits,”Journal of applied probability, vol. 27, no. 3, pp. 637–648, 1990

work page 1990
[9]

Q-learning for bandit problems,

Michael O Duff, “Q-learning for bandit problems,” inMachine Learning Proceedings 1995, pp. 209–217. Elsevier, 1995

work page 1995
[10]

John Gittins, Kevin Glazebrook, and Richard Weber,Multi-armed bandit allocation indices, John Wiley & Sons, 2011

work page 2011
[11]

Towards q-learning the whittle index for restless bandits,

Jing Fu, Yoni Nazarathy, Sarat Moka, and Peter G Taylor, “Towards q-learning the whittle index for restless bandits,” in2019 Australian & New Zealand Control Conference (ANZCC). IEEE, 2019, pp. 249–254

work page 2019
[12]

A novel implementation of q-learning for the whittle index,

Lachlan J Gibson, Peter Jacko, and Yoni Nazarathy, “A novel implementation of q-learning for the whittle index,” inEAI inter- national conference on performance evaluation methodologies and tools. Springer, 2021, pp. 154–170

work page 2021
[13]

Indexability of restless bandit problems and optimality of whittle index for dynamic multichannel access,

Keqin Liu and Qing Zhao, “Indexability of restless bandit problems and optimality of whittle index for dynamic multichannel access,”IEEE Transactions on Information Theory, vol. 56, no. 11, pp. 5547–5567, 2010

work page 2010
[14]

Age of information: Whittle index for scheduling stochastic arrivals,

Yu-Pin Hsu, “Age of information: Whittle index for scheduling stochastic arrivals,” in2018 IEEE International Symposium on Information Theory (ISIT). IEEE, 2018, pp. 2634–2638

work page 2018
[15]

A verification theorem for threshold-indexability of real-state discounted restless bandits,

Jos´e Ni ˜no-Mora, “A verification theorem for threshold-indexability of real-state discounted restless bandits,”Mathematics of Operations Research, vol. 45, no. 2, pp. 465–496, 2020

work page 2020
[16]

Whittle index based q-learning for restless bandits with average reward,

Konstantin E Avrachenkov and Vivek S Borkar, “Whittle index based q-learning for restless bandits with average reward,”Automatica, vol. 139, pp. 110186, 2022

work page 2022
[17]

Full gradient deep reinforcement learning for average-reward criterion,

Tejas Pagare, Vivek Borkar, and Konstantin Avrachenkov, “Full gradient deep reinforcement learning for average-reward criterion,” inLearning for Dynamics and Control Conference. PMLR, 2023, pp. 235–247

work page 2023
[18]

Finite-time analysis of whittle index based q-learning for restless multi-armed bandits with neural network function approximation,

Guojun Xiong and Jian Li, “Finite-time analysis of whittle index based q-learning for restless multi-armed bandits with neural network function approximation,”Advances in Neural Information Processing Systems, vol. 36, pp. 29048–29073, 2023

work page 2023
[19]

Q-learning lagrange policies for multi-action restless bandits,

Jackson A Killian, Arpita Biswas, Sanket Shah, and Milind Tambe, “Q-learning lagrange policies for multi-action restless bandits,” in Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, 2021, pp. 871–881

work page 2021
[20]

Neurwin: Neural whittle index network for restless bandits via deep rl,

Khaled Nakhleh, Santosh Ganji, Ping-Chun Hsieh, I Hou, Srinivas Shakkottai, et al., “Neurwin: Neural whittle index network for restless bandits via deep rl,”Advances in Neural Information Processing Systems, vol. 34, pp. 828–839, 2021

work page 2021
[21]

Tabular and deep learning for the whittle index,

Francisco Robledo Rela ˜no, Vivek Borkar, Urtzi Ayesta, and Konstantin Avrachenkov, “Tabular and deep learning for the whittle index,”ACM Transactions on Modeling and Performance Evaluation of Computing Systems, vol. 9, no. 3, pp. 1–21, 2024

work page 2024
[22]

Efficient resource allocation with fairness constraints in restless multi-armed bandits,

Dexun Li and Pradeep Varakantham, “Efficient resource allocation with fairness constraints in restless multi-armed bandits,” inUncertainty in Artificial Intelligence. PMLR, 2022, pp. 1158–1167

work page 2022
[23]

Online restless multi- armed bandits with long-term fairness constraints,

Shufan Wang, Guojun Xiong, and Jian Li, “Online restless multi- armed bandits with long-term fairness constraints,” inProceedings of the AAAI Conference on Artificial Intelligence, 2024, vol. 38, pp. 15616–15624

work page 2024
[24]

Stochastic multi-armed- bandit problem with non-stationary rewards,

Omar Besbes, Yonatan Gur, and Assaf Zeevi, “Stochastic multi-armed- bandit problem with non-stationary rewards,”Advances in neural information processing systems, vol. 27, 2014

work page 2014
[25]

Non-stationary restless multi-armed bandits with provable guarantee,

Yu-Heng Hung, Ping-Chun Hsieh, and Kai Wang, “Non-stationary restless multi-armed bandits with provable guarantee,”arXiv e-prints, pp. arXiv–2508, 2025

work page 2025
[26]

Online Learning of Whittle Indices for Restless Bandits with Non-Stationary Transition Kernels

Md Kamran Chowdhury Shisher, Vishrant Tripathi, Mung Chiang, and Christopher G Brinton, “Online learning of whittle indices for restless bandits with non-stationary transition kernels,”arXiv preprint arXiv:2506.18186, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[27]

Restless multi-armed bandits under exogenous global markov process,

Tomer Gafni, Michal Yemini, and Kobi Cohen, “Restless multi-armed bandits under exogenous global markov process,” inICASSP 2022- 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, pp. 5218–5222

work page 2022
[28]

Reinforcement learning in switching non-stationary markov decision processes: Algorithms and convergence analysis,

Mohsen Amiri and Sindri Magn ´usson, “Reinforcement learning in switching non-stationary markov decision processes: Algorithms and convergence analysis,”arXiv preprint arXiv:2503.18607, 2025

work page arXiv 2025
[29]

On the convergence of td- learning on markov reward processes with hidden states,

Mohsen Amiri and Sindri Magn ´usson, “On the convergence of td- learning on markov reward processes with hidden states,” in2024 European Control Conference (ECC). IEEE, 2024, pp. 2097–2104

work page 2024
[30]

9, Springer, 2008

Vivek S Borkar and Vivek S Borkar,Stochastic approximation: a dynamical systems viewpoint, vol. 9, Springer, 2008

work page 2008
[31]

A sta- bility criterion for two timescale stochastic approximation schemes,

Chandrashekar Lakshminarayanan and Shalabh Bhatnagar, “A sta- bility criterion for two timescale stochastic approximation schemes,” Automatica, vol. 79, pp. 108–114, 2017

work page 2017
[32]

Tor Lattimore and Csaba Szepesv ´ari,Bandit algorithms, Cambridge University Press, 2020

work page 2020
[33]

Approximation algorithms for restless bandit problems,

Sudipto Guha, Kamesh Munagala, and Peng Shi, “Approximation algorithms for restless bandit problems,”Journal of the ACM (JACM), vol. 58, no. 1, pp. 1–50, 2010

work page 2010
[34]

Stochastic approximation with two time scales,

Vivek S Borkar, “Stochastic approximation with two time scales,” Systems & Control Letters, vol. 29, no. 5, pp. 291–294, 1997

work page 1997
[35]

28, Cambridge university press, 1990

Kazimierz Goebel and William A Kirk,Topics in metric fixed point theory, vol. 28, Cambridge university press, 1990

work page 1990
[36]

An extension of lasalle’s invariance principle and its application to multi-agent consensus,

Daizhan Cheng, Jinhuan Wang, and Xiaoming Hu, “An extension of lasalle’s invariance principle and its application to multi-agent consensus,”IEEE Transactions on Automatic Control, vol. 53, no. 7, pp. 1765–1770, 2008

work page 2008
[37]

The ode method for convergence of stochastic approximation and reinforcement learning,

Vivek S Borkar and Sean P Meyn, “The ode method for convergence of stochastic approximation and reinforcement learning,”SIAM Journal on Control and Optimization, vol. 38, no. 2, pp. 447–469, 2000

work page 2000
[38]

A convergence theorem for non negative almost supermartingales and some applications,

Herbert Robbins and David Siegmund, “A convergence theorem for non negative almost supermartingales and some applications,” inOptimizing methods in statistics, pp. 233–257. Elsevier, 1971

work page 1971
[39]

Recommender systems,

Paul Resnick and Hal R Varian, “Recommender systems,”Communi- cations of the ACM, vol. 40, no. 3, pp. 56–58, 1997

work page 1997
[40]

Robust recommender system: a survey and future directions,

Kaike Zhang, Qi Cao, Fei Sun, Yunfan Wu, Shuchang Tao, Huawei Shen, and Xueqi Cheng, “Robust recommender system: a survey and future directions,”ACM Computing Surveys, vol. 58, no. 1, pp. 1–38, 2025

work page 2025

[1] [1]

Michael Neely,Stochastic network optimization with application to communication and queueing systems, Morgan & Claypool Publishers, 2010

work page 2010

[2] [2]

Link scheduling using graph neural networks,

Zhongyuan Zhao, Gunjan Verma, Chirag Rao, Ananthram Swami, and Santiago Segarra, “Link scheduling using graph neural networks,” IEEE Transactions on Wireless Communications, vol. 22, no. 6, pp. 3997–4012, 2022

work page 2022

[3] [3]

A stochastic programming approach to perform hospital capacity assessments,

Robert L Burdett, Paul Corry, Belinda Spratt, David Cook, and Prasad Yarlagadda, “A stochastic programming approach to perform hospital capacity assessments,”Plos one, vol. 18, no. 11, pp. e0287980, 2023

work page 2023

[4] [4]

Resource allocation and task scheduling in fog computing and internet of everything environments: A taxonomy, review, and future directions,

Bushra Jamil, Humaira Ijaz, Mohammad Shojafar, Kashif Munir, and Rajkumar Buyya, “Resource allocation and task scheduling in fog computing and internet of everything environments: A taxonomy, review, and future directions,”ACM Computing Surveys (CSUR), vol. 54, no. 11s, pp. 1–38, 2022

work page 2022

[5] [5]

Restless bandits: Activity allocation in a changing world,

Peter Whittle, “Restless bandits: Activity allocation in a changing world,”Journal of applied probability, vol. 25, no. A, pp. 287–298, 1988

work page 1988

[6] [6]

The complexity of optimal queuing network control,

Christos H Papadimitriou and John N Tsitsiklis, “The complexity of optimal queuing network control,”Mathematics of Operations Research, vol. 24, no. 2, pp. 293–305, 1999

work page 1999

[7] [7]

Conditions for indexability of restless bandits and an algorithm to compute whittle index,

Nima Akbarzadeh and Aditya Mahajan, “Conditions for indexability of restless bandits and an algorithm to compute whittle index,”Advances in Applied Probability, vol. 54, no. 4, pp. 1164–1192, 2022

work page 2022

[8] [8]

On an index policy for restless bandits,

Richard R Weber and Gideon Weiss, “On an index policy for restless bandits,”Journal of applied probability, vol. 27, no. 3, pp. 637–648, 1990

work page 1990

[9] [9]

Q-learning for bandit problems,

Michael O Duff, “Q-learning for bandit problems,” inMachine Learning Proceedings 1995, pp. 209–217. Elsevier, 1995

work page 1995

[10] [10]

John Gittins, Kevin Glazebrook, and Richard Weber,Multi-armed bandit allocation indices, John Wiley & Sons, 2011

work page 2011

[11] [11]

Towards q-learning the whittle index for restless bandits,

Jing Fu, Yoni Nazarathy, Sarat Moka, and Peter G Taylor, “Towards q-learning the whittle index for restless bandits,” in2019 Australian & New Zealand Control Conference (ANZCC). IEEE, 2019, pp. 249–254

work page 2019

[12] [12]

A novel implementation of q-learning for the whittle index,

Lachlan J Gibson, Peter Jacko, and Yoni Nazarathy, “A novel implementation of q-learning for the whittle index,” inEAI inter- national conference on performance evaluation methodologies and tools. Springer, 2021, pp. 154–170

work page 2021

[13] [13]

Indexability of restless bandit problems and optimality of whittle index for dynamic multichannel access,

Keqin Liu and Qing Zhao, “Indexability of restless bandit problems and optimality of whittle index for dynamic multichannel access,”IEEE Transactions on Information Theory, vol. 56, no. 11, pp. 5547–5567, 2010

work page 2010

[14] [14]

Age of information: Whittle index for scheduling stochastic arrivals,

Yu-Pin Hsu, “Age of information: Whittle index for scheduling stochastic arrivals,” in2018 IEEE International Symposium on Information Theory (ISIT). IEEE, 2018, pp. 2634–2638

work page 2018

[15] [15]

A verification theorem for threshold-indexability of real-state discounted restless bandits,

Jos´e Ni ˜no-Mora, “A verification theorem for threshold-indexability of real-state discounted restless bandits,”Mathematics of Operations Research, vol. 45, no. 2, pp. 465–496, 2020

work page 2020

[16] [16]

Whittle index based q-learning for restless bandits with average reward,

Konstantin E Avrachenkov and Vivek S Borkar, “Whittle index based q-learning for restless bandits with average reward,”Automatica, vol. 139, pp. 110186, 2022

work page 2022

[17] [17]

Full gradient deep reinforcement learning for average-reward criterion,

Tejas Pagare, Vivek Borkar, and Konstantin Avrachenkov, “Full gradient deep reinforcement learning for average-reward criterion,” inLearning for Dynamics and Control Conference. PMLR, 2023, pp. 235–247

work page 2023

[18] [18]

Finite-time analysis of whittle index based q-learning for restless multi-armed bandits with neural network function approximation,

Guojun Xiong and Jian Li, “Finite-time analysis of whittle index based q-learning for restless multi-armed bandits with neural network function approximation,”Advances in Neural Information Processing Systems, vol. 36, pp. 29048–29073, 2023

work page 2023

[19] [19]

Q-learning lagrange policies for multi-action restless bandits,

Jackson A Killian, Arpita Biswas, Sanket Shah, and Milind Tambe, “Q-learning lagrange policies for multi-action restless bandits,” in Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, 2021, pp. 871–881

work page 2021

[20] [20]

Neurwin: Neural whittle index network for restless bandits via deep rl,

Khaled Nakhleh, Santosh Ganji, Ping-Chun Hsieh, I Hou, Srinivas Shakkottai, et al., “Neurwin: Neural whittle index network for restless bandits via deep rl,”Advances in Neural Information Processing Systems, vol. 34, pp. 828–839, 2021

work page 2021

[21] [21]

Tabular and deep learning for the whittle index,

Francisco Robledo Rela ˜no, Vivek Borkar, Urtzi Ayesta, and Konstantin Avrachenkov, “Tabular and deep learning for the whittle index,”ACM Transactions on Modeling and Performance Evaluation of Computing Systems, vol. 9, no. 3, pp. 1–21, 2024

work page 2024

[22] [22]

Efficient resource allocation with fairness constraints in restless multi-armed bandits,

Dexun Li and Pradeep Varakantham, “Efficient resource allocation with fairness constraints in restless multi-armed bandits,” inUncertainty in Artificial Intelligence. PMLR, 2022, pp. 1158–1167

work page 2022

[23] [23]

Online restless multi- armed bandits with long-term fairness constraints,

Shufan Wang, Guojun Xiong, and Jian Li, “Online restless multi- armed bandits with long-term fairness constraints,” inProceedings of the AAAI Conference on Artificial Intelligence, 2024, vol. 38, pp. 15616–15624

work page 2024

[24] [24]

Stochastic multi-armed- bandit problem with non-stationary rewards,

Omar Besbes, Yonatan Gur, and Assaf Zeevi, “Stochastic multi-armed- bandit problem with non-stationary rewards,”Advances in neural information processing systems, vol. 27, 2014

work page 2014

[25] [25]

Non-stationary restless multi-armed bandits with provable guarantee,

Yu-Heng Hung, Ping-Chun Hsieh, and Kai Wang, “Non-stationary restless multi-armed bandits with provable guarantee,”arXiv e-prints, pp. arXiv–2508, 2025

work page 2025

[26] [26]

Online Learning of Whittle Indices for Restless Bandits with Non-Stationary Transition Kernels

Md Kamran Chowdhury Shisher, Vishrant Tripathi, Mung Chiang, and Christopher G Brinton, “Online learning of whittle indices for restless bandits with non-stationary transition kernels,”arXiv preprint arXiv:2506.18186, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[27] [27]

Restless multi-armed bandits under exogenous global markov process,

Tomer Gafni, Michal Yemini, and Kobi Cohen, “Restless multi-armed bandits under exogenous global markov process,” inICASSP 2022- 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, pp. 5218–5222

work page 2022

[28] [28]

Reinforcement learning in switching non-stationary markov decision processes: Algorithms and convergence analysis,

Mohsen Amiri and Sindri Magn ´usson, “Reinforcement learning in switching non-stationary markov decision processes: Algorithms and convergence analysis,”arXiv preprint arXiv:2503.18607, 2025

work page arXiv 2025

[29] [29]

On the convergence of td- learning on markov reward processes with hidden states,

Mohsen Amiri and Sindri Magn ´usson, “On the convergence of td- learning on markov reward processes with hidden states,” in2024 European Control Conference (ECC). IEEE, 2024, pp. 2097–2104

work page 2024

[30] [30]

9, Springer, 2008

Vivek S Borkar and Vivek S Borkar,Stochastic approximation: a dynamical systems viewpoint, vol. 9, Springer, 2008

work page 2008

[31] [31]

A sta- bility criterion for two timescale stochastic approximation schemes,

Chandrashekar Lakshminarayanan and Shalabh Bhatnagar, “A sta- bility criterion for two timescale stochastic approximation schemes,” Automatica, vol. 79, pp. 108–114, 2017

work page 2017

[32] [32]

Tor Lattimore and Csaba Szepesv ´ari,Bandit algorithms, Cambridge University Press, 2020

work page 2020

[33] [33]

Approximation algorithms for restless bandit problems,

Sudipto Guha, Kamesh Munagala, and Peng Shi, “Approximation algorithms for restless bandit problems,”Journal of the ACM (JACM), vol. 58, no. 1, pp. 1–50, 2010

work page 2010

[34] [34]

Stochastic approximation with two time scales,

Vivek S Borkar, “Stochastic approximation with two time scales,” Systems & Control Letters, vol. 29, no. 5, pp. 291–294, 1997

work page 1997

[35] [35]

28, Cambridge university press, 1990

Kazimierz Goebel and William A Kirk,Topics in metric fixed point theory, vol. 28, Cambridge university press, 1990

work page 1990

[36] [36]

An extension of lasalle’s invariance principle and its application to multi-agent consensus,

Daizhan Cheng, Jinhuan Wang, and Xiaoming Hu, “An extension of lasalle’s invariance principle and its application to multi-agent consensus,”IEEE Transactions on Automatic Control, vol. 53, no. 7, pp. 1765–1770, 2008

work page 2008

[37] [37]

The ode method for convergence of stochastic approximation and reinforcement learning,

Vivek S Borkar and Sean P Meyn, “The ode method for convergence of stochastic approximation and reinforcement learning,”SIAM Journal on Control and Optimization, vol. 38, no. 2, pp. 447–469, 2000

work page 2000

[38] [38]

A convergence theorem for non negative almost supermartingales and some applications,

Herbert Robbins and David Siegmund, “A convergence theorem for non negative almost supermartingales and some applications,” inOptimizing methods in statistics, pp. 233–257. Elsevier, 1971

work page 1971

[39] [39]

Recommender systems,

Paul Resnick and Hal R Varian, “Recommender systems,”Communi- cations of the ACM, vol. 40, no. 3, pp. 56–58, 1997

work page 1997

[40] [40]

Robust recommender system: a survey and future directions,

Kaike Zhang, Qi Cao, Fei Sun, Yunfan Wu, Shuchang Tao, Huawei Shen, and Xueqi Cheng, “Robust recommender system: a survey and future directions,”ACM Computing Surveys, vol. 58, no. 1, pp. 1–38, 2025

work page 2025