Restless Bandits with Individual Penalty Constraints: Near-Optimal Indices and Deep Reinforcement Learning

I-Hong Hou; Nida Zamir

arxiv: 2604.04101 · v3 · submitted 2026-04-05 · 💻 cs.LG

Restless Bandits with Individual Penalty Constraints: Near-Optimal Indices and Deep Reinforcement Learning

Nida Zamir , I-Hong Hou This is my paper

Pith reviewed 2026-05-13 16:55 UTC · model grok-4.3

classification 💻 cs.LG

keywords restless multi-armed banditsindex policiespenalty constraintsasymptotic optimalitydeep reinforcement learningresource allocationwireless networks

0 comments

The pith

A Penalty-Optimal Whittle index policy achieves asymptotic optimality for restless bandits while satisfying per-arm penalty constraints.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that restless multi-armed bandits can be solved under distinct per-arm constraints such as energy budgets or activation limits by using an index computed independently for each arm. The index depends only on that arm's transition kernel and its own penalty function, so it can be calculated once offline and stays fixed no matter how many arms or how much total resource is available. This property yields a policy that is asymptotically optimal in the large-system limit and that meets every individual constraint by construction. A deep reinforcement learning procedure is supplied to estimate the same indices when the kernels are initially unknown.

Core claim

The POW index for an arm is the Lagrange multiplier that equates the long-run average reward to the long-run average penalty cost in the single-arm relaxed problem; the resulting index policy is asymptotically optimal for the original constrained multi-arm problem as the number of arms tends to infinity while exactly respecting all individual penalty thresholds.

What carries the argument

The Penalty-Optimal Whittle (POW) index, defined per arm from its transition kernel and penalty function alone, that ranks arms for activation at each time step.

If this is right

Indices can be precomputed offline for each user independently of system size.
The policy scales to arbitrarily many users without recomputing indices or solving a joint optimization.
All individual constraints are satisfied at every time step by construction of the index rule.
A deep RL variant learns the indices from samples when kernels are unknown.
Simulation performance remains near-optimal across wireless scheduling, energy management, and age-of-information problems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The per-arm decomposition may allow distributed or parallel index computation in large networks.
If the asymptotic guarantee holds approximately at moderate scale, the same indices could be used directly in systems with dozens rather than thousands of arms.
The method could be extended to slowly time-varying kernels by periodic re-learning of the indices.
Constraint types beyond energy and age-of-information, such as fairness or delay bounds, fit the same index framework with only a change in the penalty function.

Load-bearing premise

Transition kernels and penalty functions are known or accurately learnable and the system contains enough arms for the optimality gap of the index policy to vanish.

What would settle it

A numerical example or analysis in which the POW policy's average reward gap fails to approach zero as the number of arms grows without bound, or in which at least one per-arm penalty constraint is violated for large but finite populations.

Figures

Figures reproduced from arXiv: 2604.04101 by I-Hong Hou, Nida Zamir.

**Figure 3.** Figure 3: Average reward and average constraint violation [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 4.** Figure 4: Comparison of learning policies for remote sensing. [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 6.** Figure 6: Comparison of learning policies for throughput [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗

read the original abstract

This paper investigates the Restless Multi-Armed Bandit (RMAB) framework under individual penalty constraints to address resource allocation challenges in dynamic wireless networked environments. Unlike conventional RMAB models, our model allows each user (arm) to have distinct and stringent performance constraints, such as energy limits, activation limits, or age of information minimums, enabling the capture of diverse objectives including fairness and efficiency. To find the optimal resource allocation policy, we propose a new Penalty-Optimal Whittle (POW) index policy. The POW index of an user only depends on the user's transition kernel and penalty constraints, and remains invariable to system-wide features such as the number of users present and the amount of resource available. This makes it computationally tractable to calculate the POW indices offline without any need for online adaptation. Moreover, we theoretically prove that the POW index policy is asymptotically optimal while satisfying all individual penalty constraints. We also introduce a deep reinforcement learning algorithm to efficiently learn the POW index on the fly. Simulation results across various applications and system configurations further demonstrate that the POW index policy not only has near-optimal performance but also significantly outperforms other existing policies.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The POW index decouples per-arm constraints from system size in restless bandits, which is a practical advance if the asymptotic optimality holds up under the paper's conditions.

read the letter

The paper's main move is to define a Penalty-Optimal Whittle index that each arm computes from its own transition kernel and private penalty function alone. This index does not change with the total number of arms or the global resource budget, so you can tabulate it offline for each user type and then just sort at decision time. That property is new relative to standard Whittle constructions for RMABs and directly addresses wireless settings where individual energy or AoI limits differ across users. They also supply a deep RL procedure to estimate the index when the kernels are unknown, and they report that the resulting policy meets the constraints while performing well in simulations across several network scenarios. Those are the concrete pieces a practitioner could pick up and try. The asymptotic optimality claim is the part that needs the most scrutiny. The argument presumably uses a fluid or concentration limit in the many-arm regime, but the abstract and stress-test note leave open whether uniform ergodicity or Lipschitz conditions on the penalties are stated and verified. If those regularity requirements are missing or only hold in special cases, the optimality gap may not vanish as claimed. The simulation section is described only at a high level, with no mention of variance across runs or the exact competing policies, so it is difficult to judge how large the reported gains actually are. This work is aimed at researchers who already work on constrained RMABs or dynamic wireless resource allocation. A reader who needs an index that respects per-arm limits without re-solving a global MDP each time the user count changes will find usable ideas here. The paper shows clear engagement with the existing RMAB literature and ships both a theoretical claim and a learning algorithm, so it is worth sending to a serious referee who can check the proof conditions and the experimental controls. I would recommend peer review rather than desk rejection, provided the authors are asked to make the regularity assumptions explicit.

Referee Report

2 major / 1 minor

Summary. The paper proposes the Penalty-Optimal Whittle (POW) index policy for restless multi-armed bandits with per-arm penalty constraints (e.g., energy or AoI limits). It asserts that each POW index depends solely on an arm's own transition kernel and penalty function, can be computed offline, and that the resulting policy is asymptotically optimal in the many-arm limit while satisfying all individual constraints. A deep RL algorithm is introduced to learn the indices online, and simulations are presented to show near-optimal performance across applications.

Significance. If the asymptotic optimality result holds under appropriate conditions, the work provides a tractable, decentralized solution to constrained RMAB problems that avoids global parameter fitting and scales to large systems. This is relevant for wireless resource allocation with heterogeneous constraints. The DRL component adds practical utility when kernels are unknown, and the per-arm index structure is a conceptual strength.

major comments (2)

[Abstract / Theoretical Analysis] Abstract and theoretical sections: the central claim of asymptotic optimality and constraint satisfaction is asserted but the proof is not detailed. No explicit statement appears of the required regularity conditions (e.g., uniform ergodicity or mixing rates on the transition kernels, Lipschitz continuity of the penalty functions) needed for the fluid-limit or concentration arguments to establish a vanishing optimality gap.
[POW Index Definition] POW index definition: the precise optimization problem solved to obtain the index value for each arm is not formulated. Without this, it is unclear how indexability of the constrained per-arm MDP is established and how the claimed independence from system-wide parameters (number of arms, total resource) is guaranteed.

minor comments (1)

[Simulations] Simulation results lack quantitative error bars, confidence intervals, or statistical significance tests against baselines, weakening the empirical support for the 'near-optimal' claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript to incorporate the suggested clarifications.

read point-by-point responses

Referee: [Abstract / Theoretical Analysis] Abstract and theoretical sections: the central claim of asymptotic optimality and constraint satisfaction is asserted but the proof is not detailed. No explicit statement appears of the required regularity conditions (e.g., uniform ergodicity or mixing rates on the transition kernels, Lipschitz continuity of the penalty functions) needed for the fluid-limit or concentration arguments to establish a vanishing optimality gap.

Authors: We agree that the proof of asymptotic optimality and constraint satisfaction should be presented with explicit regularity conditions. In the revised manuscript, we will add a dedicated subsection with the full proof outline, stating the required assumptions including uniform ergodicity of the transition kernels and Lipschitz continuity of the penalty functions to support the fluid-limit and concentration arguments. revision: yes
Referee: [POW Index Definition] POW index definition: the precise optimization problem solved to obtain the index value for each arm is not formulated. Without this, it is unclear how indexability of the constrained per-arm MDP is established and how the claimed independence from system-wide parameters (number of arms, total resource) is guaranteed.

Authors: We will revise the manuscript to explicitly formulate the optimization problem defining the POW index for each arm. This formulation will establish indexability of the constrained per-arm MDP and demonstrate that the index depends solely on the individual arm's transition kernel and penalty function, remaining independent of system-wide parameters such as the number of arms or total resources. revision: yes

Circularity Check

0 steps flagged

POW index defined per-arm; asymptotic optimality follows standard many-arm limit without reduction to inputs

full rationale

The POW index is explicitly constructed from each arm's own transition kernel and individual penalty constraints alone, remaining invariant to the number of users or total resources. This per-arm definition enables offline computation and directly feeds into the index policy whose asymptotic optimality is proved in the many-arm regime. No equation reduces a global prediction to a fitted parameter, no self-citation supplies a load-bearing uniqueness theorem, and no ansatz is smuggled via prior work. The derivation is therefore self-contained; the only minor risk noted is the precise numerical optimization used to obtain each index value, which does not create circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on standard MDP modeling of each arm plus the existence of a well-defined index that separates local constraints from global allocation.

axioms (1)

domain assumption Each arm evolves as a finite-state Markov decision process whose transition kernel is known or learnable.
Invoked when defining the POW index from the per-arm kernel and penalty function.

invented entities (1)

Penalty-Optimal Whittle index no independent evidence
purpose: A scalar priority score for each constrained arm that enables decoupled scheduling.
Newly defined quantity whose computation is claimed to be independent of global system parameters.

pith-pipeline@v0.9.0 · 5506 in / 1280 out tokens · 24096 ms · 2026-05-13T16:55:11.450868+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We propose a new Penalty-Optimal Whittle (POW) index policy... defined as I_n(s_n) := sup {λ | π*_n(s_n, λ, μ*_n(λ)) = 1}
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean LogicNat recovery unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Theorem 1: If π_Rel has global attractor... then y_Rel = y_Ind and π_Rel(y_Ind) = π_Ind(y_Ind)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

30 extracted references · 30 canonical work pages

[1]

Konstantin Avrachenkov, Vivek S Borkar, and Pratik Shah. 2024. Lagrangian index policy for restless bandits with average reward.arXiv preprint arXiv:2412.12641 (2024)

work page arXiv 2024
[2]

Vivek S Borkar, Gaurav S Kasbekar, Sarath Pattathil, and Priyesh Y Shetty. 2017. Opportunistic scheduling as restless bandits.IEEE Transactions on Control of Network Systems5, 4 (2017), 1952–1961

work page 2017
[3]

Archana Bura, Ushasi Ghosh, Dinesh Bharadia, and Srinivas Shakkottai. 2024. Windex: Realtime neural whittle indexing for scalable service guarantees in nextg cellular networks.arXiv preprint arXiv:2406.01888(2024)

work page arXiv 2024
[4]

Guozhi Chen, Yuchao Chen, Jintao Wang, and Jian Song. 2023. Minimizing age of information in downlink wireless networks with time-varying channels and peak power constraint.IEEE Transactions on Vehicular Technology72, 7 (2023), 9058–9068

work page 2023
[5]

Nicolas Gast, Bruno Gaujal, and Kimang Khun. 2023. Testing indexability and computing Whittle and Gittins index in subcubic time.Mathematical Methods of Operations Research97, 3 (2023), 391–436

work page 2023
[6]

Nicolas Gast, Bruno Gaujal, and Chen Yan. 2023. Exponential asymptotic opti- mality of whittle index policy.Queueing Systems104, 1 (2023), 107–150

work page 2023
[7]

Christine Herlihy, Aviva Prins, Aravind Srinivasan, and John P Dickerson. 2023. Planning to fairly allocate: Probabilistic fairness in the restless bandit setting. InProceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 732–740

work page 2023
[8]

S Amir Hosseini and Shivendra S Panwar. 2017. Restless streaming bandits: Scheduling scalable video in wireless networks. In2017 55th Annual Allerton Conference on Communication, Control, and Computing (Allerton). IEEE, 620–628

work page 2017
[9]

Yu-Pin Hsu. 2018. Age of information: Whittle index for scheduling stochastic arrivals. In2018 IEEE International Symposium on Information Theory (ISIT). IEEE, 2634–2638

work page 2018
[10]

Matthew Joseph, Michael Kearns, Jamie H Morgenstern, and Aaron Roth. 2016. Fairness in learning: Classic and contextual bandits.Advances in neural informa- tion processing systems29 (2016)

work page 2016
[11]

Igor Kadota, Abhishek Sinha, and Eytan Modiano. 2018. Optimizing age of information in wireless networks with throughput constraints. InIEEE INFOCOM 2018-IEEE Conference on Computer Communications. IEEE, 1844–1852

work page 2018
[12]

Maialen Larrañaga, Urtzi Ayesta, and Ina Maria Verloop. 2015. Asymptotically optimal index policies for an abandonment queue with convex holding cost. Queueing systems81, 2 (2015), 99–169

work page 2015
[13]

Dexun Li and Pradeep Varakantham. 2022. Efficient resource allocation with fairness constraints in restless multi-armed bandits. InUncertainty in Artificial Intelligence. PMLR, 1158–1167

work page 2022
[14]

Dexun Li and Pradeep Varakantham. 2023. Avoiding starvation of arms in restless multi-armed bandit. (2023)

work page 2023
[15]

Andrea Lodi, Philippe Olivier, Gilles Pesant, and Sriram Sankaranarayanan. 2024. Fairness over time in dynamic resource allocation with an application in health- care.Mathematical Programming203, 1 (2024), 285–318

work page 2024
[16]

Yi Mao and Andrew Perrault. 2024. Time-Constrained Restless Multi-Armed Bandits with Applications to City Service Scheduling.. InAAMAS. 2375–2377

work page 2024
[17]

Khaled Nakhleh, Santosh Ganji, Ping-Chun Hsieh, I Hou, and Srinivas Shakkottai

work page
[18]

Advances in Neural Information Processing Systems34 (2021), 828–839

NeurWIN: Neural Whittle index network for restless bandits via deep RL. Advances in Neural Information Processing Systems34 (2021), 828–839

work page 2021
[19]

Khaled Nakhleh and I-Hong Hou. 2022. DeepTOP: Deep threshold-optimal policy for MDPs and RMABs.Advances in Neural Information Processing Systems35 (2022), 28734–28746

work page 2022
[20]

Michael J Neely. 2013. A Lyapunov optimization approach to repeated stochastic games. In2013 51st Annual Allerton Conference on Communication, Control, and Computing (Allerton). IEEE, 1082–1089

work page 2013
[21]

Francisco Robledo Relaño, Vivek Borkar, Urtzi Ayesta, and Konstantin Avrachenkov. 2024. Tabular and deep learning for the whittle index.ACM Transactions on Modeling and Performance Evaluation of Computing Systems9, 3 (2024), 1–21

work page 2024
[22]

Md Kamran Chowdhury Shisher, Bo Ji, I-Hong Hou, and Yin Sun. 2023. Learning and communications co-design for remote inference systems: Feature length se- lection and transmission scheduling.IEEE Journal on Selected Areas in Information Theory4 (2023), 524–538

work page 2023
[23]

Santosh Kumar Singh, Vivek S Borkar, and Gaurav S Kasbekar. 2022. User association in dense mmWave networks as restless bandits.IEEE Transactions on Vehicular Technology71, 7 (2022), 7919–7929

work page 2022
[24]

Haoyue Tang, Jintao Wang, Linqi Song, and Jian Song. 2020. Minimizing age of information with power constraints: Multi-user opportunistic scheduling in multi- state time-varying channels.IEEE Journal on Selected Areas in Communications 38, 5 (2020), 854–868

work page 2020
[25]

Kai Wang, Lily Xu, Aparna Taneja, and Milind Tambe. 2023. Optimistic whittle index policy: Online learning for restless bandits. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 37. 10131–10139

work page 2023
[26]

Shufan Wang, Guojun Xiong, and Jian Li. 2024. Online restless multi-armed bandits with long-term fairness constraints. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 38. 15616–15624

work page 2024
[27]

Richard R Weber and Gideon Weiss. 1990. On an index policy for restless bandits. Journal of applied probability27, 3 (1990), 637–648

work page 1990
[28]

Peter Whittle. 1988. Restless bandits: Activity allocation in a changing world. Journal of applied probability25, A (1988), 287–298

work page 1988
[29]

Jun Xu and Chengcheng Guo. 2019. Scheduling periodic real-time traffic in lossy wireless networks as restless multi-armed bandit.IEEE Wireless Communications Letters8, 4 (2019), 1129–1132

work page 2019
[30]

Yihan Zou, Kwang Taik Kim, Xiaojun Lin, and Mung Chiang. 2021. Minimizing age-of-information in heterogeneous multi-channel systems: A new partial-index approach. InProceedings of the twenty-second international symposium on the- ory, algorithmic foundations, and protocol design for mobile networks and mobile computing. 11–20. Restless Bandits with Indivi...

work page 2021

[1] [1]

Konstantin Avrachenkov, Vivek S Borkar, and Pratik Shah. 2024. Lagrangian index policy for restless bandits with average reward.arXiv preprint arXiv:2412.12641 (2024)

work page arXiv 2024

[2] [2]

Vivek S Borkar, Gaurav S Kasbekar, Sarath Pattathil, and Priyesh Y Shetty. 2017. Opportunistic scheduling as restless bandits.IEEE Transactions on Control of Network Systems5, 4 (2017), 1952–1961

work page 2017

[3] [3]

Archana Bura, Ushasi Ghosh, Dinesh Bharadia, and Srinivas Shakkottai. 2024. Windex: Realtime neural whittle indexing for scalable service guarantees in nextg cellular networks.arXiv preprint arXiv:2406.01888(2024)

work page arXiv 2024

[4] [4]

Guozhi Chen, Yuchao Chen, Jintao Wang, and Jian Song. 2023. Minimizing age of information in downlink wireless networks with time-varying channels and peak power constraint.IEEE Transactions on Vehicular Technology72, 7 (2023), 9058–9068

work page 2023

[5] [5]

Nicolas Gast, Bruno Gaujal, and Kimang Khun. 2023. Testing indexability and computing Whittle and Gittins index in subcubic time.Mathematical Methods of Operations Research97, 3 (2023), 391–436

work page 2023

[6] [6]

Nicolas Gast, Bruno Gaujal, and Chen Yan. 2023. Exponential asymptotic opti- mality of whittle index policy.Queueing Systems104, 1 (2023), 107–150

work page 2023

[7] [7]

Christine Herlihy, Aviva Prins, Aravind Srinivasan, and John P Dickerson. 2023. Planning to fairly allocate: Probabilistic fairness in the restless bandit setting. InProceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 732–740

work page 2023

[8] [8]

S Amir Hosseini and Shivendra S Panwar. 2017. Restless streaming bandits: Scheduling scalable video in wireless networks. In2017 55th Annual Allerton Conference on Communication, Control, and Computing (Allerton). IEEE, 620–628

work page 2017

[9] [9]

Yu-Pin Hsu. 2018. Age of information: Whittle index for scheduling stochastic arrivals. In2018 IEEE International Symposium on Information Theory (ISIT). IEEE, 2634–2638

work page 2018

[10] [10]

Matthew Joseph, Michael Kearns, Jamie H Morgenstern, and Aaron Roth. 2016. Fairness in learning: Classic and contextual bandits.Advances in neural informa- tion processing systems29 (2016)

work page 2016

[11] [11]

Igor Kadota, Abhishek Sinha, and Eytan Modiano. 2018. Optimizing age of information in wireless networks with throughput constraints. InIEEE INFOCOM 2018-IEEE Conference on Computer Communications. IEEE, 1844–1852

work page 2018

[12] [12]

Maialen Larrañaga, Urtzi Ayesta, and Ina Maria Verloop. 2015. Asymptotically optimal index policies for an abandonment queue with convex holding cost. Queueing systems81, 2 (2015), 99–169

work page 2015

[13] [13]

Dexun Li and Pradeep Varakantham. 2022. Efficient resource allocation with fairness constraints in restless multi-armed bandits. InUncertainty in Artificial Intelligence. PMLR, 1158–1167

work page 2022

[14] [14]

Dexun Li and Pradeep Varakantham. 2023. Avoiding starvation of arms in restless multi-armed bandit. (2023)

work page 2023

[15] [15]

Andrea Lodi, Philippe Olivier, Gilles Pesant, and Sriram Sankaranarayanan. 2024. Fairness over time in dynamic resource allocation with an application in health- care.Mathematical Programming203, 1 (2024), 285–318

work page 2024

[16] [16]

Yi Mao and Andrew Perrault. 2024. Time-Constrained Restless Multi-Armed Bandits with Applications to City Service Scheduling.. InAAMAS. 2375–2377

work page 2024

[17] [17]

Khaled Nakhleh, Santosh Ganji, Ping-Chun Hsieh, I Hou, and Srinivas Shakkottai

work page

[18] [18]

Advances in Neural Information Processing Systems34 (2021), 828–839

NeurWIN: Neural Whittle index network for restless bandits via deep RL. Advances in Neural Information Processing Systems34 (2021), 828–839

work page 2021

[19] [19]

Khaled Nakhleh and I-Hong Hou. 2022. DeepTOP: Deep threshold-optimal policy for MDPs and RMABs.Advances in Neural Information Processing Systems35 (2022), 28734–28746

work page 2022

[20] [20]

Michael J Neely. 2013. A Lyapunov optimization approach to repeated stochastic games. In2013 51st Annual Allerton Conference on Communication, Control, and Computing (Allerton). IEEE, 1082–1089

work page 2013

[21] [21]

Francisco Robledo Relaño, Vivek Borkar, Urtzi Ayesta, and Konstantin Avrachenkov. 2024. Tabular and deep learning for the whittle index.ACM Transactions on Modeling and Performance Evaluation of Computing Systems9, 3 (2024), 1–21

work page 2024

[22] [22]

Md Kamran Chowdhury Shisher, Bo Ji, I-Hong Hou, and Yin Sun. 2023. Learning and communications co-design for remote inference systems: Feature length se- lection and transmission scheduling.IEEE Journal on Selected Areas in Information Theory4 (2023), 524–538

work page 2023

[23] [23]

Santosh Kumar Singh, Vivek S Borkar, and Gaurav S Kasbekar. 2022. User association in dense mmWave networks as restless bandits.IEEE Transactions on Vehicular Technology71, 7 (2022), 7919–7929

work page 2022

[24] [24]

Haoyue Tang, Jintao Wang, Linqi Song, and Jian Song. 2020. Minimizing age of information with power constraints: Multi-user opportunistic scheduling in multi- state time-varying channels.IEEE Journal on Selected Areas in Communications 38, 5 (2020), 854–868

work page 2020

[25] [25]

Kai Wang, Lily Xu, Aparna Taneja, and Milind Tambe. 2023. Optimistic whittle index policy: Online learning for restless bandits. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 37. 10131–10139

work page 2023

[26] [26]

Shufan Wang, Guojun Xiong, and Jian Li. 2024. Online restless multi-armed bandits with long-term fairness constraints. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 38. 15616–15624

work page 2024

[27] [27]

Richard R Weber and Gideon Weiss. 1990. On an index policy for restless bandits. Journal of applied probability27, 3 (1990), 637–648

work page 1990

[28] [28]

Peter Whittle. 1988. Restless bandits: Activity allocation in a changing world. Journal of applied probability25, A (1988), 287–298

work page 1988

[29] [29]

Jun Xu and Chengcheng Guo. 2019. Scheduling periodic real-time traffic in lossy wireless networks as restless multi-armed bandit.IEEE Wireless Communications Letters8, 4 (2019), 1129–1132

work page 2019

[30] [30]

Yihan Zou, Kwang Taik Kim, Xiaojun Lin, and Mung Chiang. 2021. Minimizing age-of-information in heterogeneous multi-channel systems: A new partial-index approach. InProceedings of the twenty-second international symposium on the- ory, algorithmic foundations, and protocol design for mobile networks and mobile computing. 11–20. Restless Bandits with Indivi...

work page 2021