pith. sign in

arxiv: 2604.04101 · v3 · submitted 2026-04-05 · 💻 cs.LG

Restless Bandits with Individual Penalty Constraints: Near-Optimal Indices and Deep Reinforcement Learning

Pith reviewed 2026-05-13 16:55 UTC · model grok-4.3

classification 💻 cs.LG
keywords restless multi-armed banditsindex policiespenalty constraintsasymptotic optimalitydeep reinforcement learningresource allocationwireless networks
0
0 comments X

The pith

A Penalty-Optimal Whittle index policy achieves asymptotic optimality for restless bandits while satisfying per-arm penalty constraints.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that restless multi-armed bandits can be solved under distinct per-arm constraints such as energy budgets or activation limits by using an index computed independently for each arm. The index depends only on that arm's transition kernel and its own penalty function, so it can be calculated once offline and stays fixed no matter how many arms or how much total resource is available. This property yields a policy that is asymptotically optimal in the large-system limit and that meets every individual constraint by construction. A deep reinforcement learning procedure is supplied to estimate the same indices when the kernels are initially unknown.

Core claim

The POW index for an arm is the Lagrange multiplier that equates the long-run average reward to the long-run average penalty cost in the single-arm relaxed problem; the resulting index policy is asymptotically optimal for the original constrained multi-arm problem as the number of arms tends to infinity while exactly respecting all individual penalty thresholds.

What carries the argument

The Penalty-Optimal Whittle (POW) index, defined per arm from its transition kernel and penalty function alone, that ranks arms for activation at each time step.

If this is right

  • Indices can be precomputed offline for each user independently of system size.
  • The policy scales to arbitrarily many users without recomputing indices or solving a joint optimization.
  • All individual constraints are satisfied at every time step by construction of the index rule.
  • A deep RL variant learns the indices from samples when kernels are unknown.
  • Simulation performance remains near-optimal across wireless scheduling, energy management, and age-of-information problems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The per-arm decomposition may allow distributed or parallel index computation in large networks.
  • If the asymptotic guarantee holds approximately at moderate scale, the same indices could be used directly in systems with dozens rather than thousands of arms.
  • The method could be extended to slowly time-varying kernels by periodic re-learning of the indices.
  • Constraint types beyond energy and age-of-information, such as fairness or delay bounds, fit the same index framework with only a change in the penalty function.

Load-bearing premise

Transition kernels and penalty functions are known or accurately learnable and the system contains enough arms for the optimality gap of the index policy to vanish.

What would settle it

A numerical example or analysis in which the POW policy's average reward gap fails to approach zero as the number of arms grows without bound, or in which at least one per-arm penalty constraint is violated for large but finite populations.

Figures

Figures reproduced from arXiv: 2604.04101 by I-Hong Hou, Nida Zamir.

Figure 2
Figure 2. Figure 2: Comparison of learning policies for throughput [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Average reward and average constraint violation [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Comparison of learning policies for remote sensing. [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 6
Figure 6. Figure 6: Comparison of learning policies for throughput [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗
read the original abstract

This paper investigates the Restless Multi-Armed Bandit (RMAB) framework under individual penalty constraints to address resource allocation challenges in dynamic wireless networked environments. Unlike conventional RMAB models, our model allows each user (arm) to have distinct and stringent performance constraints, such as energy limits, activation limits, or age of information minimums, enabling the capture of diverse objectives including fairness and efficiency. To find the optimal resource allocation policy, we propose a new Penalty-Optimal Whittle (POW) index policy. The POW index of an user only depends on the user's transition kernel and penalty constraints, and remains invariable to system-wide features such as the number of users present and the amount of resource available. This makes it computationally tractable to calculate the POW indices offline without any need for online adaptation. Moreover, we theoretically prove that the POW index policy is asymptotically optimal while satisfying all individual penalty constraints. We also introduce a deep reinforcement learning algorithm to efficiently learn the POW index on the fly. Simulation results across various applications and system configurations further demonstrate that the POW index policy not only has near-optimal performance but also significantly outperforms other existing policies.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes the Penalty-Optimal Whittle (POW) index policy for restless multi-armed bandits with per-arm penalty constraints (e.g., energy or AoI limits). It asserts that each POW index depends solely on an arm's own transition kernel and penalty function, can be computed offline, and that the resulting policy is asymptotically optimal in the many-arm limit while satisfying all individual constraints. A deep RL algorithm is introduced to learn the indices online, and simulations are presented to show near-optimal performance across applications.

Significance. If the asymptotic optimality result holds under appropriate conditions, the work provides a tractable, decentralized solution to constrained RMAB problems that avoids global parameter fitting and scales to large systems. This is relevant for wireless resource allocation with heterogeneous constraints. The DRL component adds practical utility when kernels are unknown, and the per-arm index structure is a conceptual strength.

major comments (2)
  1. [Abstract / Theoretical Analysis] Abstract and theoretical sections: the central claim of asymptotic optimality and constraint satisfaction is asserted but the proof is not detailed. No explicit statement appears of the required regularity conditions (e.g., uniform ergodicity or mixing rates on the transition kernels, Lipschitz continuity of the penalty functions) needed for the fluid-limit or concentration arguments to establish a vanishing optimality gap.
  2. [POW Index Definition] POW index definition: the precise optimization problem solved to obtain the index value for each arm is not formulated. Without this, it is unclear how indexability of the constrained per-arm MDP is established and how the claimed independence from system-wide parameters (number of arms, total resource) is guaranteed.
minor comments (1)
  1. [Simulations] Simulation results lack quantitative error bars, confidence intervals, or statistical significance tests against baselines, weakening the empirical support for the 'near-optimal' claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript to incorporate the suggested clarifications.

read point-by-point responses
  1. Referee: [Abstract / Theoretical Analysis] Abstract and theoretical sections: the central claim of asymptotic optimality and constraint satisfaction is asserted but the proof is not detailed. No explicit statement appears of the required regularity conditions (e.g., uniform ergodicity or mixing rates on the transition kernels, Lipschitz continuity of the penalty functions) needed for the fluid-limit or concentration arguments to establish a vanishing optimality gap.

    Authors: We agree that the proof of asymptotic optimality and constraint satisfaction should be presented with explicit regularity conditions. In the revised manuscript, we will add a dedicated subsection with the full proof outline, stating the required assumptions including uniform ergodicity of the transition kernels and Lipschitz continuity of the penalty functions to support the fluid-limit and concentration arguments. revision: yes

  2. Referee: [POW Index Definition] POW index definition: the precise optimization problem solved to obtain the index value for each arm is not formulated. Without this, it is unclear how indexability of the constrained per-arm MDP is established and how the claimed independence from system-wide parameters (number of arms, total resource) is guaranteed.

    Authors: We will revise the manuscript to explicitly formulate the optimization problem defining the POW index for each arm. This formulation will establish indexability of the constrained per-arm MDP and demonstrate that the index depends solely on the individual arm's transition kernel and penalty function, remaining independent of system-wide parameters such as the number of arms or total resources. revision: yes

Circularity Check

0 steps flagged

POW index defined per-arm; asymptotic optimality follows standard many-arm limit without reduction to inputs

full rationale

The POW index is explicitly constructed from each arm's own transition kernel and individual penalty constraints alone, remaining invariant to the number of users or total resources. This per-arm definition enables offline computation and directly feeds into the index policy whose asymptotic optimality is proved in the many-arm regime. No equation reduces a global prediction to a fitted parameter, no self-citation supplies a load-bearing uniqueness theorem, and no ansatz is smuggled via prior work. The derivation is therefore self-contained; the only minor risk noted is the precise numerical optimization used to obtain each index value, which does not create circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on standard MDP modeling of each arm plus the existence of a well-defined index that separates local constraints from global allocation.

axioms (1)
  • domain assumption Each arm evolves as a finite-state Markov decision process whose transition kernel is known or learnable.
    Invoked when defining the POW index from the per-arm kernel and penalty function.
invented entities (1)
  • Penalty-Optimal Whittle index no independent evidence
    purpose: A scalar priority score for each constrained arm that enables decoupled scheduling.
    Newly defined quantity whose computation is claimed to be independent of global system parameters.

pith-pipeline@v0.9.0 · 5506 in / 1280 out tokens · 24096 ms · 2026-05-13T16:55:11.450868+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

30 extracted references · 30 canonical work pages

  1. [1]

    Konstantin Avrachenkov, Vivek S Borkar, and Pratik Shah. 2024. Lagrangian index policy for restless bandits with average reward.arXiv preprint arXiv:2412.12641 (2024)

  2. [2]

    Vivek S Borkar, Gaurav S Kasbekar, Sarath Pattathil, and Priyesh Y Shetty. 2017. Opportunistic scheduling as restless bandits.IEEE Transactions on Control of Network Systems5, 4 (2017), 1952–1961

  3. [3]

    Archana Bura, Ushasi Ghosh, Dinesh Bharadia, and Srinivas Shakkottai. 2024. Windex: Realtime neural whittle indexing for scalable service guarantees in nextg cellular networks.arXiv preprint arXiv:2406.01888(2024)

  4. [4]

    Guozhi Chen, Yuchao Chen, Jintao Wang, and Jian Song. 2023. Minimizing age of information in downlink wireless networks with time-varying channels and peak power constraint.IEEE Transactions on Vehicular Technology72, 7 (2023), 9058–9068

  5. [5]

    Nicolas Gast, Bruno Gaujal, and Kimang Khun. 2023. Testing indexability and computing Whittle and Gittins index in subcubic time.Mathematical Methods of Operations Research97, 3 (2023), 391–436

  6. [6]

    Nicolas Gast, Bruno Gaujal, and Chen Yan. 2023. Exponential asymptotic opti- mality of whittle index policy.Queueing Systems104, 1 (2023), 107–150

  7. [7]

    Christine Herlihy, Aviva Prins, Aravind Srinivasan, and John P Dickerson. 2023. Planning to fairly allocate: Probabilistic fairness in the restless bandit setting. InProceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 732–740

  8. [8]

    S Amir Hosseini and Shivendra S Panwar. 2017. Restless streaming bandits: Scheduling scalable video in wireless networks. In2017 55th Annual Allerton Conference on Communication, Control, and Computing (Allerton). IEEE, 620–628

  9. [9]

    Yu-Pin Hsu. 2018. Age of information: Whittle index for scheduling stochastic arrivals. In2018 IEEE International Symposium on Information Theory (ISIT). IEEE, 2634–2638

  10. [10]

    Matthew Joseph, Michael Kearns, Jamie H Morgenstern, and Aaron Roth. 2016. Fairness in learning: Classic and contextual bandits.Advances in neural informa- tion processing systems29 (2016)

  11. [11]

    Igor Kadota, Abhishek Sinha, and Eytan Modiano. 2018. Optimizing age of information in wireless networks with throughput constraints. InIEEE INFOCOM 2018-IEEE Conference on Computer Communications. IEEE, 1844–1852

  12. [12]

    Maialen Larrañaga, Urtzi Ayesta, and Ina Maria Verloop. 2015. Asymptotically optimal index policies for an abandonment queue with convex holding cost. Queueing systems81, 2 (2015), 99–169

  13. [13]

    Dexun Li and Pradeep Varakantham. 2022. Efficient resource allocation with fairness constraints in restless multi-armed bandits. InUncertainty in Artificial Intelligence. PMLR, 1158–1167

  14. [14]

    Dexun Li and Pradeep Varakantham. 2023. Avoiding starvation of arms in restless multi-armed bandit. (2023)

  15. [15]

    Andrea Lodi, Philippe Olivier, Gilles Pesant, and Sriram Sankaranarayanan. 2024. Fairness over time in dynamic resource allocation with an application in health- care.Mathematical Programming203, 1 (2024), 285–318

  16. [16]

    Yi Mao and Andrew Perrault. 2024. Time-Constrained Restless Multi-Armed Bandits with Applications to City Service Scheduling.. InAAMAS. 2375–2377

  17. [17]

    Khaled Nakhleh, Santosh Ganji, Ping-Chun Hsieh, I Hou, and Srinivas Shakkottai

  18. [18]

    Advances in Neural Information Processing Systems34 (2021), 828–839

    NeurWIN: Neural Whittle index network for restless bandits via deep RL. Advances in Neural Information Processing Systems34 (2021), 828–839

  19. [19]

    Khaled Nakhleh and I-Hong Hou. 2022. DeepTOP: Deep threshold-optimal policy for MDPs and RMABs.Advances in Neural Information Processing Systems35 (2022), 28734–28746

  20. [20]

    Michael J Neely. 2013. A Lyapunov optimization approach to repeated stochastic games. In2013 51st Annual Allerton Conference on Communication, Control, and Computing (Allerton). IEEE, 1082–1089

  21. [21]

    Francisco Robledo Relaño, Vivek Borkar, Urtzi Ayesta, and Konstantin Avrachenkov. 2024. Tabular and deep learning for the whittle index.ACM Transactions on Modeling and Performance Evaluation of Computing Systems9, 3 (2024), 1–21

  22. [22]

    Md Kamran Chowdhury Shisher, Bo Ji, I-Hong Hou, and Yin Sun. 2023. Learning and communications co-design for remote inference systems: Feature length se- lection and transmission scheduling.IEEE Journal on Selected Areas in Information Theory4 (2023), 524–538

  23. [23]

    Santosh Kumar Singh, Vivek S Borkar, and Gaurav S Kasbekar. 2022. User association in dense mmWave networks as restless bandits.IEEE Transactions on Vehicular Technology71, 7 (2022), 7919–7929

  24. [24]

    Haoyue Tang, Jintao Wang, Linqi Song, and Jian Song. 2020. Minimizing age of information with power constraints: Multi-user opportunistic scheduling in multi- state time-varying channels.IEEE Journal on Selected Areas in Communications 38, 5 (2020), 854–868

  25. [25]

    Kai Wang, Lily Xu, Aparna Taneja, and Milind Tambe. 2023. Optimistic whittle index policy: Online learning for restless bandits. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 37. 10131–10139

  26. [26]

    Shufan Wang, Guojun Xiong, and Jian Li. 2024. Online restless multi-armed bandits with long-term fairness constraints. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 38. 15616–15624

  27. [27]

    Richard R Weber and Gideon Weiss. 1990. On an index policy for restless bandits. Journal of applied probability27, 3 (1990), 637–648

  28. [28]

    Peter Whittle. 1988. Restless bandits: Activity allocation in a changing world. Journal of applied probability25, A (1988), 287–298

  29. [29]

    Jun Xu and Chengcheng Guo. 2019. Scheduling periodic real-time traffic in lossy wireless networks as restless multi-armed bandit.IEEE Wireless Communications Letters8, 4 (2019), 1129–1132

  30. [30]

    Yihan Zou, Kwang Taik Kim, Xiaojun Lin, and Mung Chiang. 2021. Minimizing age-of-information in heterogeneous multi-channel systems: A new partial-index approach. InProceedings of the twenty-second international symposium on the- ory, algorithmic foundations, and protocol design for mobile networks and mobile computing. 11–20. Restless Bandits with Indivi...