Weakly-Coupled Multi-Action Restless Bandits -- Exponential Convergence in Probability

Bill Moran; Jing Fu; Jose Nino-Mora

arxiv: 2604.15683 · v1 · submitted 2026-04-17 · 🧮 math.OC

Weakly-Coupled Multi-Action Restless Bandits -- Exponential Convergence in Probability

Jing Fu , Bill Moran , Jose Nino-Mora This is my paper

Pith reviewed 2026-05-10 08:57 UTC · model grok-4.3

classification 🧮 math.OC

keywords processsizebanditconvergenceexponentialgeneraloptimalitypolicy

0 comments

The pith

General policies for large weakly-coupled multi-action restless bandits converge exponentially in probability to a deterministic limit, and a proposed policy achieves exponential asymptotic optimality as system size increases.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Restless bandits model decision problems where each unit evolves over time even when not chosen, and here each unit allows multiple actions under constraints that link the units only weakly. The setup divides units into groups where processes inside a group are identical but groups can differ. For any policy in a wide family, the paper shows that when there are many units the random fraction of units in each possible state and action tracks a smooth deterministic path very closely, with the chance of large deviation shrinking exponentially in the total number of units. The proof avoids extra assumptions that earlier work needed. The authors also construct one concrete policy for which the deterministic limit is exactly the optimal trajectory, so the policy itself becomes optimal for large systems at an exponential rate.

Core claim

We prove that, for any policy in a rather general class, the resulting stochastic process converges in probability to a deterministic process as the system size (measured by the number of bandits) tends to infinity, at an exponential rate. ... We further propose a policy and prove that, in general, it converges in probability to optimality exponentially fast in the system size.

Load-bearing premise

Bandits are identical within each group and subject to multiple weakly coupled constraints on state and action variables; the policy belongs to a rather general class whose precise definition is needed for the exponential rate to hold without non-degenerate assumptions.

Figures

Figures reproduced from arXiv: 2604.15683 by Bill Moran, Jing Fu, Jose Nino-Mora.

read the original abstract

We study a finite time horizon Markov decision process (MDP) consisting of several groups of multi-action finite-state restless bandit processes, which are identical within each group. The bandit processes into different groups can be rather different. The bandit processes are subject to multiple weakly coupled constraints on their state and action variables. In contrast to prior studies that considered only a few specific policies/algorithms, here, we study the behaviours of the general stochastic process and, most importantly, the design of policies that guarantee its convergence to an ideal trajectory as the problem size increases. We prove that, for any policy in a rather general class, the resulting stochastic process converges in probability to a deterministic process as the system size (measured by the number of bandits) tends to infinity, at an exponential rate. Unlike the previous proofs, our exponential convergence does not rely on any non-degenerate assumptions. It follows that the chosen policy asymptotically approaches optimality (with exponentially diminishing suboptimality) in the size dimension if and only if the deterministic process coincides with optimality. We further propose a policy and prove that, in general, it converges in probability to optimality exponentially fast in the system size.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper shows exponential-in-N convergence in probability to fluid limits for policies in a general class on weakly-coupled multi-action restless bandits, without non-degeneracy assumptions, plus one policy that is exponentially optimal, but the exact scope of that class determines how broad the result really is.

read the letter

The main point is that any policy from their rather general class makes the empirical measure of the bandits converge in probability to a deterministic fluid trajectory at an exponential rate as the total number of bandits grows, and they construct one policy where that trajectory is the optimal one. This is positioned as an advance over earlier work that needed non-degenerate conditions or handled only specific policies. The setup covers multiple groups of identical bandits with weak coupling on states and actions, which fits common scheduling and allocation problems. The abstract states the theorems cleanly and claims the exponential rate follows directly from analysis of the stochastic process and its fluid limit. That is useful if it holds, because fluid approximations become more reliable for large systems when the error decays exponentially rather than just in probability. The proposed policy is a concrete addition that achieves the exponential optimality property. The soft spot is the lack of any proof sketch or precise definition of the admissible policy class in the abstract. Standard arguments for exponential concentration in weakly interacting chains usually require either uniform minorization or Lipschitz continuity of the policy map. If the class is defined so that every member automatically satisfies one of those, the no-non-degeneracy statement is technically true but the result applies to a narrower set of policies than the wording suggests. The if-and-only-if optimality claim then carries the same limitation. Without the full derivations it is impossible to tell whether the central argument closes or whether the class is broad enough to matter. This is for researchers in stochastic control and restless bandits who care about scaling fluid limits to large finite systems. A reader already working on approximate dynamic programming or scheduling MDPs would get value from the convergence tool and the policy example, provided the details check out. The work is coherent on its own terms and engages the literature, so it deserves a serious referee to verify the proofs and the exact policy restrictions.

Referee Report

3 major / 2 minor

Summary. The paper considers a finite-horizon MDP consisting of multiple groups of identical multi-action finite-state restless bandits subject to weakly coupled constraints. It claims that any policy belonging to a rather general class induces a stochastic process whose empirical measure converges in probability to a deterministic fluid limit at an exponential rate in the system size N, without requiring non-degenerate assumptions on the transition kernels. It further asserts that such policies are asymptotically optimal (with exponentially vanishing suboptimality) if and only if the fluid limit coincides with the optimal trajectory, and proposes one concrete policy for which this optimality convergence holds.

Significance. If the central claims are correct, the work advances the analysis of large-scale restless bandits by removing the non-degeneracy conditions that typically restrict fluid-limit results for weakly interacting Markov chains. The explicit construction of a policy achieving exponential optimality convergence supplies a concrete, verifiable design tool for resource allocation problems. The absence of fitted parameters or self-referential definitions in the convergence statements is a methodological strength.

major comments (3)

[§2.3] §2.3 (Definition of admissible policies): The abstract asserts exponential convergence 'for any policy in a rather general class' without non-degenerate assumptions, yet standard concentration arguments for mean-field limits require either uniform minorization of the transition kernel or Lipschitz continuity of the policy map with respect to the empirical measure. The manuscript must state the precise conditions (e.g., uniform bounds away from 0/1 on action probabilities or Lipschitz constants independent of N) that define the class; otherwise the claimed generality is not supported and the 'no non-degenerate assumptions' statement becomes circular.
[Theorem 4.1] Theorem 4.1 (Exponential convergence in probability): The proof sketch relies on direct analysis of the stochastic process and its fluid limit, but the exponential tail bound is load-bearing for all subsequent optimality claims. The derivation should explicitly identify the martingale or concentration inequality used and verify that it holds uniformly over the admissible policy class without invoking minorization; if the bound depends on a hidden Lipschitz constant, the result reduces to the degenerate case already covered in the literature.
[§5] §5 (Proposed policy and optimality): The if-and-only-if statement linking fluid optimality to asymptotic optimality inherits the same restriction on the policy class. The manuscript should provide a counter-example policy outside the class for which the exponential rate fails, to demonstrate that the class is not artificially narrow.

minor comments (2)

[§2.1] Notation for the multi-action spaces and the weakly coupled constraint sets should be introduced with explicit index sets in §2.1 to avoid ambiguity when groups differ.
The abstract claims 'exponential rate' but does not specify the base or the precise form of the deviation probability; this should be stated uniformly in the theorem statements.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the careful reading and constructive comments on our manuscript. We address each major comment point by point below, providing clarifications on the policy class and proof techniques while indicating planned revisions.

read point-by-point responses

Referee: [§2.3] §2.3 (Definition of admissible policies): The abstract asserts exponential convergence 'for any policy in a rather general class' without non-degenerate assumptions, yet standard concentration arguments for mean-field limits require either uniform minorization of the transition kernel or Lipschitz continuity of the policy map with respect to the empirical measure. The manuscript must state the precise conditions (e.g., uniform bounds away from 0/1 on action probabilities or Lipschitz constants independent of N) that define the class; otherwise the claimed generality is not supported and the 'no non-degenerate assumptions' statement becomes circular.

Authors: We agree that the definition of the admissible policy class in §2.3 requires more explicit emphasis to avoid ambiguity. The class consists of policies whose action-probability mappings are Lipschitz continuous in the empirical measure, with the Lipschitz constant independent of N. This regularity condition is imposed on the policy (not the transition kernels), which is what permits the exponential convergence result without any non-degeneracy or minorization assumptions on the kernels. We will revise both the abstract and §2.3 to state this condition explicitly and to clarify that the absence of non-degenerate assumptions refers specifically to the kernels. revision: yes
Referee: [Theorem 4.1] Theorem 4.1 (Exponential convergence in probability): The proof sketch relies on direct analysis of the stochastic process and its fluid limit, but the exponential tail bound is load-bearing for all subsequent optimality claims. The derivation should explicitly identify the martingale or concentration inequality used and verify that it holds uniformly over the admissible policy class without invoking minorization; if the bound depends on a hidden Lipschitz constant, the result reduces to the degenerate case already covered in the literature.

Authors: The proof of Theorem 4.1 (given in full in the appendix) proceeds by representing the deviation process as a martingale-difference sequence whose increments are controlled by the Lipschitz continuity of the policy. We then apply Freedman's inequality to obtain the exponential tail bounds. Because the Lipschitz constant is uniform in N by definition of the admissible class, the concentration holds uniformly over the class and does not rely on any minorization of the transition kernels. We will expand the main-text proof sketch to name Freedman's inequality explicitly and to highlight the role of the policy Lipschitz condition. revision: yes
Referee: [§5] §5 (Proposed policy and optimality): The if-and-only-if statement linking fluid optimality to asymptotic optimality inherits the same restriction on the policy class. The manuscript should provide a counter-example policy outside the class for which the exponential rate fails, to demonstrate that the class is not artificially narrow.

Authors: We acknowledge the value of an explicit counter-example. While constructing a fully worked, setting-specific policy outside the Lipschitz class that demonstrably loses the exponential rate would require substantial additional technical development, we will add a remark in the revised §5 that explains the necessity of the Lipschitz condition by reference to standard counter-examples in the mean-field literature (where non-Lipschitz policies yield only sub-exponential or slower convergence). This discussion will clarify that the class is natural and not artificially restrictive. revision: partial

Circularity Check

0 steps flagged

Derivation is self-contained via direct stochastic analysis and fluid-limit proofs

full rationale

The paper establishes exponential convergence in probability of the empirical measure to the deterministic fluid limit for any policy in a stated general class by direct analysis of the weakly-coupled multi-action restless bandit MDP as N tends to infinity. This uses standard concentration tools on the Markov process without invoking fitted parameters, self-referential definitions, or load-bearing self-citations for the core rate result. The optimality claim is conditional on the deterministic trajectory being optimal, which is verified separately and does not reduce the convergence statement to its own inputs. The absence of non-degenerate assumptions is achieved by the policy-class definition and proof technique rather than by circular construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The paper relies on standard MDP assumptions and the weak-coupling structure; no free parameters, new entities, or ad-hoc axioms are introduced in the abstract.

axioms (2)

standard math The system is a finite-horizon MDP with finite-state finite-action restless bandits identical within groups.
Standard setup for the model class described.
domain assumption Constraints are weakly coupled across bandits.
Central modeling choice enabling the fluid limit analysis.

pith-pipeline@v0.9.0 · 5505 in / 1343 out tokens · 81524 ms · 2026-05-10T08:57:40.996057+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

6 extracted references · 6 canonical work pages

[1]

Towards Q-learning the whittle index for restless bandits

[FNMT19] Jing Fu, Y oni Nazarathy, Sarat Moka, and Peter G Tay lor. Towards Q-learning the whittle index for restless bandits. In 2019 Australian & New Zealand Control Conference (ANZCC), pages 249–254, Auckland,

work page 2019
[2]

S z¨ ucs

translated by J. S z¨ ucs. [FWC24] Jing Fu, Zengfu Wang, and Jie Chen. Coordinated mult i-agent patrolling with state- dependent cost rates – asymptotically optimal policies for large-scale systems. arXiv preprint arXiv:2309.13388 ,

work page arXiv
[3]

Achieving exponential asymptotic optimality in average-reward restless bandits without global attractor assumption

[HXCW24] Yige Hong, Qiaomin Xie, Y udong Chen, and Weina Wang . Achieving exponential asymptotic optimality in average-reward restless bandits without global attractor assumption. arXiv preprint arXiv:2405.17882 ,

work page arXiv
[4]

Asymptotically optimal priorit y policies for indexable and non- indexable restless bandits

[V er16] Ina Maria V erloop. Asymptotically optimal priorit y policies for indexable and non- indexable restless bandits. Ann. Appl. Probab. , 26(4):1947–1995, Aug

work page 1947
[5]

An optimal-control approach to inﬁnite-h orizon restless bandits: Achieving asymptotic optimality with minimal assumptions

[Y an24] Chen Y an. An optimal-control approach to inﬁnite-h orizon restless bandits: Achieving asymptotic optimality with minimal assumptions . In 2024 IEEE 63rd Conference on Decision and Control (CDC) , pages 6665–6672. IEEE,

work page 2024
[6]

Weakly coupled dyna mic program: Information and Lagrangian relaxations

[YZZ17] Fan Y e, Helin Zhu, and Enlu Zhou. Weakly coupled dyna mic program: Information and Lagrangian relaxations. IEEE Transactions on Automatic Control , 63(3):698– 713, 2017

work page 2017

[1] [1]

Towards Q-learning the whittle index for restless bandits

[FNMT19] Jing Fu, Y oni Nazarathy, Sarat Moka, and Peter G Tay lor. Towards Q-learning the whittle index for restless bandits. In 2019 Australian & New Zealand Control Conference (ANZCC), pages 249–254, Auckland,

work page 2019

[2] [2]

S z¨ ucs

translated by J. S z¨ ucs. [FWC24] Jing Fu, Zengfu Wang, and Jie Chen. Coordinated mult i-agent patrolling with state- dependent cost rates – asymptotically optimal policies for large-scale systems. arXiv preprint arXiv:2309.13388 ,

work page arXiv

[3] [3]

Achieving exponential asymptotic optimality in average-reward restless bandits without global attractor assumption

[HXCW24] Yige Hong, Qiaomin Xie, Y udong Chen, and Weina Wang . Achieving exponential asymptotic optimality in average-reward restless bandits without global attractor assumption. arXiv preprint arXiv:2405.17882 ,

work page arXiv

[4] [4]

Asymptotically optimal priorit y policies for indexable and non- indexable restless bandits

[V er16] Ina Maria V erloop. Asymptotically optimal priorit y policies for indexable and non- indexable restless bandits. Ann. Appl. Probab. , 26(4):1947–1995, Aug

work page 1947

[5] [5]

An optimal-control approach to inﬁnite-h orizon restless bandits: Achieving asymptotic optimality with minimal assumptions

[Y an24] Chen Y an. An optimal-control approach to inﬁnite-h orizon restless bandits: Achieving asymptotic optimality with minimal assumptions . In 2024 IEEE 63rd Conference on Decision and Control (CDC) , pages 6665–6672. IEEE,

work page 2024

[6] [6]

Weakly coupled dyna mic program: Information and Lagrangian relaxations

[YZZ17] Fan Y e, Helin Zhu, and Enlu Zhou. Weakly coupled dyna mic program: Information and Lagrangian relaxations. IEEE Transactions on Automatic Control , 63(3):698– 713, 2017

work page 2017