Optimizing Resource-Constrained Non-Pharmaceutical Interventions for Multi-Cluster Outbreak Control Using Hierarchical Reinforcement Learning

Andrew Perrault; Xueqiao Peng

arxiv: 2603.19397 · v2 · submitted 2026-03-19 · 💻 cs.LG

Optimizing Resource-Constrained Non-Pharmaceutical Interventions for Multi-Cluster Outbreak Control Using Hierarchical Reinforcement Learning

Xueqiao Peng , Andrew Perrault This is my paper

Pith reviewed 2026-05-15 07:53 UTC · model grok-4.3

classification 💻 cs.LG

keywords reinforcement learningresource allocationoutbreak controlnon-pharmaceutical interventionsrestless multi-armed banditsSARS-CoV-2hierarchical learningpublic health

0 comments

The pith

A hierarchical reinforcement learning system allocates limited testing and quarantine resources across multiple asynchronous outbreak clusters more effectively than bandit or heuristic methods.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper addresses how to distribute scarce non-pharmaceutical interventions such as diagnostic testing and quarantine when multiple infection clusters emerge at different times and must share a fixed resource budget. It models the task as a constrained restless multi-armed bandit problem and solves it with a two-level reinforcement learning approach: a global controller learns a continuous multiplier that sets the overall spending rate, while local policies rank individuals inside each cluster according to their estimated marginal value. In an agent-based simulator of SARS-CoV-2 transmission, this framework delivers 20 to 30 percent better outbreak control than RMAB-inspired and simple heuristic baselines across many system sizes and testing budgets. The same structure scales to forty simultaneously active clusters while producing decisions faster than direct application of the bandit formulation.

Core claim

The authors formulate multi-cluster NPI allocation as a constrained restless multi-armed bandit and show that a hierarchical reinforcement learning framework solves it: a global controller learns a continuous cost multiplier that regulates total resource demand, and a generalized local policy estimates the marginal value of allocating resources to specific individuals within each cluster. When evaluated in a realistic agent-based SARS-CoV-2 simulator with dynamically arriving clusters, the resulting policies outperform RMAB-inspired and heuristic baselines by 20-30 percent in outbreak control effectiveness and remain scalable to forty concurrent clusters.

What carries the argument

Hierarchical reinforcement learning framework consisting of a global controller that outputs a continuous resource cost multiplier and a local policy that computes marginal value of allocation within each cluster.

If this is right

Resource allocation policies can be learned that respect a shared budget while handling asynchronous cluster arrivals and heterogeneous risk levels.
The method scales decision-making to at least forty simultaneously active clusters without loss of performance.
Decision speed improves relative to direct solution of the underlying restless bandit problem.
Outbreak size and duration can be reduced under tight testing budgets compared with standard baselines.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The global-local split may apply to other constrained public-health decisions such as hospital-bed or vaccine allocation across multiple sites.
Embedding the framework in live surveillance systems could allow continuous re-learning as new clusters appear.
Robustness checks against alternative disease models or compliance assumptions would clarify how far the performance gains transfer.

Load-bearing premise

The agent-based SARS-CoV-2 simulator with dynamically arriving clusters accurately captures real-world transmission, compliance, and resource constraints.

What would settle it

Applying the learned policies to data from an actual multi-cluster outbreak or to an independent, differently calibrated epidemiological model and checking whether the 20-30 percent improvement in control effectiveness is reproduced.

Figures

Figures reproduced from arXiv: 2603.19397 by Andrew Perrault, Xueqiao Peng.

**Figure 1.** Figure 1: Overview of the proposed hierarchical RL framework for multi-cluster outbreak control. A global PPO controller adjusts a shared [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗

read the original abstract

Non-pharmaceutical interventions (NPIs), such as diagnostic testing and quarantine, are crucial for controlling infectious disease outbreaks but are often constrained by limited resources, particularly in early outbreak stages. In real-world public health settings, resources must be allocated across multiple outbreak clusters that emerge asynchronously, vary in size and risk, and compete for a shared resource budget. Here, a cluster corresponds to a group of close contacts generated by a single infected index case. Thus, decisions must be made under uncertainty and heterogeneous demands, while respecting operational constraints. We formulate this problem as a constrained restless multi-armed bandit and propose a hierarchical reinforcement learning framework. A global controller learns a continuous action cost multiplier that adjusts global resource demand, while a generalized local policy estimates the marginal value of allocating resources to individuals within each cluster. We evaluate the proposed framework in a realistic agent-based simulator of SARS-CoV-2 with dynamically arriving clusters. Across a wide range of system scales and testing budgets, our method consistently outperforms RMAB-inspired and heuristic baselines, improving outbreak control effectiveness by 20%-30%. Experiments on up to 40 concurrently active clusters further demonstrate that the hierarchical framework is highly scalable and enables faster decision-making than the RMAB-inspired method.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Hierarchical RL for NPI allocation across clusters shows 20-30% simulation gains but rests on an unvalidated custom simulator.

read the letter

This paper presents a hierarchical reinforcement learning method for allocating scarce testing and quarantine resources across multiple asynchronously arriving outbreak clusters. It frames the task as a constrained restless multi-armed bandit and splits the problem into a global controller that learns a continuous cost multiplier plus local policies that estimate marginal value per cluster. The approach is evaluated in an agent-based SARS-CoV-2 simulator with dynamically arriving clusters and reports consistent 20-30% better outbreak control than RMAB baselines and heuristics across scales and budgets, while scaling to 40 concurrent clusters without major slowdowns. The separation of global resource tuning from local allocation decisions is a reasonable way to handle shared constraints and heterogeneous demands. The results appear stable in the reported experiments and address a real operational need in early multi-cluster response. The main limitation is that all performance numbers come from one custom simulator whose parameters are not calibrated to real outbreak data and receive no sensitivity checks on transmission rates, compliance, or cluster arrival patterns. The abstract also omits details on statistical significance, seed variance, or exact baseline code, so the size of the reported edge is hard to judge. This work would interest researchers applying RL to public-health resource problems. It deserves peer review to inspect the simulator mechanics and experimental rigor, even though it will probably require added validation experiments before acceptance.

Referee Report

3 major / 2 minor

Summary. The paper formulates multi-cluster NPI resource allocation as a constrained restless multi-armed bandit and introduces a hierarchical RL method with a global controller for continuous cost multipliers and local policies for per-cluster marginal value estimation. It reports that this approach yields 20-30% better outbreak control than RMAB-inspired and heuristic baselines in an agent-based SARS-CoV-2 simulator with dynamically arriving clusters, while remaining scalable to 40 concurrent clusters.

Significance. If the reported gains hold under rigorous statistical controls and the simulator dynamics prove transferable, the hierarchical framework would offer a practical, scalable tool for early-stage outbreak resource allocation under uncertainty. The end-to-end training in an external simulator and explicit handling of asynchronous cluster arrivals are strengths, but the absence of real-data calibration or sensitivity analysis limits immediate policy relevance.

major comments (3)

[Abstract / Evaluation] Abstract and evaluation description: the central claim of consistent 20-30% gains over baselines provides no information on statistical significance, variance across random seeds, number of trials, or exact baseline implementations (e.g., how the RMAB-inspired method is solved). This omission makes it impossible to assess whether the reported improvement is robust or an artifact of simulator stochasticity.
[Simulator and Experiments] Simulator description: no calibration to real outbreak data, no sensitivity analysis on transmission probability, compliance rates, cluster-size distributions, or quarantine efficacy is reported. If these parameters deviate from reality (e.g., under-modeling stochastic fade-out), the learned policy advantage may not transfer, undermining the claim that the method improves real-world outbreak control.
[Experiments] Scalability experiments: while the paper states the framework handles up to 40 clusters and enables faster decisions than the RMAB baseline, no quantitative timing results, memory scaling, or ablation on the hierarchical decomposition are provided to support the scalability assertion.

minor comments (2)

[Method] Notation for the continuous action cost multiplier and the local policy's marginal-value estimator should be defined more explicitly with symbols and update rules to aid reproducibility.
[Method] The abstract mentions 'generalized local policy' without clarifying whether it is a single shared network or per-cluster; this should be stated clearly in the method section.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments, which highlight important aspects of statistical rigor, simulator validity, and scalability. We address each major comment below and will revise the manuscript accordingly to strengthen the presentation.

read point-by-point responses

Referee: [Abstract / Evaluation] Abstract and evaluation description: the central claim of consistent 20-30% gains over baselines provides no information on statistical significance, variance across random seeds, number of trials, or exact baseline implementations (e.g., how the RMAB-inspired method is solved). This omission makes it impossible to assess whether the reported improvement is robust or an artifact of simulator stochasticity.

Authors: We agree that statistical details are necessary to substantiate the performance claims. In the revised manuscript, we will report results aggregated over 50 independent random seeds, including means, standard deviations, and 95% confidence intervals for the 20-30% gains. We will add paired t-tests or Wilcoxon tests with p-values to demonstrate statistical significance. We will also expand the baseline description to specify that the RMAB-inspired method employs a Lagrangian relaxation solved via linear programming at each epoch, with the exact relaxation parameter tuning procedure. revision: yes
Referee: [Simulator and Experiments] Simulator description: no calibration to real outbreak data, no sensitivity analysis on transmission probability, compliance rates, cluster-size distributions, or quarantine efficacy is reported. If these parameters deviate from reality (e.g., under-modeling stochastic fade-out), the learned policy advantage may not transfer, undermining the claim that the method improves real-world outbreak control.

Authors: We acknowledge that the simulator uses literature-derived parameters rather than direct calibration to a specific real-world dataset, which is a limitation for immediate policy transfer. In revision, we will add a dedicated sensitivity analysis section varying transmission probability by ±20%, compliance rates from 0.6 to 0.9, cluster-size distributions, and quarantine efficacy, showing that the hierarchical method retains its advantage across these ranges. Full calibration to proprietary outbreak data is not feasible in this study due to data access constraints; we will explicitly note this limitation and frame the work as a simulation-based proof of concept. revision: partial
Referee: [Experiments] Scalability experiments: while the paper states the framework handles up to 40 clusters and enables faster decisions than the RMAB baseline, no quantitative timing results, memory scaling, or ablation on the hierarchical decomposition are provided to support the scalability assertion.

Authors: We will augment the experiments with quantitative scalability metrics. The revised version will include plots of average decision time per step versus number of active clusters (5 to 40), peak memory usage scaling, and direct wall-clock comparisons against the RMAB baseline. We will also add an ablation study contrasting the full hierarchical controller against a flat (non-hierarchical) policy variant to isolate the contribution of the decomposition to both performance and computational efficiency. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation self-contained

full rationale

The paper formulates a constrained restless multi-armed bandit problem and introduces a hierarchical RL architecture (global cost multiplier + local marginal-value policy) whose training and evaluation occur entirely inside an external agent-based SARS-CoV-2 simulator. No equation or claim reduces a reported prediction to a fitted parameter by construction, no load-bearing uniqueness theorem is imported via self-citation, and the 20-30% improvement figures are empirical simulation outcomes on held-out cluster-arrival scenarios rather than algebraic identities. Minor simulator-parameter choices exist but are not presented as predictions, satisfying the criteria for a score of 0.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on standard RL convergence assumptions, the fidelity of the SARS-CoV-2 agent-based simulator, and the modeling choice that clusters are independent except for the shared resource budget. No new physical entities or ad-hoc constants are introduced beyond typical RL hyperparameters.

axioms (2)

domain assumption The agent-based simulator faithfully reproduces SARS-CoV-2 transmission, cluster generation, and compliance dynamics.
Invoked implicitly by using simulation performance as the primary evidence of effectiveness.
standard math Standard policy-gradient or actor-critic convergence guarantees apply to the hierarchical training procedure.
Required for the learned global and local policies to be stable.

pith-pipeline@v0.9.0 · 5519 in / 1517 out tokens · 65031 ms · 2026-05-15T07:53:45.810084+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

reward defined as −(S1 + α2 S2 + α3 S3)/N … αactive3 = m_t · αtrue3
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Lagrangian relaxation L(λ) … global controller learns continuous cost multiplier

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

42 extracted references · 42 canonical work pages · 1 internal anchor

[1]

Data-driven methods for present and future pan- demics: Monitoring, modelling and managing.Annual Reviews in Control, 52:448–464,

[Alamoet al., 2021 ] Teodoro Alamo, Daniel G Reina, Pablo Mill ´an Gata, Victor M Preciado, and Giulia Gior- dano. Data-driven methods for present and future pan- demics: Monitoring, modelling and managing.Annual Reviews in Control, 52:448–464,

work page 2021
[2]

Explainability for artificial intelligence in healthcare: a multidisciplinary perspective.BMC medical informatics and decision making, 20(1):310,

[Amannet al., 2020 ] Julia Amann, Alessandro Blasimme, Effy Vayena, Dietmar Frey, Vince I Madai, and Precise4Q Consortium. Explainability for artificial intelligence in healthcare: a multidisciplinary perspective.BMC medical informatics and decision making, 20(1):310,

work page 2020
[3]

Out of one, many: Using language models to simulate human samples.Political Analysis, 31(3):337– 351,

[Argyleet al., 2023 ] Lisa P Argyle, Ethan C Busby, Nancy Fulda, Joshua R Gubler, Christopher Rytting, and David Wingate. Out of one, many: Using language models to simulate human samples.Political Analysis, 31(3):337– 351,

work page 2023
[4]

Whittle index based q-learning for restless bandits with average reward.Automatica, 139:110186,

[Avrachenkov and Borkar, 2022] Konstantin E Avrachenkov and Vivek S Borkar. Whittle index based q-learning for restless bandits with average reward.Automatica, 139:110186,

work page 2022
[5]

Recent advances in hierarchical re- inforcement learning.Discrete event dynamic systems, 13(4):341–379,

[Barto and Mahadevan, 2003] Andrew G Barto and Srid- har Mahadevan. Recent advances in hierarchical re- inforcement learning.Discrete event dynamic systems, 13(4):341–379,

work page 2003
[6]

Athena Scientific,

[Bertsekas, 1997] Dimitri P Bertsekas.Nonlinear Program- ming. Athena Scientific,

work page 1997
[7]

Learn to intervene: An adaptive learning policy for restless bandits in application to preventive healthcare

[Biswaset al., 2021 ] Arpita Biswas, Gaurav Aggarwal, Pradeep Varakantham, and Milind Tambe. Learn to in- tervene: An adaptive learning policy for restless bandits in application to preventive healthcare.arXiv preprint arXiv:2105.07965,

work page arXiv 2021
[8]

Springer,

[Braueret al., 2017 ] Fred Brauer, Carlos Castillo-Chavez, and Zhilan Feng.Mathematical epidemiology. Springer,

work page 2017
[9]

Ohio supercom- puter center,

[Center, 1987] Ohio Supercomputer Center. Ohio supercom- puter center,

work page 1987
[10]

Modelling the influence of human be- haviour on the spread of infectious diseases: a review

[Funket al., 2010 ] Sebastian Funk, Marcel Salath´e, and Vin- cent AA Jansen. Modelling the influence of human be- haviour on the spread of infectious diseases: a review. Journal of the Royal Society Interface, 7(50):1247–1256,

work page 2010
[11]

Some indexable families of restless bandit problems.Advances in Applied Probability, 38(3):643–672,

[Glazebrooket al., 2006 ] Kevin D Glazebrook, Diego Ruiz- Hernandez, and Christopher Kirkbride. Some indexable families of restless bandit problems.Advances in Applied Probability, 38(3):643–672,

work page 2006
[12]

Caus- ability and explainability of artificial intelligence in medicine.Wiley interdisciplinary reviews: data mining and knowledge discovery, 9(4):e1312,

[Holzingeret al., 2019 ] Andreas Holzinger, Georg Langs, Helmut Denk, Kurt Zatloukal, and Heimo M ¨uller. Caus- ability and explainability of artificial intelligence in medicine.Wiley interdisciplinary reviews: data mining and knowledge discovery, 9(4):e1312,

work page 2019
[13]

Cleanrl: High-quality single-file implementations of deep reinforcement learning algorithms.Journal of Machine Learning Research, 23(274):1–18,

[Huanget al., 2022 ] Shengyi Huang, Rousslan Fer- nand Julien Dossa, Chang Ye, Jeff Braga, Dipam Chakraborty, Kinal Mehta, and Jo ˜AG ¸ o GM Ara˜Aˇsjo. Cleanrl: High-quality single-file implementations of deep reinforcement learning algorithms.Journal of Machine Learning Research, 23(274):1–18,

work page 2022
[14]

Beyond” to act or not to act”: Fast lagrangian approaches to general multi-action restless ban- dits

[Killianet al., 2021 ] Jackson A Killian, Andrew Perrault, and Milind Tambe. Beyond” to act or not to act”: Fast lagrangian approaches to general multi-action restless ban- dits. InProceedings of the 20th International Conference on Autonomous Agents and MultiAgent Systems, pages 710–718,

work page 2021
[15]

Reinforcement learning for optimization of covid-19 mitigation policies

[Kompellaet al., 2020 ] Varun Kompella, Roberto Capo- bianco, Stacy Jong, Jonathan Browne, Spencer Fox, Lau- ren Meyers, Peter Wurman, and Peter Stone. Reinforce- ment learning for optimization of covid-19 mitigation poli- cies.arXiv preprint arXiv:2010.10560,

work page arXiv 2020
[16]

Indexabil- ity of restless bandit problems and optimality of whittle in- dex for dynamic multichannel access.IEEE Transactions on Information Theory, 56(11):5547–5567,

[Liu and Zhao, 2010] Keqin Liu and Qing Zhao. Indexabil- ity of restless bandit problems and optimality of whittle in- dex for dynamic multichannel access.IEEE Transactions on Information Theory, 56(11):5547–5567,

work page 2010
[17]

Op- timizing urban service allocation with time-constrained restless bandits

[Mao and Perrault, 2026] Yi Mao and Andrew Perrault. Op- timizing urban service allocation with time-constrained restless bandits. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 39025–39032,

work page 2026
[18]

Collapsing bandits and their application to public health interven- tion.Advances in Neural Information Processing Systems, 33:15639–15650,

[Mateet al., 2020 ] Aditya Mate, Jackson Killian, Haifeng Xu, Andrew Perrault, and Milind Tambe. Collapsing bandits and their application to public health interven- tion.Advances in Neural Information Processing Systems, 33:15639–15650,

work page 2020
[19]

Human-level control through deep reinforcement learning.nature, 518(7540):529–533,

[Mnihet al., 2015 ] V olodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Belle- mare, Alex Graves, Martin Riedmiller, Andreas K Fidje- land, Georg Ostrovski, et al. Human-level control through deep reinforcement learning.nature, 518(7540):529–533,

work page 2015
[20]

Data-efficient hierar- chical reinforcement learning.Advances in neural infor- mation processing systems, 31,

[Nachumet al., 2018 ] Ofir Nachum, Shixiang Shane Gu, Honglak Lee, and Sergey Levine. Data-efficient hierar- chical reinforcement learning.Advances in neural infor- mation processing systems, 31,

work page 2018
[21]

Neurwin: Neural whittle index network for restless bandits via deep rl

[Nakhlehet al., 2021 ] Khaled Nakhleh, Santosh Ganji, Ping-Chun Hsieh, I-Hong Hou, and Srinivas Shakkottai. Neurwin: Neural whittle index network for restless bandits via deep rl. In M. Ranzato, A. Beygelzimer, Y . Dauphin, P.S. Liang, and J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems, volume 34, pages 828–839. Curran Asso...

work page 2021
[22]

Stochastic network op- timization with application to resource allocation and queueing.Synthesis Lectures on Communication Net- works, 3:1–211,

[Neely, 2010] Michael J Neely. Stochastic network op- timization with application to resource allocation and queueing.Synthesis Lectures on Communication Net- works, 3:1–211,

work page 2010
[23]

Analysis and control of epi- demics: A survey of spreading processes on complex net- works.IEEE Control Systems Magazine, 36(1):26–46,

[Nowzariet al., 2016 ] Cameron Nowzari, Victor M Preci- ado, and George J Pappas. Analysis and control of epi- demics: A survey of spreading processes on complex net- works.IEEE Control Systems Magazine, 36(1):26–46,

work page 2016
[24]

Generative agents: Interactive sim- ulacra of human behavior

[Parket al., 2023 ] Joon Sung Park, Joseph O’Brien, Car- rie Jun Cai, Meredith Ringel Morris, Percy Liang, and Michael S Bernstein. Generative agents: Interactive sim- ulacra of human behavior. InProceedings of the 36th an- nual acm symposium on user interface software and tech- nology, pages 1–22,

work page 2023
[25]

Using rein- forcement learning for multi-objective cluster-level opti- mization of non-pharmaceutical interventions for infec- tious disease

[Penget al., 2023 ] Xueqiao Peng, Jiaqi Xu, Xi Chen, Dinh Song An Nguyen, and Andrew Perrault. Using rein- forcement learning for multi-objective cluster-level opti- mization of non-pharmaceutical interventions for infec- tious disease. InMachine Learning for Health (ML4H), pages 445–460. PMLR,

work page 2023
[26]

Decision-making for foot-and-mouth disease control: objectives matter.Epi- demics, 15:10–19,

[Probertet al., 2016 ] William JM Probert, Katriona Shea, Christopher J Fonnesbeck, Michael C Runge, Tim E Car- penter, Salome D ¨urr, M Graeme Garner, Neil Harvey, Mark A Stevenson, and Colleen T Webb. Decision-making for foot-and-mouth disease control: objectives matter.Epi- demics, 15:10–19,

work page 2016
[27]

Behavioral dynamics of covid- 19: estimating underreporting, multiple waves, and adher- ence fatigue across 92 nations.System dynamics review, 37(1):5–31,

[Rahmandadet al., 2021 ] Hazhir Rahmandad, Tse Yang Lim, and John Sterman. Behavioral dynamics of covid- 19: estimating underreporting, multiple waves, and adher- ence fatigue across 92 nations.System dynamics review, 37(1):5–31,

work page 2021
[28]

Proximal Policy Optimization Algorithms

[Schulmanet al., 2017 ] John Schulman, Filip Wolski, Pra- fulla Dhariwal, Alec Radford, and Oleg Klimov. Prox- imal policy optimization algorithms.arXiv preprint arXiv:1707.06347,

work page internal anchor Pith review Pith/arXiv arXiv 2017
[29]

Attention is all you need.Advances in neural information processing systems, 30,

[Vaswaniet al., 2017 ] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30,

work page 2017
[30]

Behavioural change models for infectious disease transmission: a systematic review (2010–2015).Journal of The Royal Society Interface, 13(125):20160820,

[Verelstet al., 2016 ] Frederik Verelst, Lander Willem, and Philippe Beutels. Behavioural change models for infectious disease transmission: a systematic review (2010–2015).Journal of The Royal Society Interface, 13(125):20160820,

work page 2016
[31]

Feudal net- works for hierarchical reinforcement learning

[Vezhnevetset al., 2017] Alexander Sasha Vezhnevets, Si- mon Osindero, Tom Schaul, Nicolas Heess, Max Jader- berg, David Silver, and Koray Kavukcuoglu. Feudal net- works for hierarchical reinforcement learning. InInterna- tional conference on machine learning, pages 3540–3549. PMLR,

work page 2017
[32]

On an index policy for restless bandits.Journal of applied probability, 27(3):637–648,

[Weber and Weiss, 1990] Richard R Weber and Gideon Weiss. On an index policy for restless bandits.Journal of applied probability, 27(3):637–648,

work page 1990
[33]

Restless bandits: Activity al- location in a changing world.Journal of applied probabil- ity, 25(A):287–298,

[Whittle, 1988] Peter Whittle. Restless bandits: Activity al- location in a changing world.Journal of applied probabil- ity, 25(A):287–298,

work page 1988
[34]

Mean field multi- agent reinforcement learning

[Yanget al., 2018 ] Yaodong Yang, Rui Luo, Minne Li, Ming Zhou, Weinan Zhang, and Jun Wang. Mean field multi- agent reinforcement learning. InInternational conference on machine learning, pages 5571–5580. PMLR,

work page 2018
[35]

In this framework, each cluster is con- trolled by a Deep Q-Network (DQN) that makes individual- level testing decisions under partial observability

as the local decision module. In this framework, each cluster is con- trolled by a Deep Q-Network (DQN) that makes individual- level testing decisions under partial observability. Our im- plementation extends this framework to support adaptation across different testing cost regimes without retraining. Supervised Learning EncoderBecause true infection sta...

work page 2023
[36]

For an active cluster, this vector encodes its size (normalized by the maximum cluster size), its age relative to the episode length, and short-term histo- ries of testing activity, symptom prevalence, and positive test outcomes over the previous three timesteps. These quanti- ties are normalized by cluster size to ensure comparability across clusters of ...

work page 2023
[37]

Quarantine decisions follow a threshold policy adopted from Peng [2023], while testing decisions are con- trolled by the learned policy

Each individual is represented by a fixed-dimensional observation vector encoding epidemi- ological belief features, symptom history, testing history, and Algorithm 1Global Q-Ranking Policy Input: Active clustersA t, local observations{o n,i,t}, global budgetB Output: Executed actions{a n,i,t} 1:C ← ∅ 2:for alln∈ A t do 3:for allindividualiin clusterndo 4...

work page 2023
[38]

for additional de- tails and justifications. B.2 Training Details We train a hierarchical reinforcement learning system con- sisting of a generalized Transformer-based Deep Q-Network Parameter Value Incubation period Lognormal (mean=1.57 days, std=0.65 days) Infectious period 7 days (from 2 days before to 5 days after symptom onset) Baseline transmission ...

work page 2023
[39]

The DQN is trained using off-policy reinforcement learn- ing with a replay buffer of size2×10 5 and a batch size of

The DQN training pipeline is adapted from the CleanRL framework [Huanget al., 2022 ]. The DQN is trained using off-policy reinforcement learn- ing with a replay buffer of size2×10 5 and a batch size of

work page 2022
[40]

A cosine learning- rate schedule with linear warmup over the first10 4 steps is applied, starting from an initial learning rate of5×10 −5

We use the Adam optimizer with gradient clipping at a maximum norm of 1.0 to stabilize updates. A cosine learning- rate schedule with linear warmup over the first10 4 steps is applied, starting from an initial learning rate of5×10 −5. Training is performed for up to5×10 6 environment steps. To support joint adaptation with the supervised learning (SL) enc...

work page 2017
[41]

Optimization uses the AdamW optimizer with a learning rate of3×10−5, weight decay10 −4, and gradient clipping with a maximum norm of 0.5

We use Gen- eralized Advantage Estimation with discount factorγ= 0.99 andλ= 0.90, and normalize advantages using a running mean and variance estimator. Optimization uses the AdamW optimizer with a learning rate of3×10−5, weight decay10 −4, and gradient clipping with a maximum norm of 0.5. The PPO clipped objective is used with a clip coefficient of 0.10, ...

work page 1987
[42]

How- ever, Hier-PPO consistently achieves lower decision latency among all the settings, with speedups ranging from approxi- mately4−8×

As expected, the runtime of both methods increases with the number of clusters. How- ever, Hier-PPO consistently achieves lower decision latency among all the settings, with speedups ranging from approxi- mately4−8×. The difference in runtime between the two methods is most significant when budgets are tight. When the test budget is relatively small compa...

work page 2004

[1] [1]

Data-driven methods for present and future pan- demics: Monitoring, modelling and managing.Annual Reviews in Control, 52:448–464,

[Alamoet al., 2021 ] Teodoro Alamo, Daniel G Reina, Pablo Mill ´an Gata, Victor M Preciado, and Giulia Gior- dano. Data-driven methods for present and future pan- demics: Monitoring, modelling and managing.Annual Reviews in Control, 52:448–464,

work page 2021

[2] [2]

Explainability for artificial intelligence in healthcare: a multidisciplinary perspective.BMC medical informatics and decision making, 20(1):310,

[Amannet al., 2020 ] Julia Amann, Alessandro Blasimme, Effy Vayena, Dietmar Frey, Vince I Madai, and Precise4Q Consortium. Explainability for artificial intelligence in healthcare: a multidisciplinary perspective.BMC medical informatics and decision making, 20(1):310,

work page 2020

[3] [3]

Out of one, many: Using language models to simulate human samples.Political Analysis, 31(3):337– 351,

[Argyleet al., 2023 ] Lisa P Argyle, Ethan C Busby, Nancy Fulda, Joshua R Gubler, Christopher Rytting, and David Wingate. Out of one, many: Using language models to simulate human samples.Political Analysis, 31(3):337– 351,

work page 2023

[4] [4]

Whittle index based q-learning for restless bandits with average reward.Automatica, 139:110186,

[Avrachenkov and Borkar, 2022] Konstantin E Avrachenkov and Vivek S Borkar. Whittle index based q-learning for restless bandits with average reward.Automatica, 139:110186,

work page 2022

[5] [5]

Recent advances in hierarchical re- inforcement learning.Discrete event dynamic systems, 13(4):341–379,

[Barto and Mahadevan, 2003] Andrew G Barto and Srid- har Mahadevan. Recent advances in hierarchical re- inforcement learning.Discrete event dynamic systems, 13(4):341–379,

work page 2003

[6] [6]

Athena Scientific,

[Bertsekas, 1997] Dimitri P Bertsekas.Nonlinear Program- ming. Athena Scientific,

work page 1997

[7] [7]

Learn to intervene: An adaptive learning policy for restless bandits in application to preventive healthcare

[Biswaset al., 2021 ] Arpita Biswas, Gaurav Aggarwal, Pradeep Varakantham, and Milind Tambe. Learn to in- tervene: An adaptive learning policy for restless bandits in application to preventive healthcare.arXiv preprint arXiv:2105.07965,

work page arXiv 2021

[8] [8]

Springer,

[Braueret al., 2017 ] Fred Brauer, Carlos Castillo-Chavez, and Zhilan Feng.Mathematical epidemiology. Springer,

work page 2017

[9] [9]

Ohio supercom- puter center,

[Center, 1987] Ohio Supercomputer Center. Ohio supercom- puter center,

work page 1987

[10] [10]

Modelling the influence of human be- haviour on the spread of infectious diseases: a review

[Funket al., 2010 ] Sebastian Funk, Marcel Salath´e, and Vin- cent AA Jansen. Modelling the influence of human be- haviour on the spread of infectious diseases: a review. Journal of the Royal Society Interface, 7(50):1247–1256,

work page 2010

[11] [11]

Some indexable families of restless bandit problems.Advances in Applied Probability, 38(3):643–672,

[Glazebrooket al., 2006 ] Kevin D Glazebrook, Diego Ruiz- Hernandez, and Christopher Kirkbride. Some indexable families of restless bandit problems.Advances in Applied Probability, 38(3):643–672,

work page 2006

[12] [12]

Caus- ability and explainability of artificial intelligence in medicine.Wiley interdisciplinary reviews: data mining and knowledge discovery, 9(4):e1312,

[Holzingeret al., 2019 ] Andreas Holzinger, Georg Langs, Helmut Denk, Kurt Zatloukal, and Heimo M ¨uller. Caus- ability and explainability of artificial intelligence in medicine.Wiley interdisciplinary reviews: data mining and knowledge discovery, 9(4):e1312,

work page 2019

[13] [13]

Cleanrl: High-quality single-file implementations of deep reinforcement learning algorithms.Journal of Machine Learning Research, 23(274):1–18,

[Huanget al., 2022 ] Shengyi Huang, Rousslan Fer- nand Julien Dossa, Chang Ye, Jeff Braga, Dipam Chakraborty, Kinal Mehta, and Jo ˜AG ¸ o GM Ara˜Aˇsjo. Cleanrl: High-quality single-file implementations of deep reinforcement learning algorithms.Journal of Machine Learning Research, 23(274):1–18,

work page 2022

[14] [14]

Beyond” to act or not to act”: Fast lagrangian approaches to general multi-action restless ban- dits

[Killianet al., 2021 ] Jackson A Killian, Andrew Perrault, and Milind Tambe. Beyond” to act or not to act”: Fast lagrangian approaches to general multi-action restless ban- dits. InProceedings of the 20th International Conference on Autonomous Agents and MultiAgent Systems, pages 710–718,

work page 2021

[15] [15]

Reinforcement learning for optimization of covid-19 mitigation policies

[Kompellaet al., 2020 ] Varun Kompella, Roberto Capo- bianco, Stacy Jong, Jonathan Browne, Spencer Fox, Lau- ren Meyers, Peter Wurman, and Peter Stone. Reinforce- ment learning for optimization of covid-19 mitigation poli- cies.arXiv preprint arXiv:2010.10560,

work page arXiv 2020

[16] [16]

Indexabil- ity of restless bandit problems and optimality of whittle in- dex for dynamic multichannel access.IEEE Transactions on Information Theory, 56(11):5547–5567,

[Liu and Zhao, 2010] Keqin Liu and Qing Zhao. Indexabil- ity of restless bandit problems and optimality of whittle in- dex for dynamic multichannel access.IEEE Transactions on Information Theory, 56(11):5547–5567,

work page 2010

[17] [17]

Op- timizing urban service allocation with time-constrained restless bandits

[Mao and Perrault, 2026] Yi Mao and Andrew Perrault. Op- timizing urban service allocation with time-constrained restless bandits. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 39025–39032,

work page 2026

[18] [18]

Collapsing bandits and their application to public health interven- tion.Advances in Neural Information Processing Systems, 33:15639–15650,

[Mateet al., 2020 ] Aditya Mate, Jackson Killian, Haifeng Xu, Andrew Perrault, and Milind Tambe. Collapsing bandits and their application to public health interven- tion.Advances in Neural Information Processing Systems, 33:15639–15650,

work page 2020

[19] [19]

Human-level control through deep reinforcement learning.nature, 518(7540):529–533,

[Mnihet al., 2015 ] V olodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Belle- mare, Alex Graves, Martin Riedmiller, Andreas K Fidje- land, Georg Ostrovski, et al. Human-level control through deep reinforcement learning.nature, 518(7540):529–533,

work page 2015

[20] [20]

Data-efficient hierar- chical reinforcement learning.Advances in neural infor- mation processing systems, 31,

[Nachumet al., 2018 ] Ofir Nachum, Shixiang Shane Gu, Honglak Lee, and Sergey Levine. Data-efficient hierar- chical reinforcement learning.Advances in neural infor- mation processing systems, 31,

work page 2018

[21] [21]

Neurwin: Neural whittle index network for restless bandits via deep rl

[Nakhlehet al., 2021 ] Khaled Nakhleh, Santosh Ganji, Ping-Chun Hsieh, I-Hong Hou, and Srinivas Shakkottai. Neurwin: Neural whittle index network for restless bandits via deep rl. In M. Ranzato, A. Beygelzimer, Y . Dauphin, P.S. Liang, and J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems, volume 34, pages 828–839. Curran Asso...

work page 2021

[22] [22]

Stochastic network op- timization with application to resource allocation and queueing.Synthesis Lectures on Communication Net- works, 3:1–211,

[Neely, 2010] Michael J Neely. Stochastic network op- timization with application to resource allocation and queueing.Synthesis Lectures on Communication Net- works, 3:1–211,

work page 2010

[23] [23]

Analysis and control of epi- demics: A survey of spreading processes on complex net- works.IEEE Control Systems Magazine, 36(1):26–46,

[Nowzariet al., 2016 ] Cameron Nowzari, Victor M Preci- ado, and George J Pappas. Analysis and control of epi- demics: A survey of spreading processes on complex net- works.IEEE Control Systems Magazine, 36(1):26–46,

work page 2016

[24] [24]

Generative agents: Interactive sim- ulacra of human behavior

[Parket al., 2023 ] Joon Sung Park, Joseph O’Brien, Car- rie Jun Cai, Meredith Ringel Morris, Percy Liang, and Michael S Bernstein. Generative agents: Interactive sim- ulacra of human behavior. InProceedings of the 36th an- nual acm symposium on user interface software and tech- nology, pages 1–22,

work page 2023

[25] [25]

Using rein- forcement learning for multi-objective cluster-level opti- mization of non-pharmaceutical interventions for infec- tious disease

[Penget al., 2023 ] Xueqiao Peng, Jiaqi Xu, Xi Chen, Dinh Song An Nguyen, and Andrew Perrault. Using rein- forcement learning for multi-objective cluster-level opti- mization of non-pharmaceutical interventions for infec- tious disease. InMachine Learning for Health (ML4H), pages 445–460. PMLR,

work page 2023

[26] [26]

Decision-making for foot-and-mouth disease control: objectives matter.Epi- demics, 15:10–19,

[Probertet al., 2016 ] William JM Probert, Katriona Shea, Christopher J Fonnesbeck, Michael C Runge, Tim E Car- penter, Salome D ¨urr, M Graeme Garner, Neil Harvey, Mark A Stevenson, and Colleen T Webb. Decision-making for foot-and-mouth disease control: objectives matter.Epi- demics, 15:10–19,

work page 2016

[27] [27]

Behavioral dynamics of covid- 19: estimating underreporting, multiple waves, and adher- ence fatigue across 92 nations.System dynamics review, 37(1):5–31,

[Rahmandadet al., 2021 ] Hazhir Rahmandad, Tse Yang Lim, and John Sterman. Behavioral dynamics of covid- 19: estimating underreporting, multiple waves, and adher- ence fatigue across 92 nations.System dynamics review, 37(1):5–31,

work page 2021

[28] [28]

Proximal Policy Optimization Algorithms

[Schulmanet al., 2017 ] John Schulman, Filip Wolski, Pra- fulla Dhariwal, Alec Radford, and Oleg Klimov. Prox- imal policy optimization algorithms.arXiv preprint arXiv:1707.06347,

work page internal anchor Pith review Pith/arXiv arXiv 2017

[29] [29]

Attention is all you need.Advances in neural information processing systems, 30,

[Vaswaniet al., 2017 ] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30,

work page 2017

[30] [30]

Behavioural change models for infectious disease transmission: a systematic review (2010–2015).Journal of The Royal Society Interface, 13(125):20160820,

[Verelstet al., 2016 ] Frederik Verelst, Lander Willem, and Philippe Beutels. Behavioural change models for infectious disease transmission: a systematic review (2010–2015).Journal of The Royal Society Interface, 13(125):20160820,

work page 2016

[31] [31]

Feudal net- works for hierarchical reinforcement learning

[Vezhnevetset al., 2017] Alexander Sasha Vezhnevets, Si- mon Osindero, Tom Schaul, Nicolas Heess, Max Jader- berg, David Silver, and Koray Kavukcuoglu. Feudal net- works for hierarchical reinforcement learning. InInterna- tional conference on machine learning, pages 3540–3549. PMLR,

work page 2017

[32] [32]

On an index policy for restless bandits.Journal of applied probability, 27(3):637–648,

[Weber and Weiss, 1990] Richard R Weber and Gideon Weiss. On an index policy for restless bandits.Journal of applied probability, 27(3):637–648,

work page 1990

[33] [33]

Restless bandits: Activity al- location in a changing world.Journal of applied probabil- ity, 25(A):287–298,

[Whittle, 1988] Peter Whittle. Restless bandits: Activity al- location in a changing world.Journal of applied probabil- ity, 25(A):287–298,

work page 1988

[34] [34]

Mean field multi- agent reinforcement learning

[Yanget al., 2018 ] Yaodong Yang, Rui Luo, Minne Li, Ming Zhou, Weinan Zhang, and Jun Wang. Mean field multi- agent reinforcement learning. InInternational conference on machine learning, pages 5571–5580. PMLR,

work page 2018

[35] [35]

In this framework, each cluster is con- trolled by a Deep Q-Network (DQN) that makes individual- level testing decisions under partial observability

as the local decision module. In this framework, each cluster is con- trolled by a Deep Q-Network (DQN) that makes individual- level testing decisions under partial observability. Our im- plementation extends this framework to support adaptation across different testing cost regimes without retraining. Supervised Learning EncoderBecause true infection sta...

work page 2023

[36] [36]

For an active cluster, this vector encodes its size (normalized by the maximum cluster size), its age relative to the episode length, and short-term histo- ries of testing activity, symptom prevalence, and positive test outcomes over the previous three timesteps. These quanti- ties are normalized by cluster size to ensure comparability across clusters of ...

work page 2023

[37] [37]

Quarantine decisions follow a threshold policy adopted from Peng [2023], while testing decisions are con- trolled by the learned policy

Each individual is represented by a fixed-dimensional observation vector encoding epidemi- ological belief features, symptom history, testing history, and Algorithm 1Global Q-Ranking Policy Input: Active clustersA t, local observations{o n,i,t}, global budgetB Output: Executed actions{a n,i,t} 1:C ← ∅ 2:for alln∈ A t do 3:for allindividualiin clusterndo 4...

work page 2023

[38] [38]

for additional de- tails and justifications. B.2 Training Details We train a hierarchical reinforcement learning system con- sisting of a generalized Transformer-based Deep Q-Network Parameter Value Incubation period Lognormal (mean=1.57 days, std=0.65 days) Infectious period 7 days (from 2 days before to 5 days after symptom onset) Baseline transmission ...

work page 2023

[39] [39]

The DQN is trained using off-policy reinforcement learn- ing with a replay buffer of size2×10 5 and a batch size of

The DQN training pipeline is adapted from the CleanRL framework [Huanget al., 2022 ]. The DQN is trained using off-policy reinforcement learn- ing with a replay buffer of size2×10 5 and a batch size of

work page 2022

[40] [40]

A cosine learning- rate schedule with linear warmup over the first10 4 steps is applied, starting from an initial learning rate of5×10 −5

We use the Adam optimizer with gradient clipping at a maximum norm of 1.0 to stabilize updates. A cosine learning- rate schedule with linear warmup over the first10 4 steps is applied, starting from an initial learning rate of5×10 −5. Training is performed for up to5×10 6 environment steps. To support joint adaptation with the supervised learning (SL) enc...

work page 2017

[41] [41]

Optimization uses the AdamW optimizer with a learning rate of3×10−5, weight decay10 −4, and gradient clipping with a maximum norm of 0.5

We use Gen- eralized Advantage Estimation with discount factorγ= 0.99 andλ= 0.90, and normalize advantages using a running mean and variance estimator. Optimization uses the AdamW optimizer with a learning rate of3×10−5, weight decay10 −4, and gradient clipping with a maximum norm of 0.5. The PPO clipped objective is used with a clip coefficient of 0.10, ...

work page 1987

[42] [42]

How- ever, Hier-PPO consistently achieves lower decision latency among all the settings, with speedups ranging from approxi- mately4−8×

As expected, the runtime of both methods increases with the number of clusters. How- ever, Hier-PPO consistently achieves lower decision latency among all the settings, with speedups ranging from approxi- mately4−8×. The difference in runtime between the two methods is most significant when budgets are tight. When the test budget is relatively small compa...

work page 2004