Optimal Control of Fluid Restless Multi-armed Bandits: A Machine Learning Approach

Cheol Woo Kim; Dimitris Bertsimas; Jos\'e Ni\~no-Mora

arxiv: 2502.03725 · v2 · submitted 2025-02-06 · 💻 cs.LG

Optimal Control of Fluid Restless Multi-armed Bandits: A Machine Learning Approach

Dimitris Bertsimas , Cheol Woo Kim , Jos\'e Ni\~no-Mora This is my paper

Pith reviewed 2026-05-23 03:26 UTC · model grok-4.3

classification 💻 cs.LG

keywords fluid restless multi-armed banditsoptimal controlmachine learningoptimal classification treesstate feedback policyepidemic controlfisheries management

0 comments

The pith

Machine learning learns time-dependent policies for fluid restless multi-armed bandits that run up to 26 million times faster than numerical methods.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a machine learning framework to compute optimal control policies for fluid restless multi-armed bandit problems whose dynamics are affine or quadratic. It solves many instances with varied initial states to build a training set, applies nonlinear feature transformations that exploit problem structure, and trains Optimal Classification Trees with hyperplane splits to produce a time-dependent state-feedback policy. The resulting policies match the quality of direct numerical solutions on machine-maintenance, epidemic-control, and fisheries problems while delivering large speed-ups once learned.

Core claim

Solving multiple FRMABP instances with diverse initial states generates a training set that, after nonlinear transformation of the feature vectors, lets OCT-H learn a high-quality time-dependent state feedback policy; once obtained, this policy delivers performance comparable to the direct numerical algorithm at up to 26 million times the speed on the tested applications.

What carries the argument

Optimal Classification Trees with hyperplane splits (OCT-H) that map transformed state features to control actions, trained on data generated by repeated numerical solution of the fluid model.

If this is right

The approach produces usable policies for machine maintenance, epidemic control, and fisheries management.
Once trained, the policy evaluates in microseconds rather than the time required for repeated numerical optimization.
The method applies whenever the state equations are affine or quadratic in the fluid variables.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Real-time deployment becomes feasible in settings where the numerical solver is too slow to rerun at each decision epoch.
The same training pipeline could be reused across families of similar problems by varying only the initial-state distribution.
If the learned policy is combined with occasional re-optimization, it may reduce total computational cost while preserving near-optimality.

Load-bearing premise

Solving many instances with different starting states yields a training set from which OCT-H can learn a single policy that works well for any reachable state.

What would settle it

Take a new initial state outside the training set, apply the learned policy, and compare its cumulative reward to the value returned by the direct numerical algorithm on the same instance; a substantial gap would show the policy fails to generalize.

read the original abstract

We present a novel machine learning framework for the optimal control of fluid restless multi-armed bandit problems (FRMABPs) with state equations that are either affine or quadratic in the state variables. By establishing fundamental properties of FRMABPs, we develop an efficient numerical algorithm that generates a comprehensive training set by solving multiple instances with diverse initial states. We further enhance this training set by applying a nonlinear transformation to the feature vectors, leveraging structural properties of FRMABPs. A time-dependent state feedback policy is then learned using Optimal Classification Trees with hyperplane splits (OCT-H). We test our approach on machine maintenance, epidemic control, and fisheries control problems, demonstrating that our method yields high-quality state feedback policies. Furthermore, once a policy is learned, it achieves a speed-up of up to 26 million times compared to the direct numerical algorithm.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Trains OCT-H on numerical FRMABP solves with nonlinear features for fast policies, but training coverage of reachable states under closed-loop use is unverified.

read the letter

The paper's core move is to solve FRMABPs numerically for lots of initial conditions, augment the features with a nonlinear map, and fit an OCT-H to get a time-dependent policy that runs much faster. They claim this works well on machine maintenance, epidemic control, and fisheries, with speedups up to 26 million times. What stands out as new is the specific pipeline for these fluid models with affine or quadratic dynamics, building on earlier OCT-H work by adding the nonlinear transform and the data generation step for restless bandits. It does a reasonable job setting up the training data generation and applying it to three different control problems. The structural properties they use for the transform seem sensible given the problem class. The main soft spot is exactly the one in the stress test. Training on diverse initial states does not guarantee that the sampled points cover the states the learned policy will actually reach when run in closed loop. The reachable set can be limited, and if the tree is fit only on the training distribution, it can give poor decisions elsewhere. The abstract does not mention any checks for this, like testing on new trajectories or measuring policy error. Without those, the high-quality claim is hard to assess. The lack of any numbers on accuracy or comparisons to other methods also makes the results difficult to evaluate. This paper is for people who work on approximate dynamic programming or ML for control in bandit-like settings. A reader looking for a practical way to speed up these specific problems might find it useful if the full paper has solid validation. I would send it to peer review so the empirical details can be checked properly.

Referee Report

2 major / 2 minor

Summary. The manuscript presents a machine learning framework for optimal control of fluid restless multi-armed bandit problems (FRMABPs) with affine or quadratic state equations. It establishes fundamental properties of FRMABPs, develops an efficient numerical algorithm to generate a training set by solving multiple instances with diverse initial states, applies a nonlinear transformation to feature vectors, and learns a time-dependent state feedback policy via Optimal Classification Trees with hyperplane splits (OCT-H). The approach is tested on machine maintenance, epidemic control, and fisheries control problems, claiming high-quality policies and computational speed-ups of up to 26 million times relative to direct numerical solution.

Significance. If the empirical claims hold with proper verification of policy quality and generalization, the work would provide a practical route to fast, deployable state-feedback policies for a class of restless bandit control problems arising in operations research and applied control, with clear computational advantages once the offline training is complete.

major comments (2)

[Abstract] Abstract (and corresponding method description): the procedure of generating training data from numerical solves on instances with diverse initial states is asserted to produce a 'comprehensive' set sufficient for OCT-H to learn a high-quality time-dependent policy, yet no verification is described that the sampled trajectories densely cover the reachable sets visited under the closed-loop dynamics of the learned policy. Because the state equations are affine or quadratic, reachable sets under feedback can be curved and low-dimensional subsets of the domain; without coverage checks or out-of-sample closed-loop validation, the generalization claim is load-bearing and unproven.
[Abstract] Abstract: the central empirical claims ('high-quality state feedback policies' and 'speed-up of up to 26 million times') are stated without any quantitative metrics, error bars, baseline comparisons, or description of how quality or timing was measured. These details are required to substantiate the performance assertions that justify the entire framework.

minor comments (2)

The description of the nonlinear transformation applied to feature vectors could be expanded with an explicit formula or pseudocode to allow reproduction.
Clarify whether the OCT-H is trained once per problem class or per instance, and how time-dependence is encoded in the classifier input.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify the presentation of our framework. We respond to each major comment below.

read point-by-point responses

Referee: [Abstract] Abstract (and corresponding method description): the procedure of generating training data from numerical solves on instances with diverse initial states is asserted to produce a 'comprehensive' set sufficient for OCT-H to learn a high-quality time-dependent policy, yet no verification is described that the sampled trajectories densely cover the reachable sets visited under the closed-loop dynamics of the learned policy. Because the state equations are affine or quadratic, reachable sets under feedback can be curved and low-dimensional subsets of the domain; without coverage checks or out-of-sample closed-loop validation, the generalization claim is load-bearing and unproven.

Authors: We agree that explicit verification strengthens the generalization claim. The manuscript uses fundamental properties of FRMABPs (established in Section 2) to select diverse initial states that target the relevant state space, and the nonlinear feature transforms further exploit problem structure. However, we acknowledge that the current text does not include dedicated coverage checks or closed-loop out-of-sample validation. In the revision we will add a new subsection with quantitative coverage analysis (e.g., convex-hull containment of simulated closed-loop trajectories) and additional out-of-sample tests on unseen initial conditions. revision: yes
Referee: [Abstract] Abstract: the central empirical claims ('high-quality state feedback policies' and 'speed-up of up to 26 million times') are stated without any quantitative metrics, error bars, baseline comparisons, or description of how quality or timing was measured. These details are required to substantiate the performance assertions that justify the entire framework.

Authors: The abstract is length-constrained, but Sections 4–5 of the manuscript report quantitative policy-quality metrics (optimality gaps relative to the numerical solver), baseline comparisons, timing measurements on standardized hardware, and the precise conditions yielding the reported speed-ups. To address the concern directly in the abstract, we will revise it to include concise quantitative highlights (e.g., average gaps and measurement protocol) while preserving readability. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper generates a training set by independently solving multiple FRMABP instances numerically for diverse initial states, augments features via a nonlinear map derived from structural properties, and fits an OCT-H classifier to obtain an approximate time-dependent policy. This constitutes standard supervised learning from external numerical data rather than any reduction of the output policy to fitted inputs, self-definitions, or self-citation chains. The central claim (high-quality learned policies with massive speed-up) rests on the quality of the independent solves and the generalization of OCT-H, with no load-bearing step that equates the learned policy to its own training data by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities; the approach implicitly assumes standard properties of FRMABPs and OCT-H that are not detailed here.

pith-pipeline@v0.9.0 · 5684 in / 1037 out tokens · 25423 ms · 2026-05-23T03:26:13.341888+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We propose a machine learning approach to solve FRMAB problems using OCT-H... augmenting the feature vector with nonlinear transformations... terms 1/xi(t) + α0,i/β0,i ...
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Pontryagin’s Maximum Principle... Hamiltonian... index policy determined by ranking γi*(t)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

40 extracted references · 40 canonical work pages · 1 internal anchor

[1]

, Makis , V

barticle Abbou , A. , Makis , V. : Group maintenance: a restless bandits approach . INFORMS J. Comput. 31 , 719 -- 731 ( 2019 ) barticle

work page 2019
[2]

: Verhulst and the logistic equation (1838) , pp

bbook Baca \"e r , N. : Verhulst and the logistic equation (1838) , pp. 35 -- 39 . Springer , London ( 2011 ). 10.1007/978-0-85729-115-8_6 . https://doi.org/10.1007/978-0-85729-115-8_6 bbook

work page doi:10.1007/978-0-85729-115-8_6 2011
[3]

, Dunn , J

barticle Bertsimas , D. , Dunn , J. : Optimal classification trees . Machine Learning 106 ( 1 ), 1039 -- 1082 ( 2017 ) barticle

work page 2017
[4]

, Dunn , J

bbook Bertsimas , D. , Dunn , J. : Machine Learning Under a Modern Optimization Lens . Dynamic Ideas , Charlestown, MA, USA ( 2019 ) bbook

work page 2019
[5]

, Friedman , J

bbook Breiman , L. , Friedman , J. , J. Stone , C. , Olshen , R.A. : Classification and Regression Trees . Chapman and Hall/CRC , New York, NY, USA ( 1984 ) bbook

work page 1984
[6]

, Gamarnik , D

bbook Bertsimas , D. , Gamarnik , D. : Queueing Theory: Classical and Modern Methods . Dynamic Ideas , Charlestown, MA, USA ( 2022 ) bbook

work page 2022
[7]

Optimal Control of Multiclass Fluid Queueing Networks: A Machine Learning Approach

botherref Bertsimas , D. , Kim , C.W. : Optimal Control of Multiclass Fluid Queueing Networks: A Machine Learning Approach (2023). https://arxiv.org/abs/2307.12405 botherref

work page internal anchor Pith review Pith/arXiv arXiv 2023
[8]

, Nasrabadi , E

barticle Bertsimas , D. , Nasrabadi , E. , Paschalidis , I.C. : Robust fluid processing networks . IEEE Transactions on Automatic Control 60 ( 3 ), 715 -- 728 ( 2015 ) barticle

work page 2015
[9]

: A class of methods for solving nonlinear simultaneous equations

barticle Broyden , C.G. : A class of methods for solving nonlinear simultaneous equations . Math. Comp. 19 , 577 -- 593 ( 1965 ) barticle

work page 1965
[10]

: On positive harris recurrence of multiclass queueing networks: A unified approach via fluid limit models

barticle Dai , J. : On positive harris recurrence of multiclass queueing networks: A unified approach via fluid limit models . The Annals of Applied Probability 5 ( 1 ), 49 -- 77 ( 1995 ) barticle

work page 1995
[11]

, Lory , P

barticle Diekhoff , H.J. , Lory , P. , Oberle , H.J. , Pesch , H.J. , Rentrop , P. , Seydel , R. : Comparing routines for the numerical solution of initial value problems of ordinary differential equations in multiple shooting . Numerische Mathematik 27 ( 4 ), 449 -- 469 ( 1976 ) barticle

work page 1976
[12]

, Nazarathy , Y

bchapter Fu , J. , Nazarathy , Y. , Moka , S. , Taylor , P.G. : Towards q-learning the whittle index for restless bandits . In: 2019 Australian & New Zealand Control Conference (ANZCC) , pp. 249 -- 254 ( 2019 ). 10.1109/ANZCC47194.2019.8945748 bchapter

work page doi:10.1109/anzcc47194.2019.8945748 2019
[13]

: Some convergence properties of B royden's method

barticle Gay , D.M. : Some convergence properties of B royden's method . SIAM J. Numer. Analysis 16 , 623 -- 630 ( 1979 ) barticle

work page 1979
[14]

, Caulkins , J.P

bbook Grass , D. , Caulkins , J.P. , Feichtinger , G. , Tragler , G. , Behrens , .D.A. (eds.): Optimal Control of Nonlinear Processes: With Applications in Drugs, Corruption, and Terror . Springer , Berlin ( 2008 ) bbook

work page 2008
[15]

: Interpretable AI Documentation (2023)

botherref Interpretable AI , L. : Interpretable AI Documentation (2023). https://www.interpretable.ai botherref

work page 2023
[16]

, Hutchinson , D

barticle Khalaf , B.M.S. , Hutchinson , D. : Parallel algorithms for initial value problems: Parallel shooting . Parallel computing 18 ( 6 ), 661 -- 673 ( 1992 ) barticle

work page 1992
[17]

, Schwartz , N.L

barticle Kamien , M.I. , Schwartz , N.L. : Optimal maintenance and sale age for a machine subject to failure . Management Science 17 ( 8 ), 495 -- 504 ( 1971 ) barticle

work page 1971
[18]

, Yang , I

bchapter Kim , J. , Yang , I. : Hamilton-jacobi-bellman equations for q-learning in continuous time . In: Bayen , A.M. , Jadbabaie , A. , Pappas , G. , Parrilo , P.A. , Recht , B. , Tomlin , C. , Zeilinger , M. (eds.) Proceedings of the 2nd Conference on Learning for Dynamics and Control . Proceedings of Machine Learning Research , vol. 120 , pp. 739 -- 7...

work page 2020
[19]

, Belousov , B

barticle Lutter , M. , Belousov , B. , Mannor , S. , Fox , D. , Garg , A. , Peters , J. : Continuous-time fitted value iteration for robust policies . IEEE Transactions on Pattern Analysis and Machine Intelligence 45 ( 5 ), 5534 -- 5548 ( 2023 ) 10.1109/TPAMI.2022.3215769 barticle

work page doi:10.1109/tpami.2022.3215769 2023
[20]

, Ayesta , U

barticle Larra\ n aga , M. , Ayesta , U. , Verloop , I.M. : Dynamic control of birth-and-death restless bandits: Application to resource-allocation problems . IEEE/ACM Transactions on Networking 24 ( 6 ), 3812 -- 3825 ( 2016 ) 10.1109/TNET.2016.2562564 barticle

work page doi:10.1109/tnet.2016.2562564 2016
[21]

, Sutton , R.S

barticle Lee , J. , Sutton , R.S. : Policy iterations for reinforcement learning problems in continuous time and space --- fundamental theory and methods . Automatica 126 , 109421 ( 2021 ) barticle

work page 2021
[22]

: Dynamic scheduling in multiclass queueing networks: Stability under discrete-review policies

barticle Maglaras , C. : Dynamic scheduling in multiclass queueing networks: Stability under discrete-review policies . Queueing Systems 31 ( 3 ), 171 -- 206 ( 1999 ) barticle

work page 1999
[23]

: Discrete-review policies for scheduling stochastic networks: trajectory tracking and fluid-scale asymptotic optimality

barticle Maglaras , C. : Discrete-review policies for scheduling stochastic networks: trajectory tracking and fluid-scale asymptotic optimality . The Annals of Applied Probability 10 ( 3 ), 897 -- 929 ( 2000 ) barticle

work page 2000
[24]

: Control Techniques for Complex Networks , 1st edn

bbook Meyn , S. : Control Techniques for Complex Networks , 1st edn. Cambridge University Press , USA ( 2007 ) bbook

work page 2007
[25]

, Manjunath , D

barticle Meshram , R. , Manjunath , D. , Gopalan , A. : On the whittle index for restless multiarmed hidden markov bandits . IEEE Transactions on Automatic Control 63 ( 9 ), 3046 -- 3053 ( 2018 ) 10.1109/TAC.2018.2799521 barticle

work page doi:10.1109/tac.2018.2799521 2018
[26]

, Mary , P

barticle Modi , N. , Mary , P. , Moy , C. : Transfer restless multi-armed bandit policy for energy-efficient heterogeneous cellular network . EURASIP Journal on Advances in Signal Processing 2019 ( 1 ), 46 ( 2019 ) barticle

work page 2019
[27]

, Madaan , L

barticle Mate , A. , Madaan , L. , Taneja , A. , Madhiwalla , N. , Verma , S. , Singh , G. , Hegde , A. , Varakantham , P. , Tambe , M. : Field study in deploying restless multi-armed bandits: Assisting non-profits in improving maternal and child health . Proceedings of the AAAI Conference on Artificial Intelligence 36 , 12017 -- 12025 ( 2022 ) 10.1609/aa...

work page doi:10.1609/aaai.v36i11.21460 2022
[28]

, Ganji , S

bchapter Nakhleh , K. , Ganji , S. , Hsieh , P.-C. , Hou , I.-H. , Shakkottai , S. : Neurwin: neural whittle index network for restless bandits via deep rl . In: Proceedings of the 35th International Conference on Neural Information Processing Systems . NIPS '21 . Curran Associates Inc. , Red Hook, NY, USA ( 2024 ) bchapter

work page 2024
[29]

: Markovian restless bandits and index policies: A review

barticle Ni\ n o - Mora , J. : Markovian restless bandits and index policies: A review . Mathematics 11 , 1639 ( 2023 ) barticle

work page 2023
[30]

, Boltyanskii , V.G

barticle Pontryagin , L.S. , Boltyanskii , V.G. , Gamkrelidze , R.V. , Mishechenko , E.F. : The mathematical theory of optimal processes. viii + 360 s. new york/london 1962. john wiley & sons. preis 90/– . Zamm-zeitschrift Fur Angewandte Mathematik Und Mechanik 43 , 514 -- 515 ( 1963 ) barticle

work page 1962
[31]

: An sis epidemic model with individual variation

barticle Pollett , P.K. : An sis epidemic model with individual variation . Mathematical Biosciences and Engineering 21 ( 4 ), 5446 -- 5455 ( 2024 ) 10.3934/mbe.2024240 barticle

work page doi:10.3934/mbe.2024240 2024
[32]

, Tsitsiklis , J.N

barticle Papadimitriou , C.H. , Tsitsiklis , J.N. : The complexity of optimal queuing network control . Mathematics of Operations Research 24 ( 2 ), 293 -- 305 ( 1999 ). Accessed 2024-08-23 barticle

work page 1999
[33]

, Bulirsch , R

bbook Stoer , J. , Bulirsch , R. : Introduction to Numerical Analysis , 3rd ed. edn. Texts in applied mathematics , vol. 12 . Springer , New York, NY ( 2013 ) bbook

work page 2013
[34]

: Some considerations of population dynamics and economics in relation to the management of the commercial marine fisheries

barticle Schaefer , M.B. : Some considerations of population dynamics and economics in relation to the management of the commercial marine fisheries . Journal of the Fisheries Research Board of Canada 14 ( 5 ), 669 -- 681 ( 1957 ) barticle

work page 1957
[35]

: The economic theory of a common-property resource: The fishery

barticle Scott Gordon , H. : The economic theory of a common-property resource: The fishery . Bulletin of Mathematical Biology 53 ( 1 ), 231 -- 252 ( 1991 ) barticle

work page 1991
[36]

: Optimal Control Theory

bbook Sethi , S.P. : Optimal Control Theory . Springer , Gewerbestrasse 11, 6330 Cham, Switzerland ( 2019 ) bbook

work page 2019
[37]

: On the stability of multiclass queueing networks: A relaxed sufficient condition via limiting fluid processes

barticle Stolyar , A.L. : On the stability of multiclass queueing networks: A relaxed sufficient condition via limiting fluid processes . Markov Processes And Related Fields 1 , 491 -- 512 ( 1995 ) barticle

work page 1995
[38]

, Modiano , E

bchapter Tripathi , V. , Modiano , E. : A whittle index approach to minimizing functions of age of information . In: 2019 57th Annual Allerton Conference on Communication, Control, and Computing (Allerton) , pp. 1160 -- 1167 ( 2019 ). 10.1109/ALLERTON.2019.8919842 bchapter

work page doi:10.1109/allerton.2019.8919842 2019
[39]

: Restless bandits: Activity allocation in a changing world

barticle Whittle , P. : Restless bandits: Activity allocation in a changing world . J. Appl. Probab. 25A , 287 -- 298 ( 1988 ) barticle

work page 1988
[40]

write newline

" write newline "" before.all 'output.state := FUNCTION string.to.integer 't := t text.length 'k := #1 'char.num := t char.num #1 substring 's := s is.num s "." = or char.num k = not and char.num #1 + 'char.num := while char.num #1 - 'char.num := t #1 char.num substring FUNCTION find.integer 't := #0 'int := int not t empty not and t #1 #1 substring 's :=...

work page

[1] [1]

, Makis , V

barticle Abbou , A. , Makis , V. : Group maintenance: a restless bandits approach . INFORMS J. Comput. 31 , 719 -- 731 ( 2019 ) barticle

work page 2019

[2] [2]

: Verhulst and the logistic equation (1838) , pp

bbook Baca \"e r , N. : Verhulst and the logistic equation (1838) , pp. 35 -- 39 . Springer , London ( 2011 ). 10.1007/978-0-85729-115-8_6 . https://doi.org/10.1007/978-0-85729-115-8_6 bbook

work page doi:10.1007/978-0-85729-115-8_6 2011

[3] [3]

, Dunn , J

barticle Bertsimas , D. , Dunn , J. : Optimal classification trees . Machine Learning 106 ( 1 ), 1039 -- 1082 ( 2017 ) barticle

work page 2017

[4] [4]

, Dunn , J

bbook Bertsimas , D. , Dunn , J. : Machine Learning Under a Modern Optimization Lens . Dynamic Ideas , Charlestown, MA, USA ( 2019 ) bbook

work page 2019

[5] [5]

, Friedman , J

bbook Breiman , L. , Friedman , J. , J. Stone , C. , Olshen , R.A. : Classification and Regression Trees . Chapman and Hall/CRC , New York, NY, USA ( 1984 ) bbook

work page 1984

[6] [6]

, Gamarnik , D

bbook Bertsimas , D. , Gamarnik , D. : Queueing Theory: Classical and Modern Methods . Dynamic Ideas , Charlestown, MA, USA ( 2022 ) bbook

work page 2022

[7] [7]

Optimal Control of Multiclass Fluid Queueing Networks: A Machine Learning Approach

botherref Bertsimas , D. , Kim , C.W. : Optimal Control of Multiclass Fluid Queueing Networks: A Machine Learning Approach (2023). https://arxiv.org/abs/2307.12405 botherref

work page internal anchor Pith review Pith/arXiv arXiv 2023

[8] [8]

, Nasrabadi , E

barticle Bertsimas , D. , Nasrabadi , E. , Paschalidis , I.C. : Robust fluid processing networks . IEEE Transactions on Automatic Control 60 ( 3 ), 715 -- 728 ( 2015 ) barticle

work page 2015

[9] [9]

: A class of methods for solving nonlinear simultaneous equations

barticle Broyden , C.G. : A class of methods for solving nonlinear simultaneous equations . Math. Comp. 19 , 577 -- 593 ( 1965 ) barticle

work page 1965

[10] [10]

: On positive harris recurrence of multiclass queueing networks: A unified approach via fluid limit models

barticle Dai , J. : On positive harris recurrence of multiclass queueing networks: A unified approach via fluid limit models . The Annals of Applied Probability 5 ( 1 ), 49 -- 77 ( 1995 ) barticle

work page 1995

[11] [11]

, Lory , P

barticle Diekhoff , H.J. , Lory , P. , Oberle , H.J. , Pesch , H.J. , Rentrop , P. , Seydel , R. : Comparing routines for the numerical solution of initial value problems of ordinary differential equations in multiple shooting . Numerische Mathematik 27 ( 4 ), 449 -- 469 ( 1976 ) barticle

work page 1976

[12] [12]

, Nazarathy , Y

bchapter Fu , J. , Nazarathy , Y. , Moka , S. , Taylor , P.G. : Towards q-learning the whittle index for restless bandits . In: 2019 Australian & New Zealand Control Conference (ANZCC) , pp. 249 -- 254 ( 2019 ). 10.1109/ANZCC47194.2019.8945748 bchapter

work page doi:10.1109/anzcc47194.2019.8945748 2019

[13] [13]

: Some convergence properties of B royden's method

barticle Gay , D.M. : Some convergence properties of B royden's method . SIAM J. Numer. Analysis 16 , 623 -- 630 ( 1979 ) barticle

work page 1979

[14] [14]

, Caulkins , J.P

bbook Grass , D. , Caulkins , J.P. , Feichtinger , G. , Tragler , G. , Behrens , .D.A. (eds.): Optimal Control of Nonlinear Processes: With Applications in Drugs, Corruption, and Terror . Springer , Berlin ( 2008 ) bbook

work page 2008

[15] [15]

: Interpretable AI Documentation (2023)

botherref Interpretable AI , L. : Interpretable AI Documentation (2023). https://www.interpretable.ai botherref

work page 2023

[16] [16]

, Hutchinson , D

barticle Khalaf , B.M.S. , Hutchinson , D. : Parallel algorithms for initial value problems: Parallel shooting . Parallel computing 18 ( 6 ), 661 -- 673 ( 1992 ) barticle

work page 1992

[17] [17]

, Schwartz , N.L

barticle Kamien , M.I. , Schwartz , N.L. : Optimal maintenance and sale age for a machine subject to failure . Management Science 17 ( 8 ), 495 -- 504 ( 1971 ) barticle

work page 1971

[18] [18]

, Yang , I

bchapter Kim , J. , Yang , I. : Hamilton-jacobi-bellman equations for q-learning in continuous time . In: Bayen , A.M. , Jadbabaie , A. , Pappas , G. , Parrilo , P.A. , Recht , B. , Tomlin , C. , Zeilinger , M. (eds.) Proceedings of the 2nd Conference on Learning for Dynamics and Control . Proceedings of Machine Learning Research , vol. 120 , pp. 739 -- 7...

work page 2020

[19] [19]

, Belousov , B

barticle Lutter , M. , Belousov , B. , Mannor , S. , Fox , D. , Garg , A. , Peters , J. : Continuous-time fitted value iteration for robust policies . IEEE Transactions on Pattern Analysis and Machine Intelligence 45 ( 5 ), 5534 -- 5548 ( 2023 ) 10.1109/TPAMI.2022.3215769 barticle

work page doi:10.1109/tpami.2022.3215769 2023

[20] [20]

, Ayesta , U

barticle Larra\ n aga , M. , Ayesta , U. , Verloop , I.M. : Dynamic control of birth-and-death restless bandits: Application to resource-allocation problems . IEEE/ACM Transactions on Networking 24 ( 6 ), 3812 -- 3825 ( 2016 ) 10.1109/TNET.2016.2562564 barticle

work page doi:10.1109/tnet.2016.2562564 2016

[21] [21]

, Sutton , R.S

barticle Lee , J. , Sutton , R.S. : Policy iterations for reinforcement learning problems in continuous time and space --- fundamental theory and methods . Automatica 126 , 109421 ( 2021 ) barticle

work page 2021

[22] [22]

: Dynamic scheduling in multiclass queueing networks: Stability under discrete-review policies

barticle Maglaras , C. : Dynamic scheduling in multiclass queueing networks: Stability under discrete-review policies . Queueing Systems 31 ( 3 ), 171 -- 206 ( 1999 ) barticle

work page 1999

[23] [23]

: Discrete-review policies for scheduling stochastic networks: trajectory tracking and fluid-scale asymptotic optimality

barticle Maglaras , C. : Discrete-review policies for scheduling stochastic networks: trajectory tracking and fluid-scale asymptotic optimality . The Annals of Applied Probability 10 ( 3 ), 897 -- 929 ( 2000 ) barticle

work page 2000

[24] [24]

: Control Techniques for Complex Networks , 1st edn

bbook Meyn , S. : Control Techniques for Complex Networks , 1st edn. Cambridge University Press , USA ( 2007 ) bbook

work page 2007

[25] [25]

, Manjunath , D

barticle Meshram , R. , Manjunath , D. , Gopalan , A. : On the whittle index for restless multiarmed hidden markov bandits . IEEE Transactions on Automatic Control 63 ( 9 ), 3046 -- 3053 ( 2018 ) 10.1109/TAC.2018.2799521 barticle

work page doi:10.1109/tac.2018.2799521 2018

[26] [26]

, Mary , P

barticle Modi , N. , Mary , P. , Moy , C. : Transfer restless multi-armed bandit policy for energy-efficient heterogeneous cellular network . EURASIP Journal on Advances in Signal Processing 2019 ( 1 ), 46 ( 2019 ) barticle

work page 2019

[27] [27]

, Madaan , L

barticle Mate , A. , Madaan , L. , Taneja , A. , Madhiwalla , N. , Verma , S. , Singh , G. , Hegde , A. , Varakantham , P. , Tambe , M. : Field study in deploying restless multi-armed bandits: Assisting non-profits in improving maternal and child health . Proceedings of the AAAI Conference on Artificial Intelligence 36 , 12017 -- 12025 ( 2022 ) 10.1609/aa...

work page doi:10.1609/aaai.v36i11.21460 2022

[28] [28]

, Ganji , S

bchapter Nakhleh , K. , Ganji , S. , Hsieh , P.-C. , Hou , I.-H. , Shakkottai , S. : Neurwin: neural whittle index network for restless bandits via deep rl . In: Proceedings of the 35th International Conference on Neural Information Processing Systems . NIPS '21 . Curran Associates Inc. , Red Hook, NY, USA ( 2024 ) bchapter

work page 2024

[29] [29]

: Markovian restless bandits and index policies: A review

barticle Ni\ n o - Mora , J. : Markovian restless bandits and index policies: A review . Mathematics 11 , 1639 ( 2023 ) barticle

work page 2023

[30] [30]

, Boltyanskii , V.G

barticle Pontryagin , L.S. , Boltyanskii , V.G. , Gamkrelidze , R.V. , Mishechenko , E.F. : The mathematical theory of optimal processes. viii + 360 s. new york/london 1962. john wiley & sons. preis 90/– . Zamm-zeitschrift Fur Angewandte Mathematik Und Mechanik 43 , 514 -- 515 ( 1963 ) barticle

work page 1962

[31] [31]

: An sis epidemic model with individual variation

barticle Pollett , P.K. : An sis epidemic model with individual variation . Mathematical Biosciences and Engineering 21 ( 4 ), 5446 -- 5455 ( 2024 ) 10.3934/mbe.2024240 barticle

work page doi:10.3934/mbe.2024240 2024

[32] [32]

, Tsitsiklis , J.N

barticle Papadimitriou , C.H. , Tsitsiklis , J.N. : The complexity of optimal queuing network control . Mathematics of Operations Research 24 ( 2 ), 293 -- 305 ( 1999 ). Accessed 2024-08-23 barticle

work page 1999

[33] [33]

, Bulirsch , R

bbook Stoer , J. , Bulirsch , R. : Introduction to Numerical Analysis , 3rd ed. edn. Texts in applied mathematics , vol. 12 . Springer , New York, NY ( 2013 ) bbook

work page 2013

[34] [34]

: Some considerations of population dynamics and economics in relation to the management of the commercial marine fisheries

barticle Schaefer , M.B. : Some considerations of population dynamics and economics in relation to the management of the commercial marine fisheries . Journal of the Fisheries Research Board of Canada 14 ( 5 ), 669 -- 681 ( 1957 ) barticle

work page 1957

[35] [35]

: The economic theory of a common-property resource: The fishery

barticle Scott Gordon , H. : The economic theory of a common-property resource: The fishery . Bulletin of Mathematical Biology 53 ( 1 ), 231 -- 252 ( 1991 ) barticle

work page 1991

[36] [36]

: Optimal Control Theory

bbook Sethi , S.P. : Optimal Control Theory . Springer , Gewerbestrasse 11, 6330 Cham, Switzerland ( 2019 ) bbook

work page 2019

[37] [37]

: On the stability of multiclass queueing networks: A relaxed sufficient condition via limiting fluid processes

barticle Stolyar , A.L. : On the stability of multiclass queueing networks: A relaxed sufficient condition via limiting fluid processes . Markov Processes And Related Fields 1 , 491 -- 512 ( 1995 ) barticle

work page 1995

[38] [38]

, Modiano , E

bchapter Tripathi , V. , Modiano , E. : A whittle index approach to minimizing functions of age of information . In: 2019 57th Annual Allerton Conference on Communication, Control, and Computing (Allerton) , pp. 1160 -- 1167 ( 2019 ). 10.1109/ALLERTON.2019.8919842 bchapter

work page doi:10.1109/allerton.2019.8919842 2019

[39] [39]

: Restless bandits: Activity allocation in a changing world

barticle Whittle , P. : Restless bandits: Activity allocation in a changing world . J. Appl. Probab. 25A , 287 -- 298 ( 1988 ) barticle

work page 1988

[40] [40]

write newline

" write newline "" before.all 'output.state := FUNCTION string.to.integer 't := t text.length 'k := #1 'char.num := t char.num #1 substring 's := s is.num s "." = or char.num k = not and char.num #1 + 'char.num := while char.num #1 - 'char.num := t #1 char.num substring FUNCTION find.integer 't := #0 'int := int not t empty not and t #1 #1 substring 's :=...

work page