Inpatient Overflow Management with Proximal Policy Optimization
Pith reviewed 2026-05-23 19:13 UTC · model grok-4.3
The pith
Proximal policy optimization with atomic actions manages inpatient overflow decisions at the scale of twenty patient classes and twenty wards.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors develop a scalable PPO framework for time-periodic inpatient overflow management that introduces atomic actions to decompose multi-patient routing into sequential assignments, employs a partially-shared policy network to balance parameter sharing with time-specific adaptations, and uses a queueing-informed value function approximation to improve evaluation. In systems with up to twenty patient classes and twenty wards, the resulting policies match or outperform benchmarks while approximate dynamic programming becomes computationally infeasible beyond five wards, and the method reduces the volume of simulation data required.
What carries the argument
Atomic actions that break multi-patient routing into sequential single-patient assignments, inside a PPO loop augmented by a partially-shared policy network and queueing-informed value approximation.
If this is right
- Overflow decisions become feasible for hospital systems an order of magnitude larger than those handled by dynamic programming methods.
- The volume of simulation runs needed to train effective policies drops substantially compared with standard reinforcement learning.
- Domain-specific adaptations such as atomic actions and queueing value estimates matter more for performance than further neural-network tuning.
- The resulting policies remain explainable enough for managerial review while operating in long-run average-cost settings with periodic demand.
- The same decomposition and approximation pattern can be reused for other periodic resource-allocation problems that share the same combinatorial structure.
Where Pith is reading between the lines
- The atomic-action decomposition may simplify reinforcement learning for other matching problems that currently suffer from exponential action spaces.
- If the queueing-informed approximation generalizes, similar value-function shortcuts could accelerate learning in other queueing-control domains.
- Real-time hospital data streams could be used to test whether the long-run average-cost objective aligns with short-term operational targets.
- The partially-shared network design offers a template for balancing global and time-local policies in other periodic scheduling tasks.
Load-bearing premise
The queueing-informed value function approximation and partially-shared policy network together preserve near-optimal performance on the original multi-patient problem without introducing bias that grows with system size.
What would settle it
Apply the trained policy to a simulated system with twenty-five wards and twenty-five patient classes and measure whether the achieved average cost remains within a small percentage of the best available benchmark or lower bound.
read the original abstract
Problem Definition: Managing inpatient flow in large hospital systems is challenging due to the complexity of assigning randomly arriving patients -- either waiting for primary units or being overflowed to alternative units. Current practices rely on ad-hoc rules, while prior analytical approaches struggle with the intractably large state and action spaces inherent in patient-unit matching. Scalable decision support is needed to optimize overflow management while accounting for time-periodic fluctuations in patient flow. Methodology/Results: We develop a scalable decision-making framework using Proximal Policy Optimization (PPO) to optimize overflow decisions in a time-periodic, long-run average cost setting. To address the combinatorial complexity, we introduce atomic actions, which decompose multi-patient routing into sequential assignments. We further enhance computational efficiency through a partially-shared policy network designed to balance parameter sharing with time-specific policy adaptations, and a queueing-informed value function approximation to improve policy evaluation. Our method significantly reduces the need for extensive simulation data, a common limitation in reinforcement learning applications. Case studies on hospital systems with up to twenty patient classes and twenty wards demonstrate that our approach matches or outperforms existing benchmarks, including approximate dynamic programming, which is computationally infeasible beyond five wards. Managerial Implications: Our framework offers a scalable, efficient, and explainable solution for managing patient flow in complex hospital systems. More broadly, our results highlight that domain-aware adaptation is more critical to improving algorithm performance than fine-tuning neural network parameters when applying general-purpose algorithms to specific applications.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper develops a Proximal Policy Optimization (PPO) framework for inpatient overflow management in time-periodic hospital systems. It introduces atomic actions to decompose combinatorial patient-unit assignments, a partially-shared policy network for time-specific adaptations, and a queueing-informed value function approximation. Case studies claim that the method scales to 20 patient classes and 20 wards while matching or outperforming benchmarks including approximate dynamic programming (infeasible beyond 5 wards).
Significance. If the empirical claims hold with detailed validation, the work demonstrates a scalable RL approach for a high-dimensional combinatorial problem in healthcare operations by embedding queueing structure and domain knowledge, rather than relying solely on generic neural network tuning. This could inform practical decision support tools where ADP is intractable.
major comments (2)
- [Abstract, §5] Abstract and §5 (case studies): the central claim that the approach 'matches or outperforms' ADP and other benchmarks for systems up to 20 wards is presented without any numerical performance metrics, cost values, overflow rates, error bars, statistical tests, or description of how simulation data were generated and validated. This absence makes it impossible to assess whether the atomic actions and queueing-informed VFA preserve performance without bias that grows with system size.
- [§3.2] §3.2 (atomic actions): the assertion that atomic actions preserve optimality of the original multi-patient assignment problem is stated as an axiom but lacks a formal proof or counter-example analysis showing that sequential decomposition does not alter the long-run average cost for the time-periodic MDP.
minor comments (2)
- [§4] Notation for the partially-shared policy network parameters is introduced without a clear diagram or pseudocode showing which layers are shared versus time-period specific.
- [§6] The managerial implications section repeats the abstract's claim about domain-aware adaptation without referencing specific ablation results that isolate the contribution of the queueing-informed VFA versus the policy network.
Simulated Author's Rebuttal
We thank the referee for their constructive comments, which highlight important areas for improving the clarity and rigor of our presentation. We respond to each major comment below, indicating the revisions we plan to make.
read point-by-point responses
-
Referee: [Abstract, §5] Abstract and §5 (case studies): the central claim that the approach 'matches or outperforms' ADP and other benchmarks for systems up to 20 wards is presented without any numerical performance metrics, cost values, overflow rates, error bars, statistical tests, or description of how simulation data were generated and validated. This absence makes it impossible to assess whether the atomic actions and queueing-informed VFA preserve performance without bias that grows with system size.
Authors: We agree with this observation. The full case studies in §5 contain detailed simulation results, but these were not sufficiently summarized in the abstract or highlighted with specific metrics. In the revised version, we will update the abstract to include key numerical findings such as average costs and overflow rates for the 20-ward systems, along with comparisons to benchmarks. Additionally, we will expand §5 to include tables with performance metrics, error bars from multiple simulation runs, statistical significance tests, and a detailed description of the simulation setup and validation procedures. revision: yes
-
Referee: [§3.2] §3.2 (atomic actions): the assertion that atomic actions preserve optimality of the original multi-patient assignment problem is stated as an axiom but lacks a formal proof or counter-example analysis showing that sequential decomposition does not alter the long-run average cost for the time-periodic MDP.
Authors: We acknowledge that a more rigorous justification is needed. The atomic actions are intended to decompose the combinatorial action space without changing the underlying decision problem, as each sequence of atomic assignments corresponds to a feasible multi-patient assignment. We will revise §3.2 to include a formal argument demonstrating that the long-run average cost is preserved under this decomposition in the time-periodic MDP, or provide counter-example analysis if applicable to clarify the conditions under which optimality is maintained. revision: yes
Circularity Check
No significant circularity in derivation chain
full rationale
The paper introduces a PPO algorithm with atomic actions, partially-shared policy network, and queueing-informed VFA as explicit algorithmic constructions for the overflow MDP. Performance claims rest on direct empirical comparison to ADP and other benchmarks on external hospital instances (up to 20 wards), not on any fitted parameter or self-citation that is redefined as the result. No equation reduces the reported policy quality to its own inputs by construction, and the method is presented as a new scalable approximation whose validity is tested outside the fitting process.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption The inpatient overflow problem can be modeled as a time-periodic Markov decision process with long-run average cost.
- ad hoc to paper Atomic actions preserve optimality of the original multi-patient assignment problem.
invented entities (2)
-
atomic actions
no independent evidence
-
partially-shared policy network
no independent evidence
Reference graph
Works this paper leans on
-
[1]
, " * write output.state after.block = add.period write newline
ENTRY address author booktitle chapter doi edition editor eid howpublished institution isbn issn journal key month note number organization pages publisher school series title type url volume year label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block FUNCTION init.state.consts #0 'before.all := #1...
-
[2]
" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in "" FUNCTION format.date year ...
-
[3]
European journal of internal medicine 20(8):764--767
Alameda C, Su \'a rez C (2009) Clinical outcomes in medical outliers admitted to hospital with heart failure. European journal of internal medicine 20(8):764--767
work page 2009
-
[4]
Federated Learning with Personalization Layers
Arivazhagan MG, Aggarwal V, Singh AK, Choudhary S (2019) Federated learning with personalization layers. arXiv preprint arXiv:1912.00818
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[5]
Management Science 70(7):4893--4911
Bertsimas D, Pauphilet J (2024) Hospital-wide inpatient flow optimization. Management Science 70(7):4893--4911
work page 2024
-
[6]
Manufacturing & Service Operations Management 17(2):157--176
Best TJ, Sand k c B, Eisenstein DD, Meltzer DO (2015) Managing hospital inpatient bed capacity through partitioning care into focused wings. Manufacturing & Service Operations Management 17(2):157--176
work page 2015
-
[7]
A Primal-Dual Approach to Constrained Markov Decision Processes
Chen Y, Dong J, Wang Z (2021) A primal-dual approach to constrained markov decision processes. arXiv preprint arXiv:2101.10895
-
[8]
Production and Operations Management 31(11):4038--4056
Cire AA, Diamant A (2022) Dynamic scheduling of home care patients to medical providers. Production and Operations Management 31(11):4038--4056
work page 2022
-
[9]
Stochastic Systems 12(1):30--67
Dai JG, Gluzman M (2022) Queueing network controls via deep reinforcement learning. Stochastic Systems 12(1):30--67
work page 2022
-
[10]
Manufacturing & Service Operations Management 21(4):894--911
Dai JG, Shi P (2019) Inpatient overflow: An approximate dynamic programming approach. Manufacturing & Service Operations Management 21(4):894--911
work page 2019
-
[11]
Dean A (2024) Learning, Matching, and Allocation Algorithms for Healthcare and Revenue Management Problems with Reusable Resources. Ph.D. thesis
work page 2024
-
[12]
Dong J, G \"o rg \"u l \"u B, Sarhangian V (2025) Multiclass queue scheduling under slowdown: An approximate dynamic programming approach. arXiv preprint arXiv:2501.10523
-
[13]
Columbia Business School Research Paper Forthcoming
Dong J, Shi P, Zheng F, Jin X (2019) Off-service placement in inpatient ward network: Resource pooling versus service slowdown. Columbia Business School Research Paper Forthcoming
work page 2019
-
[14]
2021 American Control Conference (ACC), 3743--3748 (IEEE)
Feng J, Gluzman M, Dai JG (2021) Scalable deep reinforcement learning for ride-hailing. 2021 American Control Conference (ACC), 3743--3748 (IEEE)
work page 2021
-
[15]
Manufacturing & Service Operations Management 26(2):447--464
Gao X, Kong N, Griffin P (2024) Shortening emergency medical response time with joint operations of uncrewed aerial vehicles with ambulances. Manufacturing & Service Operations Management 26(2):447--464
work page 2024
-
[16]
Manufacturing & Service Operations Management 23(1):139--154
Izady N, Mohamed I (2021) A clustered overflow configuration of inpatient beds in hospitals. Manufacturing & Service Operations Management 23(1):139--154
work page 2021
-
[17]
Emergency Medicine Journal 39(3):168--173
Jones S, Moulton C, Swift S, Molyneux P, Black S, Mason N, Oakley R, Mann C (2022) Association between delays to patient admission from the emergency department and all-cause 30-day mortality. Emergency Medicine Journal 39(3):168--173
work page 2022
-
[18]
Transportation Science 58(4):841--859
Khorasanian D, Patrick J, Saur \'e A (2024) Dynamic home care routing and scheduling with uncertain number of visits per referral. Transportation Science 58(4):841--859
work page 2024
-
[19]
Management Science 61(1):19--38
Kim SH, Chan CW, Olivares M, Escobar G (2015) Icu admission control: An empirical study of capacity allocation and its implication for patient outcomes. Management Science 61(1):19--38
work page 2015
-
[20]
Kulkarni V, Kulkarni M, Pant A (2020) Survey of personalization techniques for federated learning. 2020 fourth world conference on smart trends in systems, security and sustainability (WorldS4), 794--797 (IEEE)
work page 2020
-
[21]
Management Science 70(11):7692--7711
Lim JM, Song H, Yang JJ (2024) The spillover effects of capacity pooling in hospitals. Management Science 70(11):7692--7711
work page 2024
-
[22]
Liu Y, Whitt W (2011 a ) Large-time asymptotics for the g t/m t/st+ gi t many-server fluid queue with abandonment. Queueing systems 67:145--182
work page 2011
-
[23]
Stochastic Systems 1(2):340--410
Liu Y, Whitt W (2011 b ) Nearly periodic behavior in the overloaded g/d/s+ gi queue. Stochastic Systems 1(2):340--410
work page 2011
-
[24]
The International Journal of Health Planning and Management 38(3):805--828
Manning L, Islam MS (2023) A systematic review to identify the challenges to achieving effective patient flow in public hospitals. The International Journal of Health Planning and Management 38(3):805--828
work page 2023
-
[25]
Clinical and experimental emergency medicine 6(3):189
McKenna P, Heslin SM, Viccellio P, Mallon WK, Hernandez C, Morley EJ (2019) Emergency department and hospital crowding: causes, consequences, and cures. Clinical and experimental emergency medicine 6(3):189
work page 2019
-
[26]
Meyn SP, Tweedie RL (2012) Markov chains and stochastic stability (Springer Science & Business Media)
work page 2012
-
[27]
Perez HD, Hubbs CD, Li C, Grossmann IE (2021) Algorithmic approaches to inventory management optimization. Processes 9(1):102
work page 2021
-
[28]
Puterman ML (2014) Markov decision processes: discrete stochastic dynamic programming (John Wiley & Sons)
work page 2014
-
[29]
BMC medical informatics and decision making 13(1):1--19
Schmidt R, Geisler S, Spreckelsen C (2013) Decision support for hospital bed management using adaptable individual length of stay estimations and shared resources. BMC medical informatics and decision making 13(1):1--19
work page 2013
-
[30]
Proximal Policy Optimization Algorithms
Schulman J, Wolski F, Dhariwal P, Radford A, Klimov O (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[31]
Management Science 62(1):1--28
Shi P, Chou MC, Dai JG, Ding D, Sim J (2016) Models and insights for hospital inpatient operations: Time-dependent ed boarding time. Management Science 62(1):1--28
work page 2016
-
[32]
Management Science 66(9):3825--3842
Song H, Tucker AL, Graue R, Moravick S, Yang JJ (2020) Capacity pooling in hospitals: The hidden consequences of off-service placement. Management Science 66(9):3825--3842
work page 2020
-
[33]
Medical journal of Australia 184(5):208--212
Sprivulis PC, Da Silva JA, Jacobs IG, Jelinek GA, Frazer AR (2006) The association between hospital overcrowding and mortality among patients admitted via western australian emergency departments. Medical journal of Australia 184(5):208--212
work page 2006
-
[34]
Scandinavian journal of trauma, resuscitation and emergency medicine 21:1--7
Stowell A, Claret PG, Sebbane M, Bobbia X, Boyard C, Genre Grandpierre R, Moreau A, de La Coussaye JE (2013) Hospital out-lying through lack of beds and its impact on care and patient outcome. Scandinavian journal of trauma, resuscitation and emergency medicine 21:1--7
work page 2013
-
[35]
Stylianou N, Fackrell R, Vasilakis C (2017) Are medical outliers associated with worse patient outcomes? a retrospective study within a regional nhs hospital using routine data. BMJ open 7(5):e015676
work page 2017
-
[36]
Thomas BG, Bollapragada S, Akbay K, Toledano D, Katlic P, Dulgeroglu O, Yang D (2013) Automated bed assignments in a complex and dynamic hospital environment. Interfaces 43(5):435--448
work page 2013
-
[37]
European Journal of Operational Research 313(1):373--386
Vanvuchelen N, De Boeck K, Boute RN (2024) Cluster-based lateral transshipments for the zambian health supply chain. European Journal of Operational Research 313(1):373--386
work page 2024
-
[38]
Health care management science 23:117--141
Zhang H, Best TJ, Chivu A, Meltzer DO (2020) Simulation-based optimization to improve hospital patient assignment to physicians and clinical units. Health care management science 23:117--141
work page 2020
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.