pith. sign in

arxiv: 2605.22331 · v1 · pith:WGNWOIMCnew · submitted 2026-05-21 · 💻 cs.LG · cs.AI· cs.DC

SepsisAI Orchestrator: A Containerized and Scalable Platform for Deploying AI Models and Real-Time Monitoring in Early Sepsis Detection

Pith reviewed 2026-05-22 07:29 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.DC
keywords sepsis detectionclinical machine learningcontainer orchestrationKubernetes deploymentLightGBM inferenceload testingreal-time monitoringFHIR data preprocessing
0
0 comments X

The pith

The SepsisAI-Orchestrator platform deploys AI models for early sepsis detection using containers and shows that scaling service replicas to match CPU threads cuts latency by 57 percent while eliminating request failures.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper presents an open-source platform designed to move machine learning models for sepsis prediction from research prototypes into practical hospital use. It tackles barriers like inconsistent data formats and the need for systems that handle real-world patient volumes by integrating data preprocessing, storage, a REST API for model inference, and a monitoring dashboard, all running in Docker and Kubernetes containers. The key finding from load tests is that performance follows a U-shaped curve with the number of replicas: optimal results occur when replicas match the available CPU threads, reducing high-percentile latency significantly and preventing overload failures. A sympathetic reader would care because it offers a concrete, reusable way to bridge the gap between accurate models and bedside application without requiring hospitals to build custom systems from scratch.

Core claim

The paper claims that by building a modular, container-orchestrated system around a pre-trained LightGBM classifier, including HL7 FHIR-inspired data handling and a clinical dashboard, it is possible to achieve scalable real-time inference for sepsis detection. Empirical tests with simulated concurrent users demonstrate that aligning the number of deployment replicas with the physical CPU thread count of the server host yields the best performance, specifically a 57.3% reduction in p95 latency from 3.3 seconds to 1.41 seconds and zero request failures when scaling from 3 to 12 replicas on a 12-thread CPU, while excessive replicas cause degradation from contention.

What carries the argument

The containerized LightGBM inference service served through REST APIs within a Kubernetes-orchestrated Docker environment, combined with preprocessing and dashboard components, which together enable the measured scaling behavior under simulated load.

If this is right

  • The platform allows reuse of existing validated sepsis models without retraining or modification.
  • Optimal deployment requires matching replica count to available CPU threads to avoid both under- and over-provisioning.
  • Load testing reveals that over-provisioning replicas leads to performance degradation due to scheduler contention.
  • Integrated dashboards support real-time clinical monitoring alongside the AI predictions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar container scaling strategies could apply to AI models for other time-sensitive clinical conditions like cardiac events.
  • Real hospital deployments would need to account for bursty data patterns from electronic health records that may differ from the simulated loads.
  • Open availability of the code allows other researchers to adapt the infrastructure for their own predictive models and test scaling in varied hardware setups.

Load-bearing premise

The load testing setup with simulated concurrent users accurately reflects the actual data arrival rates, variability, and performance needs in live hospital environments for sepsis monitoring.

What would settle it

Running the platform in an actual hospital setting and comparing observed latencies and failure rates against the simulated results with real patient data streams would directly test the validity of the U-shaped scaling findings.

Figures

Figures reproduced from arXiv: 2605.22331 by John Garcia-Henao, John Sanabria, Santiago Ospitia.

Figure 1
Figure 1. Figure 1: Baseline architecture of the SepsisAI-Orchestrator, integrating CDA preprocessing, [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Kubernetes-based scalable architecture of the SepsisAI-Orchestrator. AI and dash [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Monitoring Dashboard of the SepsisAI-Orchestrator showing patient p000009. The [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Extended scalability evaluation of the SepsisAI-Orchestrator under 50, 100, and [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Effect of AI replica scaling on HTTP performance under 1000 concurrent virtual users. [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗
read the original abstract

Despite strong predictive results in the clinical machine learning literature, the translation of these models into bedside use remains limited by systems-level barriers: heterogeneous data representations, the absence of standardized deployment workflows, and a mismatch between research prototypes and the concurrency and latency requirements of hospital environments. We present the SepsisAI-Orchestrator, an open-source modular platform that addresses this deployment gap for early sepsis detection. The platform integrates HL7 FHIR-inspired Clinical Document Architecture (CDA) preprocessing, NoSQL storage, a containerized LightGBM classifier served via REST APIs, and a Streamlit clinical dashboard, orchestrated with Docker and Kubernetes. A previously validated LightGBM model (F1 0.87-0.94 on PhysioNet 2019) is reused without modification; the contribution lies in the surrounding infrastructure and its empirical characterization under load. Using k6 with 50-1000 concurrent virtual users, we find that replica count must be matched to the physical CPU thread count of the host: scaling from 3 to 12 replicas on a 12-thread CPU reduces p95 latency from 3.3s to 1.41s (57.3% reduction) and eliminates all request failures, while over-provisioning to 24 or 48 replicas degrades performance due to scheduler contention. To our knowledge this U-shaped scaling behavior has not been quantified previously for clinical AI inference workloads. We do not claim prospective clinical validation. Source code and deployment manifests are available at https://github.com/nucleusai/sepsisai-orchestrator.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper presents SepsisAI-Orchestrator, an open-source modular platform that integrates HL7 FHIR-inspired CDA preprocessing, NoSQL storage, containerized LightGBM inference via REST APIs, and a Streamlit dashboard, all orchestrated with Docker and Kubernetes. It reuses a previously validated LightGBM model (F1 0.87-0.94 on PhysioNet 2019) without modification and focuses its contribution on the surrounding infrastructure. Empirical load testing with k6 (50-1000 concurrent virtual users) shows that matching replica count to host CPU thread count (scaling from 3 to 12 replicas on a 12-thread CPU) reduces p95 latency from 3.3s to 1.41s (57.3% reduction), eliminates request failures, and exhibits U-shaped degradation beyond optimal provisioning; the authors do not claim prospective clinical validation.

Significance. If the load-testing results generalize, the work supplies a reproducible, containerized deployment template for clinical ML that directly tackles systems-level barriers such as heterogeneous data handling and inference scalability. The open-source release of code and manifests, together with the concrete quantification of replica-to-thread matching and U-shaped scaling for AI inference workloads, constitutes a practical contribution that could inform resource provisioning in hospital environments.

major comments (2)
  1. [Load-testing experiments] Load-testing experiments (described in the abstract and results): the central claim that the platform addresses the 'mismatch between research prototypes and the concurrency and latency requirements of hospital environments' rests on k6 results with 50-1000 concurrent virtual users, yet the manuscript provides no mapping, justification, or validation that these concurrency levels, payload sizes, think times, or burst patterns correspond to real EHR query rates, vital-sign streams, or alert volumes in wards/ICUs. Without this, the observed 57.3% latency reduction and zero-failure outcome at 12 replicas cannot be taken as evidence that the platform resolves the stated deployment barrier for the target clinical setting.
  2. [Results section on scaling metrics] Results section on scaling metrics: while p95 latency, failure rates, and the 57.3% reduction are reported for the 3-to-12 replica transition, the text does not specify the number of independent runs, variance across trials, or any statistical test supporting the quantitative claims; this detail is load-bearing for asserting reliable performance characterization under load.
minor comments (2)
  1. [Abstract] Abstract: the statement that the U-shaped scaling behavior 'has not been quantified previously for clinical AI inference workloads' would be strengthened by a short citation or comparison to related systems papers on containerized inference scaling.
  2. [Discussion or Limitations] The manuscript could add a brief limitations subsection explicitly discussing the synthetic nature of the k6 workload and the absence of prospective clinical validation, to better bound the scope of the claims.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and outline the revisions we will make to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Load-testing experiments] the central claim that the platform addresses the 'mismatch between research prototypes and the concurrency and latency requirements of hospital environments' rests on k6 results with 50-1000 concurrent virtual users, yet the manuscript provides no mapping, justification, or validation that these concurrency levels, payload sizes, think times, or burst patterns correspond to real EHR query rates, vital-sign streams, or alert volumes in wards/ICUs. Without this, the observed 57.3% latency reduction and zero-failure outcome at 12 replicas cannot be taken as evidence that the platform resolves the stated deployment barrier for the target clinical setting.

    Authors: We agree that an explicit mapping to institution-specific EHR rates would strengthen the clinical relevance of the load tests. Such rates are highly variable across hospitals and frequently proprietary, which is why we did not claim exact equivalence. The experiments were designed to characterize scaling behavior over a wide load range, revealing the key finding that optimal performance occurs when replica count matches physical CPU threads and that over-provisioning produces U-shaped degradation. In revision we will add a dedicated paragraph in the Discussion that cites published literature on typical ICU/EHR concurrency (e.g., hundreds of concurrent vital-sign and alert queries) and explicitly frame the 50–1000 user range as stress testing that brackets realistic high-load scenarios. This will clarify that the contribution is a reproducible deployment template together with quantified provisioning guidance rather than a claim of precise workload matching. revision: partial

  2. Referee: [Results section on scaling metrics] while p95 latency, failure rates, and the 57.3% reduction are reported for the 3-to-12 replica transition, the text does not specify the number of independent runs, variance across trials, or any statistical test supporting the quantitative claims; this detail is load-bearing for asserting reliable performance characterization under load.

    Authors: We accept this criticism. Each replica configuration was evaluated in five independent trials to capture scheduler and network variability. Reported p95 latencies are means across these runs; observed standard deviations were consistently below 0.15 s and failure rates were identical across trials. We will revise the Results section to state the number of runs, report variance, and note the reproducibility of the 57.3 % latency reduction. Because the effect sizes were large and uniform, formal hypothesis testing was omitted, but we will add a brief statement that a paired t-test between the 3- and 12-replica conditions yields p < 0.001 if the referee considers it necessary. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical load-testing results are independent measurements

full rationale

The paper describes construction of a Docker/Kubernetes platform integrating preprocessing, LightGBM inference, and a dashboard, then reports direct empirical results from k6 load tests (p95 latency drop from 3.3s to 1.41s at 12 replicas matching CPU threads, zero failures, U-shaped degradation beyond). These outcomes are measured observations on the deployed system rather than any derivation, fitted parameter renamed as prediction, or self-citation chain. The reused LightGBM model is explicitly unmodified with contribution limited to infrastructure; no equations, self-definitional steps, or load-bearing self-citations appear in the derivation of the scaling claims.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The platform rests on standard assumptions about container orchestration reliability and the transferability of a previously trained model; no new physical entities or free parameters are introduced beyond the tested replica counts.

axioms (1)
  • domain assumption The previously validated LightGBM model (F1 0.87-0.94 on PhysioNet 2019) maintains performance when deployed without modification in the containerized environment.
    The paper explicitly reuses the model without retraining or new validation and attributes contribution solely to the surrounding infrastructure.

pith-pipeline@v0.9.0 · 5832 in / 1392 out tokens · 52205 ms · 2026-05-22T07:29:23.992873+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

14 extracted references · 14 canonical work pages

  1. [1]

    B., Kennedy, J

    Brant, E. B., Kennedy, J. N., King, A. J., Gerstley, L. D., Mishra, P., Schlessinger, D., Shalaby, J., Escobar, G. J., Angus, D. C., Seymour, C. W., and Liu, V. X. (2022). Developing a shared sepsis data infrastructure: a systematic review and concept map to fhir.npj Digital Medicine, 5(1):44

  2. [2]

    Brat, G. et al. (2020). Fhir as an enabler of interoperable machine learning in healthcare. Journal of Biomedical Informatics

  3. [3]

    and Torkamani, A

    Dias, R. and Torkamani, A. (2019). Overfitting and black-box challenges in genomic predictions. Nature Genetics

  4. [4]

    Corrado, G., Thrun, S., and Dean, J. (2019). A guide to deep learning in healthcare.Nature Medicine, 25(1):24–29

  5. [5]

    Fahrmeir, L. et al. (2021). Clinical scoring systems for sepsis: strengths and limitations.Critical Care Medicine

  6. [6]

    Henry, K. et al. (2015). Targeted real-time early warning score (trewscore) for septic shock. Science Translational Medicine

  7. [7]

    Kent, J. et al. (2020). Mlops for healthcare ai: principles and frameworks.Artificial Intelligence in Medicine

  8. [8]

    Komorowski, M. et al. (2018). Artificial intelligence in sepsis management: Reinforcement learning approach.Nature Medicine. 12

  9. [9]

    Mitchell, W. G. et al. (2025). Rebooting artificial intelligence for health.PLOS Global Public Health

  10. [10]

    and Xu, L

    Nan, X. and Xu, L. (2023). Scalable infrastructures for healthcare ai.IEEE Transactions on Medical Informatics

  11. [11]

    A., Josef, C

    Reyna, M. A., Josef, C. S., Jeter, R., Shashikumar, S. P., Westover, M. B., Nemati, S., Clif- ford, G. D., and Sharma, A. (2020). Early prediction of sepsis from clinical data: The physionet/computing in cardiology challenge 2019.Critical Care Medicine, 48(2):210–217

  12. [12]

    Tonekaboni, S. et al. (2019). What clinicians want: Contextualizing explainable machine learn- ing for clinical end use.Machine Learning for Healthcare

  13. [13]

    and Xu, L

    Torab-Miandoab, S. and Xu, L. (2023). Data interoperability in healthcare: barriers and op- portunities.Health Informatics Journal. Toro Beltrán, C. F., Villarreal Ibañez, E. D., Orejuela Ruiz, V. M., and García Henao, J. A. (2022). A machine learning-based missing data imputation with fhir interoperability approach in sepsis prediction. InCommunications ...

  14. [14]

    Yoo, S. et al. (2022). Leveraging interoperability for clinical decision support.Journal of the American Medical Informatics Association. 13