arxiv: 2605.03832 · v1 · submitted 2026-05-05 · 💻 cs.LG

Recognition: unknown

A Domain Incremental Continual Learning Benchmark for ICU Time Series Model Transportability

Ryan King , Conrad Krueger , Ethan Veselka , Tianbao Yang , Bobak J. Mortazavi

Authors on Pith no claims yet

Pith reviewed 2026-05-07 16:25 UTC · model grok-4.3

classification 💻 cs.LG

keywords domain incremental learningcontinual learningICU time seriesmodel transportabilityclinical outcome predictiondata replayelastic weight consolidationregional data variations

0 comments

The pith

A benchmark frames transfer of ICU time series models between US regions as domain-incremental learning to test adaptation while retaining prior knowledge.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes a benchmark that treats the movement of ICU outcome prediction models from a source region to other US regions as a domain-incremental continual learning problem. Models must adapt to new input distributions in patient measurements and frequencies while keeping performance on the original domain intact. This setup addresses the resource barriers that prevent smaller hospitals from training their own models, allowing them to fine-tune transferred ones instead. The authors evaluate data replay, which reuses stored examples from earlier domains, and Elastic Weight Consolidation, which regularizes parameters to protect important prior features.

Core claim

Notable differences exist in ICU measurement distributions and frequencies across US regions, so a domain-incremental learning benchmark can test a model's ability to acquire meaningful information from each new regional domain while retaining key features from the source domain; this is demonstrated by applying data replay and Elastic Weight Consolidation to sequential regional data shifts where the prediction task stays constant but the input distribution changes.

What carries the argument

The domain-incremental continual learning benchmark, which presents ICU time series data from successive US geographic regions as ordered domains with fixed outcome prediction but varying measurement distributions, forcing models to manage distribution shifts without catastrophic forgetting.

If this is right

Hospitals can fine-tune a single source model on new regional data using replay or regularization instead of retraining from scratch.
Data replay stores representative examples from the original domain to maintain performance during later regional adaptations.
Elastic Weight Consolidation identifies and protects parameters critical to the source domain during updates on new data.
The benchmark quantifies how much regional variation in vital-sign frequencies and values degrades direct transfer without adaptation steps.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the benchmark proves effective, it could support shared model repositories where large hospitals release base models for smaller ones to adapt sequentially.
The same regional sequencing approach might apply to other clinical time-series tasks that face geographic or demographic distribution shifts.
Testing additional continual learning algorithms beyond replay and EWC on this benchmark could reveal which techniques best suit clinical deployment constraints.
Real-world validation would require checking whether sequential regional exposure matches the actual order in which hospitals adopt transferred models.

Load-bearing premise

Regional differences in ICU measurement distributions and frequencies across the United States meaningfully affect transferred model performance, and framing transportability as domain-incremental learning correctly captures what hospitals need when adapting models to new patient populations.

What would settle it

A model trained on one region's ICU data that shows no drop in performance and no forgetting when applied directly to another region's data without any continual learning adaptation or storage of prior examples would indicate the benchmark does not capture a real transportability requirement.

Figures

Figures reproduced from arXiv: 2605.03832 by Bobak J. Mortazavi, Conrad Krueger, Ethan Veselka, Ryan King, Tianbao Yang.

**Figure 1.** Figure 1: We illustrate our proposed benchmark frame work. A model is initialized on the MIMIC-III dataset and trained on a clinical outcome prediction task. view at source ↗

**Figure 2.** Figure 2: We plot the distribution of each measurement in our dataset across each of our tasks. In each plot, we see each region along with a dashed lined view at source ↗

read the original abstract

In recent years, machine learning has made significant progress in clinical outcome prediction, demonstrating increasingly accurate results. However, the substantial resources required for hospitals to train these models, such as data collection, labeling, and computational power, limit the feasibility for smaller hospitals to develop their own models. An alternative approach involves transferring a machine learning model trained by a large hospital to smaller hospitals, allowing them to fine-tune the model on their specific patient data. However, these models are often trained and validated on data from a single hospital, raising concerns about their generalizability to new data. Our research shows that there are notable differences in measurement distributions and frequencies across various regions in the United States. To address this, we propose a benchmark that tests a machine learning model's ability to transfer from a source domain to different regions across the country. This benchmark assesses a model's capacity to learn meaningful information about each new domain while retaining key features from the original domain. Using this benchmark, we frame the transfer of a machine learning model from one region to another as a domain incremental learning problem. While the task of patient outcome prediction remains the same, the input data distribution varies, necessitating a model that can effectively manage these shifts. We evaluate two popular domain incremental learning methods: data replay, which stores examples from previous data sources for fine-tuning on the current source, and Elastic Weight Consolidation (EWC), a model parameter regularization method that maintains features important for both data sources.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The benchmark for regional ICU model transport is a practical idea, but framing it as domain-incremental learning adds a retention requirement that the actual use case does not need.

read the letter

The paper's main point is a benchmark that treats moving ICU time-series models from one US region to another as a domain-incremental continual learning problem. They note real differences in measurement distributions and frequencies across regions, then apply replay and EWC to see if models can pick up new domains without losing the old one. That setup is the thing to know up front: it turns a transport question into a no-forgetting test.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes a benchmark for assessing the transportability of ICU time series models across US regions by framing the problem as domain-incremental continual learning. It notes regional differences in measurement distributions and frequencies, requires models to learn from new domains while retaining source-domain features, and evaluates data replay and Elastic Weight Consolidation (EWC) as solutions.

Significance. If the benchmark construction and evaluations hold, the work could supply a practical testbed for clinical model generalization, helping address resource constraints at smaller hospitals by enabling transfer rather than from-scratch training.

major comments (2)

[Abstract] Abstract: the premise that effective transportability requires retaining key features from the original domain (and thus evaluating replay/EWC to prevent forgetting) is not justified by the stated use-case. The manuscript describes transferring a model to a new hospital for local fine-tuning; post-transfer performance on the source domain is irrelevant to the receiving site, so the benchmark imposes an extraneous no-forgetting constraint that does not match practical transportability requirements.
[Abstract] Abstract: the central claims rest on unshown quantitative results, specific metrics, dataset sizes, evaluation protocols, or performance numbers for the replay and EWC methods. Without these, the soundness of the benchmark and the comparative assessment of the two methods cannot be evaluated.

minor comments (2)

Specify the exact clinical prediction task (e.g., in-hospital mortality, 30-day readmission) and the precise time-series variables and sampling frequencies used in the benchmark.
Clarify how the regional domains are partitioned (e.g., by state, hospital system, or geographic cluster) and report quantitative measures of the distribution shifts observed.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below, providing clarifications and indicating where revisions will be made to strengthen the paper.

read point-by-point responses

Referee: [Abstract] Abstract: the premise that effective transportability requires retaining key features from the original domain (and thus evaluating replay/EWC to prevent forgetting) is not justified by the stated use-case. The manuscript describes transferring a model to a new hospital for local fine-tuning; post-transfer performance on the source domain is irrelevant to the receiving site, so the benchmark imposes an extraneous no-forgetting constraint that does not match practical transportability requirements.

Authors: We acknowledge the referee's point that the primary described use-case is one-time transfer followed by local fine-tuning at the receiving hospital, where source-domain performance after transfer may not be directly relevant. Our benchmark is explicitly framed as domain-incremental continual learning to evaluate sequential adaptation across regions while tracking retention of prior-domain features, which aligns with scenarios such as iterative model updates in a multi-hospital network or when retained source features improve robustness to future shifts. However, we agree that the abstract does not sufficiently justify why the no-forgetting constraint matches the core transportability use-case. We will revise the abstract and introduction to clarify the motivation, explicitly distinguishing between single-transfer scenarios (where forgetting may be acceptable) and multi-domain or sequential-adaptation settings where retention provides value, and adjust the benchmark description accordingly. revision: partial
Referee: [Abstract] Abstract: the central claims rest on unshown quantitative results, specific metrics, dataset sizes, evaluation protocols, or performance numbers for the replay and EWC methods. Without these, the soundness of the benchmark and the comparative assessment of the two methods cannot be evaluated.

Authors: The full manuscript contains these elements in the Methods, Experiments, and Results sections: dataset sizes drawn from the eICU-CRD across US regions (with patient counts and feature statistics per domain), evaluation protocols involving sequential addition of domains with metrics tracked on all prior domains to measure both adaptation and forgetting, specific performance metrics (AUROC and AUPRC for mortality prediction), and quantitative comparisons of data replay versus EWC (including forgetting rates and final performance). We will update the abstract to include key quantitative highlights and summary statistics so that the central claims are supported without requiring the reader to consult the full text. revision: yes

Circularity Check

0 steps flagged

No circularity; benchmark and framing are externally defined

full rationale

The paper proposes an externally defined benchmark based on observed regional distribution shifts in ICU data, frames transportability as domain-incremental learning, and evaluates standard methods (replay, EWC) without any derived predictions, fitted parameters renamed as results, or load-bearing self-citations. All steps rest on independent empirical observations and established continual-learning techniques rather than reducing to the paper's own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The proposal rests on the domain assumption of regional measurement differences and the suitability of continual learning for transportability; no free parameters or invented entities are introduced in the abstract.

axioms (1)

domain assumption There are notable differences in measurement distributions and frequencies across various regions in the United States that affect model generalizability.
Directly stated as motivation for the benchmark in the abstract.

pith-pipeline@v0.9.0 · 5572 in / 1106 out tokens · 55690 ms · 2026-05-07T16:25:15.783641+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

20 extracted references · 4 canonical work pages · 1 internal anchor

[1]

Machine learning and the future of cardiovascular care: Jacc state-of-the-art review,

G. Quer, R. Arnaout, M. Henne, and R. Arnaout, “Machine learning and the future of cardiovascular care: Jacc state-of-the-art review,”Journal of the American College of Cardiology, vol. 77, no. 3, pp. 300–313, 2021

2021
[2]

A review of statistical and machine learning methods for modeling cancer risk using structured clinical data,

A. N. Richter and T. M. Khoshgoftaar, “A review of statistical and machine learning methods for modeling cancer risk using structured clinical data,”Artificial intelligence in medicine, vol. 90, pp. 1–14, 2018

2018
[3]

Data sharing through an nih central database repository: a cross-sectional survey of biolincc users,

J. S. Ross, J. D. Ritchie, E. Finn, N. R. Desai, R. L. Lehman, H. M. Krumholz, and C. P. Gross, “Data sharing through an nih central database repository: a cross-sectional survey of biolincc users,”BMJ open, vol. 6, no. 9, p. e012769, 2016

2016
[4]

Wearable devices in cardiovascular medicine,

A. Hughes, M. M. H. Shandhi, H. Master, J. Dunn, and E. Brittain, “Wearable devices in cardiovascular medicine,”Circulation Research, vol. 132, no. 5, pp. 652–670, 2023

2023
[5]

External validations of cardiovascular clinical prediction models: a large-scale review of the literature,

B. S. Wessler, J. Nelson, J. G. Park, H. McGinnes, G. Gulati, R. Brazil, B. Van Calster, D. van Klaveren, E. Venema, E. Steyerberget al., “External validations of cardiovascular clinical prediction models: a large-scale review of the literature,”Circulation: Cardiovascular Quality and Outcomes, vol. 14, no. 8, p. e007858, 2021

2021
[6]

A comprehensive ehr timeseries pre- training benchmark,

M. McDermott, B. Nestor, E. Kim, W. Zhang, A. Goldenberg, P. Szolovits, and M. Ghassemi, “A comprehensive ehr timeseries pre- training benchmark,” inProceedings of the Conference on Health, Inference, and Learning, 2021, pp. 257–278

2021
[7]

Three types of incremental learning,

G. M. Van de Ven, T. Tuytelaars, and A. S. Tolias, “Three types of incremental learning,”Nature Machine Intelligence, vol. 4, no. 12, pp. 1185–1197, 2022

2022
[8]

Catastrophic forgetting in connectionist networks,

R. M. French, “Catastrophic forgetting in connectionist networks,” Trends in cognitive sciences, vol. 3, no. 4, pp. 128–135, 1999

1999
[9]

Rainbow memory: Continual learning with a memory of diverse samples,

J. Bang, H. Kim, Y . Yoo, J.-W. Ha, and J. Choi, “Rainbow memory: Continual learning with a memory of diverse samples,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 8218–8227

2021
[10]

Overcoming catastrophic forgetting in neural networks,

J. Kirkpatrick, R. Pascanu, N. Rabinowitz, J. Veness, G. Desjardins, A. A. Rusu, K. Milan, J. Quan, T. Ramalho, A. Grabska-Barwinska et al., “Overcoming catastrophic forgetting in neural networks,”Pro- ceedings of the national academy of sciences, vol. 114, no. 13, pp. 3521–3526, 2017

2017
[11]

Multitask learning and benchmarking with clinical time series data,

H. Harutyunyan, H. Khachatrian, D. C. Kale, G. Ver Steeg, and A. Galstyan, “Multitask learning and benchmarking with clinical time series data,”Scientific Data, vol. 6, no. 1, Jun. 2019. [Online]. Available: http://dx.doi.org/10.1038/s41597-019-0103-9

work page doi:10.1038/s41597-019-0103-9 2019
[12]

Continual learning through synaptic intelligence,

F. Zenke, B. Poole, and S. Ganguli, “Continual learning through synaptic intelligence,” inInternational conference on machine learning. PMLR, 2017, pp. 3987–3995

2017
[13]

Progressive Neural Networks

A. A. Rusu, N. C. Rabinowitz, G. Desjardins, H. Soyer, J. Kirkpatrick, K. Kavukcuoglu, R. Pascanu, and R. Hadsell, “Progressive neural networks,”arXiv preprint arXiv:1606.04671, 2016

work page internal anchor Pith review arXiv 2016
[14]

Lifelong Learning with Dynamically Expandable Networks

J. Yoon, E. Yang, J. Lee, and S. J. Hwang, “Lifelong learning with dynamically expandable networks,”arXiv preprint arXiv:1708.01547, 2017

work page Pith review arXiv 2017
[15]

Mimic-iii, a freely accessible critical care database,

A. E. Johnson, T. J. Pollard, L. Shen, L.-w. H. Lehman, M. Feng, M. Ghassemi, B. Moody, P. Szolovits, L. Anthony Celi, and R. G. Mark, “Mimic-iii, a freely accessible critical care database,”Scientific data, vol. 3, no. 1, pp. 1–9, 2016

2016
[16]

The eicu collaborative research database, a freely available multi-center database for critical care research,

T. J. Pollard, A. E. Johnson, J. D. Raffa, L. A. Celi, R. G. Mark, and O. Badawi, “The eicu collaborative research database, a freely available multi-center database for critical care research,”Scientific data, vol. 5, no. 1, pp. 1–13, 2018

2018
[17]

URLhttp://dx.doi.org/10.1371/journal.pone.0235424

S. Sheikhalishahi, V . Balaraman, and V . Osmani, “Benchmarking machine learning models on multi-centre eicu critical care dataset,” PLOS ONE, vol. 15, no. 7, p. e0235424, Jul. 2020. [Online]. Available: http://dx.doi.org/10.1371/journal.pone.0235424

work page doi:10.1371/journal.pone.0235424 2020
[18]

Adam: A method for stochastic optimization,

D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” 2017

2017
[19]

Deep class-incremental learning: A survey,

D.-W. Zhou, Q.-W. Wang, Z.-H. Qi, H.-J. Ye, D.-C. Zhan, and Z. Liu, “Deep class-incremental learning: A survey,” 2023

2023
[20]

Rehearsal revealed: The limits and merits of revisiting samples in continual learning,

E. Verwimp, M. D. Lange, and T. Tuytelaars, “Rehearsal revealed: The limits and merits of revisiting samples in continual learning,” 2021

2021