Recognition: unknown
A Domain Incremental Continual Learning Benchmark for ICU Time Series Model Transportability
Pith reviewed 2026-05-07 16:25 UTC · model grok-4.3
The pith
A benchmark frames transfer of ICU time series models between US regions as domain-incremental learning to test adaptation while retaining prior knowledge.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Notable differences exist in ICU measurement distributions and frequencies across US regions, so a domain-incremental learning benchmark can test a model's ability to acquire meaningful information from each new regional domain while retaining key features from the source domain; this is demonstrated by applying data replay and Elastic Weight Consolidation to sequential regional data shifts where the prediction task stays constant but the input distribution changes.
What carries the argument
The domain-incremental continual learning benchmark, which presents ICU time series data from successive US geographic regions as ordered domains with fixed outcome prediction but varying measurement distributions, forcing models to manage distribution shifts without catastrophic forgetting.
If this is right
- Hospitals can fine-tune a single source model on new regional data using replay or regularization instead of retraining from scratch.
- Data replay stores representative examples from the original domain to maintain performance during later regional adaptations.
- Elastic Weight Consolidation identifies and protects parameters critical to the source domain during updates on new data.
- The benchmark quantifies how much regional variation in vital-sign frequencies and values degrades direct transfer without adaptation steps.
Where Pith is reading between the lines
- If the benchmark proves effective, it could support shared model repositories where large hospitals release base models for smaller ones to adapt sequentially.
- The same regional sequencing approach might apply to other clinical time-series tasks that face geographic or demographic distribution shifts.
- Testing additional continual learning algorithms beyond replay and EWC on this benchmark could reveal which techniques best suit clinical deployment constraints.
- Real-world validation would require checking whether sequential regional exposure matches the actual order in which hospitals adopt transferred models.
Load-bearing premise
Regional differences in ICU measurement distributions and frequencies across the United States meaningfully affect transferred model performance, and framing transportability as domain-incremental learning correctly captures what hospitals need when adapting models to new patient populations.
What would settle it
A model trained on one region's ICU data that shows no drop in performance and no forgetting when applied directly to another region's data without any continual learning adaptation or storage of prior examples would indicate the benchmark does not capture a real transportability requirement.
Figures
read the original abstract
In recent years, machine learning has made significant progress in clinical outcome prediction, demonstrating increasingly accurate results. However, the substantial resources required for hospitals to train these models, such as data collection, labeling, and computational power, limit the feasibility for smaller hospitals to develop their own models. An alternative approach involves transferring a machine learning model trained by a large hospital to smaller hospitals, allowing them to fine-tune the model on their specific patient data. However, these models are often trained and validated on data from a single hospital, raising concerns about their generalizability to new data. Our research shows that there are notable differences in measurement distributions and frequencies across various regions in the United States. To address this, we propose a benchmark that tests a machine learning model's ability to transfer from a source domain to different regions across the country. This benchmark assesses a model's capacity to learn meaningful information about each new domain while retaining key features from the original domain. Using this benchmark, we frame the transfer of a machine learning model from one region to another as a domain incremental learning problem. While the task of patient outcome prediction remains the same, the input data distribution varies, necessitating a model that can effectively manage these shifts. We evaluate two popular domain incremental learning methods: data replay, which stores examples from previous data sources for fine-tuning on the current source, and Elastic Weight Consolidation (EWC), a model parameter regularization method that maintains features important for both data sources.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes a benchmark for assessing the transportability of ICU time series models across US regions by framing the problem as domain-incremental continual learning. It notes regional differences in measurement distributions and frequencies, requires models to learn from new domains while retaining source-domain features, and evaluates data replay and Elastic Weight Consolidation (EWC) as solutions.
Significance. If the benchmark construction and evaluations hold, the work could supply a practical testbed for clinical model generalization, helping address resource constraints at smaller hospitals by enabling transfer rather than from-scratch training.
major comments (2)
- [Abstract] Abstract: the premise that effective transportability requires retaining key features from the original domain (and thus evaluating replay/EWC to prevent forgetting) is not justified by the stated use-case. The manuscript describes transferring a model to a new hospital for local fine-tuning; post-transfer performance on the source domain is irrelevant to the receiving site, so the benchmark imposes an extraneous no-forgetting constraint that does not match practical transportability requirements.
- [Abstract] Abstract: the central claims rest on unshown quantitative results, specific metrics, dataset sizes, evaluation protocols, or performance numbers for the replay and EWC methods. Without these, the soundness of the benchmark and the comparative assessment of the two methods cannot be evaluated.
minor comments (2)
- Specify the exact clinical prediction task (e.g., in-hospital mortality, 30-day readmission) and the precise time-series variables and sampling frequencies used in the benchmark.
- Clarify how the regional domains are partitioned (e.g., by state, hospital system, or geographic cluster) and report quantitative measures of the distribution shifts observed.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below, providing clarifications and indicating where revisions will be made to strengthen the paper.
read point-by-point responses
-
Referee: [Abstract] Abstract: the premise that effective transportability requires retaining key features from the original domain (and thus evaluating replay/EWC to prevent forgetting) is not justified by the stated use-case. The manuscript describes transferring a model to a new hospital for local fine-tuning; post-transfer performance on the source domain is irrelevant to the receiving site, so the benchmark imposes an extraneous no-forgetting constraint that does not match practical transportability requirements.
Authors: We acknowledge the referee's point that the primary described use-case is one-time transfer followed by local fine-tuning at the receiving hospital, where source-domain performance after transfer may not be directly relevant. Our benchmark is explicitly framed as domain-incremental continual learning to evaluate sequential adaptation across regions while tracking retention of prior-domain features, which aligns with scenarios such as iterative model updates in a multi-hospital network or when retained source features improve robustness to future shifts. However, we agree that the abstract does not sufficiently justify why the no-forgetting constraint matches the core transportability use-case. We will revise the abstract and introduction to clarify the motivation, explicitly distinguishing between single-transfer scenarios (where forgetting may be acceptable) and multi-domain or sequential-adaptation settings where retention provides value, and adjust the benchmark description accordingly. revision: partial
-
Referee: [Abstract] Abstract: the central claims rest on unshown quantitative results, specific metrics, dataset sizes, evaluation protocols, or performance numbers for the replay and EWC methods. Without these, the soundness of the benchmark and the comparative assessment of the two methods cannot be evaluated.
Authors: The full manuscript contains these elements in the Methods, Experiments, and Results sections: dataset sizes drawn from the eICU-CRD across US regions (with patient counts and feature statistics per domain), evaluation protocols involving sequential addition of domains with metrics tracked on all prior domains to measure both adaptation and forgetting, specific performance metrics (AUROC and AUPRC for mortality prediction), and quantitative comparisons of data replay versus EWC (including forgetting rates and final performance). We will update the abstract to include key quantitative highlights and summary statistics so that the central claims are supported without requiring the reader to consult the full text. revision: yes
Circularity Check
No circularity; benchmark and framing are externally defined
full rationale
The paper proposes an externally defined benchmark based on observed regional distribution shifts in ICU data, frames transportability as domain-incremental learning, and evaluates standard methods (replay, EWC) without any derived predictions, fitted parameters renamed as results, or load-bearing self-citations. All steps rest on independent empirical observations and established continual-learning techniques rather than reducing to the paper's own inputs by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption There are notable differences in measurement distributions and frequencies across various regions in the United States that affect model generalizability.
Reference graph
Works this paper leans on
-
[1]
Machine learning and the future of cardiovascular care: Jacc state-of-the-art review,
G. Quer, R. Arnaout, M. Henne, and R. Arnaout, “Machine learning and the future of cardiovascular care: Jacc state-of-the-art review,”Journal of the American College of Cardiology, vol. 77, no. 3, pp. 300–313, 2021
2021
-
[2]
A review of statistical and machine learning methods for modeling cancer risk using structured clinical data,
A. N. Richter and T. M. Khoshgoftaar, “A review of statistical and machine learning methods for modeling cancer risk using structured clinical data,”Artificial intelligence in medicine, vol. 90, pp. 1–14, 2018
2018
-
[3]
Data sharing through an nih central database repository: a cross-sectional survey of biolincc users,
J. S. Ross, J. D. Ritchie, E. Finn, N. R. Desai, R. L. Lehman, H. M. Krumholz, and C. P. Gross, “Data sharing through an nih central database repository: a cross-sectional survey of biolincc users,”BMJ open, vol. 6, no. 9, p. e012769, 2016
2016
-
[4]
Wearable devices in cardiovascular medicine,
A. Hughes, M. M. H. Shandhi, H. Master, J. Dunn, and E. Brittain, “Wearable devices in cardiovascular medicine,”Circulation Research, vol. 132, no. 5, pp. 652–670, 2023
2023
-
[5]
External validations of cardiovascular clinical prediction models: a large-scale review of the literature,
B. S. Wessler, J. Nelson, J. G. Park, H. McGinnes, G. Gulati, R. Brazil, B. Van Calster, D. van Klaveren, E. Venema, E. Steyerberget al., “External validations of cardiovascular clinical prediction models: a large-scale review of the literature,”Circulation: Cardiovascular Quality and Outcomes, vol. 14, no. 8, p. e007858, 2021
2021
-
[6]
A comprehensive ehr timeseries pre- training benchmark,
M. McDermott, B. Nestor, E. Kim, W. Zhang, A. Goldenberg, P. Szolovits, and M. Ghassemi, “A comprehensive ehr timeseries pre- training benchmark,” inProceedings of the Conference on Health, Inference, and Learning, 2021, pp. 257–278
2021
-
[7]
Three types of incremental learning,
G. M. Van de Ven, T. Tuytelaars, and A. S. Tolias, “Three types of incremental learning,”Nature Machine Intelligence, vol. 4, no. 12, pp. 1185–1197, 2022
2022
-
[8]
Catastrophic forgetting in connectionist networks,
R. M. French, “Catastrophic forgetting in connectionist networks,” Trends in cognitive sciences, vol. 3, no. 4, pp. 128–135, 1999
1999
-
[9]
Rainbow memory: Continual learning with a memory of diverse samples,
J. Bang, H. Kim, Y . Yoo, J.-W. Ha, and J. Choi, “Rainbow memory: Continual learning with a memory of diverse samples,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 8218–8227
2021
-
[10]
Overcoming catastrophic forgetting in neural networks,
J. Kirkpatrick, R. Pascanu, N. Rabinowitz, J. Veness, G. Desjardins, A. A. Rusu, K. Milan, J. Quan, T. Ramalho, A. Grabska-Barwinska et al., “Overcoming catastrophic forgetting in neural networks,”Pro- ceedings of the national academy of sciences, vol. 114, no. 13, pp. 3521–3526, 2017
2017
-
[11]
Multitask learning and benchmarking with clinical time series data,
H. Harutyunyan, H. Khachatrian, D. C. Kale, G. Ver Steeg, and A. Galstyan, “Multitask learning and benchmarking with clinical time series data,”Scientific Data, vol. 6, no. 1, Jun. 2019. [Online]. Available: http://dx.doi.org/10.1038/s41597-019-0103-9
-
[12]
Continual learning through synaptic intelligence,
F. Zenke, B. Poole, and S. Ganguli, “Continual learning through synaptic intelligence,” inInternational conference on machine learning. PMLR, 2017, pp. 3987–3995
2017
-
[13]
A. A. Rusu, N. C. Rabinowitz, G. Desjardins, H. Soyer, J. Kirkpatrick, K. Kavukcuoglu, R. Pascanu, and R. Hadsell, “Progressive neural networks,”arXiv preprint arXiv:1606.04671, 2016
work page internal anchor Pith review arXiv 2016
-
[14]
Lifelong Learning with Dynamically Expandable Networks
J. Yoon, E. Yang, J. Lee, and S. J. Hwang, “Lifelong learning with dynamically expandable networks,”arXiv preprint arXiv:1708.01547, 2017
work page Pith review arXiv 2017
-
[15]
Mimic-iii, a freely accessible critical care database,
A. E. Johnson, T. J. Pollard, L. Shen, L.-w. H. Lehman, M. Feng, M. Ghassemi, B. Moody, P. Szolovits, L. Anthony Celi, and R. G. Mark, “Mimic-iii, a freely accessible critical care database,”Scientific data, vol. 3, no. 1, pp. 1–9, 2016
2016
-
[16]
The eicu collaborative research database, a freely available multi-center database for critical care research,
T. J. Pollard, A. E. Johnson, J. D. Raffa, L. A. Celi, R. G. Mark, and O. Badawi, “The eicu collaborative research database, a freely available multi-center database for critical care research,”Scientific data, vol. 5, no. 1, pp. 1–13, 2018
2018
-
[17]
URLhttp://dx.doi.org/10.1371/journal.pone.0235424
S. Sheikhalishahi, V . Balaraman, and V . Osmani, “Benchmarking machine learning models on multi-centre eicu critical care dataset,” PLOS ONE, vol. 15, no. 7, p. e0235424, Jul. 2020. [Online]. Available: http://dx.doi.org/10.1371/journal.pone.0235424
-
[18]
Adam: A method for stochastic optimization,
D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” 2017
2017
-
[19]
Deep class-incremental learning: A survey,
D.-W. Zhou, Q.-W. Wang, Z.-H. Qi, H.-J. Ye, D.-C. Zhan, and Z. Liu, “Deep class-incremental learning: A survey,” 2023
2023
-
[20]
Rehearsal revealed: The limits and merits of revisiting samples in continual learning,
E. Verwimp, M. D. Lange, and T. Tuytelaars, “Rehearsal revealed: The limits and merits of revisiting samples in continual learning,” 2021
2021
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.