pith. sign in

arxiv: 2512.21602 · v3 · pith:LBWCSO2Cnew · submitted 2025-12-25 · 💻 cs.LG · cs.CV

An Empirical Study of Machine Learning Robustness and Scalability for Imbalanced Tabular Clinical Data in Emergency and Critical Care

Pith reviewed 2026-05-16 19:42 UTC · model grok-4.3

classification 💻 cs.LG cs.CV
keywords tabular foundation modelsclass imbalanceclinical predictionemergency caremachine learningrobustnessscalabilityTabPFN
0
0 comments X

The pith

Tabular foundation models achieve competitive results on imbalanced clinical data at lower computational cost than deeper alternatives.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper evaluates seven model families on highly imbalanced tabular data from two large clinical sources, measuring weighted F1 performance, degradation under rising imbalance, and compute scaling across seven prediction tasks. Tree-based methods such as XGBoost lead on one dataset while foundation models TabPFN v2.6 and TabICL lead on the other, with the latter group delivering strong ranks through an inference-only paradigm that keeps per-task cost low. TabResNet improves on TabNet yet still trails ensembles, and classical methods scale best as data volume grows. No family wins on every task and dataset, yet the efficiency edge of foundation models stands out as the practical takeaway for time-sensitive clinical environments.

Core claim

Across MIMIC-IV-ED and eICU data for seven imbalanced tasks, TabPFN v2.6 and TabICL posted the top average weighted F1 ranks on MIMIC-IV-ED while XGBoost led on eICU; TabNet declined fastest with increasing imbalance and carried the highest compute load, TabResNet outperformed TabNet at lower cost, tree methods scaled most favorably with dataset size, and foundation models kept per-task cost low via inference, leading to the conclusion that foundation models combine competitive accuracy with efficiency that could support broader deployment.

What carries the argument

Side-by-side empirical comparison of seven model families (Decision Tree, Random Forest, XGBoost, TabNet, TabResNet, TabICL, TabPFN v2.6) using weighted F1-score, robustness curves under controlled imbalance increase, and wall-clock scaling measurements.

If this is right

  • Foundation models can deliver strong weighted F1 scores on one major clinical corpus while keeping per-task compute low through inference.
  • Tree-based methods remain the most scalable option as dataset size increases.
  • TabResNet provides a lighter-weight alternative that consistently beats TabNet but does not overtake ensembles.
  • Performance rankings shift between datasets, so no single family can be assumed best for every clinical task.
  • If the low per-task cost pattern holds, foundation models lower the barrier to running adaptive models in resource-limited hospitals.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same efficiency pattern could appear in other high-stakes tabular domains that face class imbalance, such as fraud or safety monitoring.
  • Hybrid systems that route easy cases to trees and hard cases to foundation models might combine the best scaling and accuracy traits.
  • Testing these models on streaming clinical data arriving in real time would reveal whether the reported inference savings survive production constraints.
  • Adding more hospitals or countries to the evaluation would test whether the observed dataset dependence is an artifact of the two sources used.

Load-bearing premise

The two chosen clinical datasets and seven tasks capture the imbalance patterns and deployment constraints typical of real emergency and critical care settings.

What would settle it

A replication on an independent third clinical dataset where foundation models lose their accuracy edge or require more compute than ensembles would falsify the reported efficiency promise.

Figures

Figures reproduced from arXiv: 2512.21602 by Marcellin Atemkeng, Yusuf Brima.

Figure 1
Figure 1. Figure 1: Architecture of TabResNet. Sequential structure of the network architecture for tabular data. The Input Layer (Linear + BatchNorm + ReLU + Dropout) is followed by 1–3 Compact Residual Blocks, each containing two Linear layers with Batch Norm, ReLU, Dropout, and a skip connection. An optional Reduction Layer precedes the Output Layer, which produces class predictions. 4/29 [PITH_FULL_IMAGE:figures/full_fig… view at source ↗
Figure 2
Figure 2. Figure 2: Class distribution of of target outcomes. Class distribution of target variables in the MIMIC-IV-ED (top row) and eICU (bottom row) datasets. The histograms illustrate the frequency of samples across various clinical prediction tasks [PITH_FULL_IMAGE:figures/full_fig_p010_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Correlation of class imbalance metrics (MIMIC-IV-ED). Correlation heatmaps of class imbalance metrics for different prediction tasks on the MIMIC-IV-ED dataset. Pairwise correlations are shown between the CVCF, IR, and NECD metrics across weighting strategies (inverse, effective, median) for three prediction tasks. 10/29 [PITH_FULL_IMAGE:figures/full_fig_p010_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Effect of class imbalance on discharge diagnosis. Weighted F1 performance across varying levels of class imbalance for primary diagnosis prediction. The performance curves for 20 classifiers are shown, with the weighted F1 value decreasing as the imbalance severity increases [PITH_FULL_IMAGE:figures/full_fig_p011_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Effect of class imbalance on ICD Code prediction. Weighted F1 performance across varying levels of class imbalance for ICD code group prediction. Compared with fine-grained diagnosis prediction, grouped ICD categories reduce label sparsity, and classifiers generally maintain greater stability. 11/29 [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Effect of class imbalance on disposition prediction. Weighted F1 performance across varying levels of class imbalance for patient disposition prediction. The prediction of discharge outcomes shows moderate robustness to imbalance, with tree-based ensembles consistently outperforming deep learning models. Performance decreased most sharply for fine-grained diagnosis prediction, where the large number of rar… view at source ↗
Figure 7
Figure 7. Figure 7: Critical difference analysis of classifier performance. This shows the average ranks of 20 classifiers on the basis of their weighted F1 performance across experimental blocks. Lower ranks indicate better predictive performance. The classifiers connected by a horizontal bar are not significantly different from each other according to Wilcoxon signed-rank tests with Holm correction. To assess whether these … view at source ↗
Figure 8
Figure 8. Figure 8: Critical difference analysis of classifier training times. This shows the average ranks of 20 classifiers based on their training times across 15 experimental blocks (each block is a unique combination of target variable and training set size). Lower ranks indicate faster training. Classifiers connected by a horizontal bar are not significantly different from each other according to Wilcoxon signed-rank te… view at source ↗
Figure 9
Figure 9. Figure 9: Conceptual system architecture for AI-enabled clinical decision support. An overview of a ML–enabled clinical decision support system for ICU and ED care. Archival and/ or real-time data streams from monitors, ventilators, infusion pumps, and EHRs feed into predictive models for tasks such as mortality prediction, disposition, and triage prioritization. Model outputs are delivered through clinician-facing … view at source ↗
Figure 10
Figure 10. Figure 10: Computational scaling across model architectures. Training time as a function of dataset size for different prediction tasks. Each panel shows the training time (seconds, log scale) versus the total number of training samples for a specific target variable. The results are reported across all classifiers and class weighting strategies. .3.1 Correlation of Class Imbalance Metrics [PITH_FULL_IMAGE:figures/… view at source ↗
Figure 11
Figure 11. Figure 11: shows the pairwise correlations between the CVCF, IR, and NECD for different weighting strategies. NECD was inversely correlated with IR and CVCF, providing a bounded and interpretable measure of overall class distributional uncertainty. The patterns were consistent across tasks, confirming that these measures capture related but complementary aspects of imbalance severity [PITH_FULL_IMAGE:figures/full_f… view at source ↗
Figure 12
Figure 12. Figure 12: Classifier performance rankings across experimental conditions. Critical difference diagram showing the average ranks of 20 classifiers on the eICU dataset on the basis of their weighted F1 scores across experimental blocks. Lower ranks indicate better predictive performance. The classifiers connected by a horizontal bar are not significantly different according to Wilcoxon signed-rank tests with Holm cor… view at source ↗
Figure 13
Figure 13. Figure 13: Impact of class imbalance on discharge disposition prediction. Weighted F1 outcomes for five model families (Decision Tree, Random Forest, TabNet, TabResNet, XGBoost) under four weighting strategies (none, inverse, effective number, median). Performance trends are shown with respect to three imbalance measures. Tree-based ensemble approaches, particularly XGBoost, displayed the most robust performance acr… view at source ↗
Figure 14
Figure 14. Figure 14: Class imbalance and length-of-stay prediction. Weighted F1 trajectories for the same model families across different weighting strategies, plotted against CVCF, IR, and NECD. Although performance gradually decreased with stronger imbalance, overall F1 scores remained relatively high. XGBoost and Random Forest retained stable accuracy, while TabNet and TabResNet were more affected at higher skew levels. Th… view at source ↗
Figure 15
Figure 15. Figure 15: Class imbalance effect on mortality risk prediction. Weighted F1 results across model families and reweighting schemes, examined under CVCF, IR, and NECD. Mortality prediction tasks showed marked vulnerability to imbalance, with sharper degradation for neural models (especially TabNet). Tree-based ensembles, and XGBoost in particular, demonstrated comparatively greater resilience. All three metrics captur… view at source ↗
Figure 16
Figure 16. Figure 16: Influence of imbalance on resource utilization prediction. Weighted F1 values for five model families (Decision Tree, Random Forest, TabNet, TabResNet, and XGBoost) using four weighting schemes. Results are tracked across three imbalance metrics. While most models experienced only gradual declines in performance, deep learning methods were more susceptible to skew, whereas ensemble methods maintained stro… view at source ↗
Figure 17
Figure 17. Figure 17: Performance under class imbalance for severity prediction. Weighted F1 results comparing all model families and weighting strategies against CVCF, IR, and NECD. Severity classification was less sensitive to imbalance than other tasks, with tree-based ensembles, especially XGBoost, exhibiting stable performance, while neural architectures such as TabNet and TabResNet showed modest but consistent degradatio… view at source ↗
Figure 18
Figure 18. Figure 18: Critical difference diagram for classifier performance. Critical difference diagram of the average ranks of 20 classifiers on this dataset, based on weighted F1-scores across experimental blocks. Lower ranks indicate better predictive performance. Horizontal bars connect classifiers that are not significantly different under Wilcoxon signed-rank tests with Holm correction. .3.4 Training Time Scaling Plots… view at source ↗
Figure 19
Figure 19. Figure 19: presents extended analyses of training time scaling with dataset size across the five prediction tasks in the eICU dataset. Training time is reported on a logarithmic scale and shown for all classifiers and weighting strategies. These plots complement the rank-based comparisons by illustrating absolute training durations and highlighting the scaling gap between tree-based ensembles and deep learning model… view at source ↗
read the original abstract

Every year, millions of patients pass through emergency departments and intensive care units, where clinicians must make high-stakes decisions under time pressure and uncertainty. Machine learning could support prediction of deterioration, triage, and rare critical outcomes, but clinical data are often severely imbalanced, biasing models toward majority classes and reducing predictive performance. Developing robust and efficient models for imbalanced clinical tabular data therefore remains an important challenge. We evaluated six model families on imbalanced tabular data from the MIMIC-IV-ED and eICU databases: Decision Tree, Random Forest, XGBoost, TabNet, TabICL, and TabPFN v2.6. Trainable models were optimized using Bayesian hyperparameter tuning, while foundation models were evaluated in their pretrained inference regime without task-specific reweighting. Models were assessed using Macro F1-score, robustness to increasing imbalance, and computational scalability across seven clinical prediction tasks. Results differed across datasets. On MIMIC-IV-ED, TabPFN v2.6 and TabICL achieved the strongest average Macro F1 ranks, with XGBoost remaining competitive. On eICU, XGBoost consistently performed best, followed by other tree-based methods, while foundation models achieved intermediate performance. Across both datasets, TabNet showed the largest degradation under increasing imbalance and the highest computational cost. Training-time analysis showed that tree-based methods scaled most favorably with dataset size, while foundation models offered low per-task adaptation cost. These findings suggest that no single model family dominates across all clinical settings. However, tabular foundation models are narrowing the performance gap with strong classical baselines while offering a distinct efficiency-performance trade-off that may benefit resource-constrained clinical environments.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript empirically benchmarks seven model families (Decision Tree, Random Forest, XGBoost, TabNet, TabResNet, TabICL, TabPFN v2.6) on seven imbalanced prediction tasks drawn from MIMIC-IV-ED and eICU. It reports that foundation models achieve competitive weighted F1 scores with low per-task computational cost on MIMIC-IV-ED while tree-based methods dominate on eICU, that TabNet degrades most sharply with increasing imbalance, and that no single family dominates across both datasets; the central claim is that tabular foundation models therefore show practical promise for resource-constrained clinical settings.

Significance. If the efficiency advantage of foundation models generalizes beyond the two evaluated datasets, the work could lower barriers to deploying adaptive decision support in emergency and critical care. The explicit comparison of a lightweight TabResNet against TabNet and the focus on both robustness and scalability are useful contributions to the tabular ML literature.

major comments (2)
  1. [Abstract and §4] Abstract and §4 (results): the headline claim that foundation models 'combine competitive performance at low computational cost' rests on performance ranks from only MIMIC-IV-ED and eICU. These datasets share US-centric demographics, similar feature schemas, and overlapping imbalance ratios; the manuscript provides no experiments or discussion testing whether the observed ranking survives shifts in missingness patterns, outcome prevalence, or non-US critical-care distributions, which directly undermines the generalization to 'broader clinical settings' asserted in the final paragraph.
  2. [Abstract and §3] Abstract and §3 (experimental setup): the abstract states that models were evaluated for 'robustness to increasing imbalance' but supplies no description of the exact imbalance-generation procedure (e.g., subsampling ratios, stratification, or whether synthetic minority oversampling was used), nor any statistical testing or error bars on the weighted F1 scores. Without these, the reported performance differences and the claim that TabNet shows the 'steepest performance decline' cannot be verified as load-bearing.
minor comments (2)
  1. [§2] §2 (related work): the positioning of TabResNet as a 'lightweight alternative to TabNet' would benefit from a short table comparing parameter counts and inference latency of the two architectures on the same hardware.
  2. [§4] Figure captions and §4: several performance plots lack axis labels for imbalance ratio or dataset size; adding these would improve readability without altering the results.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the two major comments below and have made revisions to improve the manuscript's clarity and rigor.

read point-by-point responses
  1. Referee: [Abstract and §4] Abstract and §4 (results): the headline claim that foundation models 'combine competitive performance at low computational cost' rests on performance ranks from only MIMIC-IV-ED and eICU. These datasets share US-centric demographics, similar feature schemas, and overlapping imbalance ratios; the manuscript provides no experiments or discussion testing whether the observed ranking survives shifts in missingness patterns, outcome prevalence, or non-US critical-care distributions, which directly undermines the generalization to 'broader clinical settings' asserted in the final paragraph.

    Authors: We agree that the two datasets are similar in key respects and that additional experiments on more diverse distributions would strengthen claims of broader applicability. Our study deliberately focused on these established, publicly available benchmarks to ensure reproducibility and comparability with prior tabular ML work in critical care. In the revised manuscript we have added an explicit limitations discussion in §5 that addresses US-centric demographics, potential differences in missingness and prevalence, and the need for future multi-region validation. We have also softened the language in the abstract and conclusion to frame the efficiency advantage as promising on these representative benchmarks rather than asserting generalization to all clinical settings. revision: partial

  2. Referee: [Abstract and §3] Abstract and §3 (experimental setup): the abstract states that models were evaluated for 'robustness to increasing imbalance' but supplies no description of the exact imbalance-generation procedure (e.g., subsampling ratios, stratification, or whether synthetic minority oversampling was used), nor any statistical testing or error bars on the weighted F1 scores. Without these, the reported performance differences and the claim that TabNet shows the 'steepest performance decline' cannot be verified as load-bearing.

    Authors: The referee is correct that the original description was incomplete. We have revised §3 to provide a full account of the imbalance-generation procedure, including the subsampling method, stratification approach, and explicit statement that no synthetic oversampling was applied. We have also updated §4 to report weighted F1 scores with standard deviations from 5-fold cross-validation and to include statistical significance tests supporting the performance comparisons, including the steeper decline observed for TabNet. revision: yes

Circularity Check

0 steps flagged

No circularity: pure empirical benchmark on held-out data

full rationale

The manuscript is an empirical comparison of seven model families across seven tasks on two fixed datasets (MIMIC-IV-ED, eICU). All reported metrics (weighted F1, robustness curves, per-task runtime) are computed directly from standard train/test splits and external evaluation; no equations, fitted parameters, or self-citations are used to derive the central claims. The design contains no self-definitional loops, no predictions that reduce to training inputs by construction, and no load-bearing uniqueness theorems or ansatzes imported from prior author work. The only design choice (TabResNet as lightweight TabNet variant) is presented as an engineering decision, not a mathematical derivation that loops back on itself.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The study rests on standard machine-learning evaluation assumptions such as weighted F1 being an appropriate metric for imbalance and the chosen datasets being representative of clinical tabular data.

pith-pipeline@v0.9.0 · 5635 in / 914 out tokens · 31349 ms · 2026-05-16T19:42:31.172015+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.