Mining Electronic Health Records to Investigate Effectiveness of Ensemble Deep Clustering

Manar D. Samad; Shrabani Ghosh; Yina Hou

arxiv: 2604.07085 · v2 · pith:NCO5TPD4new · submitted 2026-04-08 · 💻 cs.LG

Mining Electronic Health Records to Investigate Effectiveness of Ensemble Deep Clustering

Manar D. Samad , Yina Hou , Shrabani Ghosh This is my paper

Pith reviewed 2026-05-10 18:37 UTC · model grok-4.3

classification 💻 cs.LG

keywords electronic health recordsdeep clusteringensemble clusteringheart failurepatient clusteringtabular dataautoencodersK-means

0 comments

The pith

An ensemble deep clustering method combined with traditional techniques achieves the highest performance in grouping heart failure patients from electronic health records.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper examines how well different clustering methods work on electronic health records to group patients and identify disease subtypes in heart failure cases. Traditional clustering approaches prove more robust on this tabular data than deep learning methods, which were built for images. The authors propose a new ensemble deep clustering technique that combines cluster assignments from several embedding dimensions. When this is merged with traditional clustering in a framework, it ranks best overall among 14 methods tested on real patient data from multiple cohorts. The work also stresses the need for separate analysis by biological sex.

Core claim

The paper establishes that traditional clustering methods perform robustly on tabular EHR data while deep learning approaches underperform due to their design for image clustering. It introduces an ensemble-based deep clustering approach that aggregates cluster assignments from multiple embedding dimensions. When combined with traditional clustering in a novel ensemble framework, this method delivers the best overall performance ranking across 14 diverse clustering methods and multiple patient cohorts. The findings highlight advantages of combining approaches and the importance of biological sex-specific clustering of EHR data.

What carries the argument

Ensemble embedding for deep clustering that aggregates cluster assignments obtained from multiple embedding dimensions rather than a single fixed embedding space, integrated with traditional clustering methods.

Load-bearing premise

Deep learning methods designed for image data inherently underperform on tabular EHR data, and aggregating assignments from multiple embedding dimensions reliably improves clustering quality without overfitting or selection bias.

What would settle it

A direct comparison showing that a single deep embedding space achieves equal or better clustering quality than the ensemble aggregation on the same heart failure EHR cohorts would falsify the advantage of the proposed method.

Figures

Figures reproduced from arXiv: 2604.07085 by Manar D. Samad, Shrabani Ghosh, Yina Hou.

**Figure 2.** Figure 2: Patient cluster labels color-coded on t-SNE visualization embeddings. t-SNE is applied to G-CEALS with [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 3.** Figure 3: Comparison of UMAP visualizations for G-CEALS (latent dimension = 10) and K-means on raw data across [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

read the original abstract

In electronic health records (EHRs), clustering patients and distinguishing disease subtypes are key tasks to elucidate pathophysiology and aid clinical decision-making. However, clustering in healthcare informatics is still based on traditional methods, especially K-means, and has achieved limited success when applied to embedding representations learned by autoencoders as hybrid methods. This paper investigates the effectiveness of traditional, hybrid, and deep learning methods in heart failure patient cohorts using real EHR data from the All of Us Research Program. Traditional clustering methods perform robustly because deep learning approaches are specifically designed for image clustering, a task that differs substantially from the tabular EHR data setting. To address the shortcomings of deep clustering, we introduce an ensemble-based deep clustering approach that aggregates cluster assignments obtained from multiple embedding dimensions, rather than relying on a single fixed embedding space. When combined with traditional clustering in a novel ensemble framework, the proposed ensemble embedding for deep clustering delivers the best overall performance ranking across 14 diverse clustering methods and multiple patient cohorts. This paper underscores the importance of biological sex-specific clustering of EHR data and the advantages of combining traditional and deep clustering approaches over a single method.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The ensemble tweak on deep clustering for EHR data is a reasonable extension but the top ranking claim needs the actual numbers and stats to hold up.

read the letter

This paper takes known ensemble ideas and applies them to deep clustering on tabular electronic health records for heart failure patients. The new piece is aggregating cluster assignments from multiple embedding dimensions instead of sticking with one, then folding that into a mix with traditional methods like K-means. They test it on All of Us data and say the combo ranks highest among 14 approaches across cohorts. What works is the practical focus. They explain why image-tuned deep clustering falls short on EHR tables and show that traditional methods hold up well. Pointing out the need for sex-specific analysis is a useful note for anyone doing real clinical data work. The comparison across methods gives a sense of the landscape. The soft spots are in the evidence. The abstract and summary give no metrics, no error bars, no cohort sizes, and no stats on whether the top ranking is significant. That makes it tough to know if the ensemble really improves things or if it's just variation in the data. The claim that multi-dimension aggregation avoids overfitting or bias is reasonable on paper but needs the methods section and results tables to confirm it doesn't introduce selection issues. This kind of work is for researchers in medical informatics who cluster patient subgroups and want to see how deep and traditional methods stack up on heart failure records. Someone building tools for subtype discovery could find the comparisons helpful. I would send it for peer review. The real dataset and the direct comparison make it worth a referee looking at the full results, even if the current write-up needs more detail on the numbers.

Referee Report

2 major / 1 minor

Summary. The paper claims that traditional clustering methods perform robustly on tabular EHR data for heart failure patient cohorts from the All of Us program, while deep learning methods designed for images underperform. It introduces an ensemble deep clustering approach that aggregates cluster assignments from multiple embedding dimensions rather than a single fixed space. When combined with traditional clustering in a novel ensemble framework, this method is asserted to deliver the best overall performance ranking across 14 diverse clustering methods and multiple patient cohorts, while also highlighting the importance of biological sex-specific clustering.

Significance. If the empirical ranking holds under rigorous validation, the work could advance healthcare informatics by demonstrating practical benefits of hybrid ensemble strategies for patient subtyping in tabular EHR data, where pure deep clustering has seen limited success. It provides a concrete example of adapting embedding-based methods to non-image domains and emphasizes sex-specific analysis, which may inform more accurate pathophysiology studies and clinical decision support.

major comments (2)

Abstract: The assertion that the proposed ensemble embedding for deep clustering 'delivers the best overall performance ranking' is presented without any quantitative metrics (e.g., ARI, NMI, silhouette scores), statistical tests, error bars, cohort sizes, or implementation details, leaving the central empirical claim unsupported by verifiable evidence.
Introduction and Methods: The foundational assumption that deep learning methods 'are specifically designed for image clustering' and thus inherently limited on tabular EHR data requires explicit ablation studies or direct comparisons to confirm that multi-dimension aggregation improves quality without introducing selection bias or overfitting, as this premise drives the need for the ensemble framework.

minor comments (1)

The 14 clustering methods should be explicitly enumerated in the methods section, and any tables reporting performance rankings should include full metric values and cohort descriptions for reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback on our manuscript. We address each major comment point by point below, indicating the revisions we will incorporate.

read point-by-point responses

Referee: Abstract: The assertion that the proposed ensemble embedding for deep clustering 'delivers the best overall performance ranking' is presented without any quantitative metrics (e.g., ARI, NMI, silhouette scores), statistical tests, error bars, cohort sizes, or implementation details, leaving the central empirical claim unsupported by verifiable evidence.

Authors: We agree that the abstract would be strengthened by including supporting quantitative evidence. In the revised manuscript, we will update the abstract to report key metrics such as the overall performance ranking across the 14 methods, average ARI and NMI values, cohort sizes (number of heart failure patients per All of Us cohort), and references to statistical significance testing. Full details including error bars from repeated runs and implementation specifics remain in the Methods and Results sections. revision: yes
Referee: Introduction and Methods: The foundational assumption that deep learning methods 'are specifically designed for image clustering' and thus inherently limited on tabular EHR data requires explicit ablation studies or direct comparisons to confirm that multi-dimension aggregation improves quality without introducing selection bias or overfitting, as this premise drives the need for the ensemble framework.

Authors: The manuscript already contains direct empirical comparisons demonstrating that standard deep clustering methods underperform relative to traditional methods on this tabular EHR data. We also report results from the multi-dimension aggregation approach versus single-embedding baselines. To further validate the aggregation step and address concerns about selection bias or overfitting, we will add explicit ablation experiments in the revised version, including performance sensitivity to the number of embedding dimensions and consistency checks across independent cohorts. revision: partial

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper's central claim is an empirical performance ranking of clustering methods (including a proposed ensemble deep clustering approach) on real EHR data from the All of Us program across multiple cohorts and 14 baselines. No derivation chain, theorem, or first-principles result is presented that reduces to its own inputs by construction, self-definition, or fitted-parameter renaming. The abstract and described framework treat the ensemble aggregation as a methodological proposal whose quality is assessed via external data experiments rather than any self-referential equation or self-citation load-bearing premise. This is the expected non-circular outcome for an applied empirical study.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim depends on standard assumptions from clustering literature that performance metrics reflect true subtype structure and that the All of Us dataset is representative; no explicit free parameters or new entities are introduced beyond the ensemble method itself.

axioms (2)

domain assumption Deep learning clustering methods optimized for images are unsuitable for tabular EHR data without modification
Directly stated in the abstract as the reason traditional methods perform robustly.
ad hoc to paper Aggregating cluster assignments from multiple embedding dimensions improves overall clustering quality
Core premise of the proposed ensemble approach without independent justification in the abstract.

pith-pipeline@v0.9.0 · 5497 in / 1215 out tokens · 53892 ms · 2026-05-10T18:37:50.035665+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

we introduce an ensemble-based deep clustering approach that aggregates cluster assignments obtained from multiple embedding dimensions... KGG ensemble... best overall performance ranking across 14 diverse clustering methods
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Traditional clustering methods perform robustly because deep learning approaches are specifically designed for image clustering, a task that differs substantially from the tabular EHR data setting.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.