FedEHR-Gen is a federated two-stage autoencoder plus TCVAE system that aligns latent spaces via layer-wise matching and uses distribution-aware aggregation to produce synthetic EHR time-series data matching centralized performance on eICU and MIMIC-III.
Representation learning to advance multi-institutional studies with electronic health record data from US and France
1 Pith paper cite this work. Polarity classification is still indexing.
abstract
The widespread adoption of electronic health records has created new opportunities for translational clinical research, yet this promise remains constrained by fragmented data across privacy-siloed institutions and substantial heterogeneity in local coding practices. While privacy-preserving collaborative learning allows institutions to work together without sharing patient-level data, it does not address inconsistencies in how clinical concepts are represented across sites. We introduce a graph-based framework that addresses this gap by treating data harmonization as a scalable representation learning problem. Rather than relying on fixed standards or manual mappings, the framework integrates institution-specific summary statistics from health records, curated biomedical knowledge graphs, and semantic information derived from large language models to learn a shared semantic space. This joint learning approach aligns diverse, site-specific vocabularies while preserving patient privacy. Evaluated across seven institutions and two languages, the framework provides a robust, data-centric foundation for training and deploying clinical models across heterogeneous healthcare systems.
fields
cs.LG 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
FedEHR-Gen: Federated Synthetic Time-Series EHR Generation via Latent Space Alignment and Distribution-Aware Aggregation
FedEHR-Gen is a federated two-stage autoencoder plus TCVAE system that aligns latent spaces via layer-wise matching and uses distribution-aware aggregation to produce synthetic EHR time-series data matching centralized performance on eICU and MIMIC-III.