pith. sign in

arxiv: 2502.08547 · v2 · submitted 2025-02-12 · 💻 cs.AI

Representation learning to advance multi-institutional studies with electronic health record data from US and France

Pith reviewed 2026-05-23 03:15 UTC · model grok-4.3

classification 💻 cs.AI
keywords electronic health recordsdata harmonizationrepresentation learningmulti-institutional collaborationprivacy-preserving methodsknowledge graphssemantic embedding
0
0 comments X

The pith

A graph-based framework aligns electronic health record vocabularies across institutions by learning a shared semantic space from local statistics, knowledge graphs, and language models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a graph-based framework for harmonizing electronic health record data from multiple institutions without sharing patient-level information. It treats data harmonization as a representation learning task that combines institution-specific summary statistics, biomedical knowledge graphs, and semantic embeddings from large language models. The goal is to create a common semantic space that maps different coding practices used at each site. This approach was tested on data from seven institutions in the US and France, supporting the development of clinical models that can be trained and deployed across different healthcare systems and languages.

Core claim

The framework learns a shared semantic space by integrating institution-specific summary statistics from health records, curated biomedical knowledge graphs, and semantic information derived from large language models, thereby aligning diverse site-specific vocabularies while preserving patient privacy.

What carries the argument

A graph-based representation learning framework that jointly embeds institution-specific data summaries, biomedical knowledge graphs, and large language model-derived semantics into a unified space for vocabulary alignment.

If this is right

  • Clinical models can be trained at one institution and deployed at others with aligned data representations.
  • The method supports multi-institutional studies across different countries and languages.
  • Privacy is maintained since only summary statistics are used, not individual patient records.
  • Scalable harmonization is achieved without relying on fixed standards or manual mappings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This approach might extend to other data types like imaging or genomic records if similar summary statistics and knowledge resources are available.
  • Institutions could use the shared space to identify and correct inconsistencies in their own coding practices.
  • Future work could test whether the alignment improves performance in specific clinical prediction tasks like disease diagnosis.

Load-bearing premise

That institution-specific summary statistics, curated biomedical knowledge graphs, and semantic information derived from large language models can be jointly learned into a shared semantic space that aligns diverse site-specific vocabularies.

What would settle it

Demonstrating that models using the learned alignments perform no better than those using random mappings or no alignment on cross-institution tasks would falsify the central claim.

Figures

Figures reproduced from arXiv: 2502.08547 by Boris Hejblum, Chuan Hong, Clara-Lea Bonzel, Doudou Zhou, Han Tong, J. Michael Gaziano, Katherine Liao, Kelly Cho, Kenneth Mandl, Kevin Pan, Lauren Costa, Linshanshan Wang, Rodolphe Thiebaut, Romain Griffier, Suqi Liu, Tianrun Cai, Tianxi Cai, Vianney Jouhet, Vidul A. Panickan, Xin Xiong, Yuk-Lam Ho, Yun-Chung Liu, Ziming Gan, Zongqi Xia.

Figure 1
Figure 1. Figure 1: Overview of GAME approach with (a) data source, extraction, processing, and algorithm, and [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The data processing procedure of the GAME algorithm. [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Mapping local codes to standard codes using GPT-4. [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Overview of key steps in the GAME algorithm: (a) aligning embeddings into a shared repre [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Comparison of AUCs for detecting similarity (left) and relatedness (right) relationships using [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Correlation between cosine similarities assigned by GAME and other PLMs compared to [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: The C-index between the cosine similarities of the candidate features and the GPT-4 scores [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Comparison of hazard ratios (HR) between AD subgroups identified by GAME embedding with [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Comparison of hazard ratios (HR) across mental health subgroups identified by GAME em [PITH_FULL_IMAGE:figures/full_fig_p018_9.png] view at source ↗
read the original abstract

The widespread adoption of electronic health records has created new opportunities for translational clinical research, yet this promise remains constrained by fragmented data across privacy-siloed institutions and substantial heterogeneity in local coding practices. While privacy-preserving collaborative learning allows institutions to work together without sharing patient-level data, it does not address inconsistencies in how clinical concepts are represented across sites. We introduce a graph-based framework that addresses this gap by treating data harmonization as a scalable representation learning problem. Rather than relying on fixed standards or manual mappings, the framework integrates institution-specific summary statistics from health records, curated biomedical knowledge graphs, and semantic information derived from large language models to learn a shared semantic space. This joint learning approach aligns diverse, site-specific vocabularies while preserving patient privacy. Evaluated across seven institutions and two languages, the framework provides a robust, data-centric foundation for training and deploying clinical models across heterogeneous healthcare systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces a graph-based framework for data harmonization of electronic health records across privacy-siloed institutions. It treats harmonization as a representation learning task that jointly incorporates institution-specific summary statistics, curated biomedical knowledge graphs, and LLM-derived semantic information to produce a shared semantic space aligning site-specific vocabularies, while preserving patient privacy. The work claims evaluation across seven institutions and two languages as providing a robust foundation for multi-institutional clinical models.

Significance. If the central mechanism can be shown to work, the approach would address a genuine barrier in collaborative EHR research by moving beyond fixed standards or manual mappings. The data-centric framing and use of multiple heterogeneous inputs (summary stats + KGs + LLM semantics) are conceptually aligned with current needs in federated clinical modeling.

major comments (2)
  1. [Abstract] Abstract: the claim that the framework 'was evaluated across seven institutions' is unsupported because the abstract (and the supplied manuscript excerpt) contains no quantitative results, baselines, error metrics, ablation studies, or performance tables.
  2. [Abstract] Abstract: the core modeling claim—that institution-specific summary statistics, biomedical KGs, and LLM semantics can be jointly learned into an aligning shared space—lacks any description of the loss function, architecture, alignment objective (contrastive, reconstruction, graph alignment, etc.), or training procedure, rendering the joint-learning premise an unverified assumption rather than a demonstrated mechanism.
minor comments (1)
  1. [Abstract] The abstract could be revised to separate the high-level motivation from the specific technical contributions and to include at least one key quantitative result if the full manuscript contains it.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the abstract. We agree that the abstract requires strengthening to better substantiate its claims and will revise it accordingly while preserving its brevity. We address each major comment below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim that the framework 'was evaluated across seven institutions' is unsupported because the abstract (and the supplied manuscript excerpt) contains no quantitative results, baselines, error metrics, ablation studies, or performance tables.

    Authors: We acknowledge that the current abstract states the evaluation scope without accompanying metrics. The full manuscript reports quantitative results including alignment F1 scores, cosine similarity improvements over baselines (e.g., direct KG matching and LLM-only embeddings), and ablation studies removing each input modality, all computed across the seven institutions. In revision we will insert a concise sentence summarizing key performance metrics and the evaluation scope to make the claim self-contained within the abstract. revision: yes

  2. Referee: [Abstract] Abstract: the core modeling claim—that institution-specific summary statistics, biomedical KGs, and LLM semantics can be jointly learned into an aligning shared space—lacks any description of the loss function, architecture, alignment objective (contrastive, reconstruction, graph alignment, etc.), or training procedure, rendering the joint-learning premise an unverified assumption rather than a demonstrated mechanism.

    Authors: The manuscript body specifies a graph neural network architecture with a composite objective: reconstruction loss on site-specific summary statistics, graph alignment loss on the biomedical KG edges, and contrastive loss aligning LLM-derived embeddings to the shared space, optimized via federated averaging. To address the abstract-level concern we will add a single clause describing the joint objective and alignment mechanism. revision: yes

Circularity Check

0 steps flagged

No significant circularity; framework relies on external inputs

full rationale

The paper introduces a graph-based representation learning framework that integrates institution-specific summary statistics, curated biomedical knowledge graphs, and LLM-derived semantics to produce a shared semantic space for vocabulary alignment. No equations, loss functions, or derivation steps are shown that reduce any claimed prediction or alignment result to a fitted parameter or input by construction. The approach depends on external resources (KGs, LLMs, site aggregates) rather than self-defining its outputs, and the provided text invokes no self-citations or uniqueness theorems as load-bearing justification. The central claim of cross-institution robustness is presented as an empirical outcome of the joint learning process, not a tautological renaming or definitional equivalence.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no explicit free parameters, axioms, or invented entities; the approach implicitly assumes standard properties of representation learning and privacy-preserving aggregation.

pith-pipeline@v0.9.0 · 5780 in / 1004 out tokens · 31922 ms · 2026-05-23T03:15:59.341056+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

62 extracted references · 62 canonical work pages · 1 internal anchor

  1. [1]

    Perspectives and challenges in patient stratification in alzheimer’s disease.Alzheimer’s research & therapy, 14(1):112, 2022

    Carla Abdelnour, Federica Agosta, Marco Bozzali, Bertrand Fougère, Atsushi Iwata, Ramin Nil- forooshan, Leonel T Takada, Félix Viñuela, and Martin Traber. Perspectives and challenges in patient stratification in alzheimer’s disease.Alzheimer’s research & therapy, 14(1):112, 2022

  2. [2]

    Melissa J Armstrong, Shangchen Song, Andrea M Kurasz, and Zhigang Li. Predictors of mortality 7https://docs.smarthealthit.org/ 19 in individuals with dementia in the national alzheimer’s coordinating center.Journal of Alzheimer’s Disease, 86(4):1935–1946, 2022

  3. [3]

    seroquel

    Lisa A Arvanitis and Barbara G Miller. Multiple fixed doses of “seroquel”(quetiapine) in patients with acute exacerbation of schizophrenia: a comparison with haloperidol and placebo.Biological psychiatry, 42(4):233–246, 1997

  4. [4]

    Ehr phenotyping via jointly embedding medical concepts and words into a unified vector space.BMC medical informatics and decision making, 18:15–25, 2018

    Tian Bai, Ashis Kumar Chanda, Brian L Egleston, and Slobodan Vucetic. Ehr phenotyping via jointly embedding medical concepts and words into a unified vector space.BMC medical informatics and decision making, 18:15–25, 2018

  5. [5]

    Tucker: Tensor factorization for knowledge graph completion

    Ivana Balažević, Carl Allen, and Timothy M Hospedales. Tucker: Tensor factorization for knowledge graph completion. arXiv preprint arXiv:1901.09590, 2019

  6. [6]

    Beam, Benjamin Kompa, Allen Schmaltz, Inbar Fried, Griffin Weber, Nathan Palmer, Xu Shi, Tianxi Cai, and Isaac S

    Andrew L. Beam, Benjamin Kompa, Allen Schmaltz, Inbar Fried, Griffin Weber, Nathan Palmer, Xu Shi, Tianxi Cai, and Isaac S. Kohane. Clinical concept embeddings learned from massive sources of multimodal medical data. In Biocomputing 2020. WORLD SCIENTIFIC, Nov 2019. doi: 10. 1142/9789811215636_0027. URL https://doi.org/10.1142%2F9789811215636_0027

  7. [7]

    A neural probabilistic language model

    Yoshua Bengio, Réjean Ducharme, and Pascal Vincent. A neural probabilistic language model. JMLR, 3:1137–1155, 2003

  8. [8]

    Bodenreider

    O. Bodenreider. The unified medical language system (umls): integrating biomedical terminology. Nucleic Acids Research, page 267D – 270, Jan 2004. doi: 10.1093/nar/gkh061. URLhttp://dx. doi.org/10.1093/nar/gkh061

  9. [9]

    Evaluation of the ccam hierarchy and semi structured code for retrieving relevant procedures in a hospital case mix database

    Cédric Bousquet, Béatrice Trombert, Julien Souvignet, Eric Sadou, and Jean-Marie Rodrigues. Evaluation of the ccam hierarchy and semi structured code for retrieving relevant procedures in a hospital case mix database. In AMIA Annual Symposium Proceedings, volume 2010, page 61. American Medical Informatics Association, 2010

  10. [10]

    International statistical classification of diseases and related health problems

    Gerlind R Brämer. International statistical classification of diseases and related health problems. tenth revision.World health statistics quarterly. Rapport trimestriel de statistiques sanitaires mon- diales, 41(1):32–36, 1988

  11. [11]

    Consensus knowledge graph learning via multi-view sparse low rank block model.arXiv preprint arXiv:2209.13762, 2022

    Tianxi Cai, Dong Xia, Luwan Zhang, and Doudou Zhou. Consensus knowledge graph learning via multi-view sparse low rank block model.arXiv preprint arXiv:2209.13762, 2022

  12. [12]

    M3-Embedding: Multi-Linguality, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation

    Jianlv Chen, Shitao Xiao, Peitian Zhang, Kun Luo, Defu Lian, and Zheng Liu. Bge m3-embedding: Multi-lingual, multi-functionality, multi-granularity text embeddings through self-knowledge distil- lation. arXiv preprint arXiv:2402.03216, 2024

  13. [13]

    Privacy protec- tion and intrusion avoidance for cloudlet-based medical data sharing.IEEE transactions on Cloud computing, 8(4):1274–1283, 2016

    Min Chen, Yongfeng Qian, Jing Chen, Kai Hwang, Shiwen Mao, and Long Hu. Privacy protec- tion and intrusion avoidance for cloudlet-based medical data sharing.IEEE transactions on Cloud computing, 8(4):1274–1283, 2016

  14. [14]

    Multi-layer representation learning for medi- cal concepts

    Edward Choi, Mohammad Taha Bahadori, Elizabeth Searles, Catherine Coffey, Michael Thompson, James Bost, Javier Tejedor-Sojo, and Jimeng Sun. Multi-layer representation learning for medi- cal concepts. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1495–1504, 2016. 20

  15. [15]

    Comparative efficacy and acceptability of antimanic drugs in acute mania: a multiple-treatments meta-analysis

    Andrea Cipriani, Corrado Barbui, Georgia Salanti, Jennifer Rendell, Rachel Brown, Sarah Stockton, Marianna Purgato, Loukia M Spineli, Guy M Goodwin, and John R Geddes. Comparative efficacy and acceptability of antimanic drugs in acute mania: a multiple-treatments meta-analysis. The Lancet, 378(9799):1306–1315, 2011

  16. [16]

    Comorbidity clusters in autism spectrum dis- orders: an electronic health record time-series analysis.Pediatrics, 133(1):e54–e63, 2014

    Finale Doshi-Velez, Yaorong Ge, and Isaac Kohane. Comorbidity clusters in autism spectrum dis- orders: an electronic health record time-series analysis.Pediatrics, 133(1):e54–e63, 2014

  17. [17]

    Risk factors for suicide in adults: systematic review and meta-analysis of psychological autopsy studies.BMJ Ment Health, 25(4):148–155, 2022

    Louis Favril, Rongqin Yu, Abdo Uyar, Michael Sharpe, and Seena Fazel. Risk factors for suicide in adults: systematic review and meta-analysis of psychological autopsy studies.BMJ Ment Health, 25(4):148–155, 2022

  18. [18]

    Seena Fazel and Bo Runeson. Suicide. New England Journal of Medicine, 382(3):266–274, 2020. doi: 10.1056/NEJMra1902944

  19. [19]

    Gnaeus: Utilizing clinical guidelines for knowledge-assisted visualisation of ehr cohorts

    Paolo Federico, Jürgen Unger, Albert Amor-Amorós, Lucia Sacchi, Denis Klimov, and Silvia Miksch. Gnaeus: Utilizing clinical guidelines for knowledge-assisted visualisation of ehr cohorts. InEuroVA@ EuroVis, pages 79–83, 2015

  20. [20]

    The benefit of augmenting open data with clinical data-warehouse ehr for forecasting sars-cov-2 hospi- talizations in bordeaux area, france.JAMIA open, 5(4):ooac086, 2022

    Thomas Ferté, Vianney Jouhet, Romain Griffier, Boris P Hejblum, and Rodolphe Thiébaut. The benefit of augmenting open data with clinical data-warehouse ehr for forecasting sars-cov-2 hospi- talizations in bordeaux area, france.JAMIA open, 5(4):ooac086, 2022

  21. [21]

    ARCH: Large-scale knowledge graph via aggregated narrative codified health records analysis.medRxiv, 2023

    Ziming Gan, Doudou Zhou, Everett Rush, Vidul A Panickan, Yuk-Lam Ho, George Ostrouchov, Zhiwei Xu, Shuting Shen, Xin Xiong, Kimberly F Greco, et al. ARCH: Large-scale knowledge graph via aggregated narrative codified health records analysis.medRxiv, 2023

  22. [22]

    A new model for learning in graph domains

    Marco Gori, Gabriele Monfardini, and Franco Scarselli. A new model for learning in graph domains. InProceedings. 2005 IEEE international joint conference on neural networks, 2005., volume 2, pages 729–734. IEEE, 2005

  23. [23]

    Domain-specific language model pretraining for biomedical natural language processing.ACM Transactions on Computing for Healthcare (HEALTH), 3(1):1–23, 2021

    Yu Gu, Robert Tinn, Hao Cheng, Michael Lucas, Naoto Usuyama, Xiaodong Liu, Tristan Naumann, Jianfeng Gao, and Hoifung Poon. Domain-specific language model pretraining for biomedical natural language processing.ACM Transactions on Computing for Healthcare (HEALTH), 3(1):1–23, 2021

  24. [24]

    An open-source framework for end-to-end analysis of electronic health record data.Nature medicine, 30(11):3369– 3380, 2024

    Lukas Heumos, Philipp Ehmele, Tim Treis, Julius Upmeier zu Belzen, Eljas Roellin, Lilly May, Altana Namsaraeva, Nastassya Horlava, Vladimir A Shitov, Xinyue Zhang, et al. An open-source framework for end-to-end analysis of electronic health record data.Nature medicine, 30(11):3369– 3380, 2024

  25. [25]

    Clinical knowledge extraction via sparse embedding regression (keser) with multi-center large scale electronic health record data

    Chuan Hong, Everett Rush, Molei Liu, Doudou Zhou, Jiehuan Sun, Aaron Sonabend, Victor M Castro, Petra Schubert, Vidul A Panickan, Tianrun Cai, et al. Clinical knowledge extraction via sparse embedding regression (keser) with multi-center large scale electronic health record data. medRxiv, 2021

  26. [26]

    Psychosis in alzheimer disease—mechanisms, genetics and therapeutic opportunities

    Zahinoor Ismail, Byron Creese, Dag Aarsland, Helen C Kales, Constantine G Lyketsos, Robert A Sweet, and Clive Ballard. Psychosis in alzheimer disease—mechanisms, genetics and therapeutic opportunities. Nature Reviews Neurology, 18(3):131–144, 2022

  27. [27]

    MIMIC-IV (version 0.4)

    A Johnson, L Bulgarelli, T Pollard, S Horng, L A Celi, and R Mark. MIMIC-IV (version 0.4). PhysioNet., 2020. 21

  28. [28]

    Code2vec: Embed- ding and clustering medical diagnosis data

    David Kartchner, Tanner Christensen, Jeffrey Humpherys, and Sean Wade. Code2vec: Embed- ding and clustering medical diagnosis data. In2017 IEEE International Conference on Healthcare Informatics, pages 386–390, 2017

  29. [29]

    Deep representation learning of electronic health records to unlock patient stratification at scale.NPJ digital medicine, 3(1):96, 2020

    Isotta Landi, Benjamin S Glicksberg, Hao-Chih Lee, Sarah Cherng, Giulia Landi, Matteo Danieletto, Joel T Dudley, Cesare Furlanello, and Riccardo Miotto. Deep representation learning of electronic health records to unlock patient stratification at scale.NPJ digital medicine, 3(1):96, 2020

  30. [30]

    Lozano, A

    Dongha Lee, Xiaoqian Jiang, and Hwanjo Yu. Harmonized representation learning on dynamic ehr graphs. Journal of biomedical informatics, 106:103426, June 2020. ISSN 1532-0464. doi: 10.1016/j. jbi.2020.103426. URL https://doi.org/10.1016/j.jbi.2020.103426

  31. [31]

    Biobert: a pre-trained biomedical language representation model for biomedical text mining

    Jinhyuk Lee, Wonjin Yoon, Sungdong Kim, Donghyeon Kim, Sunkyu Kim, Chan Ho So, and Jaewoo Kang. Biobert: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics, 36(4):1234–1240, 2020

  32. [32]

    Neural word embedding as implicit matrix factorization.Advances in neural information processing systems, 27, 2014

    Omer Levy and Yoav Goldberg. Neural word embedding as implicit matrix factorization.Advances in neural information processing systems, 27, 2014

  33. [33]

    Identification of type 2 diabetes subgroups through topological analysis of patient similarity.Science translational medicine, 7(311):311ra174–311ra174, 2015

    Li Li, Wei-Yi Cheng, Benjamin S Glicksberg, Omri Gottesman, Ronald Tamler, Rong Chen, Erwin P Bottinger, and Joel T Dudley. Identification of type 2 diabetes subgroups through topological analysis of patient similarity.Science translational medicine, 7(311):311ra174–311ra174, 2015

  34. [34]

    Development of phenotype algorithms using electronic medical records and incorporating natural language processing

    Katherine P Liao, Tianxi Cai, Guergana K Savova, Shawn N Murphy, Elizabeth W Karlson, Ash- win N Ananthakrishnan, Vivian S Gainer, Stanley Y Shaw, Zongqi Xia, Peter Szolovits, et al. Development of phenotype algorithms using electronic medical records and incorporating natural language processing. bmj, 350, 2015

  35. [35]

    Multimodal learning on graphs for disease relation extraction

    Yucong Lin, Keming Lu, Sheng Yu, Tianxi Cai, and Marinka Zitnik. Multimodal learning on graphs for disease relation extraction. CoRR, abs/2203.08893, 2022. doi: 10.48550/ARXIV.2203.08893. URL https://doi.org/10.48550/arXiv.2203.08893

  36. [36]

    Self-alignment pretraining for biomedical entity representations

    Fangyu Liu, Ehsan Shareghi, Zaiqiao Meng, Marco Basaldella, and Nigel Collier. Self-alignment pretraining for biomedical entity representations. In Kristina Toutanova, Anna Rumshisky, Luke Zettlemoyer, Dilek Hakkani-Tur, Iz Beltagy, Steven Bethard, Ryan Cotterell, Tanmoy Chakraborty, and Yichao Zhou, editors,Proceedings of the 2021 Conference of the North...

  37. [37]

    URL https://aclanthology.org/2021.naacl-main.334

  38. [38]

    The role of nmda receptors in alzheimer’s disease.Frontiers in neuroscience, 13:43, 2019

    Jinping Liu, Lirong Chang, Yizhi Song, Hui Li, and Yan Wu. The role of nmda receptors in alzheimer’s disease.Frontiers in neuroscience, 13:43, 2019

  39. [39]

    Loinc, a universal standard for identifying laboratory observations: a 5-year update.Clinical chemistry, 49(4):624–633, 2003

    Clement J McDonald, Stanley M Huff, Jeffrey G Suico, Gilbert Hill, Dennis Leavelle, Raymond Aller, Arden Forrey, Kathy Mercer, Georges DeMoor, John Hook, et al. Loinc, a universal standard for identifying laboratory observations: a 5-year update.Clinical chemistry, 49(4):624–633, 2003

  40. [40]

    Distributed representa- tions of words and phrases and their compositionality.Adv Neural Inf Process Syst, 26:3111–3119, 2013

    Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed representa- tions of words and phrases and their compositionality.Adv Neural Inf Process Syst, 26:3111–3119, 2013. 22

  41. [41]

    Federated learning for heterogeneous electronic health records utilising augmented temporal graph attention networks

    Soheila Molaei, Anshul Thakur, Ghazaleh Niknam, Andrew Soltan, Hadi Zare, and David A Clifton. Federated learning for heterogeneous electronic health records utilising augmented temporal graph attention networks. InInternational Conference on Artificial Intelligence and Statistics, pages 1342–

  42. [42]

    Omop, 2021

    OMOP. Omop, 2021. URLhttps://ohdsi.org/omop/. Accessed: June, 2021

  43. [43]

    International classification of diseases—ninth revision (icd-9)

    World Health Organization et al. International classification of diseases—ninth revision (icd-9). Weekly Epidemiological Record= Relevé épidémiologique hebdomadaire, 63(45):343–344, 1988

  44. [44]

    Glove: Global vectors for word representation

    Jeffrey Pennington, Richard Socher, and Christopher D Manning. Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, pages 1532–1543, 2014

  45. [45]

    Reina, Jason Martin, Sarthak Pati, Aikaterini Kotrotsou, Mikhail Milchenko, Weilin Xu, Daniel Marcus, Rivka Colen, and Spyridon Bakas

    Micah Sheller, Brandon Edwards, G. Reina, Jason Martin, Sarthak Pati, Aikaterini Kotrotsou, Mikhail Milchenko, Weilin Xu, Daniel Marcus, Rivka Colen, and Spyridon Bakas. Federated learning in medicine: facilitating multi-institutional collaborations without sharing patient data.Scientific Reports, 10, 07 2020. doi: 10.1038/s41598-020-69250-1

  46. [46]

    Biomegatron: larger biomedical domain language model

    Hoo-Chang Shin, Yang Zhang, Evelina Bakhturina, Raul Puri, Mostofa Patwary, Mohammad Shoeybi, and Raghav Mani. Biomegatron: larger biomedical domain language model. In Pro- ceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 4700–4706, 2020

  47. [47]

    Suicide and prevalence of mental disorders: A systematic review and meta-analysis of world data on case-control psychological autopsy studies

    Roshan Sutar, Akash Kumar, and Vikas Yadav. Suicide and prevalence of mental disorders: A systematic review and meta-analysis of world data on case-control psychological autopsy studies. Psychiatry research, page 115492, 2023

  48. [48]

    Federated k-means clustering.arXiv preprint arXiv:2310.01195, 2024

    Marcel Reinders Swier Garst. Federated k-means clustering.arXiv preprint arXiv:2310.01195, 2024

  49. [49]

    Tariot, Martin R

    Pierre N. Tariot, Martin R. Farlow, George T. Grossberg, Stephen M. Graham, Scott McDonald, Ivan Gergel, and for the Memantine Study Group. Memantine treatment in patients with moderate to severe alzheimer disease already receiving donepezila randomized controlled trial.JAMA, 291(3): 317–324, 01 2004. ISSN 0098-7484. doi: 10.1001/jama.291.3.317

  50. [50]

    Graph attention networks

    Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò, and Yoshua Bengio. Graph attention networks. InInternational Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=rJXMpikCZ

  51. [51]

    Risk and Trust Perceptions of the Public of Artifical Intelligence Applications

    Ke Wang, Ning Chen, and Ting Chen. Joint medical ontology representation learning for healthcare predictions. In2020 International Joint Conference on Neural Networks (IJCNN), pages 1–7, 2020. doi: 10.1109/IJCNN48605.2020.9207355

  52. [52]

    Stratification of alzheimer’s disease patients using knowledge-guided unsupervised latent factor clustering with electronic health record data

    Linshanshan Wang, Shruthi Venkatesh, Michele Morris, Mengyan Li, Ratnam Srivastava, Shyam Visweswaran, Oscar Lopez, Zongqi Xia, and Tianxi Cai. Stratification of alzheimer’s disease patients using knowledge-guided unsupervised latent factor clustering with electronic health record data. medRxiv, 2024. doi: 10.1101/2024.12.23.24319588. URL https://www.medr...

  53. [53]

    Multi-similarity loss with general pair weighting for deep metric learning

    Xun Wang, Xintong Han, Weilin Huang, Dengke Dong, and Matthew R Scott. Multi-similarity loss with general pair weighting for deep metric learning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5022–5030, 2019. 23

  54. [54]

    Knowledge graph embedding by trans- lating on hyperplanes

    Zhen Wang, Jianwen Zhang, Jianlin Feng, and Zheng Chen. Knowledge graph embedding by trans- lating on hyperplanes. InProceedings of the AAAI Conference on Artificial Intelligence, volume 28, 2014

  55. [55]

    Mandl, Suchun Cheng, Zongqi Xia, Kelly Cho, J

    Xin Xiong, Sara Morini Sweet, Molei Liu, Chuan Hong, Clara-Lea Bonzel, Vidul Ayakulangara Panickan, Doudou Zhou, Linshanshan Wang, Lauren Costa, Yuk-Lam Ho, Alon Geva, Kenneth D. Mandl, Suchun Cheng, Zongqi Xia, Kelly Cho, J. Michael Gaziano, Katherine P. Liao, Tianxi Cai, and Tianrun Cai. Knowledge-driven online multimodal automated phenotyping system.medRxiv,

  56. [56]

    URL https://www.medrxiv.org/content/early/2023/ 10/02/2023.09.29.23296239

    doi: 10.1101/2023.09.29.23296239. URL https://www.medrxiv.org/content/early/2023/ 10/02/2023.09.29.23296239

  57. [57]

    Kg-bert: Bert for knowledge graph completion.arXiv preprint arXiv:1909.03193, 2019

    Liang Yao, Chengsheng Mao, and Yuan Luo. Kg-bert: Bert for knowledge graph completion.arXiv preprint arXiv:1909.03193, 2019

  58. [58]

    Coder: Knowledge- infused cross-lingual medical term embedding for term normalization.Journal of Biomedical Infor- matics, 126:103983, 2022

    Zheng Yuan, Zhengyun Zhao, Haixia Sun, Jiao Li, Fei Wang, and Sheng Yu. Coder: Knowledge- infused cross-lingual medical term embedding for term normalization.Journal of Biomedical Infor- matics, 126:103983, 2022

  59. [59]

    Predictors for survival in patients with alzheimer’s disease: a large comprehensive meta-analysis.Translational Psychiatry, 14(1):184, 2024

    Xiaoting Zheng, Shichan Wang, Jingxuan Huang, Chunyu Li, and Huifang Shang. Predictors for survival in patients with alzheimer’s disease: a large comprehensive meta-analysis.Translational Psychiatry, 14(1):184, 2024

  60. [60]

    Panickan, Chuan Hong, Yuk-Lam Ho, Tianrun Cai, Lauren Costa, Xiaoou Li, Victor M

    Doudou Zhou, Ziming Gan, Xu Shi, Alina Patwari, Everett Rush, Clara-Lea Bonzel, Vidul A. Panickan, Chuan Hong, Yuk-Lam Ho, Tianrun Cai, Lauren Costa, Xiaoou Li, Victor M. Castro, Shawn N. Murphy, Gabriel Brat, Griffin Weber, Paul Avillach, J. Michael Gaziano, Kelly Cho, Katherine P. Liao, Junwei Lu, and Tianxi Cai. Multiview incomplete knowledge graph int...

  61. [61]

    grandparent

    Doudou Zhou, Yufeng Zhang, Aaron Sonabend-W, Zhaoran Wang, Junwei Lu, and Tianxi Cai. Federated offline reinforcement learning. Journal of the American Statistical Association, pages 1–12, 2024. 24 Supplementary Material Representation Learning to Advance Multi-Institutional Studies with Electronic Health Record Data S.1 Training and validation data base ...

  62. [62]

    one-step training

    In the similarity training step, we save the embedding with the highest code mapping accuracy, as detailed in Algorithm 2. In the relatedness training step, we save the embedding with the highest feature selection correlation, also detailed in Algorithm 2. When splitting the training and validation sets, we divide the similar hierarchical pairs according ...