A Multi-Dimensional Clustering Approach for Identifying Inborn Errors of Immunity
Pith reviewed 2026-05-20 16:14 UTC · model grok-4.3
The pith
A pipeline converts raw immunologic lab data into vectors for clustering to identify inborn errors of immunity patterns.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The proposed pipeline transforms raw immunologic lab data into vectors and combines this with hyperparameter tuning for disease pattern recognition via clustering to recognize novel rare disease patterns and extract IEI-associated features from a national data registry.
What carries the argument
Multi-dimensional clustering on vectorized immunologic lab data from EHR with hyperparameter tuning for pattern recognition.
If this is right
- Refines IEI feature awareness from registry data.
- Develops data tool kits for rare disease population analysis.
- Expands methods for transforming complex medical records into structures for unsupervised ML.
- Recognizes novel rare disease patterns beyond known cases.
Where Pith is reading between the lines
- The same vectorization and clustering could be applied to lab data for other rare genetic disorders.
- Linking clusters to genetic records might reveal new IEI subtypes.
- Testing on additional national registries would check pattern stability across sites.
Load-bearing premise
Raw EHR immunologic lab data from the national registry can be reliably curated and formatted into vectors that preserve clinically meaningful signals without bias or loss of information.
What would settle it
Running the pipeline on a held-out set of confirmed IEI cases and controls and finding no distinct clusters or failure to extract known associated features would falsify the claim.
Figures
read the original abstract
Rare diseases such as inborn errors of immunity (IEI) require early diagnosis to prevent end organ damage and improve quality of life. Hurdles in accessing and curating large scale electronic health record (EHR) data limit routine data driven analyses to remain on the forefront of IEI and other rare disease trends. Development of machine learning (ML) algorithms in IEI for pattern recognition as well as published methodology examining how to systematically process and integrate complex medical data is limited. Our proposed pipeline, including data curation and ML clustering algorithms, is designed to recognize novel rare disease patterns and extract IEI- associated features from a national data registry. Our methodology for EHR data formatting and processing presents the pipeline that transforms raw immunologic lab data into vectors. This is further combined with hyperparameter tuning for diseases pattern recognition via clustering. This study refines IEI feature awareness, develops data tool kits for rare disease populations analysis, and expands on transforming complex medical records in data structures interpretable by unsupervised ML.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes a pipeline for curating and processing electronic health record (EHR) immunologic lab data from a national registry, transforming raw data into vectors, and applying hyperparameter-tuned multi-dimensional clustering to identify novel patterns and extract features associated with inborn errors of immunity (IEI).
Significance. If the vectorization step reliably preserves clinically meaningful signals from sparse, inconsistent EHR data and the clustering yields biologically interpretable groups, the work could advance data-driven methods for early diagnosis of rare diseases like IEI and provide reusable toolkits for similar analyses in other rare-disease populations.
major comments (2)
- [Methods / Pipeline description] The abstract and methods description of the vectorization step (aggregation, imputation, normalization, or encoding of raw immunologic lab data) does not address how sparsity, inconsistent units/reference ranges, multiple tests per patient, or high missingness are handled. Without explicit handling, clusters may reflect data artifacts rather than IEI biology, directly undermining the central claim that the pipeline recognizes novel rare-disease patterns.
- [Results / Evaluation] No performance metrics, validation results, error analysis, cluster quality measures (e.g., silhouette score), or comparison against known IEI cases are reported. The abstract outlines the pipeline components but supplies none of these, leaving the claim of effective unsupervised clustering without demonstrated support.
minor comments (1)
- [Methods] Clarify the exact dimensionality and feature construction of the vectors produced from lab data.
Simulated Author's Rebuttal
We thank the referee for their constructive comments, which help strengthen the clarity and rigor of our manuscript on a pipeline for curating EHR immunologic data and applying multi-dimensional clustering to identify patterns in inborn errors of immunity. We address each major comment below and have incorporated revisions to improve the description of the vectorization process and to add quantitative evaluation metrics.
read point-by-point responses
-
Referee: [Methods / Pipeline description] The abstract and methods description of the vectorization step (aggregation, imputation, normalization, or encoding of raw immunologic lab data) does not address how sparsity, inconsistent units/reference ranges, multiple tests per patient, or high missingness are handled. Without explicit handling, clusters may reflect data artifacts rather than IEI biology, directly undermining the central claim that the pipeline recognizes novel rare-disease patterns.
Authors: We agree that the abstract and high-level methods overview did not sufficiently detail the handling of these data quality issues, which is critical for interpreting the clusters. The full manuscript contains a data preprocessing subsection, but we have now expanded it in the revision to explicitly describe: aggregation of multiple tests per patient via median values; z-score normalization adjusted for age- and sex-specific reference ranges to address inconsistent units; multiple imputation by chained equations (MICE) for missing data; and exclusion of features exceeding 80% missingness to reduce sparsity effects. These choices aim to retain biologically relevant signals while minimizing artifact-driven clustering. revision: yes
-
Referee: [Results / Evaluation] No performance metrics, validation results, error analysis, cluster quality measures (e.g., silhouette score), or comparison against known IEI cases are reported. The abstract outlines the pipeline components but supplies none of these, leaving the claim of effective unsupervised clustering without demonstrated support.
Authors: We acknowledge that the original manuscript prioritized pipeline description and qualitative cluster interpretation over formal quantitative validation. In the revised version, we have added a dedicated evaluation subsection reporting silhouette scores and Davies-Bouldin indices across hyperparameter configurations, along with enrichment analysis comparing cluster membership to known IEI cases in the registry. We also include a brief error analysis addressing potential biases such as variable lab testing frequency. These additions provide measurable support for the clustering results while noting the inherent challenges of external validation in rare-disease settings. revision: yes
Circularity Check
No significant circularity in proposed EHR-to-vector clustering pipeline
full rationale
The manuscript describes a forward methodological pipeline: raw immunologic lab data from a national registry is curated and formatted into vectors, followed by hyperparameter-tuned unsupervised clustering to identify IEI patterns and features. No equations, predictions, or uniqueness claims are presented that reduce by construction to the input data or to self-citations. The central claim (that the pipeline enables pattern recognition) remains independent of the vectorization step; any success or failure is an empirical question about data fidelity rather than a definitional or fitted tautology. This is a standard applied ML methods paper with no load-bearing self-referential steps.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Our methodology for EHR data formatting and processing presents the pipeline that transforms raw immunologic lab data into vectors... combined with hyperparameter tuning for diseases pattern recognition via clustering.
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanabsolute_floor_iff_bare_distinguishability unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We represent every data point as a vector consisting of five measurements... FN orm = LabValAbs − ref low / (ref high − ref low)
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Martinson AK, Chin AT, Butte MJ, Rider NL. Artificial Intelligence and Machine Learning for Inborn Errors of Immunity: Current State and Future Promise.J Allergy Clin Immunol Pract. 2024;12(10):2695-2704. doi:10.1016/j.jaip.2024.08.012
-
[2]
Ehwerhemuepha L, Carlson K, Moog R, et al. Cerner real-world data (CRWD) – A de-identified multicenter electronic health records database.Data Brief. 2022;42:108120. doi:10.1016/j.dib.2022.108120
-
[3]
Modell V . The impact of physician education and public aware- ness on early diagnosis of primary immunodeficiencies: Robert A. Good Immunology Symposium.Immunol Res. 2007;38(1-3):43-47. doi:10.1007/s12026-007-0048-5
-
[4]
Khoury P, Srinivasan R, Kakumanu S, et al. A Framework for Augmented Intelligence in Allergy and Immunology Practice and Research—A Work Group Report of the AAAAI Health Informatics, Technology, and Education Committee.J Allergy Clin Immunol Pract. 2022;10(5):1178-1188. doi:10.1016/j.jaip.2022.01.047
-
[5]
Herzog NJ, Magoulas GD. Brain Asymmetry Detection and Machine Learning Classification for Diagnosis of Early Dementia.Sensors. 2021;21(3):778. doi:10.3390/s21030778
-
[6]
Loncaric F, Marti Castellote PM, Sanchez-Martinez S, et al. Automated Pattern Recognition in Whole-Cardiac Cycle Echocardiographic Data: Capturing Functional Phenotypes with Machine Learning.J Am Soc Echocardiogr. 2021;34(11):1170-1183. doi:10.1016/j.echo.2021.06.014
-
[7]
Expert-enhanced ma- chine learning for cardiac arrhythmia classification.PLOS ONE
Sager S, Bernhardt F, Kehrle F, et al. Expert-enhanced ma- chine learning for cardiac arrhythmia classification.PLOS ONE. 2021;16(12):e0261571. doi:10.1371/journal.pone.0261571
-
[8]
A Computational Pipeline for the Diagnosis of CVID Patients.Front Immunol
Emmaneel A, Bogaert DJ, Van Gassen S, et al. A Computational Pipeline for the Diagnosis of CVID Patients.Front Immunol. 2019;10:2009. doi:10.3389/fimmu.2019.02009
-
[9]
Guevara-Barrientos D, Kaundal R. ProFeatX: A parallelized protein feature extraction suite for machine learning.Comput Struct Biotechnol J. 2023;21:796-801. doi:10.1016/j.csbj.2022.12.044
-
[10]
In2019 IEEE International Conference on Software Maintenance and Evolution (ICSME)
Rider NL, Miao D, Dodds M, et al. Calculation of a Primary Immun- odeficiency “Risk Vital Sign” via Population-Wide Analysis of Claims Data to Aid in Clinical Decision Support.Front Pediatr. 2019;7:70. doi:10.3389/fped.2019.00070
-
[11]
Rider NL, Cahill G, Motazedi T, et al. PI Prob: A risk prediction and clinical guidance system for evaluating patients with recurrent infections. PLOS ONE. 2021;16(2):e0237285. doi:10.1371/journal.pone.0237285
-
[12]
Rider NL, Coffey M, Kurian A, et al. A validated artificial intelligence-based pipeline for population-wide primary immunode- ficiency screening.J Allergy Clin Immunol. 2023;151(1):272-279. doi:10.1016/j.jaci.2022.10.005
-
[13]
Mayampurath A, Ajith A, Anderson-Smits C, et al. Early Diagnosis of Primary Immunodeficiency Disease Using Clinical Data and Machine Learning.J Allergy Clin Immunol Pract. 2022;10(11):3002-3007.e5. doi:10.1016/j.jaip.2022.08.041
-
[14]
Trulson I, Holdenrieder S, Hoffmann G. Using machine learning tech- niques for exploration and classification of laboratory data.Journal of Laboratory Medicine. 2024;48(5):203-214. doi:10.1515/labmed-2024- 0100
-
[15]
Nemati S, Mohammadi B, Hooshanginezhad Z. Clustering Based on Laboratory Data in Patients With Heart Failure Admitted to the Intensive Care Unit.Journal of Clinical Laboratory Analysis. 2024;38(21):e25109. doi:10.1002/jcla.25109
-
[16]
Shearer WT, Rosenblatt HM, Gelman RS, et al. Lymphocyte subsets in healthy children from birth through 18 years of age: the Pediatric AIDS Clinical Trials Group P1009 study.J Allergy Clin Immunol. 2003;112(5):973-980. doi:10.1016/j.jaci.2003.07.003
-
[17]
Comans-Bitter WM, de Groot R, van den Beemd R, et al. Im- munophenotyping of blood lymphocytes in childhood: Reference val- ues for lymphocyte subpopulations.J Pediatr. 1997;130(3):388-393. doi:10.1016/S0022-3476(97)70200-2
-
[18]
Sullivan KE, Stiehm ER, eds.Stiehm’s Immune Deficiencies. 1st ed. Academic Press; 2014. ISBN:978-0-12-405546-9
work page 2014
-
[19]
Multiple imputation with multivariate imputation by chained equation (MICE) package.Ann Transl Med
Zhang Z. Multiple imputation with multivariate imputation by chained equation (MICE) package.Ann Transl Med. 2016;4(2):30. doi:10.3978/j.issn.2305-5839.2015.12.63
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.