arxiv: 2604.22890 · v1 · submitted 2026-04-24 · 🧬 q-bio.OT

Recognition: unknown

AI-Derived Reproductive Phenotypes and Explainable ML for Concurrent Early Multimorbidity in U.S. Women: NHANES 2017-March 2020

Sunday A. Adetunji

Authors on Pith no claims yet

Pith reviewed 2026-05-08 08:38 UTC · model grok-4.3

classification 🧬 q-bio.OT

keywords reproductive historymultimorbidityNHANESmachine learningphenotypingexplainable AIwomen's healthearly onset chronic disease

0 comments

The pith

Adverse reproductive life-course patterns strongly cluster with concurrent early multimorbidity in U.S. women aged 20-44.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper uses national survey data to group women by reproductive history and chronic health markers, then tests whether certain groups show much higher rates of multiple early-onset conditions. It finds four distinct phenotypes, one of which contains women with heavy adverse reproductive burdens and shows 77.5 percent multimorbidity. The work compares logistic regression and XGBoost models, stresses calibration and explainability via SHAP values, and argues that reproductive history can serve as a practical signal for concurrent risk rather than isolated outcomes. A sympathetic reader would care because early identification of such clusters could shift screening and prevention toward younger women before chronic diseases fully develop.

Core claim

Principal components analysis and k-means phenotyping revealed that adverse reproductive life-course structure is strongly clustered with concurrent early multimorbidity in U.S. women aged 20-44 years. Although XGBoost improved discrimination, calibration and feature attribution remained essential for reliable translation into practice.

What carries the argument

Principal components analysis to reduce reproductive-history and multimorbidity features, followed by k-means clustering into four phenotypes, with SHAP values to explain contributions in logistic regression and XGBoost models.

If this is right

One latent phenotype showed 77.5 percent meeting the multimorbidity definition of at least two conditions among hypertension, hypercholesterolemia, cardiovascular disease, kidney disease, and kidney stones.
XGBoost achieved higher discrimination (ROC-AUC 0.766) than logistic regression (0.667) but worse calibration (Brier score 0.069 versus 0.059).
Dominant drivers of the phenotypes were age, PHQ-9 depression score, income-to-poverty ratio, race/ethnicity, education level, and the adverse reproductive index.
Adverse reproductive burden affected 58 percent of the sample and was strongly represented in the fragile cluster.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Reproductive history variables could be added to routine young-adult health checks as an inexpensive early warning for multimorbidity risk.
The same phenotyping pipeline might be tested on longitudinal cohorts to assess whether the clusters predict future disease incidence beyond cross-sectional association.
Calibration issues in tree-based models suggest that hybrid or post-hoc recalibration steps would be needed before any phenotype-based triage enters clinical guidelines.

Load-bearing premise

The chosen reproductive and chronic-condition variables, after PCA reduction and k-means clustering with k=4, produce clinically meaningful phenotypes instead of artifacts of variable coding or the specific multimorbidity definition.

What would settle it

An independent dataset or alternative clustering method in which the high-burden reproductive phenotype shows multimorbidity rates no higher than the other groups would falsify the claimed clustering.

Figures

Figures reproduced from arXiv: 2604.22890 by Sunday A. Adetunji.

**Figure 1.** Figure 1: Latent phenotypes of reproductive-chronic disease-mental-health profiles among U.S. women aged 20–44 years. Participants are projected onto the first two principal components of the standardized predictor matrix, and colors denote the four k-means-derived phenotypes. The purpose of the display is descriptive summarization of heterogeneity rather than causal subtype discovery view at source ↗

**Figure 2.** Figure 2: Receiver-operating characteristic curves for restricted early multimorbidity classification comparing logistic regression with gradient-boosted trees (XGBoost). The diagonal line denotes chance performance. The key interpretation is not only the separation of curves but the mismatch between discrimination and calibration documented in view at source ↗

**Figure 3.** Figure 3: Top ten features associated with restricted early multimorbidity in the XGBoost model. Bars represent mean view at source ↗

read the original abstract

Background:Adverse reproductive history is a multisystemic risk factor, but evidence is constrained by isolated outcome studies, limited adjustment, and non-interpretable algorithmic models. We re-frame the estimand from prediction to concurrent risk classification and emphasize calibration, interpretability, and systematic error. Methods:We analyzed 1,602 U.S. women aged 20-44 years from NHANES 2017-March 2020 with reproductive-history variables, chronic-condition indicators, and PHQ-9 data. Restricted multimorbidity was defined as at least two of hypertension, hypercholesterolemia, cardiovascular disease, kidney disease, and kidney stones. Features were summarized using principal components analysis and k-means clustering. We compared multivariable logistic regression with XGBoost and used SHAP values to quantify contributions. Results:Early multimorbidity occurred in 6.6% (106/1,602); 71.0% had no chronic condition and 22.4% had one. Adverse reproductive burden was common: 58% had at least one adverse reproductive factor and 12.6% had three or more. Four latent phenotypes emerged (n=398, 508, 102, 594), including a fragile subgroup in which 77.5% met the multimorbidity definition. In holdout evaluation, XGBoost improved discrimination relative to logistic regression (ROC-AUC 0.766 vs 0.667), but showed worse probability accuracy and calibration (Brier 0.069 vs 0.059; expected calibration error 0.113 vs 0.037). Dominant drivers were age, PHQ-9 score, income-to-poverty ratio, race/ethnicity, education, and the adverse reproductive index. Conclusions: Principal components analysis and k-means phenotyping revealed that adverse reproductive life-course structure is strongly clustered with concurrent early multimorbidity in U.S. women aged 20-44 years. Although XGBoost improved discrimination, calibration and feature attribution remained essential for reliable translation into practice

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The clustering includes the outcome indicators, weakening the claim that reproductive phenotypes independently cluster with multimorbidity, while the calibration comparison holds up better.

read the letter

The main takeaway is that the reported reproductive phenotypes linked to early multimorbidity in this NHANES sample come from a clustering step that already includes the multimorbidity-defining conditions, so the association is not as independently discovered as the abstract suggests. The more reliable contribution is the head-to-head comparison of XGBoost and logistic regression with explicit calibration checks. The paper applies PCA and k-means to a mix of reproductive history, chronic conditions, and depression scores in 1,602 women aged 20-44. It identifies four groups, one small one with 77.5% multimorbidity prevalence, and shows that XGBoost reaches an AUC of 0.766 compared to 0.667 for logistic regression, though with higher Brier score and calibration error. SHAP values flag age, PHQ-9, income-to-poverty ratio, race, education, and the adverse reproductive index as important. This is solid in the sense that it uses real survey data, reports concrete numbers, and pays attention to calibration and interpretability rather than just accuracy. Those are good practices for health applications. The soft spots are around the clustering. Because the chronic indicators are in the feature set, separating a high-multimorbidity group is not unexpected. Without PC loadings, variable contributions within clusters, or an ablation that drops the condition variables before clustering, it's hard to say how much the reproductive variables are doing the heavy lifting. The abstract also does not test sensitivity to the number of clusters or the exact multimorbidity threshold. Missing data handling and full cross-validation details are not spelled out either. The math is standard, no errors in the reported metrics that I can see from the abstract. This is for epidemiologists and public health researchers who analyze NHANES or similar surveys for multimorbidity patterns in reproductive-age women. A reader who wants to see standard ML tools applied to this topic with some emphasis on calibration would get practical value from it. It deserves peer review because the data is accessible and the calibration focus is worthwhile, though revisions would be needed to strengthen the phenotype claims. I would recommend sending it for review with requests to include the ablation study, loadings, and more method details.

Referee Report

2 major / 2 minor

Summary. The manuscript analyzes NHANES 2017-March 2020 data for 1,602 U.S. women aged 20-44 years. It applies principal components analysis and k-means clustering to reproductive-history variables, chronic-condition indicators (hypertension, hypercholesterolemia, CVD, kidney disease, kidney stones), and PHQ-9 scores to derive four latent phenotypes. Multimorbidity is defined as ≥2 of the chronic conditions. A 'fragile' phenotype (n=102) shows 77.5% multimorbidity prevalence. XGBoost is compared to logistic regression for prediction, with SHAP for interpretability, reporting AUC 0.766 vs 0.667 but poorer calibration.

Significance. If the phenotypes are not artifacts of the feature selection, the work provides evidence linking adverse reproductive life-course factors to concurrent early multimorbidity, with potential for improved risk stratification in clinical practice. The emphasis on calibration and explainability is a strength, but the clustering approach requires validation to confirm the association is driven by reproductive variables rather than the outcome indicators themselves.

major comments (2)

[Methods] Methods section: The feature set for PCA and k-means clustering includes the same chronic-condition indicators used to define multimorbidity (≥2 conditions). This setup makes the separation of a high-multimorbidity 'fragile' cluster (77.5% prevalence) expected whenever these indicators have variance, potentially rendering the reported association with reproductive structure partly definitional. No ablation removing chronic indicators or reporting of PC loadings and variable contributions within clusters is provided to demonstrate that reproductive variables drive the phenotypes.
[Results] Results/Abstract: The manuscript reports concrete metrics (AUC 0.766 vs 0.667, Brier scores, calibration error) and phenotype prevalences, but lacks details on cross-validation strategy, missing-data handling, exact feature list, sensitivity to choice of k=4, or how the multimorbidity threshold affects clustering. These omissions limit assessment of robustness for the central phenotyping claim.

minor comments (2)

[Abstract] Abstract conclusion: The phrasing that PCA and k-means 'revealed' the clustering overstates the discovery without supporting diagnostics such as loadings or ablation, given the feature overlap with the outcome definition.
Consider adding a table of PC loadings or cluster centroids to allow readers to evaluate variable importance and assess whether reproductive factors, rather than chronic indicators, dominate the latent structure.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which highlight key areas for improving the transparency and robustness of our phenotyping and predictive analyses. We address each major comment point by point below, indicating revisions where appropriate.

read point-by-point responses

Referee: [Methods] Methods section: The feature set for PCA and k-means clustering includes the same chronic-condition indicators used to define multimorbidity (≥2 conditions). This setup makes the separation of a high-multimorbidity 'fragile' cluster (77.5% prevalence) expected whenever these indicators have variance, potentially rendering the reported association with reproductive structure partly definitional. No ablation removing chronic indicators or reporting of PC loadings and variable contributions within clusters is provided to demonstrate that reproductive variables drive the phenotypes.

Authors: We acknowledge that this concern is valid: because the chronic-condition indicators are included in the feature set and also define the multimorbidity outcome, the high prevalence in the 'fragile' cluster is partly by construction. Our aim was to identify integrated latent structures linking reproductive history, chronic conditions, and depressive symptoms rather than to isolate reproductive effects. To address the referee's point directly, we will add an ablation analysis repeating PCA and k-means using only reproductive-history variables and PHQ-9 scores (excluding chronic indicators), report the resulting cluster-multimorbidity associations, and include principal component loadings plus per-variable contributions to cluster membership. These additions will appear in the revised Methods and Results sections. revision: yes
Referee: [Results] Results/Abstract: The manuscript reports concrete metrics (AUC 0.766 vs 0.667, Brier scores, calibration error) and phenotype prevalences, but lacks details on cross-validation strategy, missing-data handling, exact feature list, sensitivity to choice of k=4, or how the multimorbidity threshold affects clustering. These omissions limit assessment of robustness for the central phenotyping claim.

Authors: We agree that these methodological details are necessary for reproducibility and to evaluate robustness. In the revision we will expand the Methods section to specify the exact feature list, missing-data handling procedures, cross-validation strategy for the XGBoost and logistic regression models, sensitivity analyses for the choice of k (including silhouette scores and comparisons for k=3 to 6), and the effect of alternative multimorbidity thresholds on cluster stability. These additions will allow readers to assess the central phenotyping results more rigorously. revision: yes

Circularity Check

0 steps flagged

No significant circularity in the phenotyping or ML analysis

full rationale

The paper conducts unsupervised PCA and k-means on NHANES features that include both reproductive-history variables and chronic-condition indicators, then reports cluster compositions with respect to a post-hoc multimorbidity definition (>=2 chronic conditions). This is a descriptive grouping exercise on external public data using standard methods; the observed co-clustering of reproductive burden with multimorbidity is an empirical pattern in the data rather than a quantity forced by construction or by any fitted parameter. No equations, self-citations, or uniqueness theorems are invoked to derive the central claim. The analysis remains self-contained against the external benchmark data without tautological reduction of outputs to inputs.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 1 invented entities

The central claim rests on standard epidemiological assumptions about survey data and data-driven clustering choices rather than new mathematical axioms or unproven physical entities.

free parameters (2)

Number of clusters k
Selected to identify latent phenotypes after PCA; value 4 reported in results
Multimorbidity definition threshold
Set as at least two of five listed conditions; directly determines the 6.6% prevalence and phenotype rates

axioms (2)

domain assumption NHANES 2017-March 2020 sample is representative of U.S. women aged 20-44 for the variables analyzed
Invoked to support generalization of phenotype and multimorbidity findings
domain assumption Adverse reproductive factors can be meaningfully summarized into a single index for clustering and attribution
Used as a dominant driver in SHAP and phenotype interpretation

invented entities (1)

Fragile subgroup phenotype no independent evidence
purpose: To label the high-multimorbidity cluster emerging from k-means
Derived entirely from the clustering procedure on this dataset; no external validation or independent measurement provided

pith-pipeline@v0.9.0 · 5686 in / 1680 out tokens · 52537 ms · 2026-05-08T08:38:05.309971+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

21 extracted references · 21 canonical work pages

[1]

TRIPOD+AI statement: updated guidance for reporting clinical prediction models that use regression or machine learning methods

Collins GS, Moons KGM, Dhiman P, Riley RD, Beam AL, Van Calster B, et al. TRIPOD+AI statement: updated guidance for reporting clinical prediction models that use regression or machine learning methods. BMJ. 2024;385:q902. doi:10.1136/bmj.q902

work page doi:10.1136/bmj.q902 2024
[2]

PROBAST+AI: an updated quality, risk of bias, and applicability assessment tool for prediction models using regression or artificial intelligence methods

Moons KGM, Damen JAA, Kaul T, Hooft L, Andaur Navarro C, Dhiman P, et al. PROBAST+AI: an updated quality, risk of bias, and applicability assessment tool for prediction models using regression or artificial intelligence methods. BMJ. 2025;388:e082505. doi:10.1136/bmj-2024-082505

work page doi:10.1136/bmj-2024-082505 2025
[3]

Evaluation of clinical prediction models (part 1): from development to external validation

Collins GS, Dhiman P, Ma J, Schlussel MM, Archer L, Van Calster B, et al. Evaluation of clinical prediction models (part 1): from development to external validation. BMJ. 2024;384:e074819. doi:10.1136/bmj-2023-074819

work page doi:10.1136/bmj-2023-074819 2024
[4]

Developing clinical prediction models: a step -by-step guide

Efthimiou O, Seo M, Chalkou K, Debray TPA, Egger M, Salanti G. Developing clinical prediction models: a step -by-step guide. BMJ. 2024;386:e078276. doi:10.1136/bmj-2023- 078276

work page doi:10.1136/bmj-2023- 2024
[5]

Uncertainty of risk estimates from clinical prediction models: rationale, challenges, and approaches

Riley RD, Collins GS, Kirton L, Snell KIE, Ensor J, Whittle R, et al. Uncertainty of risk estimates from clinical prediction models: rationale, challenges, and approaches. BMJ. 2025;388:e080749. doi:10.1136/bmj-2024-080749

work page doi:10.1136/bmj-2024-080749 2025
[6]

URL https://bmcmedicine.biomedcentral

Van Calster B, McLernon DJ, Van Smeden M, Wynants L, Steyerberg EW, Bossuyt P, et al. Calibration: the Achilles heel of predictive analytics. BMC Med . 2019;17(1):230. doi:10.1186/s12916-019-1466-7

work page doi:10.1186/s12916-019-1466-7 2019
[7]

Christodoulou E, Ma J, Collins GS, Steyerberg EW, Verbakel JY, Van Calster B. A systematic review shows no performance benefit of machine learning over logistic regression Refereed (Peer-Reviewed) Conference Paper: CS016 LLM and Agent Applications II | 2026 Symposium on Data Science and Statistics | American Statistical Association for clinical prediction...

work page doi:10.1016/j.jclinepi.2019.02.004 2026
[8]

Reporting and interpreting decision curve analysis: a guide for investigators

Van Calster B, Wynants L, Verbeek JFM, Verbakel JY, Christodoulou E, Vickers AJ, et al. Reporting and interpreting decision curve analysis: a guide for investigators. Eur Urol . 2018;74(6):796-804. doi:10.1016/j.eururo.2018.08.038

work page doi:10.1016/j.eururo.2018.08.038 2018
[9]

Andrea Cristina McGlinchey and Peter J

Lundberg SM, Erion G, Chen H, DeGrave A, Prutkin JM, Nair B, et al. From local explanations to global understanding with explainable AI for trees. Nat Mach Intell . 2020;2:56-67. doi:10.1038/s42256-019-0138-9

work page doi:10.1038/s42256-019-0138-9 2020
[10]

Association between the reproductive health of young women and cardiovascular disease in later life: umbrella review

Okoth K, Chandan JS, Marshall T, Thangaratinam S, Thomas GN, Nirantharakumar K, Adderley NJ. Association between the reproductive health of young women and cardiovascular disease in later life: umbrella review. BMJ. 2020;371:m3502. doi:10.1136/bmj.m3502

work page doi:10.1136/bmj.m3502 2020
[11]

Parikh NI, Gonzalez JM, Anderson CAM, Judd SE, Rexrode KM, Hlatky MA, et al. Adverse pregnancy outcomes and cardiovascular disease risk: unique opportunities for cardiovascular disease prevention in women: a scientific statement from the American Heart Association. Circulation. 2021;143(18):e902-e916. doi:10.1161/CIR.0000000000000961

work page doi:10.1161/cir.0000000000000961 2021
[12]

Incidence and long -term outcomes of hypertensive disorders of pregnancy

Garovic VD, White WM, Vaughan L, Saiki M, Parashuram S, Garcia -Valencia O, et al. Incidence and long -term outcomes of hypertensive disorders of pregnancy. J Am Coll Cardiol. 2020;75(18):2323-2334. doi:10.1016/j.jacc.2020.03.028

work page doi:10.1016/j.jacc.2020.03.028 2020
[13]

Pregnancy and reproductive risk factors for cardiovascular disease in women

O’Kelly AC, Michos ED, Shufelt CL, Vermunt JV, Minissian MB, Quesada O, et al. Pregnancy and reproductive risk factors for cardiovascular disease in women. Circ Res . 2022;130(4):652-672. doi:10.1161/CIRCRESAHA.121.319895

work page doi:10.1161/circresaha.121.319895 2022
[14]

Pregnancy complications and later life women’s health

McNestry C, Killeen SL, Crowley RK, McAuliffe FM. Pregnancy complications and later life women’s health. Acta Obstet Gynecol Scand. 2023;102(5):523-531. doi:10.1111/aogs.14523

work page doi:10.1111/aogs.14523 2023
[15]

Miscarriage matters: the epidemiological, physical, psychological and economic burden of early pregnancy loss

Quenby S, Gallos ID, Dhillon -Smith RK, Podesek M, Stephenson MD, Fisher J, et al. Miscarriage matters: the epidemiological, physical, psychological and economic burden of early pregnancy loss. Lancet. 2021;397(10285):1658-1667. doi:10.1016/S0140-6736(21)00682- 6

work page doi:10.1016/s0140-6736(21)00682- 2021
[16]

National Health and Nutrition Examination Survey 2017 –March 2020 prepandemic data files —development of files and prevalence estimates for selected health outcomes

Stierman B, Afful J, Carroll MD, Chen TC, Davy O, Fink S, et al. National Health and Nutrition Examination Survey 2017 –March 2020 prepandemic data files —development of files and prevalence estimates for selected health outcomes. Natl Health Stat Report. 2021;(158). doi:10.15620/cdc:106273

work page doi:10.15620/cdc:106273 2017
[17]

National Health and Nutrition Examination Survey, 2017 –March 2020 prepandemic file: sample design, estimation, and analytic guidelines

Akinbami LJ, Chen TC, Davy O, Ogden CL, Fink S, Clark J, et al. National Health and Nutrition Examination Survey, 2017 –March 2020 prepandemic file: sample design, estimation, and analytic guidelines. Vital Health Stat 2. 2022;(190):1 -36. doi:10.15620/cdc:115434

work page doi:10.15620/cdc:115434 2017
[18]

XGBoost: A Scalable Tree Boosting System

Chen T, Guestrin C. XGBoost: a scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York, NY: ACM; 2016:785-794. doi:10.1145/2939672.2939785

work page doi:10.1145/2939672.2939785 2016
[19]

1987 , issue_date =

Rousseeuw PJ. Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math. 1987;20:53-65. doi:10.1016/0377-0427(87)90125-7. Refereed (Peer-Reviewed) Conference Paper: CS016 LLM and Agent Applications II | 2026 Symposium on Data Science and Statistics | American Statistical Association

work page doi:10.1016/0377-0427(87)90125-7 1987
[20]

Estimating the number of clusters in a data set via the gap statistic

Tibshirani R, Walther G, Hastie T. Estimating the number of clusters in a data set via the gap statistic. J R Stat Soc Series B Stat Methodol. 2001;63(2):411 -423. doi:10.1111/1467 - 9868.00293

work page doi:10.1111/1467 2001
[21]

Cluster -wise assessment of cluster stability

Hennig C. Cluster -wise assessment of cluster stability. Comput Stat Data Anal. 2007;52(1):258-271. doi:10.1016/j.csda.2006.11.025

work page doi:10.1016/j.csda.2006.11.025 2007