pith. sign in

arxiv: 2605.13464 · v1 · pith:OIT4KSOSnew · submitted 2026-05-13 · 💻 cs.LG

A Unified Three-Stage Machine Learning Framework for Diabetes Detection, Subtype Discrimination, and Cognitive-Metabolic Hypothesis Testing

Pith reviewed 2026-05-14 19:13 UTC · model grok-4.3

classification 💻 cs.LG
keywords diabetes detectionsubtype clusteringcognitive associationmachine learning classificationK-Means clusteringSHAP explainabilityglycaemic controlmetabolic-cognitive link
0
0 comments X

The pith

A three-stage machine learning framework detects diabetes, clusters subtypes without labels, and links better glycaemic control to higher cognitive scores.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper builds a single reproducible pipeline that first classifies diabetes presence from routine clinical measurements, then partitions confirmed cases into subtypes through clustering on a few key variables, and finally tests whether metabolic control correlates with cognitive performance in longitudinal data. It benchmarks multiple classifiers with cross-validation and feature attribution, applies silhouette-validated K-Means to recover two groups, and reports a statistically corrected positive correlation. The work matters because it shows how standard, interpretable machine-learning steps can be chained to move from diagnosis to exploratory subtype analysis and hypothesis testing without requiring new labeled subtype data.

Core claim

The authors establish that supervised classifiers reach an ROC-AUC of 0.825 and accuracy of 0.762 on the NCSU diabetes dataset with Glucose, BMI, and Age as dominant predictors, that K-Means clustering using Glucose, Insulin, and Age yields two partitions among diabetic cases with silhouette score approximately 0.116 interpreted as clinically plausible, and that glycaemic control shows a significant positive Spearman correlation of 0.208 with cognitive function in the Ohio dataset that survives Holm correction.

What carries the argument

The unified three-stage pipeline: supervised classification with SHAP explainability for detection, silhouette-validated K-Means clustering for subtype discrimination, and statistical correlation testing for metabolic-cognitive associations.

If this is right

  • Glucose, BMI, and Age function as the primary predictive biomarkers in the classification stage.
  • Two subtype partitions can be recovered from diabetic cases using only Glucose, Insulin, and Age without ground-truth labels.
  • Glycaemic control maintains a positive association with cognitive function after multiple-testing correction.
  • The combination of cross-validation, feature attribution, and statistical validation supports reproducible diabetes analytics.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The clustering approach could be tested on other chronic conditions to identify subtypes from routine lab values alone.
  • Linking the recovered clusters to longitudinal complication data would test whether they carry predictive value for personalized management.
  • Adding additional metabolic or imaging features might raise the silhouette score and clarify whether the observed cognitive association strengthens or attenuates.

Load-bearing premise

A low silhouette score of approximately 0.116 still marks clinically plausible subtype partitions, and the NCSU and Ohio datasets are representative without major unstated selection biases.

What would settle it

A replication dataset in which the same three features produce K-Means clusters that show no difference in independent clinical outcomes such as complication rates or treatment response would falsify the subtype stage.

Figures

Figures reproduced from arXiv: 2605.13464 by Rishav Tewari, Ruzina Haque Laskar, Vishal Pandey.

Figure 1
Figure 1. Figure 1: Three-stage unified pipeline. Stage 1 performs binary diabetes detection with cross-validated supervised classifiers and SHAP explainability. Stage 2 applies validated K-Means clustering to the diabetic sub-cohort for T1DM/T2DM discrimination. Stage 3 conducts statistical hypothesis testing on the Ohio longitudinal cohort to probe the T3DM glycaemic-cognitive link. 4.1 Stage 1: Binary Diabetes Detection Pr… view at source ↗
Figure 2
Figure 2. Figure 2: Stage 1 test-set evaluation. Left: confusion matrix for SVM-RBF on the held-out test set. Right: ROC curve with AUC = 0.80. SHAP feature attribution [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: SHAP beeswarm plot , Random Forest (Stage 1). Each dot represents one test-set instance. Colour encodes feature value (red = high, blue = low). Horizontal position encodes SHAP value (positive = pushes prediction towards diabetic class). Features are ranked by mean |SHAP| in descending order. median age , is consistent with T2DM phenomenology (insulin resistance, relative insulin excess in early stages, ad… view at source ↗
Figure 4
Figure 4. Figure 4: K-Means silhouette validation curve. k=2 achieves a silhouette score of ≈ 0.116, consistent with moderate cluster structure. k=4 is a local maximum but lacks clinical interpretability [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
read the original abstract

Diabetes mellitus affects over 537 million adults worldwide and remains a major challenge in preventive healthcare. Existing machine-learning studies primarily formulate diabetes prediction as a binary classification problem, while subtype-oriented analysis and glycaemic-cognitive associations remain comparatively underexplored. We present a reproducible three-stage machine learning framework for diabetes detection, subtype-oriented clustering, and metabolic-cognitive association analysis. In Stage 1, five supervised classifiers together with a stacking ensemble are benchmarked on the NCSU Diabetes Dataset using stratified five-fold cross-validation and evaluation metrics including ROC-AUC, balanced accuracy, recall, and F1-score. SVM-RBF and Logistic Regression achieve the highest ROC-AUC ($0.825 \pm 0.026$), while Random Forest achieves the highest accuracy ($0.762 \pm 0.030$). SHAP explainability identifies Glucose, BMI, and Age as the dominant predictive biomarkers. In Stage 2, silhouette-validated K-Means clustering ($k=2$, silhouette $\approx 0.116$) is applied to confirmed diabetic cases using Glucose, Insulin, and Age, recovering clinically plausible subtype-oriented partitions without requiring ground-truth subtype labels. In Stage 3, statistical analysis of the Ohio Longitudinal Cognitive Dataset ($n=373$) reveals a significant positive association between glycaemic control and cognitive function ($\rho_s = 0.208$, $p = 5.29 \times 10^{-5}$), which survives Holm correction. The findings support the utility of statistically grounded and interpretable ML pipelines for reproducible diabetes analytics and subtype-aware exploratory analysis.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript presents a three-stage machine learning framework for diabetes detection using supervised classifiers on the NCSU dataset, subtype discrimination via K-Means clustering on diabetic cases, and analysis of glycaemic-cognitive associations in the Ohio dataset. It reports performance metrics from cross-validation, identifies key biomarkers via SHAP, claims clinically plausible subtypes from clustering with silhouette score ≈0.116, and a significant correlation (ρ_s = 0.208, p = 5.29×10^{-5}) surviving correction.

Significance. If the subtype partitions are clinically meaningful, the work offers a reproducible, interpretable pipeline integrating prediction, exploratory subtyping, and hypothesis testing for diabetes research. Strengths include stratified cross-validation, ensemble benchmarking, SHAP explainability, and proper multiple-testing correction, supporting utility for reproducible diabetes analytics.

major comments (1)
  1. [Stage 2] Stage 2 clustering section: K-Means (k=2 on Glucose/Insulin/Age) yields silhouette ≈0.116, indicating weak separation and overlap. This directly undercuts the claim of recovering 'clinically plausible subtype-oriented partitions' without external validation against known subtypes, clinical thresholds, or additional indices (e.g., Davies-Bouldin).
minor comments (2)
  1. [Stage 1] Stage 1 methods lack explicit details on hyperparameter search ranges, missing-value imputation, and exact feature scaling, limiting full reproducibility despite the reported CV protocol.
  2. [Abstract and Stage 2] The abstract and Stage 2 text describe the clustering as 'silhouette-validated' without acknowledging the low absolute value or discussing its implications for partition quality.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the major comment point by point below and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [Stage 2] Stage 2 clustering section: K-Means (k=2 on Glucose/Insulin/Age) yields silhouette ≈0.116, indicating weak separation and overlap. This directly undercuts the claim of recovering 'clinically plausible subtype-oriented partitions' without external validation against known subtypes, clinical thresholds, or additional indices (e.g., Davies-Bouldin).

    Authors: We agree that a silhouette score of ≈0.116 reflects weak separation and notable overlap, which limits the strength of interpreting the clusters as clinically definitive subtypes. In the revised manuscript we will moderate the language in the abstract, Stage 2 section, and discussion to describe the results as 'exploratory subtype-oriented partitions' rather than 'clinically plausible'. We will also add the Davies-Bouldin index (and Calinski-Harabasz index) to the cluster validation, explicitly discuss the low silhouette as a limitation, and note the lack of external validation against known clinical subtypes or thresholds. These revisions will improve transparency without altering the reported methodology or metrics. revision: yes

Circularity Check

0 steps flagged

No significant circularity in three-stage ML framework

full rationale

The paper applies standard supervised classifiers (with stratified 5-fold CV and metrics like ROC-AUC) to the NCSU dataset in Stage 1, performs K-Means (k=2) clustering with silhouette validation on diabetic cases using Glucose/Insulin/Age in Stage 2, and computes Spearman correlation on the Ohio dataset in Stage 3. No derivation reduces to its inputs by construction, no fitted parameters are renamed as predictions, and no load-bearing self-citations or ansatzes are present. All steps are direct applications of established methods to external data, yielding independent empirical results.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The paper depends on standard statistical assumptions and the validity of the input datasets; no new physical or mathematical entities are postulated.

free parameters (2)
  • Number of clusters k = 2
    Selected based on silhouette score validation in stage 2
  • Classifier hyperparameters
    Tuned for SVM-RBF, Random Forest etc., but values not reported in abstract
axioms (2)
  • standard math Stratified five-fold cross-validation provides unbiased performance estimates
    Used in stage 1 for benchmarking classifiers
  • domain assumption The chosen features (Glucose, Insulin, Age) are sufficient for subtype discrimination
    Assumed in stage 2 clustering without ground truth labels

pith-pipeline@v0.9.0 · 5601 in / 1396 out tokens · 56994 ms · 2026-05-14T19:13:37.682852+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

17 extracted references · 17 canonical work pages

  1. [1]

    Standards of medical care in diabetes --- 2021

    American Diabetes Association. Standards of medical care in diabetes --- 2021. Diabetes Care, 44(Suppl.\ 1):S1--S232, 2021

  2. [2]

    M. A. Atkinson, G. S. Eisenbarth, and A. W. Michels. Type 1 diabetes. The Lancet, 383(9911):69--82, 2014

  3. [3]

    S. M. de la Monte and J. R. Wands. Alzheimer's disease is type 3 diabetes --- evidence reviewed. Journal of Diabetes Science and Technology, 2(6):1101--1113, 2008

  4. [4]

    Feinkohl, J

    I. Feinkohl, J. F. Price, M. W. Strachan, and B. M. Frier. The impact of diabetes on cognitive decline: potential vascular, metabolic, and psychosocial risk factors. Alzheimer's & Dementia, 11(8):970--978, 2015

  5. [5]

    IDF Diabetes Atlas, 10th ed

    International Diabetes Federation. IDF Diabetes Atlas, 10th ed. Brussels, Belgium: IDF, 2021

  6. [6]

    Janson, T

    J. Janson, T. Laedtke, J. E. Parisi, P. O'Brien, and R. C. Petersen. Increased risk of type 2 diabetes in Alzheimer disease. Diabetes, 53(2):474--481, 2004

  7. [7]

    S. E. Kahn and M. E. Cooper. Type 2 diabetes, cardiovascular disease, and the mechanism of action of antidiabetic agents. Diabetes Care, 42(12):2237--2246, 2019

  8. [8]

    Kavakiotis, O

    I. Kavakiotis, O. Tsave, A. Salifoglou, N. Maglaveras, I. Vlahavas, and I. Chouvarda. Machine learning and data mining methods in diabetes research. Computational and Structural Biotechnology Journal, 15:104--116, 2017

  9. [9]

    J. G. Klann, A. Joss, K. Embree, and S. N. Murphy. Data model harmonization for the all of us research program: transforming i2b2 data into the OMOP common data model. PLOS ONE, 14(2):e0212463, 2019

  10. [10]

    S. M. Lundberg and S.-I. Lee. A unified approach to interpreting model predictions. In Advances in Neural Information Processing Systems, volume 30, 2017

  11. [11]

    D. S. Marcus, T. H. Wang, J. Parker, J. G. Csernansky, J. C. Morris, and R. L. Buckner. Open access series of imaging studies ( OASIS ): longitudinal MRI data in nondemented and demented older adults. Journal of Cognitive Neuroscience, 22(12):2677--2684, 2010

  12. [12]

    Shimpi and Shakkeera

    J. Shimpi and Shakkeera. Predictive analysis of type-1 and type-2 diabetes mellitus using machine learning. In Proceedings of the 3rd ICCIP, 2021. Available at https://ssrn.com/abstract=3917810

  13. [13]

    Sisodia and D

    D. Sisodia and D. S. Sisodia. Prediction of diabetes using classification algorithms. Procedia Computer Science, 132:1578--1585, 2018

  14. [14]

    J. W. Smith, J. E. Everhart, W. C. Dickson, W. C. Knowler, and R. S. Johannes. Using the ADAP learning algorithm to forecast the onset of diabetes mellitus. In Proceedings of the Annual Symposium on Computer Application in Medical Care, pages 261--265, 1988

  15. [15]

    M. W. Strachan, J. F. Price, and B. M. Frier. Diabetes, cognitive impairment, and dementia. Diabetes Care, 41(11):2509--2518, 2018

  16. [16]

    Tasin, T

    I. Tasin, T. U. Nabil, S. Islam, and R. Khan. Diabetes prediction using machine learning and explainable AI techniques. Healthcare Technology Letters, 10(1--2):1--10, 2023

  17. [17]

    N. T. Vagelatos and G. D. Eslick. Type 2 diabetes as a risk factor for Alzheimer's disease: the confounders, interactions, and neuropathology associated with this relationship. Epidemiologic Reviews, 35(1):152--160, 2013