pith. sign in

arxiv: 2604.17611 · v1 · submitted 2026-04-19 · 💻 cs.LG · cs.AI

STEP-PD: Stage-Aware and Explainable Parkinson's Disease Severity Classification Using Multimodal Clinical Assessments

Pith reviewed 2026-05-10 06:16 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords Parkinson's diseaseseverity classificationmachine learningmultimodal assessmentsexplainable AIgradient boostingdisease stagingprogression monitoring
0
0 comments X

The pith

Machine learning models using multimodal clinical assessments can classify Parkinson's disease severity into three stages with over 94 percent accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper proposes a framework to classify the severity of Parkinson's disease by grouping standard clinical staging into healthy, mild, and moderate-to-severe categories. It applies machine learning to integrate subjective and objective measures from multiple assessments over time. The strongest results come from a gradient boosting model that reaches high accuracy in both pairwise and three-way classification tasks. Explanations from the model indicate that features related to balance and axial function become more important as the disease advances. A reader would care because such tools could improve monitoring of symptom progression without needing specialized equipment.

Core claim

The paper establishes that multimodal clinical assessments enable accurate visit-level stratification of Parkinson's disease severity, with a gradient boosting classifier delivering 95.48 percent accuracy in distinguishing healthy from mild cases, 99.44 percent for healthy versus moderate-to-severe, 96.78 percent for mild versus moderate-to-severe, and 94.14 percent accuracy with 0.8775 macro F1 for the three-class problem, accompanied by explanations that show a shift from motor features to axial and balance impairments.

What carries the argument

The STEP-PD severity-aware framework that combines multimodal data with gradient boosting and SHAP value explanations to provide both predictions and insights into feature importance across severity levels.

If this is right

  • Visit-level predictions become feasible using routinely collected clinical data.
  • Explanations highlight changing importance of motor versus balance features as severity increases.
  • High performance holds across multiple binary distinctions and the full three-class task.
  • Interpretability supports clinical trust and understanding of progression.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar multimodal approaches could be tested for staging other neurodegenerative conditions.
  • Prioritizing assessments of axial function might improve efficiency in clinical evaluations.
  • Independent validation on single-visit data per patient would confirm robustness against repeated measures bias.

Load-bearing premise

The boundaries of the clinical staging system used to define the three severity classes are stable and consistent, and repeated assessments from the same individuals do not introduce data leakage or bias during model evaluation.

What would settle it

Testing the model on a completely independent set of patients assessed only once would falsify the claim if the accuracy falls substantially below the reported levels.

Figures

Figures reproduced from arXiv: 2604.17611 by Ananda Mohan Mondal, Christian Poellabauer, John Michael Templeton, Md Mezbahul Islam.

Figure 1
Figure 1. Figure 1: Overall Study Framework including Data Description, Data Preprocessing, Classification Algorithms, and Explanation Technique and Precision Clinical Decision. EPW: Epworth Sleepiness Scale; MDS-UPDRS: Movement Disorder Society Unified Parkinson’s Disease Rating Scale; QUIP: Questionnaire for Impulsive-Compulsive Disorders; REM: Rapid Eye Movement; SCOPA-AUT: Scales for Outcomes in Parkinson’s disease - Auto… view at source ↗
Figure 2
Figure 2. Figure 2: Cohort description after data preprocessing. The mild cohort includes samples from stages 1 and 2, and the Moderate to Severe (Mod￾Severe) cohort includes samples from stages 3, 4, and 5. C. Classification Algorithms 1) Classification Problems: To enable clinically inter￾pretable severity-sensitive modeling, the five stages of the H&Y scale were consolidated into two cohorts: Mild includes Stages 1 and 2, … view at source ↗
Figure 3
Figure 3. Figure 3: t-SNE visualization of the learned feature space for PD stage classification across different diagnostic settings. Panels show two-dimensional t-SNE projections of subject representations for (a) Healthy vs. Mild, (b) Healthy vs. Mod–Severe, (c) Mild vs. Mod–Severe, and (d) multi-class classification of Healthy, Mild, and Mod–Severe. severity settings, and that minority-sensitive metrics are essen￾tial for… view at source ↗
Figure 4
Figure 4. Figure 4: Out-of-Fold (OOF) confusion matrix (normalized) for the best ML model. XGBoost achieved the highest accuracy with all four classification problems. Healthy: Healthy samples; Mild: Stage 1 and Stage 2 PD samples; Mod-Severe (Moderate to severe): Stage 3, Stage 4, and Stage 5 PD samples. a. Healthy vs Mild b. Healthy vs Mod-Severe c. Mild vs Mod-Severe Cohort-wise Global SHAP (Stacked): Healthy vs. Mild Coho… view at source ↗
Figure 5
Figure 5. Figure 5: Cohort-wise global feature contributions for three Parkinson’s disease classification tasks. Stacked bar plots show the mean absolute SHAP values of the 15 top features, decomposed by cohort contribution, for (a) Healthy vs. Mild, (b) Healthy vs. Mod-Severe, and (c) Mild vs. Mod-Severe. Green bars indicate feature contributions to the less severe cohort, while red bars represent feature contributions to th… view at source ↗
Figure 6
Figure 6. Figure 6: Sample-specific feature contributions for a Healthy (patient No: 3000) and Mild (patient No: 3001) sample. F. Discussion Objective and key contributions: The goal of this work is to go beyond binary detection of Parkinson’s disease and enable interpretable severity-aware classification using clin￾ical evaluations from a longitudinal cohort. Specifically, we targeted three clinically meaningful classificati… view at source ↗
read the original abstract

Parkinson's disease (PD) is a progressive disorder in which symptom burden and functional impairment evolve over time, making severity staging essential for clinical monitoring and treatment planning. However, many computational studies emphasize binary PD detection and do not fully use repeated follow-up clinical assessments for stage-aware prediction. This study proposes STEP-PD, a severity-aware machine learning framework to classify PD severity using clinically interpretable boundaries. It leverages all available visits from the Parkinson's Progression Markers Initiative (PPMI) and integrates routinely collected subjective questionnaires and objective clinician-assessed measures. Disease severity is defined using Hoehn and Yahr staging and grouped into three clinically meaningful categories: Healthy, Mild PD (stages 1-2), and Moderate-to-Severe PD (stages 3-5). Three binary classification problems and a three-class severity task were evaluated using stratified cross-validation with imbalance-aware training. To enhance interpretability, SHAP was used to provide global explanations and local patient-level waterfall explanations. Across all tasks, XGBoost achieved the strongest and most stable performance, with accuracies of 95.48% (Healthy vs. Mild), 99.44% (Healthy vs. Moderate-to-Severe), and 96.78% (Mild vs. Moderate-to-Severe), and 94.14% accuracy with 0.8775 Macro-F1 for three-class severity classification. Explainability results highlight a shift from early motor features to progression-related axial and balance impairments. These findings show that multimodal clinical assessments within the PPMI cohort can support accurate and interpretable visit-level PD severity stratification.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 3 minor

Summary. The paper proposes STEP-PD, a machine learning framework for stage-aware classification of Parkinson's disease severity into Healthy, Mild (Hoehn and Yahr 1-2), and Moderate-to-Severe (Hoehn and Yahr 3-5) categories. It integrates multimodal subjective questionnaires and objective clinician assessments from all PPMI visits, evaluates three binary tasks plus a three-class task via stratified cross-validation and imbalance-aware training with XGBoost, reports high accuracies (95.48%, 99.44%, 96.78% binary; 94.14% accuracy and 0.8775 Macro-F1 for three-class), and applies SHAP for global and local explanations that highlight a shift toward axial and balance features with progression.

Significance. If the performance holds under subject-blocked evaluation, the work offers a practical contribution by demonstrating that routinely collected multimodal clinical data can support accurate, interpretable visit-level severity stratification. Strengths include the use of clinically meaningful Hoehn and Yahr boundaries, concrete cross-validation metrics, imbalance-aware training, and SHAP-based explanations that link feature importance to disease progression stages.

major comments (1)
  1. [Methods] Methods section (cross-validation description): The paper states that stratified cross-validation was used on all available visits but does not specify whether splits were performed at the visit level or blocked by subject ID (e.g., GroupKFold or Leave-One-Subject-Out). Because Hoehn and Yahr stages change slowly and multimodal features are correlated across visits from the same patient, visit-level splitting risks data leakage that could inflate the reported accuracies (95.48% Healthy vs. Mild, 99.44% Healthy vs. Moderate-to-Severe, 96.78% Mild vs. Moderate-to-Severe, and 94.14% three-class). Please clarify the exact splitting procedure and report results with subject-level blocking if they differ.
minor comments (3)
  1. [Abstract] Abstract and Methods: Add the total number of unique subjects and visits from the PPMI cohort to contextualize the dataset scale and the impact of repeated measures.
  2. [Methods] Methods: Provide explicit details on missing-data handling, feature preprocessing steps, and the hyperparameter search procedure for XGBoost to support reproducibility.
  3. [Results] Results: The SHAP waterfall plots are useful for local explanations; ensure all axes and feature names are clearly labeled and that example patients are described without revealing protected health information.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their thorough review and constructive feedback on our manuscript. The major comment regarding the cross-validation procedure is well-taken, and we address it directly below. We have revised the manuscript to clarify the original procedure and incorporate additional subject-blocked analyses.

read point-by-point responses
  1. Referee: [Methods] Methods section (cross-validation description): The paper states that stratified cross-validation was used on all available visits but does not specify whether splits were performed at the visit level or blocked by subject ID (e.g., GroupKFold or Leave-One-Subject-Out). Because Hoehn and Yahr stages change slowly and multimodal features are correlated across visits from the same patient, visit-level splitting risks data leakage that could inflate the reported accuracies (95.48% Healthy vs. Mild, 99.44% Healthy vs. Moderate-to-Severe, 96.78% Mild vs. Moderate-to-Severe, and 94.14% three-class). Please clarify the exact splitting procedure and report results with subject-level blocking if they differ.

    Authors: We thank the referee for highlighting this critical methodological detail. The original analysis employed stratified k-fold cross-validation at the visit level using StratifiedKFold, without subject blocking; this was not explicitly described in the Methods section. We agree that the longitudinal structure of PPMI data (slowly changing Hoehn and Yahr stages and correlated features within subjects) creates a risk of leakage under visit-level splits. In the revised manuscript we have updated the Methods section to state the splitting procedure precisely. We have also conducted new experiments with subject-level blocking via GroupKFold (ensuring all visits from a given subject remain in the same fold) and report these results alongside the original metrics in an updated table and discussion section. The subject-blocked performance remains competitive, though modestly lower, and we discuss the implications for clinical applicability. These changes strengthen the validity of the reported findings. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical ML performance metrics obtained via cross-validation on held-out data

full rationale

The paper reports visit-level classification accuracies from XGBoost (and other models) trained on multimodal PPMI features with Hoehn-Yahr-derived labels. These metrics arise from stratified cross-validation on held-out folds rather than any derivation, ansatz, or parameter fit that reduces the reported numbers to the input definitions by construction. No equations, uniqueness theorems, or self-citations are invoked to force the central claims; the results remain externally falsifiable and independent of the label-generation process.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The claim rests on the clinical validity of Hoehn and Yahr staging to define severity classes and on standard supervised learning assumptions that cross-validation estimates generalization. No new entities are postulated.

free parameters (1)
  • XGBoost hyperparameters
    Tuned values that control tree depth, learning rate, and regularization; these are fitted to maximize the reported accuracies.
axioms (2)
  • domain assumption Hoehn and Yahr stages provide stable, clinically meaningful boundaries for grouping into Healthy, Mild, and Moderate-to-Severe
    Directly used to label the target classes in all tasks.
  • domain assumption Multimodal clinical assessments from PPMI visits are sufficient and unbiased inputs for severity prediction
    Assumed when feeding questionnaires and clinician measures into the model.

pith-pipeline@v0.9.0 · 5606 in / 1479 out tokens · 59599 ms · 2026-05-10T06:16:57.963022+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

33 extracted references · 33 canonical work pages · 1 internal anchor

  1. [1]

    Projections for prevalence of parkinson’s disease and its driving factors in 195 countries and territories to 2050: modelling study of global burden of disease study 2021,

    D. Su, Y . Cui, C. He, P. Yin, R. Bai, J. Zhu, J. S. Lam, J. Zhang, R. Yan, X. Zhenget al., “Projections for prevalence of parkinson’s disease and its driving factors in 195 countries and territories to 2050: modelling study of global burden of disease study 2021,”bmj, vol. 388, 2025

  2. [2]

    Statistics: Parkinson’s disease,

    Parkinson’s Foundation, “Statistics: Parkinson’s disease,” https://www.parkinson.org/understanding-parkinsons/statistics, 2025, accessed 2026-02-04

  3. [3]

    A review of machine learning and deep learning algorithms for parkinson’s disease detection using handwriting and voice datasets,

    M. A. Islam, M. Z. H. Majumder, M. A. Hussein, K. M. Hossain, and M. S. Miah, “A review of machine learning and deep learning algorithms for parkinson’s disease detection using handwriting and voice datasets,” Heliyon, vol. 10, no. 3, 2024

  4. [4]

    Machine learning within the parkinson’s progression markers initiative: Review of the current state of affairs,

    R. T. Gerraty, A. Provost, L. Li, E. Wagner, M. Haas, and L. Lancashire, “Machine learning within the parkinson’s progression markers initiative: Review of the current state of affairs,”Frontiers in Aging Neuroscience, vol. 15, p. 1076657, 2023

  5. [5]

    Ma- chine learning models for parkinson disease: Systematic review,

    T. Tabashum, R. C. Snyder, M. K. O’Brien, and M. V . Albert, “Ma- chine learning models for parkinson disease: Systematic review,”JMIR medical informatics, vol. 12, no. 1, p. e50117, 2024

  6. [6]

    A review of machine learning and deep learning for parkinson’s disease detection,

    H. Rabie and M. A. Akhloufi, “A review of machine learning and deep learning for parkinson’s disease detection,”Discover Artificial Intelligence, vol. 5, no. 1, p. 24, 2025

  7. [7]

    The parkinson’s progression markers initiative (ppmi)– establishing a pd biomarker cohort,

    K. Marek, S. Chowdhury, A. Siderowf, S. Lasch, C. S. Coffey, C. Caspell-Garcia, T. Simuni, D. Jennings, C. M. Tanner, J. Q. Tro- janowskiet al., “The parkinson’s progression markers initiative (ppmi)– establishing a pd biomarker cohort,”Annals of clinical and translational neurology, vol. 5, no. 12, pp. 1460–1477, 2018

  8. [8]

    Ppmi study design: Study cohorts,

    Parkinson’s Progression Markers Initiative (PPMI), “Ppmi study design: Study cohorts,” https://www.ppmi-info.org/study-design/study-cohorts, accessed 2026-02-04

  9. [9]

    SCOPE-PD: Explainable AI on Subjective and Clinical Objective Measurements of Parkinson's Disease for Precision Decision-Making

    M. M. Islam, J. M. Templeton, M. Sobhan, C. Poellabauer, and A. M. Mondal, “Scope-pd: Explainable ai on subjective and clinical objective measurements of parkinson’s disease for precision decision-making,” arXiv preprint arXiv:2601.22516, 2026, to be published in Springer CCIS series. [Online]. Available: https://arxiv.org/abs/2601.22516

  10. [10]

    Appendix 8: Hoehn and yahr stages,

    National Center for Biotechnology Information (NCBI), “Appendix 8: Hoehn and yahr stages,” https://www.ncbi.nlm.nih.gov/books/NBK379751/, accessed 2026- 02-04

  11. [11]

    A unified approach to interpreting model predictions,

    S. M. Lundberg and S.-I. Lee, “A unified approach to interpreting model predictions,”Advances in neural information processing systems, vol. 30, 2017

  12. [12]

    Tilda-x: Transcriptome-informed lung cancer disparities via explainable ai,

    M. Sobhan, M. M. Islam, M. J. Trepka, G. E. Holt, C. J. Dimitroff, and A. M. Mondal, “Tilda-x: Transcriptome-informed lung cancer disparities via explainable ai,”Cancers, vol. 17, no. 21, p. 3454, 2025

  13. [13]

    Telediagnosis of parkinson’s disease symptom severity using h&y scale,

    N. Padman, R. Swarnalatha, V . Venkatesh, and N. Kumar, “Telediagnosis of parkinson’s disease symptom severity using h&y scale,”J. Eng. Sci. Technol, vol. 15, pp. 1466–1480, 2020

  14. [14]

    Movement disorder society-sponsored revision of the unified parkinson’s disease rating scale (mds-updrs): scale presentation and clinimetric testing results,

    C. G. Goetz, B. C. Tilley, S. R. Shaftman, G. T. Stebbins, S. Fahn, P. Martinez-Martin, W. Poewe, C. Sampaio, M. B. Stern, R. Dodel et al., “Movement disorder society-sponsored revision of the unified parkinson’s disease rating scale (mds-updrs): scale presentation and clinimetric testing results,”Movement disorders: official journal of the Movement Disor...

  15. [15]

    Diag- nosis of parkinson’s disease on the basis of clinical and genetic clas- sification: a population-based modelling study,

    M. A. Nalls, C. Y . McLean, J. Rick, S. Eberly, S. J. Hutten, K. Gwinn, M. Sutherland, M. Martinez, P. Heutink, N. M. Williamset al., “Diag- nosis of parkinson’s disease on the basis of clinical and genetic clas- sification: a population-based modelling study,”The Lancet Neurology, vol. 14, no. 10, pp. 1002–1009, 2015

  16. [16]

    A new method for measuring daytime sleepiness: the epworth sleepiness scale,

    M. W. Johns, “A new method for measuring daytime sleepiness: the epworth sleepiness scale,”sleep, vol. 14, no. 6, pp. 540–545, 1991

  17. [17]

    The geriatric depression scale (gds),

    S. A. Greenberg, “The geriatric depression scale (gds),”Best Practices in Nursing Care to Older Adults, vol. 4, no. 1, pp. 1–2, 2012

  18. [18]

    Questionnaire for impulsive-compulsive disorders in parkinson’s disease–rating scale,

    D. Weintraub, E. Mamikonyan, K. Papay, J. A. Shea, S. X. Xie, and A. Siderowf, “Questionnaire for impulsive-compulsive disorders in parkinson’s disease–rating scale,”Movement disorders, vol. 27, no. 2, pp. 242–247, 2012

  19. [19]

    Chronic behavioral disorders of human rem sleep: a new category of parasomnia,

    C. H. Schenck, S. R. Bundlie, M. G. Ettinger, and M. W. Mahowald, “Chronic behavioral disorders of human rem sleep: a new category of parasomnia,”Sleep, vol. 9, no. 2, pp. 293–308, 1986

  20. [20]

    Assess- ment of autonomic dysfunction in parkinson’s disease: the scopa-aut,

    M. Visser, J. Marinus, A. M. Stiggelbout, and J. J. Van Hilten, “Assess- ment of autonomic dysfunction in parkinson’s disease: the scopa-aut,” Movement disorders: official journal of the Movement Disorder Society, vol. 19, no. 11, pp. 1306–1312, 2004

  21. [21]

    The state-trait anxiety inventory, trait version: does it really measure anxiety?

    A. Bados, J. G ´omez-Benito, and G. Balaguer, “The state-trait anxiety inventory, trait version: does it really measure anxiety?”Journal of personality assessment, vol. 92, no. 6, pp. 560–567, 2010

  22. [22]

    Visuospatial judgment: A clinical test,

    A. L. Benton, N. R. Varney, and K. d. Hamsher, “Visuospatial judgment: A clinical test,”Archives of neurology, vol. 35, no. 6, pp. 364–367, 1978

  23. [23]

    Hopkins verbal learning test,

    J. Brandt, “Hopkins verbal learning test,”Clinical Neuropsychologist, 2001

  24. [24]

    Does the letter number sequencing task measure anything more than digit span?

    S. F. Crowe, “Does the letter number sequencing task measure anything more than digit span?”Assessment, vol. 7, no. 2, pp. 113–117, 2000

  25. [25]

    The montreal cognitive assessment (moca),

    J. Hobson, “The montreal cognitive assessment (moca),”Occupational Medicine, vol. 65, no. 9, pp. 764–765, 2015

  26. [26]

    A new standardization of semantic verbal fluency test,

    B. Zarino, M. Crespi, M. Launi, and A. Casarotti, “A new standardization of semantic verbal fluency test,”Neurological Sciences, vol. 35, no. 9, pp. 1405–1411, 2014

  27. [27]

    Symbol digit modalities test: a valid clinical trial endpoint for measuring cognition in multiple sclerosis,

    L. Strober, J. DeLuca, R. H. Benedict, A. Jacobs, J. A. Cohen, N. Chiaravalloti, L. D. Hudson, R. A. Rudick, N. G. LaRocca, and M. S. O. A. C. (MSOAC), “Symbol digit modalities test: a valid clinical trial endpoint for measuring cognition in multiple sclerosis,”Multiple Sclerosis Journal, vol. 25, no. 13, pp. 1781–1790, 2019

  28. [28]

    Redone-pd: Reflections of dopamine-related gene mutations on neu- rocognitive functions in healthy controls and parkinson’s disease,

    M. M. Islam, J. M. Templeton, C. Poellabauer, and A. M. Mondal, “Redone-pd: Reflections of dopamine-related gene mutations on neu- rocognitive functions in healthy controls and parkinson’s disease,” in 2024 IEEE International Conference on Bioinformatics and Biomedicine (BIBM). IEEE, 2024, pp. 6113–6120

  29. [29]

    Predicting severity of parkinson’s disease using deep learning,

    S. Grover, S. Bhartia, A. Yadav, S. KRet al., “Predicting severity of parkinson’s disease using deep learning,”Procedia computer science, vol. 132, pp. 1788–1794, 2018

  30. [30]

    Classification of parkinson’s disease and its stages using machine learning,

    J. M. Templeton, C. Poellabauer, and S. Schneider, “Classification of parkinson’s disease and its stages using machine learning,”Scientific reports, vol. 12, no. 1, p. 14036, 2022

  31. [31]

    Explainable ai for parkinson’s disease prediction: A machine learning approach with interpretable models,

    A. O. Esan, D. B. Olawade, A. A. Soladoye, B. A. Omodunbi, I. A. Adeyanju, and N. Aderinto, “Explainable ai for parkinson’s disease prediction: A machine learning approach with interpretable models,” Current research in translational medicine, p. 103541, 2025

  32. [32]

    Enhancing early parkinson’s disease detection through multimodal deep learning and explainable ai: insights from the ppmi database,

    V . Dentamaro, D. Impedovo, L. Musti, G. Pirlo, and P. Taurisano, “Enhancing early parkinson’s disease detection through multimodal deep learning and explainable ai: insights from the ppmi database,”Scientific Reports, vol. 14, no. 1, p. 20941, 2024

  33. [33]

    A comprehensive framework for parkinson’s disease diagnosis using explainable artifi- cial intelligence empowered machine learning techniques,

    S. Priyadharshini, K. Ramkumar, S. Vairavasundaram, K. Narasimhan, S. Venkatesh, R. Amirtharajan, and K. Kotecha, “A comprehensive framework for parkinson’s disease diagnosis using explainable artifi- cial intelligence empowered machine learning techniques,”Alexandria Engineering Journal, vol. 107, pp. 568–582, 2024