Modeling Parkinson's Disease Progression Using Longitudinal Voice Biomarkers: A Comparative Study of Statistical and Neural Mixed-Effects Models

Lanruo Wang; Ran Tong; Tong Wang; Wei Yan

arxiv: 2507.20058 · v4 · submitted 2025-07-26 · 📊 stat.ML · cs.LG· stat.AP

Modeling Parkinson's Disease Progression Using Longitudinal Voice Biomarkers: A Comparative Study of Statistical and Neural Mixed-Effects Models

Ran Tong , Lanruo Wang , Tong Wang , Wei Yan This is my paper

Pith reviewed 2026-05-19 03:11 UTC · model grok-4.3

classification 📊 stat.ML cs.LGstat.AP

keywords Parkinson's diseasevoice biomarkerslongitudinal datamixed-effects modelsGAMMsneural mixed effectsprediction errortelemonitoring

0 comments

The pith

Generalized additive mixed models achieve the lowest prediction error when modeling Parkinson's progression from repeated voice measurements.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper compares neural and statistical mixed-effects approaches for handling correlated voice data collected over time from the same Parkinson's patients. It shows that flexible neural models tend to overfit badly with only 42 subjects while generalized additive mixed models deliver accurate forecasts along with clear, interpretable effects for each person. A sympathetic reader would care because reliable non-invasive tracking could help monitor disease without needing huge new clinical trials. The work uses one public telemonitoring dataset and evaluates all methods under the same forward-prediction task.

Core claim

On the Oxford Parkinson's telemonitoring dataset of 42 subjects, generalized additive mixed models reach a mean squared error of 6.56 for longitudinal prediction of disease progression from voice biomarkers, while neural mixed-effects models and generalized neural network mixed models produce errors above 90. The statistical approach supplies both good accuracy and retained subject-level random effects plus smooth functional terms; the neural baselines supply greater flexibility yet overfit severely in this small-sample regime.

What carries the argument

Head-to-head comparison of Neural Mixed Effects models, Generalized Neural Network Mixed Models, and Generalized Additive Mixed Models under identical longitudinal prediction conditions on voice biomarker trajectories.

If this is right

Statistical mixed-effects models remain practical for small longitudinal telemonitoring studies where data are limited.
Neural mixed-effects models are prone to severe overfitting and high error when sample sizes are modest.
Larger and more heterogeneous patient collections are needed before flexible neural models can be fairly tested in this domain.
Semi-parametric models can keep both predictive strength and interpretable subject-specific structure.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The results point toward matching model complexity directly to available sample size in other small-cohort biomarker studies.
Similar head-to-head tests could be run on different neurodegenerative conditions or non-voice biomarkers to check whether the pattern holds.
Practitioners might consider starting with interpretable mixed models and adding neural components only after confirming they improve performance on held-out data.

Load-bearing premise

All models are evaluated under the same longitudinal prediction task on the single Oxford Parkinson's telemonitoring dataset containing only 42 patients.

What would settle it

A follow-up experiment on an independent cohort of at least 100 patients in which any neural mixed-effects model records an MSE below 20 would directly contradict the reported performance gap.

Figures

Figures reproduced from arXiv: 2507.20058 by Lanruo Wang, Ran Tong, Tong Wang, Wei Yan.

**Figure 2.** Figure 2: Top row: Residuals, fixed effect Q-Q, and random effect Q-Q plots from the original model [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Estimated spline effect of test_time from GAMM, showing nonlinear progression over time. 3.4 Generalized Neural Network Mixed Model (GNMM) for Non-linear Longitudinal Modeling Building on Mandel Mandel et al. (2023) we use a Generalized Neural Network Mixed Model (GNMM) to predict the longitudinal Total_UPDRS scores collected in the tele-monitoring study of Parkinson’s disease. We retain the notation intro… view at source ↗

read the original abstract

Longitudinal voice biomarkers provide a non-invasive source of information for monitoring Parkinson's disease progression, but their statistical analysis is difficult because repeated measurements from the same subject are correlated, clinical cohorts are often small, and disease trajectories can vary substantially across individuals. This study evaluates statistical and neural mixed-effects approaches for modeling Parkinson's disease progression from telemonitoring voice data. Using the Oxford Parkinson's telemonitoring dataset (N=42), we compare Neural Mixed Effects (NME) models, Generalized Neural Network Mixed Models (GNMMs), and semi-parametric Generalized Additive Mixed Models (GAMMs) under the same longitudinal prediction setting. The results show that neural mixed-effects models provide flexible nonlinear representations but can overfit severely in this small-sample setting, whereas GAMMs achieve stronger predictive performance and retain interpretable smooth effects and subject-level structure. In particular, the GAMM-based approach attains the lowest prediction error (MSE 6.56), while the neural baselines have substantially larger errors (MSE > 90). These findings support the use of interpretable statistical mixed-effects models for small longitudinal telemonitoring studies and suggest that larger and more diverse cohorts are needed before highly flexible neural mixed-effects models can be reliably assessed in this application.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript evaluates Neural Mixed Effects (NME) models, Generalized Neural Network Mixed Models (GNMMs), and semi-parametric Generalized Additive Mixed Models (GAMMs) for modeling Parkinson's disease progression from longitudinal voice biomarkers in the Oxford Parkinson's telemonitoring dataset (N=42 subjects). It reports that GAMMs achieve the lowest prediction error (MSE 6.56) under a longitudinal prediction task while neural approaches overfit severely (MSE >90), concluding that interpretable statistical mixed-effects models are preferable for small-sample telemonitoring studies.

Significance. If the evaluation protocols are shown to be consistent, the work offers useful empirical evidence that flexible neural mixed-effects models can underperform relative to GAMMs in small longitudinal biomedical datasets due to overfitting, while retaining the value of subject-level structure and smooth effects. The direct comparison on a named public dataset strengthens the practical takeaway for Parkinson's telemonitoring applications.

major comments (2)

[Abstract] Abstract: The claim that all models are evaluated 'under the same longitudinal prediction setting' is not supported by any description of the subject-specific random-effect prediction step for the neural baselines (NME and GNMMs). In a mixed-effects context with N=42, fair MSE comparison requires that each model uses an equivalent procedure (e.g., posterior means, empirical Bayes, or neural approximation) for held-out subject-level effects; absent this detail, the reported gap (GAMM MSE 6.56 vs. neural MSE >90) may reflect inconsistent evaluation rather than model superiority.
[Results] Results section (or equivalent): No information is given on the data-split strategy, cross-validation scheme, or uncertainty quantification (error bars) around the reported MSE values. With only 42 subjects, these omissions make it impossible to assess whether the overfitting interpretation for neural models is robust or an artifact of a particular train/test partition.

minor comments (2)

[Abstract] The abstract and methods would benefit from explicit notation distinguishing fixed effects, random effects, and the exact loss used for each model class.
[Figures] Figure captions should clarify whether plotted trajectories include subject-specific random effects or only population-level predictions.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments highlight important aspects of methodological transparency that will improve the clarity and reproducibility of the work. We address each major comment below and will revise the manuscript to incorporate the requested details.

read point-by-point responses

Referee: [Abstract] The claim that all models are evaluated 'under the same longitudinal prediction setting' is not supported by any description of the subject-specific random-effect prediction step for the neural baselines (NME and GNMMs). In a mixed-effects context with N=42, fair MSE comparison requires that each model uses an equivalent procedure (e.g., posterior means, empirical Bayes, or neural approximation) for held-out subject-level effects; absent this detail, the reported gap (GAMM MSE 6.56 vs. neural MSE >90) may reflect inconsistent evaluation rather than model superiority.

Authors: We acknowledge that the manuscript does not explicitly describe the subject-specific random-effect prediction procedure for the neural models. The longitudinal prediction task was designed to be consistent across all approaches: for NME and GNMMs, subject-specific effects were obtained via an empirical Bayes-style optimization that fixes the population-level parameters and estimates subject-level adjustments on the training observations for each held-out subject, mirroring the posterior-mean prediction used for GAMMs. To resolve the ambiguity, we will add a dedicated paragraph in the Methods section detailing this procedure for each model class and confirming that the same held-out subject prediction protocol was applied uniformly. This revision will demonstrate that the performance differences reflect model characteristics rather than evaluation inconsistencies. revision: yes
Referee: [Results] No information is given on the data-split strategy, cross-validation scheme, or uncertainty quantification (error bars) around the reported MSE values. With only 42 subjects, these omissions make it impossible to assess whether the overfitting interpretation for neural models is robust or an artifact of a particular train/test partition.

Authors: We agree that explicit documentation of the splitting and validation strategy is essential given the small cohort size. The study employed subject-wise leave-one-subject-out cross-validation, holding out all observations from each test subject to preserve the longitudinal dependence structure. We will revise the Methods and Results sections to state this protocol clearly. For uncertainty quantification, we will add error bars by reporting the standard deviation of MSE across repeated runs with different random seeds for model initialization and optimization; if space permits, we will also include results from a supplementary 5-fold subject-wise CV. These changes will allow readers to evaluate the robustness of the overfitting conclusion for the neural models. revision: yes

Circularity Check

0 steps flagged

No circularity: direct empirical comparison on external dataset

full rationale

The paper reports an empirical head-to-head evaluation of NME, GNMM, and GAMM models on the publicly available Oxford Parkinson's telemonitoring dataset (N=42 subjects). The headline result (GAMM MSE 6.56 versus neural MSE >90) is obtained by fitting each model class to longitudinal voice data and computing out-of-sample prediction error under a stated common longitudinal prediction protocol. No equation or result is defined in terms of itself, no fitted parameter is relabeled as an independent prediction, and no load-bearing premise rests on a self-citation chain. The comparison therefore remains an external, falsifiable measurement rather than a self-referential derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, novel axioms, or invented entities are described beyond standard longitudinal mixed-model assumptions.

axioms (1)

domain assumption Repeated measurements from the same subject are correlated and disease trajectories vary substantially across individuals
Invoked in the abstract to justify use of mixed-effects models for the voice biomarker data.

pith-pipeline@v0.9.0 · 5754 in / 1354 out tokens · 57905 ms · 2026-05-19T03:11:59.718630+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We compare Neural Mixed Effects (NME) models, Generalized Neural Network Mixed Models (GNMMs), and semi-parametric Generalized Additive Mixed Models (GAMMs) under the same longitudinal prediction setting... GAMM MSE 6.56 vs neural MSE >90
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean LogicNat_induction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

log(UPDRSij) = β0 + β1 agei + β2 HNRij + f(test_timeij) + b0i + b1i test_timeij + εij

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

27 extracted references · 27 canonical work pages

[1]

Ananthanarayanan, A., Senivarapu, S., & Murari, A. (2025). Towards causal interpretability in deep learning for parkinson’s detection from voice data. medRxiv, 2025.04.25.25326311

work page 2025
[2]

V., Hargrave, Z

Arora, S., Vetek, E. V., Hargrave, Z. B., et al. (2015). Detecting and monitoring the symptoms of parkinson's disease using smartphones: a pilot study. Parkinsonism & Related Disorders, 21(6):650--653

work page 2015
[3]

R., Post, M

Bloem, B. R., Post, M. R., & Dorsey, R. (2021). The expanding burden of parkinson's disease. Journal of Parkinson's Disease, 11(2):403--413

work page 2021
[4]

Breslow, N. E. & Clayton, D. G. (1993). Approximate inference in generalized linear mixed models. Journal of the American Statistical Association, 88(421):9--25

work page 1993
[5]

Del Din, S., Godfrey, A., & Rochester, L. (2016). Free-living gait characteristics in ageing and parkinson's disease: impact of environment and ambulatory bout length. Journal of NeuroEngineering and Rehabilitation, 13:46

work page 2016
[6]

P., Laird, N

Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society: Series B (Methodological), 39(1):1--22

work page 1977
[7]

R., et al

Dorsey, R., Bloem, B. R., et al. (2018). Global, regional, and national burden of parkinson's disease, 1990--2016. The Lancet Neurology, 17(11):939--953

work page 2018
[8]

Drotar, P., Mekyska, M., & Ruzicka, I. (2016). Evaluation of handwriting kinematics and pressure for differential diagnosis of parkinson's disease. Artificial Intelligence in Medicine, 67:39--46

work page 2016
[9]

Eskidere, Ö., Ertaş, F., & Hanilçi, C. (2012). A comparison of regression methods for remote tracking of parkinson’s disease progression. Expert Systems with Applications, 39(5):5523--5528

work page 2012
[10]

L., & Members of the UPDRS Development Committee (1987)

Fahn, S., Elton, R. L., & Members of the UPDRS Development Committee (1987). Unified parkinson's disease rating scale. In S. Fahn, C. D. Marsden, D. B. Calne, & M. Goldstein (Eds.), Recent Developments in Parkinson’s Disease, vol. 2, pp. 153--163. Macmillan Healthcare Information

work page 1987
[11]

R., Thompson, R., & Cullis, B

Gilmour, A. R., Thompson, R., & Cullis, B. R. (1995). Average information reml: An efficient algorithm for variance parameter estimation in linear mixed models. Biometrics, 51(4):1440--1450

work page 1995
[12]

G., Nguyen, S

Goetz, C. G., Nguyen, S. T., et al. (2008). Movement disorder society-sponsored revision of the unified parkinson’s disease rating scale (mds-updrs). Movement Disorders, 23(15):2129--2170

work page 2008
[13]

Laird, N. M. & Ware, J. H. (1982). Random-effects models for longitudinal data. Biometrics, 38(4):963--974

work page 1982
[14]

& Zhang, D

Lin, X. & Zhang, D. (1999). Inference in generalized additive mixed models by using smoothing splines. Journal of the Royal Statistical Society: Series B, 61(2):381--400

work page 1999
[15]

Lindstrom, M. J. & Bates, D. M. (1990). Nonlinear mixed effects models for repeated measures data. Biometrics, 46:673--687

work page 1990
[16]

Maity, T. K. & Pal, A. K. (2013). Subject‐specific treatment to neural networks for repeated measures analysis. In Proceedings of the International MultiConference of Engineers and Computer Scientists, vol. 1, pp. 60--65

work page 2013
[17]

P., & Barnett, I

Mandel, F., Ghosh, R. P., & Barnett, I. (2023). Neural networks for clustered and longitudinal data using mixed effects models. Biometrics, 79(2):711--721

work page 2023
[18]

Nilashi, M., Ibrahim, O., & Ahani, A. (2016). Accuracy improvement for predicting parkinson’s disease progression. Scientific Reports, 6

work page 2016
[19]

Patterson, H. D. & Thompson, R. (1971). Recovery of inter-block information when block sizes are unequal. Biometrika, 58(3):545--554

work page 1971
[20]

Parkinson’s disease telemonitoring data set

UCI Machine Learning Repository (2012). Parkinson’s disease telemonitoring data set. https://archive.ics.uci.edu/ml/machine-learning-databases/parkinsons/telemonitoring/parkinsons_updrs.data

work page 2012
[21]

P., & Carroll, R

Ruppert, D., Wand, M. P., & Carroll, R. J. (2003). Semiparametric Regression. Cambridge University Press

work page 2003
[22]

Tong, R., Xu, T., Ju, X., & Wang, L. (2025). Progress in medical ai: Reviewing large language models and multimodal systems for diagnosis. AI Med, 1(1):5

work page 2025
[23]

A., McSharry, P

Tsanas, A., Little, M. A., McSharry, P. E., & Ramig, L. O. (2012). Accurate telemonitoring of parkinson’s disease progression by non-invasive speech tests. Journal of the Royal Society Interface, 9(75):1905--1912

work page 2012
[24]

Wood, S. N. (2011). Fast stable restricted maximum likelihood and marginal likelihood estimation of semiparametric generalized linear models. Journal of the Royal Statistical Society: Series B, 73(1):3--36

work page 2011
[25]

Wood, S. N. (2017). Generalized Additive Models: An Introduction with R (2nd ed.). Chapman & Hall/CRC

work page 2017
[26]

B., Sheeber, L

Wörtwein, T., Allen, N. B., Sheeber, L. B., Auerbach, R. P., Cohn, J. F., & Morency, L.-P. (2023). Neural mixed effects for nonlinear personalized predictions. In Proceedings of the 2023 International Conference on Multimodal Interaction (ICMI ’23), pp. 445--454. ACM

work page 2023
[27]

J., & Singh, V

Xiong, Y., Kim, H. J., & Singh, V. (2019). Mixed effects neural networks (menets) with applications to gaze estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7743--7752

work page 2019

[1] [1]

Ananthanarayanan, A., Senivarapu, S., & Murari, A. (2025). Towards causal interpretability in deep learning for parkinson’s detection from voice data. medRxiv, 2025.04.25.25326311

work page 2025

[2] [2]

V., Hargrave, Z

Arora, S., Vetek, E. V., Hargrave, Z. B., et al. (2015). Detecting and monitoring the symptoms of parkinson's disease using smartphones: a pilot study. Parkinsonism & Related Disorders, 21(6):650--653

work page 2015

[3] [3]

R., Post, M

Bloem, B. R., Post, M. R., & Dorsey, R. (2021). The expanding burden of parkinson's disease. Journal of Parkinson's Disease, 11(2):403--413

work page 2021

[4] [4]

Breslow, N. E. & Clayton, D. G. (1993). Approximate inference in generalized linear mixed models. Journal of the American Statistical Association, 88(421):9--25

work page 1993

[5] [5]

Del Din, S., Godfrey, A., & Rochester, L. (2016). Free-living gait characteristics in ageing and parkinson's disease: impact of environment and ambulatory bout length. Journal of NeuroEngineering and Rehabilitation, 13:46

work page 2016

[6] [6]

P., Laird, N

Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society: Series B (Methodological), 39(1):1--22

work page 1977

[7] [7]

R., et al

Dorsey, R., Bloem, B. R., et al. (2018). Global, regional, and national burden of parkinson's disease, 1990--2016. The Lancet Neurology, 17(11):939--953

work page 2018

[8] [8]

Drotar, P., Mekyska, M., & Ruzicka, I. (2016). Evaluation of handwriting kinematics and pressure for differential diagnosis of parkinson's disease. Artificial Intelligence in Medicine, 67:39--46

work page 2016

[9] [9]

Eskidere, Ö., Ertaş, F., & Hanilçi, C. (2012). A comparison of regression methods for remote tracking of parkinson’s disease progression. Expert Systems with Applications, 39(5):5523--5528

work page 2012

[10] [10]

L., & Members of the UPDRS Development Committee (1987)

Fahn, S., Elton, R. L., & Members of the UPDRS Development Committee (1987). Unified parkinson's disease rating scale. In S. Fahn, C. D. Marsden, D. B. Calne, & M. Goldstein (Eds.), Recent Developments in Parkinson’s Disease, vol. 2, pp. 153--163. Macmillan Healthcare Information

work page 1987

[11] [11]

R., Thompson, R., & Cullis, B

Gilmour, A. R., Thompson, R., & Cullis, B. R. (1995). Average information reml: An efficient algorithm for variance parameter estimation in linear mixed models. Biometrics, 51(4):1440--1450

work page 1995

[12] [12]

G., Nguyen, S

Goetz, C. G., Nguyen, S. T., et al. (2008). Movement disorder society-sponsored revision of the unified parkinson’s disease rating scale (mds-updrs). Movement Disorders, 23(15):2129--2170

work page 2008

[13] [13]

Laird, N. M. & Ware, J. H. (1982). Random-effects models for longitudinal data. Biometrics, 38(4):963--974

work page 1982

[14] [14]

& Zhang, D

Lin, X. & Zhang, D. (1999). Inference in generalized additive mixed models by using smoothing splines. Journal of the Royal Statistical Society: Series B, 61(2):381--400

work page 1999

[15] [15]

Lindstrom, M. J. & Bates, D. M. (1990). Nonlinear mixed effects models for repeated measures data. Biometrics, 46:673--687

work page 1990

[16] [16]

Maity, T. K. & Pal, A. K. (2013). Subject‐specific treatment to neural networks for repeated measures analysis. In Proceedings of the International MultiConference of Engineers and Computer Scientists, vol. 1, pp. 60--65

work page 2013

[17] [17]

P., & Barnett, I

Mandel, F., Ghosh, R. P., & Barnett, I. (2023). Neural networks for clustered and longitudinal data using mixed effects models. Biometrics, 79(2):711--721

work page 2023

[18] [18]

Nilashi, M., Ibrahim, O., & Ahani, A. (2016). Accuracy improvement for predicting parkinson’s disease progression. Scientific Reports, 6

work page 2016

[19] [19]

Patterson, H. D. & Thompson, R. (1971). Recovery of inter-block information when block sizes are unequal. Biometrika, 58(3):545--554

work page 1971

[20] [20]

Parkinson’s disease telemonitoring data set

UCI Machine Learning Repository (2012). Parkinson’s disease telemonitoring data set. https://archive.ics.uci.edu/ml/machine-learning-databases/parkinsons/telemonitoring/parkinsons_updrs.data

work page 2012

[21] [21]

P., & Carroll, R

Ruppert, D., Wand, M. P., & Carroll, R. J. (2003). Semiparametric Regression. Cambridge University Press

work page 2003

[22] [22]

Tong, R., Xu, T., Ju, X., & Wang, L. (2025). Progress in medical ai: Reviewing large language models and multimodal systems for diagnosis. AI Med, 1(1):5

work page 2025

[23] [23]

A., McSharry, P

Tsanas, A., Little, M. A., McSharry, P. E., & Ramig, L. O. (2012). Accurate telemonitoring of parkinson’s disease progression by non-invasive speech tests. Journal of the Royal Society Interface, 9(75):1905--1912

work page 2012

[24] [24]

Wood, S. N. (2011). Fast stable restricted maximum likelihood and marginal likelihood estimation of semiparametric generalized linear models. Journal of the Royal Statistical Society: Series B, 73(1):3--36

work page 2011

[25] [25]

Wood, S. N. (2017). Generalized Additive Models: An Introduction with R (2nd ed.). Chapman & Hall/CRC

work page 2017

[26] [26]

B., Sheeber, L

Wörtwein, T., Allen, N. B., Sheeber, L. B., Auerbach, R. P., Cohn, J. F., & Morency, L.-P. (2023). Neural mixed effects for nonlinear personalized predictions. In Proceedings of the 2023 International Conference on Multimodal Interaction (ICMI ’23), pp. 445--454. ACM

work page 2023

[27] [27]

J., & Singh, V

Xiong, Y., Kim, H. J., & Singh, V. (2019). Mixed effects neural networks (menets) with applications to gaze estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7743--7752

work page 2019