pith. sign in

arxiv: 2606.10088 · v1 · pith:PHD54KQQnew · submitted 2026-06-08 · 💻 cs.CV

Interpretable Temporal Facial-Region Motion Analysis for In-the-Wild Parkinson's Disease Video Classification

Pith reviewed 2026-06-27 16:56 UTC · model grok-4.3

classification 💻 cs.CV
keywords Parkinson's diseasefacial motion analysishypomimiavideo classificationtemporal descriptorsYouTubePD benchmarkRandom Forestinterpretability
0
0 comments X

The pith

Normalized velocity descriptors from 14 facial regions classify in-the-wild Parkinson's videos at 0.826 balanced accuracy using a Random Forest.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether geometric motion features extracted from facial keypoints can distinguish Parkinson's disease videos from non-PD videos on the YouTubePD benchmark. It compares five descriptor families under a fixed binary classification setup and finds that normalized velocity features paired with a Random Forest yield the highest and most stable performance. A sympathetic reader would care because the approach is lightweight, uses only 2D keypoints, and supplies region-level importance scores that link back to the clinical sign of hypomimia without requiring clinical equipment or controlled recording conditions.

Core claim

Normalized velocity descriptors computed over 14 predefined facial regions, when fed to a Random Forest, reach 0.826 balanced accuracy and 0.855 AUROC on the held-out YouTubePD test split; the same representation remains stable across ten random seeds (0.810 ± 0.018 balanced accuracy). Static geometry, un-normalized velocity, relative velocity, and a GRU sequence model all underperform this combination under identical protocol. Region ablation and permutation importance further show that the method is interpretable at the level of individual facial areas.

What carries the argument

Normalized velocity descriptors: per-region Euclidean displacements between consecutive frames, scaled by the inter-ocular distance of that frame, aggregated over time and used as input features to a Random Forest classifier.

If this is right

  • The representation is stable enough that performance does not depend on a single random seed.
  • Ablation shows that performance drops when any of the 14 regions is removed, indicating distributed rather than single-region information.
  • Permutation importance ranks regions consistently, supplying an explicit map from motion statistics to classification decisions.
  • The same descriptors remain competitive with a recurrent baseline while remaining fully interpretable by inspection of feature importances.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Because the features are derived only from 2D keypoints, the pipeline could be re-run on any existing video archive without new recordings.
  • If the same descriptors were computed on videos paired with MDS-UPDRS facial scores, a regression extension might test whether motion magnitude tracks clinical severity.
  • The seed-robustness result implies that future work can focus on dataset shift rather than hyper-parameter sensitivity when moving to new video sources.

Load-bearing premise

The YouTubePD videos constitute an unbiased and correctly labeled sample of real-world PD versus non-PD cases.

What would settle it

Retraining the identical normalized-velocity-plus-Random-Forest pipeline on a new dataset whose labels come from in-person neurological examination and observing balanced accuracy fall below 0.70 would falsify the claim that the descriptors reliably separate the classes.

Figures

Figures reproduced from arXiv: 2606.10088 by Riyadh Almushrafy (Majmaah University, Saudi Arabia).

Figure 1
Figure 1. Figure 1: Example visualization of the 14 processed YouTubePD facial-region polygons. The [PITH_FULL_IMAGE:figures/full_fig_p006_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Receiver operating characteristic curves for the strongest baseline configurations on [PITH_FULL_IMAGE:figures/full_fig_p011_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Baseline configurations shown in AUROC–F1 space. Each numbered marker corre [PITH_FULL_IMAGE:figures/full_fig_p012_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Single-region ablation results on the YouTubePD binary classification task. Each point [PITH_FULL_IMAGE:figures/full_fig_p014_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Grouped permutation importance by temporal statistic for the normalized velocity [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Confusion matrix for the best-performing Normalized Velocity + Random Forest [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗
read the original abstract

Reduced facial expressivity is a common motor manifestation of Parkinson's disease (PD), often described as hypomimia or facial bradykinesia. This paper examines whether temporal motion descriptors extracted from facial-region keypoints can support in-the-wild PD-related video classification on the YouTubePD benchmark. Each video is represented using geometric descriptors from 14 predefined facial regions. Static geometry, normalized geometry, velocity-based descriptors, relative-velocity descriptors, and a GRU sequence baseline are compared under the same binary classification protocol. To assess stability and interpretability, the study includes seed-robustness analysis, region-level ablation, and permutation importance. The best result is obtained with normalized velocity descriptors and a Random Forest classifier, reaching a balanced accuracy of 0.826 and an AUROC of 0.855 on the held-out test split. Across 10 random seeds, this representation remains stable, with balanced accuracy of 0.810 +/- 0.018 and AUROC of 0.855 +/- 0.005. Overall, the results suggest that normalized facial-region motion is a lightweight and interpretable representation for YouTubePD video classification. The study is framed as a benchmark-level analysis and does not claim clinical severity assessment or MDS-UPDRS facial-expression scoring.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper evaluates temporal motion descriptors extracted from 14 predefined facial regions in videos for binary classification of Parkinson's disease (PD) versus non-PD on the YouTubePD benchmark. It compares static geometry, normalized geometry, velocity-based descriptors, relative-velocity descriptors, and a GRU baseline, reporting that normalized velocity descriptors paired with a Random Forest classifier achieve the highest performance: balanced accuracy 0.826 and AUROC 0.855 on the held-out test split. The work includes seed-robustness checks (stable at 0.810 ± 0.018 balanced accuracy across 10 seeds), region-level ablation, and permutation importance analysis, framing the contribution as a lightweight, interpretable benchmark study without clinical diagnostic claims.

Significance. If the YouTubePD labels are reliable, the results demonstrate that normalized facial-region velocity features can support stable in-the-wild PD video classification with competitive metrics and built-in interpretability via ablation and permutation importance. The explicit seed-robustness analysis, region ablation, and permutation importance are strengths that increase confidence in the empirical findings and distinguish the work from purely black-box approaches.

major comments (1)
  1. [Data / benchmark description (likely §3 or Methods)] Data / benchmark description (likely §3 or Methods): The central claim of 0.826 balanced accuracy / 0.855 AUROC on the held-out split rests on the assumption that YouTubePD provides correctly labeled, unbiased PD vs. non-PD samples. No details are given on label provenance (self-report vs. verified diagnosis), inter-rater checks, or mitigation of YouTube-specific selection bias and recording variability; without this, the numeric results cannot be interpreted as evidence for the descriptors' utility.
minor comments (2)
  1. [Abstract] Abstract: The phrase 'normalized velocity descriptors' is used without a brief parenthetical definition or reference to the exact computation (e.g., which keypoints and normalization), reducing immediate clarity for readers.
  2. [Results] Results section: The reported standard deviations across 10 seeds are given only for the best model; providing the same statistics for the other descriptor/classifier combinations would strengthen the comparative claims.

Simulated Author's Rebuttal

1 responses · 1 unresolved

We thank the referee for the constructive feedback and for acknowledging the strengths of our seed-robustness checks, region ablation, and permutation importance analysis. We address the single major comment below and will revise the manuscript to improve the benchmark description.

read point-by-point responses
  1. Referee: [Data / benchmark description (likely §3 or Methods)] Data / benchmark description (likely §3 or Methods): The central claim of 0.826 balanced accuracy / 0.855 AUROC on the held-out split rests on the assumption that YouTubePD provides correctly labeled, unbiased PD vs. non-PD samples. No details are given on label provenance (self-report vs. verified diagnosis), inter-rater checks, or mitigation of YouTube-specific selection bias and recording variability; without this, the numeric results cannot be interpreted as evidence for the descriptors' utility.

    Authors: We agree that the manuscript requires a clearer description of the YouTubePD benchmark to allow proper interpretation of the reported metrics. In the revised version we will add a dedicated subsection (likely in §3) that summarizes the benchmark construction as described in its original reference: labels derive from self-reported PD status in video titles/descriptions for the positive class and from control videos for the negative class. We will explicitly note the absence of clinical verification or inter-rater reliability metrics and acknowledge YouTube-specific selection and recording biases. This addition will frame the work strictly as a benchmark study on the given dataset. We cannot supply verified medical diagnoses or new inter-rater data, as these are outside the scope of the public benchmark and would require an entirely different data-collection protocol. revision: yes

standing simulated objections not resolved
  • Absence of clinically verified diagnoses and inter-rater reliability statistics for YouTubePD labels, which are inherent limitations of the public benchmark and cannot be retroactively supplied by the present study.

Circularity Check

0 steps flagged

No circularity: purely empirical benchmark results on held-out split

full rationale

The paper performs an empirical comparison of geometric and velocity-based facial descriptors for binary PD classification on the YouTubePD benchmark using standard classifiers (Random Forest, GRU). Reported metrics (balanced accuracy 0.826, AUROC 0.855) are obtained directly from evaluation on an explicitly held-out test split, with seed-robustness and ablation checks. No derivation, uniqueness theorem, ansatz, or prediction is presented that reduces by construction to fitted inputs, self-citations, or renamed known results. The analysis is self-contained as standard ML benchmarking without load-bearing theoretical steps.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central empirical claim rests on the assumption that the YouTubePD labels are reliable and that the 14 facial regions capture the relevant motion signal. No free parameters are explicitly fitted beyond standard classifier training; no new entities are postulated.

axioms (2)
  • domain assumption Facial keypoint detection is sufficiently accurate on in-the-wild YouTube videos to support velocity computation.
    The pipeline presupposes reliable extraction of the 14 regions; any systematic failure of the keypoint detector would invalidate all motion descriptors.
  • domain assumption The binary PD/non-PD labels in YouTubePD are treated as ground truth without reported inter-rater reliability or clinical confirmation.
    The classification protocol depends on these labels being correct; the abstract provides no evidence on label quality.

pith-pipeline@v0.9.1-grok · 5760 in / 1391 out tokens · 16371 ms · 2026-06-27T16:56:52.756332+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

28 extracted references · 21 canonical work pages

  1. [1]

    Gunzler, Ciara Kilbane, Vishwajit Murthy, Paolo Bonato, David Golan, Daniel Tarsy, Tanya Simuni, Terry D

    Avner Abrami, Steven A. Gunzler, Ciara Kilbane, Vishwajit Murthy, Paolo Bonato, David Golan, Daniel Tarsy, Tanya Simuni, Terry D. Ellis, Jason Karlawish, et al. Automated computer vision assessment of hypomimia in parkinson disease: Proof-of-principle pilot study.Journal of Medical Internet Research, 23(2):e21037, 2021. doi: 10.2196/21037

  2. [2]

    OpenFace 2.0: Facial Behavior Analysis Toolkit

    Tadas Baltruˇ saitis, Amir Zadeh, Yao Chong Lim, and Louis-Philippe Morency. OpenFace 2.0: Facial Behavior Analysis Toolkit. In2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018), pages 59–66, 2018. doi: 10.1109/FG.2018.00019

  3. [3]

    Reyes-Garcia, Paolo Vanni, Gaetano Zaccara, and Claudia Manfredi

    Andrea Bandini, Simone Orlandi, Hugo Jair Escalante, Francesco Giovannelli, Massimo Cincotta, Carlos A. Reyes-Garcia, Paolo Vanni, Gaetano Zaccara, and Claudia Manfredi. Automatic analysis of facial expressions in parkinson’s disease.Journal of Neuroscience Methods, 281:1–11, 2017. doi: 10.1016/j.jneumeth.2017.02.006. 20

  4. [4]

    Concept decompositions for large sparse text data using clustering

    Leo Breiman. Random forests.Machine Learning, 45(1):5–32, 2001. doi: 10.1023/A: 1010933404324

  5. [5]

    Learning phrase representations using RNN encoder ⚶decoder for statistical machine translation

    Kyunghyun Cho, Bart van Merri¨ enboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using RNN encoder–decoder for statistical machine translation. InProceedings of the 2014 Confer- ence on Empirical Methods in Natural Language Processing, pages 1724–1734, 2014. doi: 10.3115/v1/D14-1179

  6. [6]

    URL https://doi

    Corinna Cortes and Vladimir Vapnik. Support-vector networks.Machine Learning, 20(3): 273–297, 1995. doi: 10.1007/BF00994018

  7. [7]

    Lazzaro di Biase, Pasquale Maria Pecoraro, and Francesco Bugamelli. AI Video Analysis in Parkinson’s Disease: A Systematic Review of the Most Accurate Computer Vision Tools for Diagnosis, Symptom Monitoring, and Therapy Management.Sensors, 25(20):6373, 2025. doi: 10.3390/s25206373

  8. [8]

    Friesen.Facial Action Coding System: A Technique for the Measurement of Facial Movement

    Paul Ekman and Wallace V. Friesen.Facial Action Coding System: A Technique for the Measurement of Facial Movement. Consulting Psychologists Press, Palo Alto, CA, 1978

  9. [9]

    An introduction to ROC analysis.Pattern Recognition Letters, 27(8):861–874,

    Tom Fawcett. An introduction to ROC analysis.Pattern Recognition Letters, 27(8):861–874,

  10. [10]

    doi: 10.1016/j.patrec.2005.10.010

  11. [11]

    El-Yacoubi

    Anas Filali Razzouki, Laetitia Jeancolas, Graziella Mangone, Sara Sambin, Aliz´ e Chalan¸ con, Manon Gomes, St´ ephane Leh´ ericy, Jean-Christophe Corvol, Marie Vidailhet, Isabelle Arnulf, Dijana Petrovska-Delacr´ etaz, and Mounim A. El-Yacoubi. Leveraging action unit derivatives for early-stage parkinson’s disease detection.IRBM, 46:100874, 2025. doi: 10...

  12. [12]

    El-Yacoubi

    Anas Filali Razzouki, Laetitia Jeancolas, Sara Sambin, Graziella Mangone, Aliz´ e Chalan¸ con, Manon Gomes, St´ ephane Leh´ ericy, Marie Vidailhet, Isabelle Arnulf, Jean-Christophe Corvol, Dijana Petrovska-Delacr´ etaz, and Mounim A. El-Yacoubi. Explaining facial action units’ correlation with hypomimia and clinical scores in parkinson’s disease.npj Parki...

  13. [13]

    Pattern Recognition40, 2110–2117 (2007).https://doi.org/10.1016/j.patcog

    Anas Filali Razzouki, Laetitia Jeancolas, Dijana Petrovska-Delacr´ etaz, and Mounim A. El-Yacoubi. Facial Digital Markers for Hypomimia Detection in Parkinson’s Disease: A Systematic Review.Pattern Recognition, 172(Part C):112573, 2026. doi: 10.1016/j.patcog. 2025.112573

  14. [14]

    Aaron Fisher, Cynthia Rudin, and Francesca Dominici. All models are wrong, but many are useful: Learning a variable’s importance by studying an entire class of prediction models simultaneously.Journal of Machine Learning Research, 20(177):1–81, 2019. URL http://jmlr.org/papers/v20/18-760.html

  15. [15]

    Goetz, Barbara C

    Christopher G. Goetz, Barbara C. Tilley, Stephanie R. Shaftman, Glenn T. Stebbins, Stanley Fahn, Pablo Martinez-Martin, Werner Poewe, Cristina Sampaio, Matthew B. Stern, Richard Dodel, Bruno Dubois, Robert Holloway, Joseph Jankovic, Jaime Kulisevsky, Anthony E. Lang, Andrew Lees, Sue Leurgans, Peter A. LeWitt, David Nyenhuis, C. Warren Olanow, Olivier Ras...

  16. [16]

    G´ omez, Aythami Morales, Julian Fierrez, and Juan R

    Luis F. G´ omez, Aythami Morales, Julian Fierrez, and Juan R. Orozco-Arroyave. Exploring facial expressions and action unit domains for parkinson detection.PLOS ONE, 18(2): e0281248, 2023. doi: 10.1371/journal.pone.0281248

  17. [17]

    Long short-term memory.Neural Computation, 9(8): 1735–1780, 1997

    Sepp Hochreiter and J¨ urgen Schmidhuber. Long short-term memory.Neural Computation, 9(8):1735–1780, 1997. doi: 10.1162/neco.1997.9.8.1735

  18. [18]

    Diagnosing parkinson disease through facial expression recognition: Video analysis.Journal of Medical Internet Research, 22(7):e18697,

    Bo Jin, Yue Qu, Liang Zhang, and Zhan Gao. Diagnosing parkinson disease through facial expression recognition: Video analysis.Journal of Medical Internet Research, 22(7):e18697,

  19. [19]

    Automated video-based assessment of facial bradykinesia in de-novo parkinson’s disease.npj Digital Medicine, 5(1):98, 2022

    Michal Novotn´ y, Tereza Tykalov´ a, Hana R˚ uˇ ziˇ ckov´ a, Evˇ zen R˚ uˇ ziˇ cka, Petr Duˇ sek, and Jan Rusz. Automated video-based assessment of facial bradykinesia in de-novo parkinson’s disease.npj Digital Medicine, 5(1):98, 2022. doi: 10.1038/s41746-022-00642-5

  20. [20]

    Facial Expression Analysis in Parkinsons’s Disease Using Machine Learning: A Review.ACM Computing Surveys, 57(8):1–25, 2025

    Guilherme Oliveira, Quoc Ngo, Leandro Passos, Danilo Jodas, Jo˜ ao Papa, and Dinesh Kumar. Facial Expression Analysis in Parkinsons’s Disease Using Machine Learning: A Review.ACM Computing Surveys, 57(8):1–25, 2025. doi: 10.1145/3716818

  21. [21]

    Espay, Matteo Bologna, and Lazzaro di Biase

    Pasquale Maria Pecoraro, Luca Marsili, Alberto J. Espay, Matteo Bologna, and Lazzaro di Biase. Computer Vision Technologies in Movement Disorders: A Systematic Review. Movement Disorders Clinical Practice, 12(9):1229–1243, 2025. doi: 10.1002/mdc3.70123

  22. [22]

    Quantita- tive evaluation of hypomimia in parkinson’s disease: A face tracking approach.Sensors, 22 (4):1358, 2022

    Elena Pegolo, Daniele Volpe, Alberto Cucca, Lucia Ricciardi, and Zimi Sawacha. Quantita- tive evaluation of hypomimia in parkinson’s disease: A face tracking approach.Sensors, 22 (4):1358, 2022. doi: 10.3390/s22041358

  23. [23]

    David M. W. Powers. Evaluation: From precision, recall and F-measure to ROC, informed- ness, markedness and correlation.Journal of Machine Learning Technologies, 2(1):37–63, 2011

  24. [24]

    Rafael H

    Takaya Saito and Marc Rehmsmeier. The precision-recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets.PLOS ONE, 10(3): e0118432, 2015. doi: 10.1371/journal.pone.0118432

  25. [25]

    Mirian, Juana Ayala Castaneda, Michael Grundy, and Martin J

    Eline Serb´ ee, Kye Won Park, Atefeh Irani, Maryam S. Mirian, Juana Ayala Castaneda, Michael Grundy, and Martin J. McKeown. Facial expression analysis to uncover the relationship between sialorrhea and hypomimia in parkinson’s disease.Frontiers in Neurology, 16:1661043, 2025. doi: 10.3389/fneur.2025.1661043

  26. [26]

    Detection of hypomimia in patients with parkinson’s disease via smile videos.Annals of Translational Medicine, 9(16):1307, 2021

    Ge Su, Bo Lin, Jianwei Yin, Wei Luo, Renjun Xu, Jie Xu, and Kexiong Dong. Detection of hypomimia in patients with parkinson’s disease via smile videos.Annals of Translational Medicine, 9(16):1307, 2021. doi: 10.21037/atm-21-3457

  27. [27]

    YouTubePD: A multimodal benchmark for parkinson’s disease analysis

    Andy Zhou, Jiahua Dong, George Heintz, Volodymyr Kindratenko, Samuel Li, Xiang Li, Shirui Luo, Ansh Sharma, Pranav Sriram, Yu-Xiong Wang, Christopher Zallek, and Yuanyi Zhong. YouTubePD: A multimodal benchmark for parkinson’s disease analysis. InAdvances in Neural Information Processing Systems, volume 36, 2023

  28. [28]

    YouTubePD: A multimodal benchmark for parkinson’s disease analysis: Supplementary material

    Andy Zhou, Jiahua Dong, George Heintz, Volodymyr Kindratenko, Samuel Li, Xiang Li, Shirui Luo, Ansh Sharma, Pranav Sriram, Yu-Xiong Wang, Christopher Zallek, and Yuanyi Zhong. YouTubePD: A multimodal benchmark for parkinson’s disease analysis: Supplementary material. Supplementary material for NeurIPS Datasets and Benchmarks, 2023. 22