Phonological Subspace Collapse Is Aetiology-Specific and Cross-Lingually Stable: Evidence from 3,374 Speakers

arxiv: 2604.21706 · v1 · submitted 2026-04-23 · 💻 cs.CL

Phonological Subspace Collapse Is Aetiology-Specific and Cross-Lingually Stable: Evidence from 3,374 Speakers

Bernard Muller , Antonio Armando Ortiz Barra\~n\'on , LaVonne Roberts This is my paper

Pith reviewed 2026-05-09 21:17 UTC · model grok-4.3

classification 💻 cs.CL

keywords dysarthriaphonological subspacesself-supervised speech representationsaetiology-specific profilescross-lingual stabilityd-prime separabilityspeech motor disordersParkinson's disease

0 comments p. Extension

The pith

Dysarthria from different causes produces distinct phonological subspace collapse patterns that keep the same shape across languages.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper scales a training-free method for measuring how dysarthria collapses phonological feature subspaces in frozen self-supervised speech models from hundreds to 3,374 speakers across 12 languages and five causes. It finds that the resulting degradation profiles separate the causes at the group level for most features and that the relative shape of each profile stays highly consistent no matter which language the speakers use. The same patterns emerge from six different model backbones and survive when token counts are fixed, showing the signal is not an artifact of how much data is used. If these results hold, speech samples alone could reveal the underlying cause of impaired speech production in a language-independent way, though absolute severity scores would still need local calibration within each dataset.

Core claim

Aetiology-specific degradation profiles are distinguishable at the group level with 10 of 13 features yielding large effect sizes (epsilon-squared > 0.14) and Parkinson's disease separable from the articulatory execution group at Cohen's d = 0.83; cosine similarity of 5-dimensional consonant d-prime profiles exceeds 0.95 across languages for each aetiology, while all six SSL backbones produce monotonic severity gradients with inter-model agreement above rho = 0.77 and fixed-token estimation preserves the severity correlation.

What carries the argument

d-prime separability of phonological feature subspaces in frozen self-supervised speech representations, which measures how much each aetiology reduces the model's ability to distinguish phonological classes such as consonants and vowels.

If this is right

Supports a training-free, architecture-independent framework for aetiology-aware dysarthria characterisation.
Enables language-independent phenotyping of degradation patterns with within-corpus calibration needed for absolute severity.
Group-level distinction works for most phonological features while individual-level classification stays limited at 22.6 percent macro F1.
Cross-backbone agreement above rho = 0.77 and preserved correlations at fixed token counts confirm the signal is robust and not a token-count artefact.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Automated tools could screen for likely aetiology from speech samples alone in multilingual clinical settings.
The high cross-lingual shape stability suggests phonological subspaces break down in ways that are fundamental to speech motor control rather than language-specific.
Longitudinal tracking of the same speakers could reveal how subspace collapse progresses with disease stage.
Calibration methods that align absolute d-prime values across datasets would allow severity comparisons between studies and languages.

Load-bearing premise

The measured d-prime differences truly reflect aetiology-driven phonological subspace collapse rather than recording conditions, speaker demographics or dataset-specific artifacts.

What would settle it

Re-analysis of d-prime profiles after matching speakers across aetiologies for age, sex, recording quality and language; if the group differences disappear, the claim that profiles are aetiology-specific would be falsified.

Figures

Figures reproduced from arXiv: 2604.21706 by Antonio Armando Ortiz Barra\~n\'on, Bernard Muller, LaVonne Roberts.

**Figure 1.** Figure 1: Deviation from healthy controls by aetiology (Cohen’s d). Grey cells indicate missing data (insufficient speakers for effect size computation; Stroke × vowel triangle area has fewer than 5 speakers with valid estimates). Rows show 13 phonological and prosodic features; columns show 5 dysarthric aetiologies. Darker red indicates greater degradation from HC baseline. To rule out the possibility that aetiolog… view at source ↗

**Figure 2.** Figure 2: Pairwise aetiology comparison (HC-normalised to 1.0). Each panel shows two aetiologies against the HC reference (light blue). Distinct shapes reflect aetiology-specific degradation patterns. 4.2 4.2 Cross-lingual profile-shape stability A central question for clinical deployment is whether phonological degradation profiles are languagespecific or consistent across languages. If a PD patient in Slovakia sh… view at source ↗

**Figure 3.** Figure 3: HC-normalised Parkinson’s disease profiles across 6 languages (those with n>=3 PD speakers). Each bar shows the ratio of PD mean to language-specific HC mean for 9 d-prime features (1.0 = healthy). The parallel pattern across languages demonstrates cross-lingual consistency of the PD phonological profile. As an anecdotal observation, the single Swahili PD speaker (n = 1) has a consonant d-prime profile wit… view at source ↗

**Figure 4.** Figure 4: Severity gradient across 6 SSL backbones. Error bars show 95% bootstrap confidence intervals (1,000 resamples over speakers). The smallest mild–moderate margin (XLS-R, 0.004) is not significant; all other adjacent-severity differences exceed the bootstrap CI width. All models show monotonic decrease from control to severe. Absolute d-prime magnitudes differ by model architecture, but the gradient direction… view at source ↗

**Figure 5.** Figure 5: Inter-model agreement on per-speaker composite consonant d-prime (Spearman rho). 13 [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗

read the original abstract

We previously introduced a training-free method for dysarthria severity assessment based on d-prime separability of phonological feature subspaces in frozen self-supervised speech representations, validated on 890 speakers across 5 languages with HuBERT-base. Here, we scale the analysis to 3,374 speakers from 25 datasets spanning 12 languages and 5 aetiologies (Parkinson's disease, cerebral palsy, ALS, Down syndrome, and stroke), plus healthy controls, using 6 SSL backbones. We report three findings. First, aetiology-specific degradation profiles are distinguishable at the group level: 10 of 13 features yield large effect sizes (epsilon-squared > 0.14, Holm-corrected p < 0.001), with Parkinson's disease separable from the articulatory execution group at Cohen's d = 0.83; individual-level classification remains limited (22.6% macro F1). Second, profiles show cross-lingual profile-shape stability: cosine similarity of 5-dimensional consonant d-prime profiles exceeds 0.95 across the languages available for each aetiology. Absolute d-prime magnitudes are not cross-lingually calibrated, so the method supports language-independent phenotyping of degradation patterns but requires within-corpus calibration for absolute severity interpretation. Third, the method is architecture-independent: all 6 backbones produce monotonic severity gradients with inter-model agreement exceeding rho = 0.77. Fixed-token d-prime estimation preserves the severity correlation (rho = -0.733 at 200 tokens per class), confirming that the signal is not a token-count artefact. These results support phonological subspace analysis as a robust, training-free framework for aetiology-aware dysarthria characterisation, with evidence of cross-lingual profile-shape stability and cross-backbone robustness in the represented sample.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper scales prior d-prime subspace work to 3,374 speakers and finds aetiology-specific profiles that hold shape across languages and models, but the pooled design leaves dataset confounds as a live alternative explanation.

read the letter

The main takeaway is that aetiology-specific degradation patterns in frozen SSL features appear at the group level and keep their relative shape across languages, with solid effect sizes and high inter-model agreement. This is a straightforward scaling of the authors' earlier smaller study, now covering five aetiologies, twelve languages, and six backbones on a much bigger sample. They report concrete numbers: ten of thirteen features show large epsilon-squared values, Parkinson's separates from the articulatory group at d=0.83, cosine similarity on consonant profiles exceeds 0.95 across languages, and all models give monotonic severity gradients with rho above 0.77. The token-count check and fixed-token robustness test are useful controls that strengthen the claim the signal is not an artifact of how many tokens are available. Those parts are done cleanly and deserve credit for the empirical care shown in the numbers they publish. The soft spot is the lack of any reported within-dataset replication or regression on recording covariates. With twenty-five datasets that differ in microphones, noise, sampling rates, and speaker demographics, the between-aetiology d-prime differences could partly reflect those corpus-level factors rather than phonological subspace collapse per se. The cross-lingual profile stability is consistent with either a stable phonological effect or a stable confound pattern, and the abstract already notes that absolute magnitudes are not calibrated across languages. Individual classification at 22.6% macro F1 also shows the signal is group-level only. This work is aimed at computational speech researchers and clinical phoneticians who want training-free phenotyping tools. It is worth sending to peer review because the scale and the reported statistics are substantial enough to test, even if the methods section will need close checking on dataset matching and potential confounds.

Referee Report

2 major / 2 minor

Summary. The paper scales a training-free d-prime separability analysis of phonological feature subspaces in frozen SSL speech representations to 3,374 speakers across 25 datasets, 12 languages, and 5 dysarthria aetiologies (PD, CP, ALS, DS, stroke) plus controls. It reports three main results: (1) aetiology-specific degradation profiles are group-distinguishable (10/13 features with ε² > 0.14, Holm p < 0.001; PD vs. articulatory-execution group d = 0.83), though individual classification is limited (22.6% macro F1); (2) consonant d-prime profile shapes are cross-lingually stable (cosine > 0.95) while absolute magnitudes are not; (3) the pattern is robust across 6 SSL backbones (inter-model ρ > 0.77) and preserved under fixed-token estimation.

Significance. If the attribution to aetiology holds after confound controls, the work supplies a scalable, training-free, architecture-independent phenotyping tool for dysarthria that distinguishes degradation patterns by cause and language, with direct clinical relevance for severity assessment and subgrouping. The large multi-dataset, multi-language sample and explicit reporting of effect sizes, p-values, and inter-model agreement are strengths.

major comments (2)

[Methods/Results (pooled analysis)] Methods and Results sections: the pooled analysis across 25 heterogeneous datasets attributes between-aetiology d-prime differences directly to disease without reported dataset-level matching, regression on recording-quality covariates (microphone, sampling rate, noise), or within-dataset replication of the aetiology contrast. Because SSL embeddings are sensitive to corpus-level statistics, the reported ε² > 0.14 and d = 0.83 could partly reflect stable dataset artifacts rather than phonological subspace collapse; the cross-lingual cosine > 0.95 is equally consistent with a stable confound pattern.
[Results (classification performance)] Results, first finding: individual-level classification performance is reported as only 22.6% macro F1 despite large group-level effect sizes; this gap is acknowledged but not quantified with respect to how much of the group separability survives after speaker-level demographic or recording controls, weakening the claim that the profiles are aetiology-specific at a clinically usable level.

minor comments (2)

[Abstract] Abstract: the claim of 'architecture-independent' results should be qualified by the specific 6 backbones tested rather than left unqualified.
[Discussion] The paper notes that absolute d-prime magnitudes require within-corpus calibration; this limitation should be stated more prominently when discussing clinical translation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for highlighting important methodological concerns regarding potential confounds in our pooled analysis and the interpretation of our classification results. We provide point-by-point responses below and indicate where revisions will be made to address these issues.

read point-by-point responses

Referee: Methods and Results sections: the pooled analysis across 25 heterogeneous datasets attributes between-aetiology d-prime differences directly to disease without reported dataset-level matching, regression on recording-quality covariates (microphone, sampling rate, noise), or within-dataset replication of the aetiology contrast. Because SSL embeddings are sensitive to corpus-level statistics, the reported ε² > 0.14 and d = 0.83 could partly reflect stable dataset artifacts rather than phonological subspace collapse; the cross-lingual cosine > 0.95 is equally consistent with a stable confound pattern.

Authors: We agree that the absence of explicit controls for dataset-level factors such as recording quality is a limitation of the current pooled analysis. Although we have replication across multiple datasets for several aetiologies and observe high cross-lingual profile stability, this does not fully rule out stable confounds. In the revised manuscript, we will include additional analyses: (1) regression of d-prime values on available recording metadata where reported across datasets, and (2) within-dataset effect size calculations for aetiologies represented in multiple corpora. These will be reported in a new supplementary section to better isolate aetiology-specific effects. revision: partial
Referee: Results, first finding: individual-level classification performance is reported as only 22.6% macro F1 despite large group-level effect sizes; this gap is acknowledged but not quantified with respect to how much of the group separability survives after speaker-level demographic or recording controls, weakening the claim that the profiles are aetiology-specific at a clinically usable level.

Authors: We acknowledge that the modest individual classification performance (22.6% macro F1) limits clinical applicability at the single-speaker level, and we have not yet quantified the robustness of group separability after speaker-level controls. In revision, we will add speaker-level analyses using mixed-effects models to evaluate the unique variance explained by aetiology after accounting for demographics (age, sex) and dataset indicators. This will provide a clearer assessment of the aetiology-specific signal at both group and individual levels. revision: yes

Circularity Check

0 steps flagged

No circularity: results are direct empirical measurements on new data

full rationale

The paper applies a previously introduced d-prime separability method to a new collection of 3,374 speakers across 25 datasets. All reported quantities (epsilon-squared effect sizes, Cohen's d, cosine similarities of profile shapes, Spearman correlations) are computed directly from the frozen SSL embeddings and group labels in the current data. No equations redefine the target profiles in terms of themselves, no fitted parameters are relabeled as predictions, and the self-citation to the prior method paper is not load-bearing for the distinguishability or stability claims. The derivation chain consists of standard statistical comparisons on independent samples and therefore contains no reduction by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Empirical scaling study with no mathematical derivations; relies on standard statistical tests and pre-trained SSL models whose internal representations are treated as given.

pith-pipeline@v0.9.0 · 5644 in / 1190 out tokens · 42311 ms · 2026-05-09T21:17:35.117786+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

47 extracted references · 34 canonical work pages · 3 internal anchors

[1]

Duffy, J. R. (2019). Motor Speech Disorders: Substrates, Differential Diagnosis, and Management (4th ed.). Elsevier

2019
[2]

Training-Free Cross-Lingual Dysarthria Severity Assessment via Phonological Subspace Analysis in Self-Supervised Speech Representations

Muller, B., Ortiz Barranon, A. A., and Roberts, L. (2026). Training-free cross-lingual dysarthria severity assessment via phonological subspace analysis in self-supervised speech representations. arXiv preprint arXiv:2604.10123.doi:10.48550/arXiv.2604.10123

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2604.10123 2026
[3]

J., Mortensen, D

Choi, K., Yeo, E., Cho, C. J., Mortensen, D. R., and Harwath, D. (2026). Self-supervised speech models encode phonetic context via position-dependent orthogonal subspaces. arXiv preprint arXiv:2603.12642

work page arXiv 2026
[4]

J., Wu, P., Mohamed, A., and Anumanchipalli, G

Cho, C. J., Wu, P., Mohamed, A., and Anumanchipalli, G. K. (2023). Evidence of vocal tract articulation in self-supervised learning of speech. In Proceedings of ICASSP 2023, 1–5.doi:10.1109/ICASSP4935 7.2023.10094711

work page doi:10.1109/icassp4935 2023
[5]

Towards scientificintelligence:Asurveyofllm-basedscientificagents

Halpern, B. M., Tienkamp, T., Abur, D., and Toda, T. (2026). PathBench: Speech intelligibility benchmark for automatic pathological speech assessment. arXiv preprint arXiv:2603.08097.doi:10.48550/arXiv .2603.08097

work page internal anchor Pith review doi:10.48550/arxiv 2026
[6]

K., Rusz, J., Magimai Doss, M., Orozco- Arroyave, J

Hernandez, A., Yeo, E., Choi, K., Li, C.-J., Yue, Z., Das, R. K., Rusz, J., Magimai Doss, M., Orozco- Arroyave, J. R., Arias-Vergara, T., Maier, A., Noth, E., Mortensen, D. R., Harwath, D., and Perez-Toro, P. A. (2026). Adapting self-supervised speech representations for cross-lingual dysarthria detection in Parkinson’s disease. arXiv:2603.22225. 21

work page arXiv 2026
[7]

D., Rusz, J., and Orozco-Arroyave, J

Rios-Urrego, C. D., Rusz, J., and Orozco-Arroyave, J. R. (2024). Automatic speech-based assessment to discriminate Parkinson’s disease from essential tremor with a cross-language approach. npj Digital Medicine, 7, 37.doi:10.1038/s41746-024-01027-6

work page doi:10.1038/s41746-024-01027-6 2024
[8]

Multilingual dysarthric speech assessment us- ing universal phone recognition and language-specific phonemic contrast modeling

Yeo, E. J., Liss, J. M., Berisha, V ., and Mortensen, D. R. (2026). Multilingual dysarthric speech assess- ment using universal phone recognition and language-specific phonemic contrast modeling. arXiv preprint arXiv:2601.21205.doi:10.48550/arXiv.2601.21205

work page doi:10.48550/arxiv.2601.21205 2026
[9]

Baevski, A., Zhou, H., Mohamed, A., and Auli, M. (2020). wav2vec 2.0: A framework for self- supervised learning of speech representations. In Advances in Neural Information Processing Systems, 33, 12449–12460

2020
[10]

Kadirvelu, B., Stumpf, L., Waibel, S., and Faisal, A. A. (2025). Speaker-independent dysarthria sever- ity classification using self-supervised transformers and multi-task learning. PLOS Digital Health, 4(11), e0001076.doi:10.1371/journal.pdig.0001076

work page doi:10.1371/journal.pdig.0001076 2025
[11]

Babu, A., Wang, C., Tjandra, A., Lakhotia, K., Xu, Q., Goyal, N., Singh, K., von Platen, P., Saraf, Y ., Pino, J., Baevski, A., Conneau, A., and Auli, M. (2022). XLS-R: Self-supervised cross-lingual speech representation learning at scale. In Proceedings of Interspeech 2022, 2278–2282.doi:10.21437/Inter speech.2022-143

work page doi:10.21437/inter 2022
[12]

P., Huang, W.-C., and Toda, T

Violeta, L. P., Huang, W.-C., and Toda, T. (2022). Investigating self-supervised pretraining frameworks for pathological speech recognition. In Proceedings of Interspeech 2022.doi:10.21437/Interspeech.2 022-10043

work page doi:10.21437/interspeech.2 2022
[13]

Sapkota, B., Shrestha, S., and Baral, R. (2025). Do all features matter? Layer-wise feature probing of self-supervised speech models for dysarthria severity classification. Speech Communication, 175, 103326. doi:10.1016/j.specom.2025.103326

work page doi:10.1016/j.specom.2025.103326 2025
[14]

Something from Nothing: Data Augmentation for Robust Severity Level Estimation of Dysarthric Speech

Bae, J., Zheng, X., Kim, M., Yoo, C. D., and Hasegawa-Johnson, M. (2026). Something from nothing: Data augmentation for robust severity level estimation of dysarthric speech. arXiv:2603.15988

work page internal anchor Pith review Pith/arXiv arXiv 2026
[15]

R., and Nöth, E

Javanmardi, F., Arias-Vergara, T., Orozco-Arroyave, J. R., and Nöth, E. (2024). Pre-trained models for detection and severity level classification of dysarthria from speech. Speech Communication, 156, 103047. doi:10.1016/j.specom.2024.103047

work page doi:10.1016/j.specom.2024.103047 2024
[16]

HuBERT: Self-supervised speech representation learning by masked prediction of hidden units,

Hsu, W.-N., Bolte, B., Tsai, Y .-H. H., Lakhotia, K., Salakhutdinov, R., and Mohamed, A. (2021). HuBERT: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Transac- tions on Audio, Speech, and Language Processing, 29, 3451–3460.doi:10.1109/TASLP.2021.3122291

work page doi:10.1109/taslp.2021.3122291 2021
[17]

Macmillan, N. A. and Creelman, C. D. (2005). Detection Theory: A User’s Guide (2nd ed.). Lawrence Erlbaum Associates

2005
[18]

P., Findlater, L., Lea, C., Herrlinger, S., Korn, P., Abou-Zahra, S., Heywood, R., Tomanek, K., and MacDonald, B

Hasegawa-Johnson, M., Zheng, X., Kim, H., Mendes, C., Dickinson, M., Hege, E., Zwilling, C., Moore Channell, M., Mattie, L., Hodges, H., Ramig, L., Bellard, M., Shebanek, M., Sari, L., Kalgaonkar, K., Frerichs, D., Bigham, J. P., Findlater, L., Lea, C., Herrlinger, S., Korn, P., Abou-Zahra, S., Heywood, R., Tomanek, K., and MacDonald, B. (2024). Community...

work page doi:10.1044/2024_jslhr-24-00122 2024
[19]

Panayotov, V ., Chen, G., Povey, D., and Khudanpur, S. (2015). LibriSpeech: An ASR corpus based on public domain audio books. In Proceedings of ICASSP 2015, 5206–5210.doi:10.1109/ICASSP.2015. 7178964

work page doi:10.1109/icassp.2015 2015
[20]

K., and Wolff, T

Rudzicz, F., Namasivayam, A. K., and Wolff, T. (2012). The TORGO database of acoustic and articulatory speech from speakers with dysarthria. Language Resources and Evaluation, 46(4), 523–541.doi:10.100 7/s10579-011-9145-0

2012
[21]

Kim, H., Hasegawa-Johnson, M., Perlman, A., Gunderson, J., Huang, T., Watkin, K., and Frame, S. (2008). Dysarthric speech database for universal access research. In Proceedings of Interspeech 2008, 1741–1744. doi:10.21437/Interspeech.2008-480

work page doi:10.21437/interspeech.2008-480 2008
[22]

Rusko, M., Sabo, R., Trnka, M., Zimmermann, A., Malaschitz, R., Ruzicky, E., Brandoburova, P., Kevicka, V ., and Skorvanek, M. (2024). Slovak database of speech affected by neurodegenerative diseases. Scientific Data, 11, 1320.doi:10.1038/s41597-024-04171-6 22

work page doi:10.1038/s41597-024-04171-6 2024
[23]

Jesus, L. M. T., Belo, I., Machado, J., and Hall, A. (2017). The advanced voice function assessment databases (A VFAD): Tools for voice clinicians and speech research. In Advances in Speech-language Pathology. IntechOpen.doi:10.5772/intechopen.69643

work page doi:10.5772/intechopen.69643 2017
[24]

Middag, C., Martens, J.-P., Van Nuffelen, G., and De Bodt, M. (2009). Automated intelligibility assessment of pathological speech using phonological features. EURASIP Journal on Advances in Signal Processing, 2009, 1–9.doi:10.1155/2009/629030

work page doi:10.1155/2009/629030 2009
[25]

Ganzeboom, M., Bakker, M., Beijer, L., Strik, H., and Rietveld, T. (2022). A serious game for speech training in dysarthric speakers with Parkinson’s disease: Exploring therapeutic efficacy and patient sat- isfaction. International Journal of Language and Communication Disorders, 57(5), 1091–1106.doi: 10.1111/1460-6984.12722

work page doi:10.1111/1460-6984.12722 2022
[26]

Ganzeboom, M., Bakker, M., Beijer, L., Rietveld, T., and Strik, H. (2018). Speech training for neurological patients using a serious game. British Journal of Educational Technology, 49(4), 761–774.doi:10.1111/ bjet.12640

2018
[27]

A., Guerrero-Lopez, A., Luque-Buzo, E., Arias-Londono, J

Mendes-Laureano, J., Gomez-Garcia, J. A., Guerrero-Lopez, A., Luque-Buzo, E., Arias-Londono, J. D., Grandas-Perez, F. J., and Godino-Llorente, J. I. (2024). NeuroV oz: A Castilian Spanish corpus of parkin- sonian speech. Scientific Data, 11, 1367.doi:10.1038/s41597-024-04186-z

work page doi:10.1038/s41597-024-04186-z 2024
[28]

R., Arias-Londono, J

Orozco-Arroyave, J. R., Arias-Londono, J. D., Vargas-Bonilla, J. F., Gonzalez-Rativa, M. C., and Noth, E. (2014). New Spanish speech corpus database for the analysis of people suffering from Parkinson’s disease. In Proceedings of LREC 2014, 342–347

2014
[29]

Dimauro, G., Di Nicola, V ., Bevilacqua, V ., Caivano, D., and Girardi, F. (2017). Assessment of speech intelligibility in Parkinson’s disease using a speech-to-text system. IEEE Access, 5, 22199-22208.doi: 10.1109/ACCESS.2017.2762475

work page doi:10.1109/access.2017.2762475 2017
[30]

Turrisi, R., Braccia, A., Emanuele, M., Giulietti, S., Pugliatti, M., Sensi, M., Fadiga, L., and Badino, L. (2021). EasyCall corpus: A dysarthric speech dataset. In Proceedings of Interspeech 2021, 41–45. doi:10.21437/Interspeech.2021-549

work page doi:10.21437/interspeech.2021-549 2021
[31]

Gao, M., Chen, H., Du, J., Xu, X., Guo, H., Bu, H., Yang, J., Li, M., and Lee, C.-H. (2024). Enhancing voice wake-up for dysarthria: Mandarin Dysarthria Speech Corpus release and customized system design. In Proceedings of Interspeech 2024.doi:10.21437/Interspeech.2024-879

work page doi:10.21437/interspeech.2024-879 2024
[32]

Wan, Y ., Sun, M., Kang, X., Li, J., Guo, P., Gao, M., and Wang, S.-J. (2024). CDSD: Chinese dysarthria speech database. In Proceedings of Interspeech 2024, 4109–4113.doi:10.21437/Interspeech.202 4-1597

work page doi:10.21437/interspeech.202 2024
[33]

SLR65: Crowdsourced high-quality Tamil multi-speaker speech dataset [Dataset].http s://www.openslr.org/65/(accessed 2026-03-01)

OpenSLR (2020). SLR65: Crowdsourced high-quality Tamil multi-speaker speech dataset [Dataset].http s://www.openslr.org/65/(accessed 2026-03-01)

2020
[34]

and Barry, W

Puetzer, M. and Barry, W. J. (2007). Saarbruecken V oice Database. Institute of Phonetics, Saarland Uni- versity.http://www.stimmdatenbank.coli.uni-saarland.de/

2007
[35]

Mihajlik, P., Toth, L., and Nemeth, G. (2023). Hungarian dysarthric speech database [Dataset]. Budapest University of Technology and Economics

2023
[36]

Kenyan Swahili Dysarthric Speech Corpus [Dataset]

CDLI (2024). Kenyan Swahili Dysarthric Speech Corpus [Dataset]. Centre for Digital Language Inclusion, University of Cape Town.https://www.cdli.uct.ac.za/(accessed 2026-03-15)

2024
[37]

L., Palmer, K

Stipancic, K. L., Palmer, K. M., Rowe, H. P., Yunusova, Y ., Berry, J. D., and Green, J. R. (2021). You say severe, I say mild: Toward an empirical classification of dysarthria severity. Journal of Speech, Language, and Hearing Research, 64(12), 4718–4735.doi:10.1044/2021_JSLHR-21-00197

work page doi:10.1044/2021_jslhr-21-00197 2021
[38]

Grosman, J. (2021). Fine-tuned XLSR-53 large models for speech recognition [Model collection]. Hug- gingFace.https://huggingface.co/jonatasgrosman(accessed 2026-04-01)

2021
[39]

J., Hasegawa-Johnson, M., Jiang, P.-P., Kuila, A., Lea, C., MacDonald, B., Mantena, G., Ravichandran, V ., Sari, L., Tomanek, K., Yoo, C

Zheng, X., Phukon, B., Na, J., Cutrell, E., Han, K. J., Hasegawa-Johnson, M., Jiang, P.-P., Kuila, A., Lea, C., MacDonald, B., Mantena, G., Ravichandran, V ., Sari, L., Tomanek, K., Yoo, C. D., and Zwilling, C. (2025). The Interspeech 2025 Speech Accessibility Project Challenge. In Proceedings of Interspeech 2025, 3269–3273.doi:10.21437/Interspeech.2025-566

work page doi:10.21437/interspeech.2025-566 2025
[40]

V ., Senerchia, G., Salvatore, E., De Pietro, G., De Falco, I., and Sannino, G

Dubbioso, R., Spisto, M., Verde, L., Iuzzolino, V . V ., Senerchia, G., Salvatore, E., De Pietro, G., De Falco, I., and Sannino, G. (2024). V oice signals database of ALS patients with different dysarthria severity and healthy controls. Scientific Data, 11(1), 800.doi:10.1038/s41597-024-03597-2 23

work page doi:10.1038/s41597-024-03597-2 2024
[41]

Pratap, V ., Tjandra, A., Shi, B., Tomasello, P., Babu, A., Kundu, S., Elkahky, A., Ni, Z., Vyas, A., Fazel- Zarandi, M., Baevski, A., Adi, Y ., Zhang, X., Hsu, W.-N., Conneau, A., and Auli, M. (2024). Scaling speech technology to 1,000+ languages. Journal of Machine Learning Research, 25(97), 1–52

2024
[42]

Chen, S., Wang, C., Chen, Z., Wu, Y ., Liu, S., Chen, Z., Li, J., Kanda, N., Yoshioka, T., Xiao, X., Wu, J., Zhou, L., Ren, S., Qian, Y ., Qian, Y ., Wu, J., Zeng, M., Yu, X., and Wei, F. (2022). WavLM: Large-scale self-supervised pre-training for full stack speech processing. IEEE Journal of Selected Topics in Signal Processing, 16(6), 1505–1518.doi:10.1...

work page doi:10.1109/jstsp.2022.3188113 2022
[43]

Cohen, J. (1988). Statistical Power Analysis for the Behavioral Sciences (2nd ed.). Lawrence Erlbaum Associates

1988
[44]

The SSNCE Database of Tamil Dysarthric Speech [Dataset]

LDC (2021). The SSNCE Database of Tamil Dysarthric Speech [Dataset]. Linguistic Data Consortium, LDC2021S04.doi:10.35111/hkh2-vh40

work page doi:10.35111/hkh2-vh40 2021
[45]

McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., and Sonderegger, M. (2017). Montreal Forced Aligner: Trainable text-speech alignment using Kaldi. In Proceedings of Interspeech 2017, 498–502. doi:10.21437/Interspeech.2017-1386

work page doi:10.21437/interspeech.2017-1386 2017
[46]

Van Nuffelen, G., Middag, C., De Bodt, M., and Martens, J.-P. (2009). Speech technology-based as- sessment of phoneme intelligibility in dysarthria. International Journal of Language and Communication Disorders, 44(5), 716–730.doi:10.1080/13682820802342062

work page doi:10.1080/13682820802342062 2009
[47]

and Yarkoni, T

Westfall, J. and Yarkoni, T. (2016). Statistically controlling for confounding constructs is harder than you think. PLOS ONE, 11(3), e0152719.doi:10.1371/journal.pone.0152719 24

work page doi:10.1371/journal.pone.0152719 2016

[1] [1]

Duffy, J. R. (2019). Motor Speech Disorders: Substrates, Differential Diagnosis, and Management (4th ed.). Elsevier

2019

[2] [2]

Training-Free Cross-Lingual Dysarthria Severity Assessment via Phonological Subspace Analysis in Self-Supervised Speech Representations

Muller, B., Ortiz Barranon, A. A., and Roberts, L. (2026). Training-free cross-lingual dysarthria severity assessment via phonological subspace analysis in self-supervised speech representations. arXiv preprint arXiv:2604.10123.doi:10.48550/arXiv.2604.10123

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2604.10123 2026

[3] [3]

J., Mortensen, D

Choi, K., Yeo, E., Cho, C. J., Mortensen, D. R., and Harwath, D. (2026). Self-supervised speech models encode phonetic context via position-dependent orthogonal subspaces. arXiv preprint arXiv:2603.12642

work page arXiv 2026

[4] [4]

J., Wu, P., Mohamed, A., and Anumanchipalli, G

Cho, C. J., Wu, P., Mohamed, A., and Anumanchipalli, G. K. (2023). Evidence of vocal tract articulation in self-supervised learning of speech. In Proceedings of ICASSP 2023, 1–5.doi:10.1109/ICASSP4935 7.2023.10094711

work page doi:10.1109/icassp4935 2023

[5] [5]

Towards scientificintelligence:Asurveyofllm-basedscientificagents

Halpern, B. M., Tienkamp, T., Abur, D., and Toda, T. (2026). PathBench: Speech intelligibility benchmark for automatic pathological speech assessment. arXiv preprint arXiv:2603.08097.doi:10.48550/arXiv .2603.08097

work page internal anchor Pith review doi:10.48550/arxiv 2026

[6] [6]

K., Rusz, J., Magimai Doss, M., Orozco- Arroyave, J

Hernandez, A., Yeo, E., Choi, K., Li, C.-J., Yue, Z., Das, R. K., Rusz, J., Magimai Doss, M., Orozco- Arroyave, J. R., Arias-Vergara, T., Maier, A., Noth, E., Mortensen, D. R., Harwath, D., and Perez-Toro, P. A. (2026). Adapting self-supervised speech representations for cross-lingual dysarthria detection in Parkinson’s disease. arXiv:2603.22225. 21

work page arXiv 2026

[7] [7]

D., Rusz, J., and Orozco-Arroyave, J

Rios-Urrego, C. D., Rusz, J., and Orozco-Arroyave, J. R. (2024). Automatic speech-based assessment to discriminate Parkinson’s disease from essential tremor with a cross-language approach. npj Digital Medicine, 7, 37.doi:10.1038/s41746-024-01027-6

work page doi:10.1038/s41746-024-01027-6 2024

[8] [8]

Multilingual dysarthric speech assessment us- ing universal phone recognition and language-specific phonemic contrast modeling

Yeo, E. J., Liss, J. M., Berisha, V ., and Mortensen, D. R. (2026). Multilingual dysarthric speech assess- ment using universal phone recognition and language-specific phonemic contrast modeling. arXiv preprint arXiv:2601.21205.doi:10.48550/arXiv.2601.21205

work page doi:10.48550/arxiv.2601.21205 2026

[9] [9]

Baevski, A., Zhou, H., Mohamed, A., and Auli, M. (2020). wav2vec 2.0: A framework for self- supervised learning of speech representations. In Advances in Neural Information Processing Systems, 33, 12449–12460

2020

[10] [10]

Kadirvelu, B., Stumpf, L., Waibel, S., and Faisal, A. A. (2025). Speaker-independent dysarthria sever- ity classification using self-supervised transformers and multi-task learning. PLOS Digital Health, 4(11), e0001076.doi:10.1371/journal.pdig.0001076

work page doi:10.1371/journal.pdig.0001076 2025

[11] [11]

Babu, A., Wang, C., Tjandra, A., Lakhotia, K., Xu, Q., Goyal, N., Singh, K., von Platen, P., Saraf, Y ., Pino, J., Baevski, A., Conneau, A., and Auli, M. (2022). XLS-R: Self-supervised cross-lingual speech representation learning at scale. In Proceedings of Interspeech 2022, 2278–2282.doi:10.21437/Inter speech.2022-143

work page doi:10.21437/inter 2022

[12] [12]

P., Huang, W.-C., and Toda, T

Violeta, L. P., Huang, W.-C., and Toda, T. (2022). Investigating self-supervised pretraining frameworks for pathological speech recognition. In Proceedings of Interspeech 2022.doi:10.21437/Interspeech.2 022-10043

work page doi:10.21437/interspeech.2 2022

[13] [13]

Sapkota, B., Shrestha, S., and Baral, R. (2025). Do all features matter? Layer-wise feature probing of self-supervised speech models for dysarthria severity classification. Speech Communication, 175, 103326. doi:10.1016/j.specom.2025.103326

work page doi:10.1016/j.specom.2025.103326 2025

[14] [14]

Something from Nothing: Data Augmentation for Robust Severity Level Estimation of Dysarthric Speech

Bae, J., Zheng, X., Kim, M., Yoo, C. D., and Hasegawa-Johnson, M. (2026). Something from nothing: Data augmentation for robust severity level estimation of dysarthric speech. arXiv:2603.15988

work page internal anchor Pith review Pith/arXiv arXiv 2026

[15] [15]

R., and Nöth, E

Javanmardi, F., Arias-Vergara, T., Orozco-Arroyave, J. R., and Nöth, E. (2024). Pre-trained models for detection and severity level classification of dysarthria from speech. Speech Communication, 156, 103047. doi:10.1016/j.specom.2024.103047

work page doi:10.1016/j.specom.2024.103047 2024

[16] [16]

HuBERT: Self-supervised speech representation learning by masked prediction of hidden units,

Hsu, W.-N., Bolte, B., Tsai, Y .-H. H., Lakhotia, K., Salakhutdinov, R., and Mohamed, A. (2021). HuBERT: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Transac- tions on Audio, Speech, and Language Processing, 29, 3451–3460.doi:10.1109/TASLP.2021.3122291

work page doi:10.1109/taslp.2021.3122291 2021

[17] [17]

Macmillan, N. A. and Creelman, C. D. (2005). Detection Theory: A User’s Guide (2nd ed.). Lawrence Erlbaum Associates

2005

[18] [18]

P., Findlater, L., Lea, C., Herrlinger, S., Korn, P., Abou-Zahra, S., Heywood, R., Tomanek, K., and MacDonald, B

Hasegawa-Johnson, M., Zheng, X., Kim, H., Mendes, C., Dickinson, M., Hege, E., Zwilling, C., Moore Channell, M., Mattie, L., Hodges, H., Ramig, L., Bellard, M., Shebanek, M., Sari, L., Kalgaonkar, K., Frerichs, D., Bigham, J. P., Findlater, L., Lea, C., Herrlinger, S., Korn, P., Abou-Zahra, S., Heywood, R., Tomanek, K., and MacDonald, B. (2024). Community...

work page doi:10.1044/2024_jslhr-24-00122 2024

[19] [19]

Panayotov, V ., Chen, G., Povey, D., and Khudanpur, S. (2015). LibriSpeech: An ASR corpus based on public domain audio books. In Proceedings of ICASSP 2015, 5206–5210.doi:10.1109/ICASSP.2015. 7178964

work page doi:10.1109/icassp.2015 2015

[20] [20]

K., and Wolff, T

Rudzicz, F., Namasivayam, A. K., and Wolff, T. (2012). The TORGO database of acoustic and articulatory speech from speakers with dysarthria. Language Resources and Evaluation, 46(4), 523–541.doi:10.100 7/s10579-011-9145-0

2012

[21] [21]

Kim, H., Hasegawa-Johnson, M., Perlman, A., Gunderson, J., Huang, T., Watkin, K., and Frame, S. (2008). Dysarthric speech database for universal access research. In Proceedings of Interspeech 2008, 1741–1744. doi:10.21437/Interspeech.2008-480

work page doi:10.21437/interspeech.2008-480 2008

[22] [22]

Rusko, M., Sabo, R., Trnka, M., Zimmermann, A., Malaschitz, R., Ruzicky, E., Brandoburova, P., Kevicka, V ., and Skorvanek, M. (2024). Slovak database of speech affected by neurodegenerative diseases. Scientific Data, 11, 1320.doi:10.1038/s41597-024-04171-6 22

work page doi:10.1038/s41597-024-04171-6 2024

[23] [23]

Jesus, L. M. T., Belo, I., Machado, J., and Hall, A. (2017). The advanced voice function assessment databases (A VFAD): Tools for voice clinicians and speech research. In Advances in Speech-language Pathology. IntechOpen.doi:10.5772/intechopen.69643

work page doi:10.5772/intechopen.69643 2017

[24] [24]

Middag, C., Martens, J.-P., Van Nuffelen, G., and De Bodt, M. (2009). Automated intelligibility assessment of pathological speech using phonological features. EURASIP Journal on Advances in Signal Processing, 2009, 1–9.doi:10.1155/2009/629030

work page doi:10.1155/2009/629030 2009

[25] [25]

Ganzeboom, M., Bakker, M., Beijer, L., Strik, H., and Rietveld, T. (2022). A serious game for speech training in dysarthric speakers with Parkinson’s disease: Exploring therapeutic efficacy and patient sat- isfaction. International Journal of Language and Communication Disorders, 57(5), 1091–1106.doi: 10.1111/1460-6984.12722

work page doi:10.1111/1460-6984.12722 2022

[26] [26]

Ganzeboom, M., Bakker, M., Beijer, L., Rietveld, T., and Strik, H. (2018). Speech training for neurological patients using a serious game. British Journal of Educational Technology, 49(4), 761–774.doi:10.1111/ bjet.12640

2018

[27] [27]

A., Guerrero-Lopez, A., Luque-Buzo, E., Arias-Londono, J

Mendes-Laureano, J., Gomez-Garcia, J. A., Guerrero-Lopez, A., Luque-Buzo, E., Arias-Londono, J. D., Grandas-Perez, F. J., and Godino-Llorente, J. I. (2024). NeuroV oz: A Castilian Spanish corpus of parkin- sonian speech. Scientific Data, 11, 1367.doi:10.1038/s41597-024-04186-z

work page doi:10.1038/s41597-024-04186-z 2024

[28] [28]

R., Arias-Londono, J

Orozco-Arroyave, J. R., Arias-Londono, J. D., Vargas-Bonilla, J. F., Gonzalez-Rativa, M. C., and Noth, E. (2014). New Spanish speech corpus database for the analysis of people suffering from Parkinson’s disease. In Proceedings of LREC 2014, 342–347

2014

[29] [29]

Dimauro, G., Di Nicola, V ., Bevilacqua, V ., Caivano, D., and Girardi, F. (2017). Assessment of speech intelligibility in Parkinson’s disease using a speech-to-text system. IEEE Access, 5, 22199-22208.doi: 10.1109/ACCESS.2017.2762475

work page doi:10.1109/access.2017.2762475 2017

[30] [30]

Turrisi, R., Braccia, A., Emanuele, M., Giulietti, S., Pugliatti, M., Sensi, M., Fadiga, L., and Badino, L. (2021). EasyCall corpus: A dysarthric speech dataset. In Proceedings of Interspeech 2021, 41–45. doi:10.21437/Interspeech.2021-549

work page doi:10.21437/interspeech.2021-549 2021

[31] [31]

Gao, M., Chen, H., Du, J., Xu, X., Guo, H., Bu, H., Yang, J., Li, M., and Lee, C.-H. (2024). Enhancing voice wake-up for dysarthria: Mandarin Dysarthria Speech Corpus release and customized system design. In Proceedings of Interspeech 2024.doi:10.21437/Interspeech.2024-879

work page doi:10.21437/interspeech.2024-879 2024

[32] [32]

Wan, Y ., Sun, M., Kang, X., Li, J., Guo, P., Gao, M., and Wang, S.-J. (2024). CDSD: Chinese dysarthria speech database. In Proceedings of Interspeech 2024, 4109–4113.doi:10.21437/Interspeech.202 4-1597

work page doi:10.21437/interspeech.202 2024

[33] [33]

SLR65: Crowdsourced high-quality Tamil multi-speaker speech dataset [Dataset].http s://www.openslr.org/65/(accessed 2026-03-01)

OpenSLR (2020). SLR65: Crowdsourced high-quality Tamil multi-speaker speech dataset [Dataset].http s://www.openslr.org/65/(accessed 2026-03-01)

2020

[34] [34]

and Barry, W

Puetzer, M. and Barry, W. J. (2007). Saarbruecken V oice Database. Institute of Phonetics, Saarland Uni- versity.http://www.stimmdatenbank.coli.uni-saarland.de/

2007

[35] [35]

Mihajlik, P., Toth, L., and Nemeth, G. (2023). Hungarian dysarthric speech database [Dataset]. Budapest University of Technology and Economics

2023

[36] [36]

Kenyan Swahili Dysarthric Speech Corpus [Dataset]

CDLI (2024). Kenyan Swahili Dysarthric Speech Corpus [Dataset]. Centre for Digital Language Inclusion, University of Cape Town.https://www.cdli.uct.ac.za/(accessed 2026-03-15)

2024

[37] [37]

L., Palmer, K

Stipancic, K. L., Palmer, K. M., Rowe, H. P., Yunusova, Y ., Berry, J. D., and Green, J. R. (2021). You say severe, I say mild: Toward an empirical classification of dysarthria severity. Journal of Speech, Language, and Hearing Research, 64(12), 4718–4735.doi:10.1044/2021_JSLHR-21-00197

work page doi:10.1044/2021_jslhr-21-00197 2021

[38] [38]

Grosman, J. (2021). Fine-tuned XLSR-53 large models for speech recognition [Model collection]. Hug- gingFace.https://huggingface.co/jonatasgrosman(accessed 2026-04-01)

2021

[39] [39]

J., Hasegawa-Johnson, M., Jiang, P.-P., Kuila, A., Lea, C., MacDonald, B., Mantena, G., Ravichandran, V ., Sari, L., Tomanek, K., Yoo, C

Zheng, X., Phukon, B., Na, J., Cutrell, E., Han, K. J., Hasegawa-Johnson, M., Jiang, P.-P., Kuila, A., Lea, C., MacDonald, B., Mantena, G., Ravichandran, V ., Sari, L., Tomanek, K., Yoo, C. D., and Zwilling, C. (2025). The Interspeech 2025 Speech Accessibility Project Challenge. In Proceedings of Interspeech 2025, 3269–3273.doi:10.21437/Interspeech.2025-566

work page doi:10.21437/interspeech.2025-566 2025

[40] [40]

V ., Senerchia, G., Salvatore, E., De Pietro, G., De Falco, I., and Sannino, G

Dubbioso, R., Spisto, M., Verde, L., Iuzzolino, V . V ., Senerchia, G., Salvatore, E., De Pietro, G., De Falco, I., and Sannino, G. (2024). V oice signals database of ALS patients with different dysarthria severity and healthy controls. Scientific Data, 11(1), 800.doi:10.1038/s41597-024-03597-2 23

work page doi:10.1038/s41597-024-03597-2 2024

[41] [41]

Pratap, V ., Tjandra, A., Shi, B., Tomasello, P., Babu, A., Kundu, S., Elkahky, A., Ni, Z., Vyas, A., Fazel- Zarandi, M., Baevski, A., Adi, Y ., Zhang, X., Hsu, W.-N., Conneau, A., and Auli, M. (2024). Scaling speech technology to 1,000+ languages. Journal of Machine Learning Research, 25(97), 1–52

2024

[42] [42]

Chen, S., Wang, C., Chen, Z., Wu, Y ., Liu, S., Chen, Z., Li, J., Kanda, N., Yoshioka, T., Xiao, X., Wu, J., Zhou, L., Ren, S., Qian, Y ., Qian, Y ., Wu, J., Zeng, M., Yu, X., and Wei, F. (2022). WavLM: Large-scale self-supervised pre-training for full stack speech processing. IEEE Journal of Selected Topics in Signal Processing, 16(6), 1505–1518.doi:10.1...

work page doi:10.1109/jstsp.2022.3188113 2022

[43] [43]

Cohen, J. (1988). Statistical Power Analysis for the Behavioral Sciences (2nd ed.). Lawrence Erlbaum Associates

1988

[44] [44]

The SSNCE Database of Tamil Dysarthric Speech [Dataset]

LDC (2021). The SSNCE Database of Tamil Dysarthric Speech [Dataset]. Linguistic Data Consortium, LDC2021S04.doi:10.35111/hkh2-vh40

work page doi:10.35111/hkh2-vh40 2021

[45] [45]

McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., and Sonderegger, M. (2017). Montreal Forced Aligner: Trainable text-speech alignment using Kaldi. In Proceedings of Interspeech 2017, 498–502. doi:10.21437/Interspeech.2017-1386

work page doi:10.21437/interspeech.2017-1386 2017

[46] [46]

Van Nuffelen, G., Middag, C., De Bodt, M., and Martens, J.-P. (2009). Speech technology-based as- sessment of phoneme intelligibility in dysarthria. International Journal of Language and Communication Disorders, 44(5), 716–730.doi:10.1080/13682820802342062

work page doi:10.1080/13682820802342062 2009

[47] [47]

and Yarkoni, T

Westfall, J. and Yarkoni, T. (2016). Statistically controlling for confounding constructs is harder than you think. PLOS ONE, 11(3), e0152719.doi:10.1371/journal.pone.0152719 24

work page doi:10.1371/journal.pone.0152719 2016