Phonological Subspace Collapse Is Aetiology-Specific and Cross-Lingually Stable: Evidence from 3,374 Speakers
Pith reviewed 2026-05-09 21:17 UTC · model grok-4.3
The pith
Dysarthria from different causes produces distinct phonological subspace collapse patterns that keep the same shape across languages.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Aetiology-specific degradation profiles are distinguishable at the group level with 10 of 13 features yielding large effect sizes (epsilon-squared > 0.14) and Parkinson's disease separable from the articulatory execution group at Cohen's d = 0.83; cosine similarity of 5-dimensional consonant d-prime profiles exceeds 0.95 across languages for each aetiology, while all six SSL backbones produce monotonic severity gradients with inter-model agreement above rho = 0.77 and fixed-token estimation preserves the severity correlation.
What carries the argument
d-prime separability of phonological feature subspaces in frozen self-supervised speech representations, which measures how much each aetiology reduces the model's ability to distinguish phonological classes such as consonants and vowels.
If this is right
- Supports a training-free, architecture-independent framework for aetiology-aware dysarthria characterisation.
- Enables language-independent phenotyping of degradation patterns with within-corpus calibration needed for absolute severity.
- Group-level distinction works for most phonological features while individual-level classification stays limited at 22.6 percent macro F1.
- Cross-backbone agreement above rho = 0.77 and preserved correlations at fixed token counts confirm the signal is robust and not a token-count artefact.
Where Pith is reading between the lines
- Automated tools could screen for likely aetiology from speech samples alone in multilingual clinical settings.
- The high cross-lingual shape stability suggests phonological subspaces break down in ways that are fundamental to speech motor control rather than language-specific.
- Longitudinal tracking of the same speakers could reveal how subspace collapse progresses with disease stage.
- Calibration methods that align absolute d-prime values across datasets would allow severity comparisons between studies and languages.
Load-bearing premise
The measured d-prime differences truly reflect aetiology-driven phonological subspace collapse rather than recording conditions, speaker demographics or dataset-specific artifacts.
What would settle it
Re-analysis of d-prime profiles after matching speakers across aetiologies for age, sex, recording quality and language; if the group differences disappear, the claim that profiles are aetiology-specific would be falsified.
Figures
read the original abstract
We previously introduced a training-free method for dysarthria severity assessment based on d-prime separability of phonological feature subspaces in frozen self-supervised speech representations, validated on 890 speakers across 5 languages with HuBERT-base. Here, we scale the analysis to 3,374 speakers from 25 datasets spanning 12 languages and 5 aetiologies (Parkinson's disease, cerebral palsy, ALS, Down syndrome, and stroke), plus healthy controls, using 6 SSL backbones. We report three findings. First, aetiology-specific degradation profiles are distinguishable at the group level: 10 of 13 features yield large effect sizes (epsilon-squared > 0.14, Holm-corrected p < 0.001), with Parkinson's disease separable from the articulatory execution group at Cohen's d = 0.83; individual-level classification remains limited (22.6% macro F1). Second, profiles show cross-lingual profile-shape stability: cosine similarity of 5-dimensional consonant d-prime profiles exceeds 0.95 across the languages available for each aetiology. Absolute d-prime magnitudes are not cross-lingually calibrated, so the method supports language-independent phenotyping of degradation patterns but requires within-corpus calibration for absolute severity interpretation. Third, the method is architecture-independent: all 6 backbones produce monotonic severity gradients with inter-model agreement exceeding rho = 0.77. Fixed-token d-prime estimation preserves the severity correlation (rho = -0.733 at 200 tokens per class), confirming that the signal is not a token-count artefact. These results support phonological subspace analysis as a robust, training-free framework for aetiology-aware dysarthria characterisation, with evidence of cross-lingual profile-shape stability and cross-backbone robustness in the represented sample.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper scales a training-free d-prime separability analysis of phonological feature subspaces in frozen SSL speech representations to 3,374 speakers across 25 datasets, 12 languages, and 5 dysarthria aetiologies (PD, CP, ALS, DS, stroke) plus controls. It reports three main results: (1) aetiology-specific degradation profiles are group-distinguishable (10/13 features with ε² > 0.14, Holm p < 0.001; PD vs. articulatory-execution group d = 0.83), though individual classification is limited (22.6% macro F1); (2) consonant d-prime profile shapes are cross-lingually stable (cosine > 0.95) while absolute magnitudes are not; (3) the pattern is robust across 6 SSL backbones (inter-model ρ > 0.77) and preserved under fixed-token estimation.
Significance. If the attribution to aetiology holds after confound controls, the work supplies a scalable, training-free, architecture-independent phenotyping tool for dysarthria that distinguishes degradation patterns by cause and language, with direct clinical relevance for severity assessment and subgrouping. The large multi-dataset, multi-language sample and explicit reporting of effect sizes, p-values, and inter-model agreement are strengths.
major comments (2)
- [Methods/Results (pooled analysis)] Methods and Results sections: the pooled analysis across 25 heterogeneous datasets attributes between-aetiology d-prime differences directly to disease without reported dataset-level matching, regression on recording-quality covariates (microphone, sampling rate, noise), or within-dataset replication of the aetiology contrast. Because SSL embeddings are sensitive to corpus-level statistics, the reported ε² > 0.14 and d = 0.83 could partly reflect stable dataset artifacts rather than phonological subspace collapse; the cross-lingual cosine > 0.95 is equally consistent with a stable confound pattern.
- [Results (classification performance)] Results, first finding: individual-level classification performance is reported as only 22.6% macro F1 despite large group-level effect sizes; this gap is acknowledged but not quantified with respect to how much of the group separability survives after speaker-level demographic or recording controls, weakening the claim that the profiles are aetiology-specific at a clinically usable level.
minor comments (2)
- [Abstract] Abstract: the claim of 'architecture-independent' results should be qualified by the specific 6 backbones tested rather than left unqualified.
- [Discussion] The paper notes that absolute d-prime magnitudes require within-corpus calibration; this limitation should be stated more prominently when discussing clinical translation.
Simulated Author's Rebuttal
We thank the referee for highlighting important methodological concerns regarding potential confounds in our pooled analysis and the interpretation of our classification results. We provide point-by-point responses below and indicate where revisions will be made to address these issues.
read point-by-point responses
-
Referee: Methods and Results sections: the pooled analysis across 25 heterogeneous datasets attributes between-aetiology d-prime differences directly to disease without reported dataset-level matching, regression on recording-quality covariates (microphone, sampling rate, noise), or within-dataset replication of the aetiology contrast. Because SSL embeddings are sensitive to corpus-level statistics, the reported ε² > 0.14 and d = 0.83 could partly reflect stable dataset artifacts rather than phonological subspace collapse; the cross-lingual cosine > 0.95 is equally consistent with a stable confound pattern.
Authors: We agree that the absence of explicit controls for dataset-level factors such as recording quality is a limitation of the current pooled analysis. Although we have replication across multiple datasets for several aetiologies and observe high cross-lingual profile stability, this does not fully rule out stable confounds. In the revised manuscript, we will include additional analyses: (1) regression of d-prime values on available recording metadata where reported across datasets, and (2) within-dataset effect size calculations for aetiologies represented in multiple corpora. These will be reported in a new supplementary section to better isolate aetiology-specific effects. revision: partial
-
Referee: Results, first finding: individual-level classification performance is reported as only 22.6% macro F1 despite large group-level effect sizes; this gap is acknowledged but not quantified with respect to how much of the group separability survives after speaker-level demographic or recording controls, weakening the claim that the profiles are aetiology-specific at a clinically usable level.
Authors: We acknowledge that the modest individual classification performance (22.6% macro F1) limits clinical applicability at the single-speaker level, and we have not yet quantified the robustness of group separability after speaker-level controls. In revision, we will add speaker-level analyses using mixed-effects models to evaluate the unique variance explained by aetiology after accounting for demographics (age, sex) and dataset indicators. This will provide a clearer assessment of the aetiology-specific signal at both group and individual levels. revision: yes
Circularity Check
No circularity: results are direct empirical measurements on new data
full rationale
The paper applies a previously introduced d-prime separability method to a new collection of 3,374 speakers across 25 datasets. All reported quantities (epsilon-squared effect sizes, Cohen's d, cosine similarities of profile shapes, Spearman correlations) are computed directly from the frozen SSL embeddings and group labels in the current data. No equations redefine the target profiles in terms of themselves, no fitted parameters are relabeled as predictions, and the self-citation to the prior method paper is not load-bearing for the distinguishability or stability claims. The derivation chain consists of standard statistical comparisons on independent samples and therefore contains no reduction by construction.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Duffy, J. R. (2019). Motor Speech Disorders: Substrates, Differential Diagnosis, and Management (4th ed.). Elsevier
2019
-
[2]
Muller, B., Ortiz Barranon, A. A., and Roberts, L. (2026). Training-free cross-lingual dysarthria severity assessment via phonological subspace analysis in self-supervised speech representations. arXiv preprint arXiv:2604.10123.doi:10.48550/arXiv.2604.10123
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2604.10123 2026
-
[3]
Choi, K., Yeo, E., Cho, C. J., Mortensen, D. R., and Harwath, D. (2026). Self-supervised speech models encode phonetic context via position-dependent orthogonal subspaces. arXiv preprint arXiv:2603.12642
-
[4]
J., Wu, P., Mohamed, A., and Anumanchipalli, G
Cho, C. J., Wu, P., Mohamed, A., and Anumanchipalli, G. K. (2023). Evidence of vocal tract articulation in self-supervised learning of speech. In Proceedings of ICASSP 2023, 1–5.doi:10.1109/ICASSP4935 7.2023.10094711
-
[5]
Towards scientificintelligence:Asurveyofllm-basedscientificagents
Halpern, B. M., Tienkamp, T., Abur, D., and Toda, T. (2026). PathBench: Speech intelligibility benchmark for automatic pathological speech assessment. arXiv preprint arXiv:2603.08097.doi:10.48550/arXiv .2603.08097
work page internal anchor Pith review doi:10.48550/arxiv 2026
-
[6]
K., Rusz, J., Magimai Doss, M., Orozco- Arroyave, J
Hernandez, A., Yeo, E., Choi, K., Li, C.-J., Yue, Z., Das, R. K., Rusz, J., Magimai Doss, M., Orozco- Arroyave, J. R., Arias-Vergara, T., Maier, A., Noth, E., Mortensen, D. R., Harwath, D., and Perez-Toro, P. A. (2026). Adapting self-supervised speech representations for cross-lingual dysarthria detection in Parkinson’s disease. arXiv:2603.22225. 21
-
[7]
D., Rusz, J., and Orozco-Arroyave, J
Rios-Urrego, C. D., Rusz, J., and Orozco-Arroyave, J. R. (2024). Automatic speech-based assessment to discriminate Parkinson’s disease from essential tremor with a cross-language approach. npj Digital Medicine, 7, 37.doi:10.1038/s41746-024-01027-6
-
[8]
Yeo, E. J., Liss, J. M., Berisha, V ., and Mortensen, D. R. (2026). Multilingual dysarthric speech assess- ment using universal phone recognition and language-specific phonemic contrast modeling. arXiv preprint arXiv:2601.21205.doi:10.48550/arXiv.2601.21205
-
[9]
Baevski, A., Zhou, H., Mohamed, A., and Auli, M. (2020). wav2vec 2.0: A framework for self- supervised learning of speech representations. In Advances in Neural Information Processing Systems, 33, 12449–12460
2020
-
[10]
Kadirvelu, B., Stumpf, L., Waibel, S., and Faisal, A. A. (2025). Speaker-independent dysarthria sever- ity classification using self-supervised transformers and multi-task learning. PLOS Digital Health, 4(11), e0001076.doi:10.1371/journal.pdig.0001076
-
[11]
Babu, A., Wang, C., Tjandra, A., Lakhotia, K., Xu, Q., Goyal, N., Singh, K., von Platen, P., Saraf, Y ., Pino, J., Baevski, A., Conneau, A., and Auli, M. (2022). XLS-R: Self-supervised cross-lingual speech representation learning at scale. In Proceedings of Interspeech 2022, 2278–2282.doi:10.21437/Inter speech.2022-143
-
[12]
Violeta, L. P., Huang, W.-C., and Toda, T. (2022). Investigating self-supervised pretraining frameworks for pathological speech recognition. In Proceedings of Interspeech 2022.doi:10.21437/Interspeech.2 022-10043
-
[13]
Sapkota, B., Shrestha, S., and Baral, R. (2025). Do all features matter? Layer-wise feature probing of self-supervised speech models for dysarthria severity classification. Speech Communication, 175, 103326. doi:10.1016/j.specom.2025.103326
-
[14]
Something from Nothing: Data Augmentation for Robust Severity Level Estimation of Dysarthric Speech
Bae, J., Zheng, X., Kim, M., Yoo, C. D., and Hasegawa-Johnson, M. (2026). Something from nothing: Data augmentation for robust severity level estimation of dysarthric speech. arXiv:2603.15988
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[15]
Javanmardi, F., Arias-Vergara, T., Orozco-Arroyave, J. R., and Nöth, E. (2024). Pre-trained models for detection and severity level classification of dysarthria from speech. Speech Communication, 156, 103047. doi:10.1016/j.specom.2024.103047
-
[16]
HuBERT: Self-supervised speech representation learning by masked prediction of hidden units,
Hsu, W.-N., Bolte, B., Tsai, Y .-H. H., Lakhotia, K., Salakhutdinov, R., and Mohamed, A. (2021). HuBERT: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Transac- tions on Audio, Speech, and Language Processing, 29, 3451–3460.doi:10.1109/TASLP.2021.3122291
-
[17]
Macmillan, N. A. and Creelman, C. D. (2005). Detection Theory: A User’s Guide (2nd ed.). Lawrence Erlbaum Associates
2005
-
[18]
Hasegawa-Johnson, M., Zheng, X., Kim, H., Mendes, C., Dickinson, M., Hege, E., Zwilling, C., Moore Channell, M., Mattie, L., Hodges, H., Ramig, L., Bellard, M., Shebanek, M., Sari, L., Kalgaonkar, K., Frerichs, D., Bigham, J. P., Findlater, L., Lea, C., Herrlinger, S., Korn, P., Abou-Zahra, S., Heywood, R., Tomanek, K., and MacDonald, B. (2024). Community...
-
[19]
Panayotov, V ., Chen, G., Povey, D., and Khudanpur, S. (2015). LibriSpeech: An ASR corpus based on public domain audio books. In Proceedings of ICASSP 2015, 5206–5210.doi:10.1109/ICASSP.2015. 7178964
-
[20]
K., and Wolff, T
Rudzicz, F., Namasivayam, A. K., and Wolff, T. (2012). The TORGO database of acoustic and articulatory speech from speakers with dysarthria. Language Resources and Evaluation, 46(4), 523–541.doi:10.100 7/s10579-011-9145-0
2012
-
[21]
Kim, H., Hasegawa-Johnson, M., Perlman, A., Gunderson, J., Huang, T., Watkin, K., and Frame, S. (2008). Dysarthric speech database for universal access research. In Proceedings of Interspeech 2008, 1741–1744. doi:10.21437/Interspeech.2008-480
-
[22]
Rusko, M., Sabo, R., Trnka, M., Zimmermann, A., Malaschitz, R., Ruzicky, E., Brandoburova, P., Kevicka, V ., and Skorvanek, M. (2024). Slovak database of speech affected by neurodegenerative diseases. Scientific Data, 11, 1320.doi:10.1038/s41597-024-04171-6 22
-
[23]
Jesus, L. M. T., Belo, I., Machado, J., and Hall, A. (2017). The advanced voice function assessment databases (A VFAD): Tools for voice clinicians and speech research. In Advances in Speech-language Pathology. IntechOpen.doi:10.5772/intechopen.69643
-
[24]
Middag, C., Martens, J.-P., Van Nuffelen, G., and De Bodt, M. (2009). Automated intelligibility assessment of pathological speech using phonological features. EURASIP Journal on Advances in Signal Processing, 2009, 1–9.doi:10.1155/2009/629030
-
[25]
Ganzeboom, M., Bakker, M., Beijer, L., Strik, H., and Rietveld, T. (2022). A serious game for speech training in dysarthric speakers with Parkinson’s disease: Exploring therapeutic efficacy and patient sat- isfaction. International Journal of Language and Communication Disorders, 57(5), 1091–1106.doi: 10.1111/1460-6984.12722
-
[26]
Ganzeboom, M., Bakker, M., Beijer, L., Rietveld, T., and Strik, H. (2018). Speech training for neurological patients using a serious game. British Journal of Educational Technology, 49(4), 761–774.doi:10.1111/ bjet.12640
2018
-
[27]
A., Guerrero-Lopez, A., Luque-Buzo, E., Arias-Londono, J
Mendes-Laureano, J., Gomez-Garcia, J. A., Guerrero-Lopez, A., Luque-Buzo, E., Arias-Londono, J. D., Grandas-Perez, F. J., and Godino-Llorente, J. I. (2024). NeuroV oz: A Castilian Spanish corpus of parkin- sonian speech. Scientific Data, 11, 1367.doi:10.1038/s41597-024-04186-z
-
[28]
R., Arias-Londono, J
Orozco-Arroyave, J. R., Arias-Londono, J. D., Vargas-Bonilla, J. F., Gonzalez-Rativa, M. C., and Noth, E. (2014). New Spanish speech corpus database for the analysis of people suffering from Parkinson’s disease. In Proceedings of LREC 2014, 342–347
2014
-
[29]
Dimauro, G., Di Nicola, V ., Bevilacqua, V ., Caivano, D., and Girardi, F. (2017). Assessment of speech intelligibility in Parkinson’s disease using a speech-to-text system. IEEE Access, 5, 22199-22208.doi: 10.1109/ACCESS.2017.2762475
-
[30]
Turrisi, R., Braccia, A., Emanuele, M., Giulietti, S., Pugliatti, M., Sensi, M., Fadiga, L., and Badino, L. (2021). EasyCall corpus: A dysarthric speech dataset. In Proceedings of Interspeech 2021, 41–45. doi:10.21437/Interspeech.2021-549
-
[31]
Gao, M., Chen, H., Du, J., Xu, X., Guo, H., Bu, H., Yang, J., Li, M., and Lee, C.-H. (2024). Enhancing voice wake-up for dysarthria: Mandarin Dysarthria Speech Corpus release and customized system design. In Proceedings of Interspeech 2024.doi:10.21437/Interspeech.2024-879
-
[32]
Wan, Y ., Sun, M., Kang, X., Li, J., Guo, P., Gao, M., and Wang, S.-J. (2024). CDSD: Chinese dysarthria speech database. In Proceedings of Interspeech 2024, 4109–4113.doi:10.21437/Interspeech.202 4-1597
-
[33]
SLR65: Crowdsourced high-quality Tamil multi-speaker speech dataset [Dataset].http s://www.openslr.org/65/(accessed 2026-03-01)
OpenSLR (2020). SLR65: Crowdsourced high-quality Tamil multi-speaker speech dataset [Dataset].http s://www.openslr.org/65/(accessed 2026-03-01)
2020
-
[34]
and Barry, W
Puetzer, M. and Barry, W. J. (2007). Saarbruecken V oice Database. Institute of Phonetics, Saarland Uni- versity.http://www.stimmdatenbank.coli.uni-saarland.de/
2007
-
[35]
Mihajlik, P., Toth, L., and Nemeth, G. (2023). Hungarian dysarthric speech database [Dataset]. Budapest University of Technology and Economics
2023
-
[36]
Kenyan Swahili Dysarthric Speech Corpus [Dataset]
CDLI (2024). Kenyan Swahili Dysarthric Speech Corpus [Dataset]. Centre for Digital Language Inclusion, University of Cape Town.https://www.cdli.uct.ac.za/(accessed 2026-03-15)
2024
-
[37]
Stipancic, K. L., Palmer, K. M., Rowe, H. P., Yunusova, Y ., Berry, J. D., and Green, J. R. (2021). You say severe, I say mild: Toward an empirical classification of dysarthria severity. Journal of Speech, Language, and Hearing Research, 64(12), 4718–4735.doi:10.1044/2021_JSLHR-21-00197
-
[38]
Grosman, J. (2021). Fine-tuned XLSR-53 large models for speech recognition [Model collection]. Hug- gingFace.https://huggingface.co/jonatasgrosman(accessed 2026-04-01)
2021
-
[39]
Zheng, X., Phukon, B., Na, J., Cutrell, E., Han, K. J., Hasegawa-Johnson, M., Jiang, P.-P., Kuila, A., Lea, C., MacDonald, B., Mantena, G., Ravichandran, V ., Sari, L., Tomanek, K., Yoo, C. D., and Zwilling, C. (2025). The Interspeech 2025 Speech Accessibility Project Challenge. In Proceedings of Interspeech 2025, 3269–3273.doi:10.21437/Interspeech.2025-566
-
[40]
V ., Senerchia, G., Salvatore, E., De Pietro, G., De Falco, I., and Sannino, G
Dubbioso, R., Spisto, M., Verde, L., Iuzzolino, V . V ., Senerchia, G., Salvatore, E., De Pietro, G., De Falco, I., and Sannino, G. (2024). V oice signals database of ALS patients with different dysarthria severity and healthy controls. Scientific Data, 11(1), 800.doi:10.1038/s41597-024-03597-2 23
-
[41]
Pratap, V ., Tjandra, A., Shi, B., Tomasello, P., Babu, A., Kundu, S., Elkahky, A., Ni, Z., Vyas, A., Fazel- Zarandi, M., Baevski, A., Adi, Y ., Zhang, X., Hsu, W.-N., Conneau, A., and Auli, M. (2024). Scaling speech technology to 1,000+ languages. Journal of Machine Learning Research, 25(97), 1–52
2024
-
[42]
Chen, S., Wang, C., Chen, Z., Wu, Y ., Liu, S., Chen, Z., Li, J., Kanda, N., Yoshioka, T., Xiao, X., Wu, J., Zhou, L., Ren, S., Qian, Y ., Qian, Y ., Wu, J., Zeng, M., Yu, X., and Wei, F. (2022). WavLM: Large-scale self-supervised pre-training for full stack speech processing. IEEE Journal of Selected Topics in Signal Processing, 16(6), 1505–1518.doi:10.1...
-
[43]
Cohen, J. (1988). Statistical Power Analysis for the Behavioral Sciences (2nd ed.). Lawrence Erlbaum Associates
1988
-
[44]
The SSNCE Database of Tamil Dysarthric Speech [Dataset]
LDC (2021). The SSNCE Database of Tamil Dysarthric Speech [Dataset]. Linguistic Data Consortium, LDC2021S04.doi:10.35111/hkh2-vh40
-
[45]
McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., and Sonderegger, M. (2017). Montreal Forced Aligner: Trainable text-speech alignment using Kaldi. In Proceedings of Interspeech 2017, 498–502. doi:10.21437/Interspeech.2017-1386
-
[46]
Van Nuffelen, G., Middag, C., De Bodt, M., and Martens, J.-P. (2009). Speech technology-based as- sessment of phoneme intelligibility in dysarthria. International Journal of Language and Communication Disorders, 44(5), 716–730.doi:10.1080/13682820802342062
-
[47]
Westfall, J. and Yarkoni, T. (2016). Statistically controlling for confounding constructs is harder than you think. PLOS ONE, 11(3), e0152719.doi:10.1371/journal.pone.0152719 24
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.