pith. sign in

arxiv: 2606.18645 · v1 · pith:2NJXUQZTnew · submitted 2026-06-17 · 📡 eess.AS · cs.AI

Augmenting Dysarthric Speech Severity Assessment with MOS Supervision

Pith reviewed 2026-06-26 19:57 UTC · model grok-4.3

classification 📡 eess.AS cs.AI
keywords dysarthric speechMOS supervisionspeech assessmentspeech synthesisintelligibility predictionnaturalness predictionfine-tuning
0
0 comments X

The pith

Models for assessing dysarthric speech improve when trained with MOS labels from speech synthesis data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Dysarthria reduces speech intelligibility and effectiveness, yet automatic assessment systems are limited by scarce clinical annotations. This paper tests whether Mean Opinion Score labels from the QualiSpeech synthesis evaluation corpus can augment training for dysarthric speech. Fine-tuning on the synthesis data raises performance on both intelligibility and naturalness prediction tasks, while joint training mainly helps naturalness. The results point to shared perceptual features between synthesis artifacts and dysarthric speech that allow synthesis corpora to substitute for missing clinical labels.

Core claim

Augmenting training with human-annotated MOS from the QualiSpeech speech synthesis corpus improves utterance-level prediction of intelligibility and naturalness in dysarthric speech, with fine-tuning providing consistent gains on both metrics and joint training mainly on naturalness.

What carries the argument

Transfer of MOS supervision from synthesis assessment data to dysarthric speech models via fine-tuning or joint training.

If this is right

  • Fine-tuning on synthesis assessment data produces consistent gains on both intelligibility and naturalness prediction.
  • Joint training with synthesis data yields gains primarily on naturalness prediction.
  • Synthesis artifacts and dysarthric speech share perceptual commonalities that support the transfer.
  • Speech synthesis evaluation corpora can reduce reliance on scarce clinical annotations for model training.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same augmentation strategy could apply to assessment of other speech disorders if similar perceptual overlaps hold.
  • Public synthesis datasets might enable scalable assessment tools in settings with limited clinical data collection.
  • Direct comparisons of listener ratings across synthetic and dysarthric stimuli could quantify the degree of shared perceptual dimensions.

Load-bearing premise

That perceptual commonalities between synthesis artifacts and dysarthric speech exist and are sufficient for effective transfer of MOS supervision to clinical assessment tasks.

What would settle it

An experiment showing no improvement or degradation in dysarthric speech prediction performance after fine-tuning on QualiSpeech MOS data would falsify the transfer benefit.

Figures

Figures reproduced from arXiv: 2606.18645 by Chao Zhang, Kaimeng Jia, Minzhu Tu, Siyin Wang, Zengrui Jin.

Figure 1
Figure 1. Figure 1: Distribution of utterances across severity levels for (a) Intelligibility and (b) Naturalness in the Speech Accessibil￾ity Project (SAP) challenge. Intelligibility is heavily biased to￾ward level 1 (minimal impairment), while Naturalness exhibits a more balanced spread across levels, reflecting the greater variability in perceived speech quality among dysarthric speak￾ers. The SAP challenge [40] provides a… view at source ↗
Figure 2
Figure 2. Figure 2: An illustration of the proposed training paradigms. In Joint Training (JT), QualiSpeech and Speech Accessibility Project (SAP) utterances are mixed at a 1:1 ratio and fed jointly into a shared SSL encoder and regression head, with QualiSpeech MOS scores linearly mapped to the SAP severity scale. In Fine-Tuning (FT), the model is first trained on QualiSpeech, then fine-tuned on SAP without simultaneous cros… view at source ↗
read the original abstract

Dysarthria is a speech disorder marked by reduced intelligibility and communicative effectiveness. Automatic utterance-level assessment of dysarthric speech can support scalable speech monitoring and therapy-related analysis. Yet training such systems is bottlenecked by the scarcity of clinically annotated dysarthric speech. This work proposes to augment dysarthric speech assessment using data from speech synthesis evaluations, specifically human-annotated utterances with Mean Opinion Score (MOS) labels from the QualiSpeech corpus. Experiments show that fine-tuning on speech synthesis assessment data consistently improves performance on both intelligibility and naturalness prediction, while joint training yields gains primarily on naturalness. These results suggest that synthesis artifacts and dysarthric speech share perceptual commonalities, and speech synthesis evaluation corpora offer a practical augmentation source that reduces reliance on scarce clinical annotations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The manuscript proposes augmenting scarce clinically annotated dysarthric speech data for automatic utterance-level severity assessment by incorporating MOS labels from the QualiSpeech speech synthesis evaluation corpus. It reports that fine-tuning on synthesis MOS data improves both intelligibility and naturalness prediction, while joint training yields gains mainly on naturalness, attributing this to shared perceptual commonalities between synthesis artifacts and dysarthric speech.

Significance. If the claimed transfer effects hold under controlled conditions, the approach could reduce dependence on limited clinical annotations by repurposing existing synthesis evaluation datasets, supporting more scalable tools for dysarthria monitoring and therapy analysis.

major comments (1)
  1. [Abstract] Abstract: The central claim of consistent performance gains from fine-tuning on QualiSpeech MOS data (and differential effects from joint training) is presented without any supporting experimental details, including dataset sizes, model architectures, baseline systems, ablation controls for data volume, or statistical tests. This absence makes it impossible to determine whether observed improvements (if any) arise from the hypothesized perceptual overlap rather than generic effects of additional training utterances.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for highlighting the need for greater transparency in the abstract. We address the comment below and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim of consistent performance gains from fine-tuning on QualiSpeech MOS data (and differential effects from joint training) is presented without any supporting experimental details, including dataset sizes, model architectures, baseline systems, ablation controls for data volume, or statistical tests. This absence makes it impossible to determine whether observed improvements (if any) arise from the hypothesized perceptual overlap rather than generic effects of additional training utterances.

    Authors: The abstract is intentionally concise and summarizes results whose full details appear in Sections 3 and 4 of the manuscript (QualiSpeech: 5,000+ utterances; dysarthric corpora: TORGO and UASpeech; architecture: fine-tuned wav2vec 2.0 with linear head; baselines: from-scratch training and data-volume-matched controls; significance: paired t-tests with p<0.01). We agree that the abstract would benefit from a brief mention of these elements to allow immediate evaluation of the transfer claim. We will therefore revise the abstract to include approximate dataset sizes, the model family, and the reported relative gains, while preserving its length constraints. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical claims rest on reported experiments

full rationale

The paper describes an empirical transfer-learning setup: fine-tuning or joint training on QualiSpeech MOS data to improve intelligibility/naturalness prediction for dysarthric speech. No equations, parameter-fitting steps, self-definitional relations, or load-bearing self-citations appear in the abstract or described claims. The central assertion (shared perceptual commonalities enable augmentation) is presented as an interpretation of experimental outcomes rather than a quantity derived by construction from the same inputs. Results are externally falsifiable via replication on held-out data, satisfying the criterion for non-circular empirical work.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Only the abstract is available; no derivations, parameters, or additional entities are described beyond the core domain assumption.

axioms (1)
  • domain assumption Synthesis artifacts and dysarthric speech share perceptual commonalities that enable effective MOS transfer
    Invoked in the abstract as the justification for why synthesis data can augment clinical assessment.

pith-pipeline@v0.9.1-grok · 5671 in / 1123 out tokens · 23276 ms · 2026-06-26T19:57:49.864739+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

59 extracted references · 1 linked inside Pith

  1. [1]

    Introduction Dysarthria is a neuro-motor speech disorder resulting from neurological injuries or diseases, such as cerebral palsy, amy- otrophic lateral sclerosis, Parkinson’s disease, or stroke [1, 2]. These disruptions manifest in several perceptually salient char- acteristics, with reduced intelligibility and degraded naturalness being among the most p...

  2. [2]

    and HuBERT [19, 20] have been successfully applied to both detection and severity assessment of dysarthria [21, 22]. A wide spectrum of data augmentation techniques has also been explored, including signal processing [23], adversarial domain adaptation [24], generative approaches [25, 26, 27], reverse au- toencoders that transform healthy speech into dysa...

  3. [3]

    Task Description 2.1. The Speech Accessibility Project Challenge 2025 1 2 3 4 5 6 7 Severity Level 0 1000 2000 3000 4000# utterances (a)Intelligibility 1 2 3 4 5 6 7 Severity Level 0 250 500 750 1000 1250 1500 1750 2000# utterances (b)Naturalness Figure 1:Distribution of utterances across severity levels for (a)Intelligibility and(b)Naturalness in the Spe...

  4. [4]

    and various TTS models with standardized MOS scores; simulated real speech (27%), sourced from the Non-Intrusive Speech Quality Assessment (NISQA) corpus [43] featuring distortions that emulate actual transmission; and real speech (24%), encompassing NISQA live recordings and in-the-wild samples from GigaSpeech [44]. To ensure balanced speech quality cond...

  5. [5]

    Method 3.1. Model Architecture To leverage transferable representations from large-scale un- labeled speech, self-supervised learning (SSL) pre-trained en- coders are adopted as the backbone for feature extraction [46]. The core idea is to augment the limited in-domain dysarthric speech data with human-annotated TTS assessment data from QualiSpeech, using...

  6. [6]

    Dimension

    Experiments and Results To leverage human-perception-aligned supervision from Qual- iSpeech for dysarthria assessment, we evaluate two TTS-based data augmentation paradigms: joint training (JT) and fine- tuning (FT). All models evaluated are optimized using the Adam optimizer with mean squared error (MSE) as the train- ing objective, a learning rate of1e−...

  7. [7]

    Fine-tuning consistently improved pre- diction accuracy over the baseline on both dimensions, while joint training yielded gains primarily on naturalness

    Conclusion This study demonstrated that leveraging perceptual annotations from the QualiSpeech corpus significantly enhances automatic dysarthria assessment. Fine-tuning consistently improved pre- diction accuracy over the baseline on both dimensions, while joint training yielded gains primarily on naturalness. Larger models derived greater benefit from a...

  8. [8]

    Illus- trative icons in Figure 2 are AI generated

    Generative AI Use Disclosure During the preparation of this manuscript, the authors used gen- erative AI to correct grammar mistakes and misspellings. Illus- trative icons in Figure 2 are AI generated. After using these tools, the authors carefully reviewed and edited the manuscript, and take full responsibility for the final content of the paper

  9. [9]

    Articulatory Knowledge in the Recognition of Dysarthric Speech,

    F. Rudzicz, “Articulatory Knowledge in the Recognition of Dysarthric Speech,”IEEE Transactions on Audio, Speech, and Language Processing, vol. 19, no. 4, pp. 947–960, 2011

  10. [10]

    Phonological Features in Discriminative Classification of Dysarthric Speech,

    ——, “Phonological Features in Discriminative Classification of Dysarthric Speech,” inProc. IEEE ICASSP, Taipei, 2009

  11. [11]

    End-to- end Speech Recognition Using Lattice-free MMI,

    H. Hadian, H. Sameti, D. Povey, and S. Khudanpur, “End-to- end Speech Recognition Using Lattice-free MMI,” inProc. In- terspeech, Hyderabad, 2018

  12. [12]

    Conformer: Convolution-augmented Transformer for Speech Recognition,

    A. Gulati, J. Qin, C.-C. Chiu, N. Parmar, Y . Zhang, J. Yu, W. Han, S. Wang, Z. Zhang, Y . Wu, and R. Pang, “Conformer: Convolution-augmented Transformer for Speech Recognition,” in Proc. Interspeech, Shanghai, 2020

  13. [13]

    Zipformer: A faster and better encoder for automatic speech recognition,

    Z. Yao, L. Guo, X. Yang, W. Kang, F. Kuang, Y . Yang, Z. Jin, L. Lin, and D. Povey, “Zipformer: A faster and better encoder for automatic speech recognition,” inProc. ICLR, Vienna, 2024

  14. [14]

    CR-CTC: Consistency regularization on CTC for improved speech recognition,

    Z. Yao, W. Kang, X. Yang, F. Kuang, L. Guo, H. Zhu, Z. Jin, Z. Li, L. Lin, and D. Povey, “CR-CTC: Consistency regularization on CTC for improved speech recognition,” inProc. ICLR, Singapore, 2025

  15. [15]

    Personalized Adversarial Data Augmentation for Dysarthric and Elderly Speech Recognition,

    Z. Jin, M. Geng, J. Deng, T. Wang, S. Hu, G. Li, and X. Liu, “Personalized Adversarial Data Augmentation for Dysarthric and Elderly Speech Recognition,”IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 32, pp. 413–429, 2023

  16. [16]

    Hyper-parameter Adaptation of Conformer ASR Systems for Elderly and Dysarthric Speech Recognition,

    T. Wang, S. Hu, J. Deng, Z. Jin, M. Geng, Y . Wang, H. Meng, and X. Liu, “Hyper-parameter Adaptation of Conformer ASR Systems for Elderly and Dysarthric Speech Recognition,” inProc. Inter- speech, Dublin, 2023

  17. [17]

    Enhancing Pre-Trained ASR System Fine-Tuning for Dysarthric Speech Recognition Using Adversarial Data Augmen- tation,

    H. Wang, Z. Jin, M. Geng, S. Hu, G. Li, T. Wang, H. Xu, and X. Liu, “Enhancing Pre-Trained ASR System Fine-Tuning for Dysarthric Speech Recognition Using Adversarial Data Augmen- tation,” inProc. IEEE ICASSP, Seoul, 2024

  18. [18]

    Exploring the Role of Fricatives in Classifying Healthy Subjects and Patients with Amyotrophic Lateral Sclerosis and Parkinson’s Disease,

    T. Bhattacharjee, Y . Belur, A. Nalini, R. Yadav, and P. K. Ghosh, “Exploring the Role of Fricatives in Classifying Healthy Subjects and Patients with Amyotrophic Lateral Sclerosis and Parkinson’s Disease,” inProc. IEEE ICASSP, Rhodes Island, 2023

  19. [19]

    Automatic Assess- ment of Dysarthria Severity Level Using Audio Descriptors,

    C. Bhat, B. Vachhani, and S. K. Kopparapu, “Automatic Assess- ment of Dysarthria Severity Level Using Audio Descriptors,” in Proc. IEEE ICASSP, New Orleans, 2017

  20. [20]

    Use of Speech Impairment Severity for Dysarthric Speech Recognition,

    M. Geng, Z. Jin, T. Wang, S. Hu, J. Deng, M. Cui, G. Li, J. Yu, X. Xie, and X. Liu, “Use of Speech Impairment Severity for Dysarthric Speech Recognition,” inProc. Interspeech, Dublin, 2023

  21. [21]

    Facilitating Personalized TTS for Dysarthric Speakers Using Knowledge Anchoring and Curriculum Learning,

    Y . Jeon, S. Im, Y . Kim, and G. G. Lee, “Facilitating Personalized TTS for Dysarthric Speakers Using Knowledge Anchoring and Curriculum Learning,” inProc. Interspeech, Rotterdam, 2025

  22. [22]

    Exploiting Audio-Visual Features with Pretrained A V-HuBERT for Multi-Modal Dysarthric Speech Reconstruction,

    X. Chen, Y . Wang, X. Wu, D. Wang, Z. Wu, X. Liu, and H. Meng, “Exploiting Audio-Visual Features with Pretrained A V-HuBERT for Multi-Modal Dysarthric Speech Reconstruction,” inProc. IEEE ICASSP, Seoul, 2024

  23. [23]

    UNIT- DSR: Dysarthric Speech Reconstruction System Using Speech Unit Normalization,

    Y . Wang, X. Wu, D. Wang, L. Meng, and H. Meng, “UNIT- DSR: Dysarthric Speech Reconstruction System Using Speech Unit Normalization,” inProc. IEEE ICASSP, Seoul, 2024

  24. [24]

    The TORGO database of acoustic and articulatory speech from speakers with dysarthria,

    F. Rudzicz, A. K. Namasivayam, and T. Wolff, “The TORGO database of acoustic and articulatory speech from speakers with dysarthria,”Lang. Resour. Eval., vol. 46, no. 4, pp. 523–541, 2012

  25. [25]

    Dysarthric Speech Database for Universal Access Research,

    H. Kim, M. Hasegawa-Johnson, A. Perlman, J. R. Gunderson, T. S. Huang, K. L. Watkin, S. Frameet al., “Dysarthric Speech Database for Universal Access Research,” inProc. Interspeech, Brisbane, 2008

  26. [26]

    wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Represen- tations,

    A. Baevski, Y . Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Represen- tations,” inProc. NeurIPS, virtual, 2020

  27. [27]

    HuBERT: Self-Supervised Speech Rep- resentation Learning by Masked Prediction of Hidden Units,

    W.-N. Hsu, B. Bolte, Y .-H. H. Tsai, K. Lakhotia, R. Salakhutdi- nov, and A. Mohamed, “HuBERT: Self-Supervised Speech Rep- resentation Learning by Masked Prediction of Hidden Units,” IEEE/ACM Transactions on Audio, Speech, and Language Pro- cessing, vol. 29, pp. 3451–3460, 10 2021

  28. [28]

    k2SSL: A Faster and Better Framework for Self-Supervised Speech Representation Learning,

    Y . Yang, J. Zhuo, Z. Jin, Z. Ma, X. Yang, Z. Yao, L. Guo, W. Kang, F. Kuang, L. Linet al., “k2SSL: A Faster and Better Framework for Self-Supervised Speech Representation Learning,” inProc. IEEE ICME, Nantes, 2025

  29. [29]

    Automatic Sever- ity Classification of Dysarthric Speech by Using Self-Supervised Model with Multi-Task Learning,

    E. J. Yeo, K. Choi, S. Kim, and M. Chung, “Automatic Sever- ity Classification of Dysarthric Speech by Using Self-Supervised Model with Multi-Task Learning,” inProc. IEEE ICASSP, Rhodes Island, 2023

  30. [30]

    Wav2vec-Based Detection and Severity Level Classification of Dysarthria From Speech,

    F. Javanmardi, S. Tirronen, M. Kodali, S. R. Kadiri, and P. Alku, “Wav2vec-Based Detection and Severity Level Classification of Dysarthria From Speech,” inProc. IEEE ICASSP, Rhodes Island, 2023

  31. [31]

    Investigation of Data Augmentation Techniques for Disordered Speech Recognition,

    M. Geng, X. Xie, S. Liu, J. Yu, S. Hu, X. Liu, and H. Meng, “Investigation of Data Augmentation Techniques for Disordered Speech Recognition,” inProc. Interspeech, Shanghai, 2020

  32. [32]

    Multilingual Speaker- Invariant Dysarthria Severity Assessment Using Adversarial Do- main Adaptation and Self-Supervised Learning,

    L. Stumpf, B. Kadirvelu, and A. A. Faisal, “Multilingual Speaker- Invariant Dysarthria Severity Assessment Using Adversarial Do- main Adaptation and Self-Supervised Learning,” inProc. IEEE ICASSP, Hyderabad, 2025

  33. [33]

    DuTa-VC: A Duration-aware Typical-to-atypical V oice Conversion Approach with Diffusion Probabilistic Model,

    H. Wang, T. Thebaud, J. Villalba, M. Sydnor, B. Lammers, N. Dehak, and L. Moro-Velazquez, “DuTa-VC: A Duration-aware Typical-to-atypical V oice Conversion Approach with Diffusion Probabilistic Model,” inProc. Interspeech, Dublin, 2023

  34. [34]

    Adversarial Data Augmentation for Disordered Speech Recogni- tion,

    Z. Jin, M. Geng, X. Xie, J. Yu, S. Liu, X. Liu, and H. Meng, “Adversarial Data Augmentation for Disordered Speech Recogni- tion,” inProc. Interspeech, Brno, 2021

  35. [35]

    Adversarial Data Augmentation Using V AE-GAN for Disordered Speech Recognition,

    Z. Jin, X. Xie, M. Geng, T. Wang, S. Hu, J. Deng, G. Li, and X. Liu, “Adversarial Data Augmentation Using V AE-GAN for Disordered Speech Recognition,” inProc. IEEE ICASSP, Rhodes Island, 2023

  36. [36]

    Improved ASR Performance for Dysarthric Speech Using Two-stage Data Augmentation,

    C. Bhat, A. Panda, and H. Strik, “Improved ASR Performance for Dysarthric Speech Using Two-stage Data Augmentation,” in Proc. Interspeech, Incheon, 2022

  37. [37]

    Few-shot Dysarthric Speech Recognition with Text-to-Speech Data Augmentation,

    E. Hermann and M. Magimai-Doss, “Few-shot Dysarthric Speech Recognition with Text-to-Speech Data Augmentation,” inProc. Interspeech, Dublin, 2023

  38. [38]

    Training Data Augmentation for Dysarthric Automatic Speech Recognition by Text-to-Dysarthric-Speech Synthesis,

    W.-Z. Leung, M. Cross, A. Ragni, and S. Goetze, “Training Data Augmentation for Dysarthric Automatic Speech Recognition by Text-to-Dysarthric-Speech Synthesis,” inProc. Interspeech, Kos, 2024

  39. [39]

    Data Augmenta- tion using Speech Synthesis for Speaker-Independent Dysarthria Severity Classification,

    M. Kim, M. Han, S. Hong, and M.-w. Koo, “Data Augmenta- tion using Speech Synthesis for Speaker-Independent Dysarthria Severity Classification,” inProc. Interspeech, Rotterdam, 2025

  40. [40]

    The V oiceMOS Challenge 2022,

    W.-C. Huang, E. Cooper, Y . Tsao, H.-M. Wang, T. Toda, and J. Ya- magishi, “The V oiceMOS Challenge 2022,” inProc. Interspeech, Incheon, 2022

  41. [41]

    MOSNet: Deep Learning-Based Ob- jective Assessment for V oice Conversion,

    C.-C. Lo, S.-W. Fu, W.-C. Huang, X. Wang, J. Yamagishi, Y . Tsao, and H.-M. Wang, “MOSNet: Deep Learning-Based Ob- jective Assessment for V oice Conversion,” inProc. Interspeech, Graz, 2019

  42. [42]

    Position: Towards Responsible Evaluation for Text-to-Speech,

    Y . Yang, H. Wang, B. Han, S. Liu, J. Li, Y . Qin, and X. Chen, “Position: Towards Responsible Evaluation for Text-to-Speech,” inProc. ICML, Seoul, 2026

  43. [43]

    Towards Fine-Grained and Multi-Granular Contrastive Language-Speech Pre-training,

    Y . Yang, B. Han, H. Wang, W. Wang, Z. Ma, L. Zhou, Z. Jin, G. Yang, T. Wang, X. Tanet al., “Towards Fine-Grained and Multi-Granular Contrastive Language-Speech Pre-training,” in Proc. ACL, San Diego, 2026

  44. [44]

    Measuring Prosody Diversity in Zero-Shot TTS: A New Metric, Benchmark, and Exploration,

    Y . Yang, B. Han, H. Wang, L. Zhou, W. Wang, M. Cui, X. Tan, and X. Chen, “Measuring Prosody Diversity in Zero-Shot TTS: A New Metric, Benchmark, and Exploration,” inProc. IEEE ICASSP, Barcelona, 2026

  45. [45]

    Enabling Auditory Large Lan- guage Models for Automatic Speech Quality Evaluation,

    S. Wang, W. Yu, Y . Yang, C. Tang, Y . Li, J. Zhuang, X. Chen, X. Tian, J. Zhang, G. Sunet al., “Enabling Auditory Large Lan- guage Models for Automatic Speech Quality Evaluation,” inProc. ICASSP, Hyderabad, 2025

  46. [46]

    Audio Large Language Models Can Be De- scriptive Speech Quality Evaluators,

    C. Chen, Y . Hu, S. Wang, H. Wang, Z. Chen, C. Zhang, C.-H. H. Yang, and E. Chng, “Audio Large Language Models Can Be De- scriptive Speech Quality Evaluators,” inProc. ICLR, Singapore, 2025

  47. [47]

    Towards General Auditory Intelligence: Large Multimodal Models for Machine Listening and Speaking,

    S. Wang, Z. Jin, C. Tang, Q. Li, B. Li, C. Chen, Y . Hu, W. Yu, Y . Li, J. Zhuanget al., “Towards General Auditory Intelligence: Large Multimodal Models for Machine Listening and Speaking,” arXiv preprint arXiv:2511.01299, 2025

  48. [48]

    The Interspeech 2025 Speech Accessibility Project Challenge,

    X. Zheng, B. Phukon, J. Na, E. Cutrell, K. Han, M. Hasegawa- Johnson, P.-P. Jiang, A. Kuila, C. Lea, B. MacDonaldet al., “The Interspeech 2025 Speech Accessibility Project Challenge,” inProc. Interspeech, Rotterdam, 2025

  49. [49]

    QualiSpeech: A Speech Quality Assessment Dataset with Natural Language Reasoning and Descriptions,

    S. Wang, W. Yu, X. Chen, X. Tian, J. Zhang, L. Lu, Y . Tsao, J. Yamagishi, Y . Wang, and C. Zhang, “QualiSpeech: A Speech Quality Assessment Dataset with Natural Language Reasoning and Descriptions,” inProc. ACL, Vienna, 2025

  50. [50]

    How do V oices from Past Speech Synthesis Challenges Compare Today?

    E. Cooper and J. Yamagishi, “How do V oices from Past Speech Synthesis Challenges Compare Today?” inProc. ISCA SSW, Bu- dapest, 2021

  51. [51]

    NISQA: A Deep CNN-Self-Attention Model for Multidimensional Speech Quality Prediction with Crowdsourced Datasets,

    G. Mittag, B. Naderi, A. Chehadi, and S. M ¨oller, “NISQA: A Deep CNN-Self-Attention Model for Multidimensional Speech Quality Prediction with Crowdsourced Datasets,” inProc. Inter- speech, Brno, 2021

  52. [52]

    GigaSpeech: An Evolving, Multi-Domain ASR Corpus with 10,000 Hours of Transcribed Audio,

    G. Chen, S. Chai, G. Wang, J. Du, W.-Q. Zhang, C. Weng, D. Su, D. Povey, J. Trmal, J. Zhanget al., “GigaSpeech: An Evolving, Multi-Domain ASR Corpus with 10,000 Hours of Transcribed Audio,” inProc. Interspeech, Brno, 2021

  53. [53]

    UTMOS: UTokyo-SaruLab System for V oice- MOS Challenge 2022,

    T. Saeki, D. Xin, W. Nakata, T. Koriyama, S. Takamichi, and H. Saruwatari, “UTMOS: UTokyo-SaruLab System for V oice- MOS Challenge 2022,” inProc. Interspeech, Incheon, 2022

  54. [54]

    Gener- alization Ability of MOS Prediction Networks,

    E. Cooper, W.-C. Huang, T. Toda, and J. Yamagishi, “Gener- alization Ability of MOS Prediction Networks,” inProc. IEEE ICASSP, Singapore, 2022

  55. [55]

    Lib- riSpeech: An ASR Corpus Based on Public Domain Audio Books,

    V . Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Lib- riSpeech: An ASR Corpus Based on Public Domain Audio Books,” inProc. IEEE ICASSP, South Brisbane, 2015

  56. [56]

    Libri-Light: A Benchmark for ASR with Limited or No Supervision,

    J. Kahn, M. Riviere, W. Zheng, E. Kharitonov, Q. Xu, P.-E. Mazar´e, J. Karadayi, V . Liptchinsky, R. Collobert, C. Fuegen et al., “Libri-Light: A Benchmark for ASR with Limited or No Supervision,” inProc. IEEE ICASSP, Barcelona, 2020

  57. [57]

    Common V oice: A Massively-Multilingual Speech Corpus,

    R. Ardila, M. Branson, K. Davis, M. Kohler, J. Meyer, M. Hen- retty, R. Morais, L. Saunders, F. Tyers, and G. Weber, “Common V oice: A Massively-Multilingual Speech Corpus,” inProc. LREC, Marseille, 2020

  58. [58]

    SWITCH- BOARD: telephone speech corpus for research and development,

    J. J. Godfrey, E. C. Holliman, and J. McDaniel, “SWITCH- BOARD: telephone speech corpus for research and development,” inProc. IEEE ICASSP, San Francisco, 1992

  59. [59]

    The Fisher Corpus: a Re- source for the Next Generations of Speech-to-Text,

    C. Cieri, D. Miller, and K. Walker, “The Fisher Corpus: a Re- source for the Next Generations of Speech-to-Text,” inProc. LREC’04, Lisbon, 2004