pith. sign in

arxiv: 2411.03715 · v2 · submitted 2024-11-06 · 💻 cs.SD · eess.AS

MOS-Bench: Benchmarking Generalization Abilities of Subjective Speech Quality Assessment Models

Pith reviewed 2026-05-23 17:55 UTC · model grok-4.3

classification 💻 cs.SD eess.AS
keywords subjective speech quality assessmentout-of-domain generalizationMOS predictiondata poolingspeech quality benchmarkmulti-dataset trainingperceptual quality modeling
0
0 comments X

The pith

Pooling multiple speech quality datasets yields better out-of-domain generalization than single-set or domain-aware training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Current subjective speech quality assessment models often fail when applied to new recording conditions or listeners. The paper introduces MOS-Bench, a collection of 8 training sets and 17 test sets that span different languages, systems, and listening tests. Experiments show that simply combining the training sets improves prediction accuracy on the held-out test sets, while an existing domain-aware method does not add clear benefit. The authors further find that the diversity of the combined data matters more than simply increasing its total volume. This points to a practical route for building more reliable quality predictors without new model architectures.

Core claim

Existing SSQA models exhibit large performance drops on out-of-domain test sets. Training on the pooled collection of eight datasets produces higher correlation with human scores on the seventeen test sets than training on any single dataset or using AlignNet. Variation across the pooled data contributes to this gain beyond the effect of dataset size alone.

What carries the argument

MOS-Bench, the dataset collection of eight training sets and seventeen test sets used to measure out-of-domain generalization and to compare pooled versus domain-aware training.

If this is right

  • SSQA models can be made more robust by collecting and pooling existing labeled datasets rather than designing new training objectives.
  • Increasing the number of distinct listening-test conditions in the training pool improves generalization more reliably than scaling the total number of utterances.
  • Domain-aware adaptation methods may be unnecessary when sufficient data diversity is already present through pooling.
  • Future SSQA papers should report performance on multiple held-out test sets to demonstrate generalization.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same pooling approach could be tested on other perceptual prediction tasks such as image quality or music preference where labeled data also exist in separate collections.
  • If variation is the key driver, then deliberately constructing training sets that cover more listener demographics or acoustic environments may yield further gains.
  • Model developers could prioritize releasing their training data under open licenses to enable larger pooled collections.

Load-bearing premise

The seventeen test sets capture the kinds of distribution shifts that actually occur when speech quality models are deployed in new environments.

What would settle it

A new listening test collected under conditions absent from all seventeen current test sets where the pooled model shows no improvement over single-dataset baselines.

Figures

Figures reproduced from arXiv: 2411.03715 by Erica Cooper, Tomoki Toda, Wen-Chin Huang.

Figure 1
Figure 1. Figure 1: Main models and inference methods supported in SHEET, the open-source toolkit developed. [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Distribution plot of an SSL-MOS model trained on [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Best score difference and best score ratio result for single dataset training experiments. For best score difference, the [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: SSL embedding visualization of SSQA models trained [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 6
Figure 6. Figure 6: Best score difference and best score ratio result for multiple datasets training experiments. Red boxes indicate a best [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: SSL embedding visualization of SSQA models trained [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗
Figure 1
Figure 1. Figure 1: Raw scores of the single dataset training experiments. [PITH_FULL_IMAGE:figures/full_fig_p013_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: SSL embedding visualization of SSQA models trained on one single dataset. For each subfigure, the right-hand side [PITH_FULL_IMAGE:figures/full_fig_p014_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Raw scores of the multiple dataset training experiments. [PITH_FULL_IMAGE:figures/full_fig_p015_3.png] view at source ↗
read the original abstract

In this paper, we study the task of subjective speech quality assessment (SSQA), which refers to predicting the perceptual quality of speech. Owing to the development of deep neural network models, SSQA has greatly advanced and has been widely applied in scientific papers to evaluate speech generation systems. Nonetheless, the insufficient out-of-domain (OOD) generalization ability of current SSQA models is underexplored and often overlooked by researchers. To study this problem systematically, we present MOS-Bench, a diverse SSQA dataset collection that currently contains 8 training sets and 17 test sets. Through extensive experiments, we first highlight the OOD generalization challenges of existing models. We then evaluate the efficacy of multiple-dataset training, comparing straightforward data pooling against AlignNet, an existing domain-aware method. We demonstrate that pooling multiple training sets provides a simple yet effective solution, and variation in the data is a key factor for robust generalization beyond training data size.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces MOS-Bench, a benchmark collection of 8 training sets and 17 test sets for subjective speech quality assessment (SSQA). It demonstrates OOD generalization challenges in existing models via extensive experiments and shows that straightforward pooling of multiple training sets is a simple yet effective approach for robust generalization, with data variation mattering more than training data size alone; this is compared against AlignNet, a domain-aware baseline.

Significance. If the empirical results hold, the work is significant because it supplies a diverse, publicly usable benchmark for an underexplored but practically important problem in SSQA, where models are routinely used to evaluate speech generation systems. The finding that simple multi-dataset pooling outperforms or matches more complex domain-adaptation methods, together with the emphasis on data variation, supplies a concrete, immediately actionable recommendation and falsifiable predictions for future model training.

major comments (2)
  1. [Abstract and experimental-setup section] Abstract and experimental-setup section: the central claim that pooling improves OOD generalization and that variation (not size) is the key factor rests on comparisons whose details—model architectures, precise OOD definitions, statistical tests, and error bars—are not supplied in the abstract and are only alluded to in the high-level description of “extensive experiments.” Without these, the reported superiority of pooling cannot be independently verified.
  2. [Test-set construction (Section describing the 17 test sets)] Test-set construction (Section describing the 17 test sets): the claim that the 17 test sets constitute genuine out-of-domain shifts representative of real deployment variations is load-bearing for all generalization conclusions. The manuscript must explicitly document collection protocols, labeling differences, and acoustic or perceptual mismatches that distinguish these sets from the training distributions; absent such justification, the observed gains could be artifacts of dataset curation rather than true domain shift.
minor comments (2)
  1. [Results figures and tables] Add error bars or confidence intervals to all tables and figures that compare pooling against AlignNet and single-dataset baselines.
  2. [Discussion of data-variation results] Clarify the exact definition of “variation in the data” (e.g., acoustic diversity metrics, speaker coverage, or perceptual-score distribution spread) when asserting it is more important than data size.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and positive recommendation. We address the two major comments below and will revise the manuscript accordingly to improve clarity and verifiability.

read point-by-point responses
  1. Referee: [Abstract and experimental-setup section] Abstract and experimental-setup section: the central claim that pooling improves OOD generalization and that variation (not size) is the key factor rests on comparisons whose details—model architectures, precise OOD definitions, statistical tests, and error bars—are not supplied in the abstract and are only alluded to in the high-level description of “extensive experiments.” Without these, the reported superiority of pooling cannot be independently verified.

    Authors: We agree that the abstract would benefit from additional detail to support independent verification of the claims. In the revision we will expand the abstract to briefly specify the model architectures evaluated, the precise criteria used to define OOD test sets, and the statistical procedures (including error bars) employed in the comparisons. Full experimental protocols, architectures, OOD definitions, and statistical results already appear in Sections 3–5; the abstract update will make these elements more immediately accessible. revision: yes

  2. Referee: [Test-set construction (Section describing the 17 test sets)] Test-set construction (Section describing the 17 test sets): the claim that the 17 test sets constitute genuine out-of-domain shifts representative of real deployment variations is load-bearing for all generalization conclusions. The manuscript must explicitly document collection protocols, labeling differences, and acoustic or perceptual mismatches that distinguish these sets from the training distributions; absent such justification, the observed gains could be artifacts of dataset curation rather than true domain shift.

    Authors: We acknowledge that a more explicit justification of domain shift is necessary. In the revised manuscript we will add a dedicated subsection (or expanded appendix) that documents, for each of the 17 test sets: (i) collection protocols and source conditions, (ii) labeling procedures and any differences from training-set protocols, and (iii) acoustic and perceptual characteristics that differentiate them from the training distributions. This will strengthen the argument that the observed generalization gaps reflect genuine domain shifts. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper is an empirical benchmark study with no equations, derivations, or parameter-fitting steps. All claims rest on direct experimental comparisons across provided training and test sets. No self-citations, ansatzes, or uniqueness theorems are invoked as load-bearing elements in any derivation chain. The central result (pooling improves OOD performance) is presented as an observed outcome of the experiments rather than a constructed equivalence.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no free parameters, axioms, or invented entities are identifiable from the provided text.

pith-pipeline@v0.9.0 · 5694 in / 1037 out tokens · 21655 ms · 2026-05-23T17:55:56.069740+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. VITA-QinYu: Expressive Spoken Language Model for Role-Playing and Singing

    cs.CL 2026-05 unverdicted novelty 7.0

    VITA-QinYu is the first expressive end-to-end spoken language model supporting role-playing and singing alongside conversation, trained on 15.8K hours of data and outperforming prior models on expressiveness and conve...

  2. Meta Audiobox Aesthetics: Unified Automatic Quality Assessment for Speech, Music, and Sound

    cs.SD 2025-02 unverdicted novelty 6.0

    Unified no-reference models assess audio aesthetics across speech, music, and sound via four perceptual axes and achieve performance comparable or superior to human mean opinion scores.

Reference graph

Works this paper leans on

48 extracted references · 48 canonical work pages · cited by 2 Pith papers · 1 internal anchor

  1. [1]

    Recommendation p.800: Methods for subjec- tive determination of transmission quality,

    ITUT Recommendation, “Recommendation p.800: Methods for subjec- tive determination of transmission quality,” International Telecommuni- cations Union—Radiocommunication (ITU-T) , 1998

  2. [2]

    Speech Quality Estimation: Models and Trends,

    S. M ¨oller, W.-Y . Chan, N. C ˆot´e, T. H. Falk, A. Raake, and M. W¨altermann, “Speech Quality Estimation: Models and Trends,”IEEE Signal Processing Magazine , vol. 28, no. 6, pp. 18–28, 2011

  3. [3]

    A review on subjective and objective evaluation of synthetic speech,

    E. Cooper, W.-C. Huang, Y . Tsao, H.-M. Wang, T. Toda, and J. Ya- magishi, “A review on subjective and objective evaluation of synthetic speech,” Acoustical Science and Technology, vol. 45, no. 4, pp. 161–183, 2024

  4. [4]

    Speech Synthesis Evaluation — State-of-the-Art Assessment and Sug- gestion for a Novel Research Program,

    P. Wagner, J. Beskow, S. Betz, J. Edlund, J. Gustafson, G. Eje Henter, S. Le Maguer, Z. Malisz, ´Eva Sz ´ekely, C. T ˚annander, and J. V oße, “Speech Synthesis Evaluation — State-of-the-Art Assessment and Sug- gestion for a Novel Research Program,” in Proc. 10th ISCA Workshop on Speech Synthesis (SSW 10) , 2019, pp. 105–110

  5. [5]

    SpeechBERTScore: Reference-aware automatic evaluation of speech generation leveraging nlp evaluation metrics,

    T. Saeki, S. Maiti, S. Takamichi, S. Watanabe, and H. Saruwatari, “SpeechBERTScore: Reference-aware automatic evaluation of speech generation leveraging nlp evaluation metrics,” arXiv preprint arXiv:2401.16812, 2024

  6. [6]

    The V oiceMOS Challenge 2022,

    W.-C. Huang, E. Cooper, Y . Tsao, H.-M. Wang, T. Toda, and J. Yamag- ishi, “The V oiceMOS Challenge 2022,” in Proc. Interspeech, 2022, pp. 4536–4540

  7. [7]

    Generalization ability of MOS prediction networks,

    E. Cooper, W.-C. Huang, T. Toda, and J. Yamagishi, “Generalization ability of MOS prediction networks,” in Proc. ICASSP, 2022, pp. 8442– 8446

  8. [8]

    The V oicemos Challenge 2023: Zero-Shot Subjective Speech Quality Prediction for Multiple Domains,

    E. Cooper, W.-C. Huang, Y . Tsao, H.-M. Wang, T. Toda, and J. Ya- magishi, “The V oicemos Challenge 2023: Zero-Shot Subjective Speech Quality Prediction for Multiple Domains,” in Proc. ASRU , 2023, pp. 1–7. JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 11

  9. [9]

    Self-supervised speech representation learning: A review,

    A. Mohamed, H.-y. Lee, L. Borgholt, J. D. Havtorn, J. Edin, C. Igel, K. Kirchhoff, S.-W. Li, K. Livescu, L. Maaløe et al. , “Self-supervised speech representation learning: A review,” IEEE Journal of Selected Topics in Signal Processing , vol. 16, no. 6, pp. 1179–1210, 2022

  10. [10]

    wav2vec 2.0: A framework for self-supervised learning of speech representations,

    A. Baevski, H. Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self-supervised learning of speech representations,” in Proc. NeruIPS, 2020

  11. [11]

    Hubert: Self-supervised speech representation learning by masked prediction of hidden units,

    W.-N. Hsu, B. Bolte, Y .-H. H. Tsai, K. Lakhotia, R. Salakhutdinov, and A. Mohamed, “Hubert: Self-supervised speech representation learning by masked prediction of hidden units,” IEEE/ACM TASLP, vol. 29, pp. 3451–3460, 2021

  12. [12]

    WavLM: Large-scale self-supervised pre- training for full stack speech processing,

    S. Chen, C. Wang, Z. Chen, Y . Wu, S. Liu, Z. Chen, J. Li, N. Kanda, T. Yoshioka, X. Xiao et al., “WavLM: Large-scale self-supervised pre- training for full stack speech processing,” IEEE Journal of Selected Topics in Signal Processing , vol. 16, no. 6, pp. 1505–1518, 2022

  13. [13]

    Utilizing self-supervised representations for mos prediction,

    W.-C. Tseng, C. yu Huang, W.-T. Kao, Y . Y . Lin, and H. yi Lee, “Utilizing self-supervised representations for mos prediction,” in Proc. Interspeech, 2021, pp. 2781–2785

  14. [14]

    Squid: Measuring speech naturalness in many languages,

    T. Sellam, A. Bapna, J. Camp, D. Mackinnon, A. P. Parikh, and J. Riesa, “Squid: Measuring speech naturalness in many languages,” in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . IEEE, 2023, pp. 1–5

  15. [15]

    Deep learning-based non-intrusive multi-objective speech assessment model with cross-domain features,

    R. E. Zezario, S.-W. Fu, F. Chen, C.-S. Fuh, H.-M. Wang, and Y . Tsao, “Deep learning-based non-intrusive multi-objective speech assessment model with cross-domain features,” IEEE/ACM TASLP, vol. 31, pp. 54– 70, 2023

  16. [16]

    Efficient Speech Quality Assessment Using Self-Supervised Framewise Embeddings,

    K. El Hajal, Z. Wu, N. Scheidwasser-Clow, G. Elbanna, and M. Cernak, “Efficient Speech Quality Assessment Using Self-Supervised Framewise Embeddings,” in Proc. ICASSP, 2023, pp. 1–5

  17. [17]

    SpeechLMScore: evaluat- ing speech generation using speech language model,

    S. Maiti, Y . Peng, T. Saeki, and S. Watanabe, “SpeechLMScore: evaluat- ing speech generation using speech language model,” in Proc. ICASSP, 2023, pp. 1–5

  18. [18]

    Self- Supervised Speech Quality Estimation and Enhancement Using Only Clean Speech,

    S.-W. Fu, K.-H. Hung, Y . Tsao, and Y .-C. F. Wang, “Self- Supervised Speech Quality Estimation and Enhancement Using Only Clean Speech,” in Proc. ICLR , 2024. [Online]. Available: https://openreview.net/forum?id=ale56Ya59q

  19. [19]

    UTMOS: UTokyo-SaruLab System for V oiceMOS Chal- lenge 2022,

    T. Saeki, D. Xin, W. Nakata, T. Koriyama, S. Takamichi, and H. Saruwatari, “UTMOS: UTokyo-SaruLab System for V oiceMOS Chal- lenge 2022,” in Proc. Interspeech, 2022, pp. 4521–4525

  20. [20]

    Stacked generalization,

    D. H. Wolpert, “Stacked generalization,” Neural networks, vol. 5, no. 2, pp. 241–259, 1992

  21. [21]

    LDNet: unified listener dependent modeling in MOS prediction for synthetic speech,

    W.-C. Huang, E. Cooper, J. Yamagishi, and T. Toda, “LDNet: unified listener dependent modeling in MOS prediction for synthetic speech,” in Proc. ICASSP, 2022, pp. 896–900

  22. [22]

    LE-SSL-MOS: Self-Supervised Learning MOS Prediction with Listener Enhancement,

    Z. Qi, X. Hu, W. Zhou, S. Li, H. Wu, J. Lu, and X. Xu, “LE-SSL-MOS: Self-Supervised Learning MOS Prediction with Listener Enhancement,” in Proc. IEEE ASRU, 2023, pp. 1–6

  23. [23]

    Recommendation p.1401: Methods, metrics and procedures for statistical evaluation, qualification and comparison of objective quality prediction models,

    ITUT Recommendation, “Recommendation p.1401: Methods, metrics and procedures for statistical evaluation, qualification and comparison of objective quality prediction models,” International Telecommunications Union—Radiocommunication (ITU-T) , 2001

  24. [24]

    Back to the future: Extending the blizzard challenge 2013,

    S. Le Maguer, S. King, and N. Harte, “Back to the future: Extending the blizzard challenge 2013,” in Interspeech, 2022, pp. 2378–2382

  25. [25]

    Investigating Range-Equalizing Bias in Mean Opinion Score Ratings of Synthesized Speech,

    E. Cooper and J. Yamagishi, “Investigating Range-Equalizing Bias in Mean Opinion Score Ratings of Synthesized Speech,” in Proc. Interspeech, 2023, pp. 1104–1108

  26. [26]

    Bias- aware loss for training image and speech quality prediction models from multiple datasets,

    G. Mittag, S. Zadtootaghaj, T. Michael, B. Naderi, and S. M ¨oller, “Bias- aware loss for training image and speech quality prediction models from multiple datasets,” in International Conference on Quality of Multimedia Experience (QoMEX), 2021, pp. 97–102

  27. [27]

    MBNET: MOS Prediction for Synthesized Speech with Mean-Bias Network,

    Y . Leng, X. Tan, S. Zhao, F. Soong, X.-Y . Li, and T. Qin, “MBNET: MOS Prediction for Synthesized Speech with Mean-Bias Network,” in Proc. ICASSP, 2021, pp. 391–395

  28. [28]

    Alignnet: Learning dataset score alignment functions to enable better training of speech quality estimators,

    J. Pieper and S. V oran, “Alignnet: Learning dataset score alignment functions to enable better training of speech quality estimators,” in Proc. Interspeech, 2024, pp. 82–86

  29. [29]

    How do voices from past speech synthesis challenges compare today?

    E. Cooper and J. Yamagishi, “How do voices from past speech synthesis challenges compare today?” in Proc. 11th ISCA Speech Synthesis Workshop (SSW 11), 2021, pp. 183–188

  30. [30]

    SOMOS: The Samsung Open MOS Dataset for the Evaluation of Neural Text-to- Speech Synthesis,

    G. Maniati, A. Vioni, N. Ellinas, K. Nikitaras, K. Klapsas, J. S. Sung, G. Jho, A. Chalamandaris, and P. Tsiakoulis, “SOMOS: The Samsung Open MOS Dataset for the Evaluation of Neural Text-to- Speech Synthesis,” in Proc. Interspeech, 2022, pp. 2388–2392

  31. [31]

    Singmos: An extensive open- source singing voice dataset for mos prediction,

    Y . Tang, J. Shi, Y . Wu, and Q. Jin, “Singmos: An extensive open- source singing voice dataset for mos prediction,” arXiv preprint arXiv:2406.10911, 2024

  32. [32]

    NISQA: A Deep CNN-Self-Attention Model for Multidimensional Speech Quality Pre- diction with Crowdsourced Datasets,

    G. Mittag, B. Naderi, A. Chehadi, and S. M ¨oller, “NISQA: A Deep CNN-Self-Attention Model for Multidimensional Speech Quality Pre- diction with Crowdsourced Datasets,” in Proc. Interspeech, 2021, pp. 2127–2131

  33. [33]

    InQSS: a speech intelligibility and quality assessment model using a multi-task learning network,

    Y .-W. Chen and Y . Tsao, “InQSS: a speech intelligibility and quality assessment model using a multi-task learning network,” in Proc. Inter- speech, 2022, pp. 3088–3092

  34. [34]

    ConferencingSpeech 2022 Challenge: Non-intrusive Objective Speech Quality Assessment (NISQA) Challenge for Online Conferencing Ap- plications,

    G. Yi, W. Xiao, Y . Xiao, B. Naderi, S. M ¨oller, W. Wardah, G. Mittag, R. Culter, Z. Zhang, D. S. Williamson, F. Chen, F. Yang, and S. Shang, “ConferencingSpeech 2022 Challenge: Non-intrusive Objective Speech Quality Assessment (NISQA) Challenge for Online Conferencing Ap- plications,” in Proc. Interspeech, 2022, pp. 3308–3312

  35. [35]

    DNN No-Reference PSTN Speech Quality Prediction,

    G. Mittag, R. Cutler, Y . Hosseinkashi, M. Revow, S. Srinivasan, N. Chande, and R. Aichner, “DNN No-Reference PSTN Speech Quality Prediction,” in Proc. Interspeech, 2020, pp. 2867–2871

  36. [36]

    The Blizzard Challenge 2019,

    Z. Wu, Z. Xie, and S. King, “The Blizzard Challenge 2019,” in Proc. Blizzard Challenge Workshop, vol. 2019, 2019

  37. [37]

    The Blizzard Challenge 2023,

    O. Perrotin, B. Stephenson, S. Gerber, and G. Bailly, “The Blizzard Challenge 2023,” in Proc. 18th Blizzard Challenge Workshop, Grenoble, France, August 29 2023

  38. [38]

    The Singing V oice Conversion Challenge 2023,

    W.-C. Huang, L. P. Violeta, S. Liu, J. Shi, and T. Toda, “The Singing V oice Conversion Challenge 2023,” in Proc. ASRU, 2023, pp. 1–8

  39. [39]

    A study on incorporating Whisper for robust speech assessment,

    R. E. Zezario, Y .-W. Chen, S.-W. Fu, Y . Tsao, H.-M. Wang, and C.-S. Fuh, “A study on incorporating Whisper for robust speech assessment,” in Proc. ICME, 2024

  40. [40]

    The V oicemos Challenge 2024: Beyond Speech Quality Prediction,

    W.-C. Huang, S.-W. Fu, E. Cooper, R. Zezario, T. Toda, H.-M. Wang, J. Yamagishi, and Y . Tsao, “The V oicemos Challenge 2024: Beyond Speech Quality Prediction,” in Proc. SLT, 2024

  41. [41]

    Espnet-TTS: Unified, Reproducible, and Integratable Open Source End-to-End Text-to-Speech Toolkit,

    T. Hayashi, R. Yamamoto, K. Inoue, T. Yoshimura, S. Watanabe, T. Toda, K. Takeda, Y . Zhang, and X. Tan, “Espnet-TTS: Unified, Reproducible, and Integratable Open Source End-to-End Text-to-Speech Toolkit,” in Proc. ICASSP, 2020, pp. 7654–7658

  42. [42]

    The LJ Speech Dataset,

    K. Ito and L. Johnson, “The LJ Speech Dataset,” https://keithito.com/ LJ-Speech-Dataset/, 2017

  43. [43]

    LPCNet: Improving neural speech synthe- sis through linear prediction,

    J.-M. Valin and J. Skoglund, “LPCNet: Improving neural speech synthe- sis through linear prediction,” in Proc. ICASSP, 2019, pp. 5891–5895

  44. [44]

    The Kaldi Speech Recognition Toolkit,

    D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y . Qian, P. Schwarz et al. , “The Kaldi Speech Recognition Toolkit,” in Proc. ASRU, 2011

  45. [45]

    ESPnet: End-to-End Speech Processing Toolkit,

    S. Watanabe, T. Hori, S. Karita, T. Hayashi, J. Nishitoba, Y . Unno, N. E. Y . Soplin, J. Heymann, M. Wiesner, N. Chen, A. Renduchintala, and T. Ochiai, “ESPnet: End-to-End Speech Processing Toolkit,” in Proc. Interspeech, 2018, pp. 2207–2211

  46. [46]

    RAMP: Retrieval-Augmented MOS Prediction via Confidence-based Dynamic Weighting,

    H. Wang, S. Zhao, X. Zheng, and Y . Qin, “RAMP: Retrieval-Augmented MOS Prediction via Confidence-based Dynamic Weighting,” in Proc. Interspeech, 2023, pp. 1095–1099

  47. [47]

    MOSNet: Deep Learning-Based Objective Assessment for V oice Conversion,

    C.-C. Lo, S.-W. Fu, W.-C. Huang, X. Wang, J. Yamagishi, Y . Tsao, and H.-M. Wang, “MOSNet: Deep Learning-Based Objective Assessment for V oice Conversion,” inProc. Interspeech, 2019, pp. 1541–1545

  48. [48]

    MOS-Bench: Benchmarking Generalization Abilities of Subjective Speech Quality Assessment Models

    L. Van der Maaten and G. Hinton, “Visualizing data using t-SNE,” Journal of machine learning research , vol. 9, no. 11, 2008. 1 Supplementary Materials for: MOS-Bench: Benchmarking Generalization Abilities of Subjective Speech Quality Assessment Models Wen-Chin Huang, Member , IEEE, Erica Cooper, Member , IEEE, and Tomoki Toda Member , IEEE I. A DDITIONAL...