MOS-Bench: Benchmarking Generalization Abilities of Subjective Speech Quality Assessment Models

Erica Cooper; Tomoki Toda; Wen-Chin Huang

arxiv: 2411.03715 · v2 · submitted 2024-11-06 · 💻 cs.SD · eess.AS

MOS-Bench: Benchmarking Generalization Abilities of Subjective Speech Quality Assessment Models

Wen-Chin Huang , Erica Cooper , Tomoki Toda This is my paper

Pith reviewed 2026-05-23 17:55 UTC · model grok-4.3

classification 💻 cs.SD eess.AS

keywords subjective speech quality assessmentout-of-domain generalizationMOS predictiondata poolingspeech quality benchmarkmulti-dataset trainingperceptual quality modeling

0 comments

The pith

Pooling multiple speech quality datasets yields better out-of-domain generalization than single-set or domain-aware training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Current subjective speech quality assessment models often fail when applied to new recording conditions or listeners. The paper introduces MOS-Bench, a collection of 8 training sets and 17 test sets that span different languages, systems, and listening tests. Experiments show that simply combining the training sets improves prediction accuracy on the held-out test sets, while an existing domain-aware method does not add clear benefit. The authors further find that the diversity of the combined data matters more than simply increasing its total volume. This points to a practical route for building more reliable quality predictors without new model architectures.

Core claim

Existing SSQA models exhibit large performance drops on out-of-domain test sets. Training on the pooled collection of eight datasets produces higher correlation with human scores on the seventeen test sets than training on any single dataset or using AlignNet. Variation across the pooled data contributes to this gain beyond the effect of dataset size alone.

What carries the argument

MOS-Bench, the dataset collection of eight training sets and seventeen test sets used to measure out-of-domain generalization and to compare pooled versus domain-aware training.

If this is right

SSQA models can be made more robust by collecting and pooling existing labeled datasets rather than designing new training objectives.
Increasing the number of distinct listening-test conditions in the training pool improves generalization more reliably than scaling the total number of utterances.
Domain-aware adaptation methods may be unnecessary when sufficient data diversity is already present through pooling.
Future SSQA papers should report performance on multiple held-out test sets to demonstrate generalization.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same pooling approach could be tested on other perceptual prediction tasks such as image quality or music preference where labeled data also exist in separate collections.
If variation is the key driver, then deliberately constructing training sets that cover more listener demographics or acoustic environments may yield further gains.
Model developers could prioritize releasing their training data under open licenses to enable larger pooled collections.

Load-bearing premise

The seventeen test sets capture the kinds of distribution shifts that actually occur when speech quality models are deployed in new environments.

What would settle it

A new listening test collected under conditions absent from all seventeen current test sets where the pooled model shows no improvement over single-dataset baselines.

Figures

Figures reproduced from arXiv: 2411.03715 by Erica Cooper, Tomoki Toda, Wen-Chin Huang.

**Figure 2.** Figure 2: Distribution plot of an SSL-MOS model trained on [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Best score difference and best score ratio result for single dataset training experiments. For best score difference, the [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: SSL embedding visualization of SSQA models trained [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 6.** Figure 6: Best score difference and best score ratio result for multiple datasets training experiments. Red boxes indicate a best [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 7.** Figure 7: SSL embedding visualization of SSQA models trained [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗

**Figure 1.** Figure 1: Raw scores of the single dataset training experiments. [PITH_FULL_IMAGE:figures/full_fig_p013_1.png] view at source ↗

**Figure 2.** Figure 2: SSL embedding visualization of SSQA models trained on one single dataset. For each subfigure, the right-hand side [PITH_FULL_IMAGE:figures/full_fig_p014_2.png] view at source ↗

**Figure 3.** Figure 3: Raw scores of the multiple dataset training experiments. [PITH_FULL_IMAGE:figures/full_fig_p015_3.png] view at source ↗

read the original abstract

In this paper, we study the task of subjective speech quality assessment (SSQA), which refers to predicting the perceptual quality of speech. Owing to the development of deep neural network models, SSQA has greatly advanced and has been widely applied in scientific papers to evaluate speech generation systems. Nonetheless, the insufficient out-of-domain (OOD) generalization ability of current SSQA models is underexplored and often overlooked by researchers. To study this problem systematically, we present MOS-Bench, a diverse SSQA dataset collection that currently contains 8 training sets and 17 test sets. Through extensive experiments, we first highlight the OOD generalization challenges of existing models. We then evaluate the efficacy of multiple-dataset training, comparing straightforward data pooling against AlignNet, an existing domain-aware method. We demonstrate that pooling multiple training sets provides a simple yet effective solution, and variation in the data is a key factor for robust generalization beyond training data size.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MOS-Bench gives a concrete collection of 8 train and 17 test sets to expose OOD failures in SSQA models, and shows pooling beats AlignNet while variation matters more than size.

read the letter

Colleague, the main point is that this paper assembles MOS-Bench with 8 training sets and 17 test sets to measure how SSQA models perform outside their training distribution. The experiments indicate that simply pooling the training data outperforms AlignNet, and that diversity across datasets drives better generalization than training set size alone. They also document clear OOD drops for existing models, which matches what people see when they try to reuse these predictors on new speech systems. What the work does well is collect the datasets into one place and run the direct head-to-head on pooling versus the domain-aware baseline. That comparison is straightforward and addresses a gap that gets mentioned but rarely tested at this scale in the SSQA literature. The benchmark itself could serve as a shared resource for checking new models. The soft spots sit in the experimental reporting. The abstract gives no model architectures, no exact rules for what counts as OOD, and no error bars or statistical tests, so the size of the reported gains is hard to judge from the summary alone. It is also unclear how much the 17 test sets reflect genuine deployment shifts versus differences in how the original datasets were recorded or labeled. Those details matter for deciding whether the variation-over-size conclusion generalizes. This paper is aimed at researchers who build or apply SSQA models for speech synthesis evaluation. A reader who needs a practical way to test generalization would find the splits and the pooling result useful even if they want tighter controls. It deserves a serious referee because the benchmark construction and the empirical comparison are grounded enough to warrant external checks, though the authors will likely need to add the missing experimental specifics. I would send it for review with requests for those details and for clearer documentation of the dataset selection process.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces MOS-Bench, a benchmark collection of 8 training sets and 17 test sets for subjective speech quality assessment (SSQA). It demonstrates OOD generalization challenges in existing models via extensive experiments and shows that straightforward pooling of multiple training sets is a simple yet effective approach for robust generalization, with data variation mattering more than training data size alone; this is compared against AlignNet, a domain-aware baseline.

Significance. If the empirical results hold, the work is significant because it supplies a diverse, publicly usable benchmark for an underexplored but practically important problem in SSQA, where models are routinely used to evaluate speech generation systems. The finding that simple multi-dataset pooling outperforms or matches more complex domain-adaptation methods, together with the emphasis on data variation, supplies a concrete, immediately actionable recommendation and falsifiable predictions for future model training.

major comments (2)

[Abstract and experimental-setup section] Abstract and experimental-setup section: the central claim that pooling improves OOD generalization and that variation (not size) is the key factor rests on comparisons whose details—model architectures, precise OOD definitions, statistical tests, and error bars—are not supplied in the abstract and are only alluded to in the high-level description of “extensive experiments.” Without these, the reported superiority of pooling cannot be independently verified.
[Test-set construction (Section describing the 17 test sets)] Test-set construction (Section describing the 17 test sets): the claim that the 17 test sets constitute genuine out-of-domain shifts representative of real deployment variations is load-bearing for all generalization conclusions. The manuscript must explicitly document collection protocols, labeling differences, and acoustic or perceptual mismatches that distinguish these sets from the training distributions; absent such justification, the observed gains could be artifacts of dataset curation rather than true domain shift.

minor comments (2)

[Results figures and tables] Add error bars or confidence intervals to all tables and figures that compare pooling against AlignNet and single-dataset baselines.
[Discussion of data-variation results] Clarify the exact definition of “variation in the data” (e.g., acoustic diversity metrics, speaker coverage, or perceptual-score distribution spread) when asserting it is more important than data size.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and positive recommendation. We address the two major comments below and will revise the manuscript accordingly to improve clarity and verifiability.

read point-by-point responses

Referee: [Abstract and experimental-setup section] Abstract and experimental-setup section: the central claim that pooling improves OOD generalization and that variation (not size) is the key factor rests on comparisons whose details—model architectures, precise OOD definitions, statistical tests, and error bars—are not supplied in the abstract and are only alluded to in the high-level description of “extensive experiments.” Without these, the reported superiority of pooling cannot be independently verified.

Authors: We agree that the abstract would benefit from additional detail to support independent verification of the claims. In the revision we will expand the abstract to briefly specify the model architectures evaluated, the precise criteria used to define OOD test sets, and the statistical procedures (including error bars) employed in the comparisons. Full experimental protocols, architectures, OOD definitions, and statistical results already appear in Sections 3–5; the abstract update will make these elements more immediately accessible. revision: yes
Referee: [Test-set construction (Section describing the 17 test sets)] Test-set construction (Section describing the 17 test sets): the claim that the 17 test sets constitute genuine out-of-domain shifts representative of real deployment variations is load-bearing for all generalization conclusions. The manuscript must explicitly document collection protocols, labeling differences, and acoustic or perceptual mismatches that distinguish these sets from the training distributions; absent such justification, the observed gains could be artifacts of dataset curation rather than true domain shift.

Authors: We acknowledge that a more explicit justification of domain shift is necessary. In the revised manuscript we will add a dedicated subsection (or expanded appendix) that documents, for each of the 17 test sets: (i) collection protocols and source conditions, (ii) labeling procedures and any differences from training-set protocols, and (iii) acoustic and perceptual characteristics that differentiate them from the training distributions. This will strengthen the argument that the observed generalization gaps reflect genuine domain shifts. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper is an empirical benchmark study with no equations, derivations, or parameter-fitting steps. All claims rest on direct experimental comparisons across provided training and test sets. No self-citations, ansatzes, or uniqueness theorems are invoked as load-bearing elements in any derivation chain. The central result (pooling improves OOD performance) is presented as an observed outcome of the experiments rather than a constructed equivalence.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no free parameters, axioms, or invented entities are identifiable from the provided text.

pith-pipeline@v0.9.0 · 5694 in / 1037 out tokens · 21655 ms · 2026-05-23T17:55:56.069740+00:00 · methodology

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

VITA-QinYu: Expressive Spoken Language Model for Role-Playing and Singing
cs.CL 2026-05 unverdicted novelty 7.0

VITA-QinYu is the first expressive end-to-end spoken language model supporting role-playing and singing alongside conversation, trained on 15.8K hours of data and outperforming prior models on expressiveness and conve...
Meta Audiobox Aesthetics: Unified Automatic Quality Assessment for Speech, Music, and Sound
cs.SD 2025-02 unverdicted novelty 6.0

Unified no-reference models assess audio aesthetics across speech, music, and sound via four perceptual axes and achieve performance comparable or superior to human mean opinion scores.

Reference graph

Works this paper leans on

48 extracted references · 48 canonical work pages · cited by 2 Pith papers · 1 internal anchor

[1]

Recommendation p.800: Methods for subjec- tive determination of transmission quality,

ITUT Recommendation, “Recommendation p.800: Methods for subjec- tive determination of transmission quality,” International Telecommuni- cations Union—Radiocommunication (ITU-T) , 1998

work page 1998
[2]

Speech Quality Estimation: Models and Trends,

S. M ¨oller, W.-Y . Chan, N. C ˆot´e, T. H. Falk, A. Raake, and M. W¨altermann, “Speech Quality Estimation: Models and Trends,”IEEE Signal Processing Magazine , vol. 28, no. 6, pp. 18–28, 2011

work page 2011
[3]

A review on subjective and objective evaluation of synthetic speech,

E. Cooper, W.-C. Huang, Y . Tsao, H.-M. Wang, T. Toda, and J. Ya- magishi, “A review on subjective and objective evaluation of synthetic speech,” Acoustical Science and Technology, vol. 45, no. 4, pp. 161–183, 2024

work page 2024
[4]

Speech Synthesis Evaluation — State-of-the-Art Assessment and Sug- gestion for a Novel Research Program,

P. Wagner, J. Beskow, S. Betz, J. Edlund, J. Gustafson, G. Eje Henter, S. Le Maguer, Z. Malisz, ´Eva Sz ´ekely, C. T ˚annander, and J. V oße, “Speech Synthesis Evaluation — State-of-the-Art Assessment and Sug- gestion for a Novel Research Program,” in Proc. 10th ISCA Workshop on Speech Synthesis (SSW 10) , 2019, pp. 105–110

work page 2019
[5]

SpeechBERTScore: Reference-aware automatic evaluation of speech generation leveraging nlp evaluation metrics,

T. Saeki, S. Maiti, S. Takamichi, S. Watanabe, and H. Saruwatari, “SpeechBERTScore: Reference-aware automatic evaluation of speech generation leveraging nlp evaluation metrics,” arXiv preprint arXiv:2401.16812, 2024

work page arXiv 2024
[6]

The V oiceMOS Challenge 2022,

W.-C. Huang, E. Cooper, Y . Tsao, H.-M. Wang, T. Toda, and J. Yamag- ishi, “The V oiceMOS Challenge 2022,” in Proc. Interspeech, 2022, pp. 4536–4540

work page 2022
[7]

Generalization ability of MOS prediction networks,

E. Cooper, W.-C. Huang, T. Toda, and J. Yamagishi, “Generalization ability of MOS prediction networks,” in Proc. ICASSP, 2022, pp. 8442– 8446

work page 2022
[8]

The V oicemos Challenge 2023: Zero-Shot Subjective Speech Quality Prediction for Multiple Domains,

E. Cooper, W.-C. Huang, Y . Tsao, H.-M. Wang, T. Toda, and J. Ya- magishi, “The V oicemos Challenge 2023: Zero-Shot Subjective Speech Quality Prediction for Multiple Domains,” in Proc. ASRU , 2023, pp. 1–7. JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 11

work page 2023
[9]

Self-supervised speech representation learning: A review,

A. Mohamed, H.-y. Lee, L. Borgholt, J. D. Havtorn, J. Edin, C. Igel, K. Kirchhoff, S.-W. Li, K. Livescu, L. Maaløe et al. , “Self-supervised speech representation learning: A review,” IEEE Journal of Selected Topics in Signal Processing , vol. 16, no. 6, pp. 1179–1210, 2022

work page 2022
[10]

wav2vec 2.0: A framework for self-supervised learning of speech representations,

A. Baevski, H. Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self-supervised learning of speech representations,” in Proc. NeruIPS, 2020

work page 2020
[11]

Hubert: Self-supervised speech representation learning by masked prediction of hidden units,

W.-N. Hsu, B. Bolte, Y .-H. H. Tsai, K. Lakhotia, R. Salakhutdinov, and A. Mohamed, “Hubert: Self-supervised speech representation learning by masked prediction of hidden units,” IEEE/ACM TASLP, vol. 29, pp. 3451–3460, 2021

work page 2021
[12]

WavLM: Large-scale self-supervised pre- training for full stack speech processing,

S. Chen, C. Wang, Z. Chen, Y . Wu, S. Liu, Z. Chen, J. Li, N. Kanda, T. Yoshioka, X. Xiao et al., “WavLM: Large-scale self-supervised pre- training for full stack speech processing,” IEEE Journal of Selected Topics in Signal Processing , vol. 16, no. 6, pp. 1505–1518, 2022

work page 2022
[13]

Utilizing self-supervised representations for mos prediction,

W.-C. Tseng, C. yu Huang, W.-T. Kao, Y . Y . Lin, and H. yi Lee, “Utilizing self-supervised representations for mos prediction,” in Proc. Interspeech, 2021, pp. 2781–2785

work page 2021
[14]

Squid: Measuring speech naturalness in many languages,

T. Sellam, A. Bapna, J. Camp, D. Mackinnon, A. P. Parikh, and J. Riesa, “Squid: Measuring speech naturalness in many languages,” in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . IEEE, 2023, pp. 1–5

work page 2023
[15]

Deep learning-based non-intrusive multi-objective speech assessment model with cross-domain features,

R. E. Zezario, S.-W. Fu, F. Chen, C.-S. Fuh, H.-M. Wang, and Y . Tsao, “Deep learning-based non-intrusive multi-objective speech assessment model with cross-domain features,” IEEE/ACM TASLP, vol. 31, pp. 54– 70, 2023

work page 2023
[16]

Efficient Speech Quality Assessment Using Self-Supervised Framewise Embeddings,

K. El Hajal, Z. Wu, N. Scheidwasser-Clow, G. Elbanna, and M. Cernak, “Efficient Speech Quality Assessment Using Self-Supervised Framewise Embeddings,” in Proc. ICASSP, 2023, pp. 1–5

work page 2023
[17]

SpeechLMScore: evaluat- ing speech generation using speech language model,

S. Maiti, Y . Peng, T. Saeki, and S. Watanabe, “SpeechLMScore: evaluat- ing speech generation using speech language model,” in Proc. ICASSP, 2023, pp. 1–5

work page 2023
[18]

Self- Supervised Speech Quality Estimation and Enhancement Using Only Clean Speech,

S.-W. Fu, K.-H. Hung, Y . Tsao, and Y .-C. F. Wang, “Self- Supervised Speech Quality Estimation and Enhancement Using Only Clean Speech,” in Proc. ICLR , 2024. [Online]. Available: https://openreview.net/forum?id=ale56Ya59q

work page 2024
[19]

UTMOS: UTokyo-SaruLab System for V oiceMOS Chal- lenge 2022,

T. Saeki, D. Xin, W. Nakata, T. Koriyama, S. Takamichi, and H. Saruwatari, “UTMOS: UTokyo-SaruLab System for V oiceMOS Chal- lenge 2022,” in Proc. Interspeech, 2022, pp. 4521–4525

work page 2022
[20]

Stacked generalization,

D. H. Wolpert, “Stacked generalization,” Neural networks, vol. 5, no. 2, pp. 241–259, 1992

work page 1992
[21]

LDNet: unified listener dependent modeling in MOS prediction for synthetic speech,

W.-C. Huang, E. Cooper, J. Yamagishi, and T. Toda, “LDNet: unified listener dependent modeling in MOS prediction for synthetic speech,” in Proc. ICASSP, 2022, pp. 896–900

work page 2022
[22]

LE-SSL-MOS: Self-Supervised Learning MOS Prediction with Listener Enhancement,

Z. Qi, X. Hu, W. Zhou, S. Li, H. Wu, J. Lu, and X. Xu, “LE-SSL-MOS: Self-Supervised Learning MOS Prediction with Listener Enhancement,” in Proc. IEEE ASRU, 2023, pp. 1–6

work page 2023
[23]

Recommendation p.1401: Methods, metrics and procedures for statistical evaluation, qualification and comparison of objective quality prediction models,

ITUT Recommendation, “Recommendation p.1401: Methods, metrics and procedures for statistical evaluation, qualification and comparison of objective quality prediction models,” International Telecommunications Union—Radiocommunication (ITU-T) , 2001

work page 2001
[24]

Back to the future: Extending the blizzard challenge 2013,

S. Le Maguer, S. King, and N. Harte, “Back to the future: Extending the blizzard challenge 2013,” in Interspeech, 2022, pp. 2378–2382

work page 2013
[25]

Investigating Range-Equalizing Bias in Mean Opinion Score Ratings of Synthesized Speech,

E. Cooper and J. Yamagishi, “Investigating Range-Equalizing Bias in Mean Opinion Score Ratings of Synthesized Speech,” in Proc. Interspeech, 2023, pp. 1104–1108

work page 2023
[26]

Bias- aware loss for training image and speech quality prediction models from multiple datasets,

G. Mittag, S. Zadtootaghaj, T. Michael, B. Naderi, and S. M ¨oller, “Bias- aware loss for training image and speech quality prediction models from multiple datasets,” in International Conference on Quality of Multimedia Experience (QoMEX), 2021, pp. 97–102

work page 2021
[27]

MBNET: MOS Prediction for Synthesized Speech with Mean-Bias Network,

Y . Leng, X. Tan, S. Zhao, F. Soong, X.-Y . Li, and T. Qin, “MBNET: MOS Prediction for Synthesized Speech with Mean-Bias Network,” in Proc. ICASSP, 2021, pp. 391–395

work page 2021
[28]

Alignnet: Learning dataset score alignment functions to enable better training of speech quality estimators,

J. Pieper and S. V oran, “Alignnet: Learning dataset score alignment functions to enable better training of speech quality estimators,” in Proc. Interspeech, 2024, pp. 82–86

work page 2024
[29]

How do voices from past speech synthesis challenges compare today?

E. Cooper and J. Yamagishi, “How do voices from past speech synthesis challenges compare today?” in Proc. 11th ISCA Speech Synthesis Workshop (SSW 11), 2021, pp. 183–188

work page 2021
[30]

SOMOS: The Samsung Open MOS Dataset for the Evaluation of Neural Text-to- Speech Synthesis,

G. Maniati, A. Vioni, N. Ellinas, K. Nikitaras, K. Klapsas, J. S. Sung, G. Jho, A. Chalamandaris, and P. Tsiakoulis, “SOMOS: The Samsung Open MOS Dataset for the Evaluation of Neural Text-to- Speech Synthesis,” in Proc. Interspeech, 2022, pp. 2388–2392

work page 2022
[31]

Singmos: An extensive open- source singing voice dataset for mos prediction,

Y . Tang, J. Shi, Y . Wu, and Q. Jin, “Singmos: An extensive open- source singing voice dataset for mos prediction,” arXiv preprint arXiv:2406.10911, 2024

work page arXiv 2024
[32]

NISQA: A Deep CNN-Self-Attention Model for Multidimensional Speech Quality Pre- diction with Crowdsourced Datasets,

G. Mittag, B. Naderi, A. Chehadi, and S. M ¨oller, “NISQA: A Deep CNN-Self-Attention Model for Multidimensional Speech Quality Pre- diction with Crowdsourced Datasets,” in Proc. Interspeech, 2021, pp. 2127–2131

work page 2021
[33]

InQSS: a speech intelligibility and quality assessment model using a multi-task learning network,

Y .-W. Chen and Y . Tsao, “InQSS: a speech intelligibility and quality assessment model using a multi-task learning network,” in Proc. Inter- speech, 2022, pp. 3088–3092

work page 2022
[34]

ConferencingSpeech 2022 Challenge: Non-intrusive Objective Speech Quality Assessment (NISQA) Challenge for Online Conferencing Ap- plications,

G. Yi, W. Xiao, Y . Xiao, B. Naderi, S. M ¨oller, W. Wardah, G. Mittag, R. Culter, Z. Zhang, D. S. Williamson, F. Chen, F. Yang, and S. Shang, “ConferencingSpeech 2022 Challenge: Non-intrusive Objective Speech Quality Assessment (NISQA) Challenge for Online Conferencing Ap- plications,” in Proc. Interspeech, 2022, pp. 3308–3312

work page 2022
[35]

DNN No-Reference PSTN Speech Quality Prediction,

G. Mittag, R. Cutler, Y . Hosseinkashi, M. Revow, S. Srinivasan, N. Chande, and R. Aichner, “DNN No-Reference PSTN Speech Quality Prediction,” in Proc. Interspeech, 2020, pp. 2867–2871

work page 2020
[36]

The Blizzard Challenge 2019,

Z. Wu, Z. Xie, and S. King, “The Blizzard Challenge 2019,” in Proc. Blizzard Challenge Workshop, vol. 2019, 2019

work page 2019
[37]

The Blizzard Challenge 2023,

O. Perrotin, B. Stephenson, S. Gerber, and G. Bailly, “The Blizzard Challenge 2023,” in Proc. 18th Blizzard Challenge Workshop, Grenoble, France, August 29 2023

work page 2023
[38]

The Singing V oice Conversion Challenge 2023,

W.-C. Huang, L. P. Violeta, S. Liu, J. Shi, and T. Toda, “The Singing V oice Conversion Challenge 2023,” in Proc. ASRU, 2023, pp. 1–8

work page 2023
[39]

A study on incorporating Whisper for robust speech assessment,

R. E. Zezario, Y .-W. Chen, S.-W. Fu, Y . Tsao, H.-M. Wang, and C.-S. Fuh, “A study on incorporating Whisper for robust speech assessment,” in Proc. ICME, 2024

work page 2024
[40]

The V oicemos Challenge 2024: Beyond Speech Quality Prediction,

W.-C. Huang, S.-W. Fu, E. Cooper, R. Zezario, T. Toda, H.-M. Wang, J. Yamagishi, and Y . Tsao, “The V oicemos Challenge 2024: Beyond Speech Quality Prediction,” in Proc. SLT, 2024

work page 2024
[41]

Espnet-TTS: Unified, Reproducible, and Integratable Open Source End-to-End Text-to-Speech Toolkit,

T. Hayashi, R. Yamamoto, K. Inoue, T. Yoshimura, S. Watanabe, T. Toda, K. Takeda, Y . Zhang, and X. Tan, “Espnet-TTS: Unified, Reproducible, and Integratable Open Source End-to-End Text-to-Speech Toolkit,” in Proc. ICASSP, 2020, pp. 7654–7658

work page 2020
[42]

The LJ Speech Dataset,

K. Ito and L. Johnson, “The LJ Speech Dataset,” https://keithito.com/ LJ-Speech-Dataset/, 2017

work page 2017
[43]

LPCNet: Improving neural speech synthe- sis through linear prediction,

J.-M. Valin and J. Skoglund, “LPCNet: Improving neural speech synthe- sis through linear prediction,” in Proc. ICASSP, 2019, pp. 5891–5895

work page 2019
[44]

The Kaldi Speech Recognition Toolkit,

D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y . Qian, P. Schwarz et al. , “The Kaldi Speech Recognition Toolkit,” in Proc. ASRU, 2011

work page 2011
[45]

ESPnet: End-to-End Speech Processing Toolkit,

S. Watanabe, T. Hori, S. Karita, T. Hayashi, J. Nishitoba, Y . Unno, N. E. Y . Soplin, J. Heymann, M. Wiesner, N. Chen, A. Renduchintala, and T. Ochiai, “ESPnet: End-to-End Speech Processing Toolkit,” in Proc. Interspeech, 2018, pp. 2207–2211

work page 2018
[46]

RAMP: Retrieval-Augmented MOS Prediction via Confidence-based Dynamic Weighting,

H. Wang, S. Zhao, X. Zheng, and Y . Qin, “RAMP: Retrieval-Augmented MOS Prediction via Confidence-based Dynamic Weighting,” in Proc. Interspeech, 2023, pp. 1095–1099

work page 2023
[47]

MOSNet: Deep Learning-Based Objective Assessment for V oice Conversion,

C.-C. Lo, S.-W. Fu, W.-C. Huang, X. Wang, J. Yamagishi, Y . Tsao, and H.-M. Wang, “MOSNet: Deep Learning-Based Objective Assessment for V oice Conversion,” inProc. Interspeech, 2019, pp. 1541–1545

work page 2019
[48]

MOS-Bench: Benchmarking Generalization Abilities of Subjective Speech Quality Assessment Models

L. Van der Maaten and G. Hinton, “Visualizing data using t-SNE,” Journal of machine learning research , vol. 9, no. 11, 2008. 1 Supplementary Materials for: MOS-Bench: Benchmarking Generalization Abilities of Subjective Speech Quality Assessment Models Wen-Chin Huang, Member , IEEE, Erica Cooper, Member , IEEE, and Tomoki Toda Member , IEEE I. A DDITIONAL...

work page internal anchor Pith review Pith/arXiv arXiv 2008

[1] [1]

Recommendation p.800: Methods for subjec- tive determination of transmission quality,

ITUT Recommendation, “Recommendation p.800: Methods for subjec- tive determination of transmission quality,” International Telecommuni- cations Union—Radiocommunication (ITU-T) , 1998

work page 1998

[2] [2]

Speech Quality Estimation: Models and Trends,

S. M ¨oller, W.-Y . Chan, N. C ˆot´e, T. H. Falk, A. Raake, and M. W¨altermann, “Speech Quality Estimation: Models and Trends,”IEEE Signal Processing Magazine , vol. 28, no. 6, pp. 18–28, 2011

work page 2011

[3] [3]

A review on subjective and objective evaluation of synthetic speech,

E. Cooper, W.-C. Huang, Y . Tsao, H.-M. Wang, T. Toda, and J. Ya- magishi, “A review on subjective and objective evaluation of synthetic speech,” Acoustical Science and Technology, vol. 45, no. 4, pp. 161–183, 2024

work page 2024

[4] [4]

Speech Synthesis Evaluation — State-of-the-Art Assessment and Sug- gestion for a Novel Research Program,

P. Wagner, J. Beskow, S. Betz, J. Edlund, J. Gustafson, G. Eje Henter, S. Le Maguer, Z. Malisz, ´Eva Sz ´ekely, C. T ˚annander, and J. V oße, “Speech Synthesis Evaluation — State-of-the-Art Assessment and Sug- gestion for a Novel Research Program,” in Proc. 10th ISCA Workshop on Speech Synthesis (SSW 10) , 2019, pp. 105–110

work page 2019

[5] [5]

SpeechBERTScore: Reference-aware automatic evaluation of speech generation leveraging nlp evaluation metrics,

T. Saeki, S. Maiti, S. Takamichi, S. Watanabe, and H. Saruwatari, “SpeechBERTScore: Reference-aware automatic evaluation of speech generation leveraging nlp evaluation metrics,” arXiv preprint arXiv:2401.16812, 2024

work page arXiv 2024

[6] [6]

The V oiceMOS Challenge 2022,

W.-C. Huang, E. Cooper, Y . Tsao, H.-M. Wang, T. Toda, and J. Yamag- ishi, “The V oiceMOS Challenge 2022,” in Proc. Interspeech, 2022, pp. 4536–4540

work page 2022

[7] [7]

Generalization ability of MOS prediction networks,

E. Cooper, W.-C. Huang, T. Toda, and J. Yamagishi, “Generalization ability of MOS prediction networks,” in Proc. ICASSP, 2022, pp. 8442– 8446

work page 2022

[8] [8]

The V oicemos Challenge 2023: Zero-Shot Subjective Speech Quality Prediction for Multiple Domains,

E. Cooper, W.-C. Huang, Y . Tsao, H.-M. Wang, T. Toda, and J. Ya- magishi, “The V oicemos Challenge 2023: Zero-Shot Subjective Speech Quality Prediction for Multiple Domains,” in Proc. ASRU , 2023, pp. 1–7. JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 11

work page 2023

[9] [9]

Self-supervised speech representation learning: A review,

A. Mohamed, H.-y. Lee, L. Borgholt, J. D. Havtorn, J. Edin, C. Igel, K. Kirchhoff, S.-W. Li, K. Livescu, L. Maaløe et al. , “Self-supervised speech representation learning: A review,” IEEE Journal of Selected Topics in Signal Processing , vol. 16, no. 6, pp. 1179–1210, 2022

work page 2022

[10] [10]

wav2vec 2.0: A framework for self-supervised learning of speech representations,

A. Baevski, H. Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self-supervised learning of speech representations,” in Proc. NeruIPS, 2020

work page 2020

[11] [11]

Hubert: Self-supervised speech representation learning by masked prediction of hidden units,

W.-N. Hsu, B. Bolte, Y .-H. H. Tsai, K. Lakhotia, R. Salakhutdinov, and A. Mohamed, “Hubert: Self-supervised speech representation learning by masked prediction of hidden units,” IEEE/ACM TASLP, vol. 29, pp. 3451–3460, 2021

work page 2021

[12] [12]

WavLM: Large-scale self-supervised pre- training for full stack speech processing,

S. Chen, C. Wang, Z. Chen, Y . Wu, S. Liu, Z. Chen, J. Li, N. Kanda, T. Yoshioka, X. Xiao et al., “WavLM: Large-scale self-supervised pre- training for full stack speech processing,” IEEE Journal of Selected Topics in Signal Processing , vol. 16, no. 6, pp. 1505–1518, 2022

work page 2022

[13] [13]

Utilizing self-supervised representations for mos prediction,

W.-C. Tseng, C. yu Huang, W.-T. Kao, Y . Y . Lin, and H. yi Lee, “Utilizing self-supervised representations for mos prediction,” in Proc. Interspeech, 2021, pp. 2781–2785

work page 2021

[14] [14]

Squid: Measuring speech naturalness in many languages,

T. Sellam, A. Bapna, J. Camp, D. Mackinnon, A. P. Parikh, and J. Riesa, “Squid: Measuring speech naturalness in many languages,” in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . IEEE, 2023, pp. 1–5

work page 2023

[15] [15]

Deep learning-based non-intrusive multi-objective speech assessment model with cross-domain features,

R. E. Zezario, S.-W. Fu, F. Chen, C.-S. Fuh, H.-M. Wang, and Y . Tsao, “Deep learning-based non-intrusive multi-objective speech assessment model with cross-domain features,” IEEE/ACM TASLP, vol. 31, pp. 54– 70, 2023

work page 2023

[16] [16]

Efficient Speech Quality Assessment Using Self-Supervised Framewise Embeddings,

K. El Hajal, Z. Wu, N. Scheidwasser-Clow, G. Elbanna, and M. Cernak, “Efficient Speech Quality Assessment Using Self-Supervised Framewise Embeddings,” in Proc. ICASSP, 2023, pp. 1–5

work page 2023

[17] [17]

SpeechLMScore: evaluat- ing speech generation using speech language model,

S. Maiti, Y . Peng, T. Saeki, and S. Watanabe, “SpeechLMScore: evaluat- ing speech generation using speech language model,” in Proc. ICASSP, 2023, pp. 1–5

work page 2023

[18] [18]

Self- Supervised Speech Quality Estimation and Enhancement Using Only Clean Speech,

S.-W. Fu, K.-H. Hung, Y . Tsao, and Y .-C. F. Wang, “Self- Supervised Speech Quality Estimation and Enhancement Using Only Clean Speech,” in Proc. ICLR , 2024. [Online]. Available: https://openreview.net/forum?id=ale56Ya59q

work page 2024

[19] [19]

UTMOS: UTokyo-SaruLab System for V oiceMOS Chal- lenge 2022,

T. Saeki, D. Xin, W. Nakata, T. Koriyama, S. Takamichi, and H. Saruwatari, “UTMOS: UTokyo-SaruLab System for V oiceMOS Chal- lenge 2022,” in Proc. Interspeech, 2022, pp. 4521–4525

work page 2022

[20] [20]

Stacked generalization,

D. H. Wolpert, “Stacked generalization,” Neural networks, vol. 5, no. 2, pp. 241–259, 1992

work page 1992

[21] [21]

LDNet: unified listener dependent modeling in MOS prediction for synthetic speech,

W.-C. Huang, E. Cooper, J. Yamagishi, and T. Toda, “LDNet: unified listener dependent modeling in MOS prediction for synthetic speech,” in Proc. ICASSP, 2022, pp. 896–900

work page 2022

[22] [22]

LE-SSL-MOS: Self-Supervised Learning MOS Prediction with Listener Enhancement,

Z. Qi, X. Hu, W. Zhou, S. Li, H. Wu, J. Lu, and X. Xu, “LE-SSL-MOS: Self-Supervised Learning MOS Prediction with Listener Enhancement,” in Proc. IEEE ASRU, 2023, pp. 1–6

work page 2023

[23] [23]

Recommendation p.1401: Methods, metrics and procedures for statistical evaluation, qualification and comparison of objective quality prediction models,

ITUT Recommendation, “Recommendation p.1401: Methods, metrics and procedures for statistical evaluation, qualification and comparison of objective quality prediction models,” International Telecommunications Union—Radiocommunication (ITU-T) , 2001

work page 2001

[24] [24]

Back to the future: Extending the blizzard challenge 2013,

S. Le Maguer, S. King, and N. Harte, “Back to the future: Extending the blizzard challenge 2013,” in Interspeech, 2022, pp. 2378–2382

work page 2013

[25] [25]

Investigating Range-Equalizing Bias in Mean Opinion Score Ratings of Synthesized Speech,

E. Cooper and J. Yamagishi, “Investigating Range-Equalizing Bias in Mean Opinion Score Ratings of Synthesized Speech,” in Proc. Interspeech, 2023, pp. 1104–1108

work page 2023

[26] [26]

Bias- aware loss for training image and speech quality prediction models from multiple datasets,

G. Mittag, S. Zadtootaghaj, T. Michael, B. Naderi, and S. M ¨oller, “Bias- aware loss for training image and speech quality prediction models from multiple datasets,” in International Conference on Quality of Multimedia Experience (QoMEX), 2021, pp. 97–102

work page 2021

[27] [27]

MBNET: MOS Prediction for Synthesized Speech with Mean-Bias Network,

Y . Leng, X. Tan, S. Zhao, F. Soong, X.-Y . Li, and T. Qin, “MBNET: MOS Prediction for Synthesized Speech with Mean-Bias Network,” in Proc. ICASSP, 2021, pp. 391–395

work page 2021

[28] [28]

Alignnet: Learning dataset score alignment functions to enable better training of speech quality estimators,

J. Pieper and S. V oran, “Alignnet: Learning dataset score alignment functions to enable better training of speech quality estimators,” in Proc. Interspeech, 2024, pp. 82–86

work page 2024

[29] [29]

How do voices from past speech synthesis challenges compare today?

E. Cooper and J. Yamagishi, “How do voices from past speech synthesis challenges compare today?” in Proc. 11th ISCA Speech Synthesis Workshop (SSW 11), 2021, pp. 183–188

work page 2021

[30] [30]

SOMOS: The Samsung Open MOS Dataset for the Evaluation of Neural Text-to- Speech Synthesis,

G. Maniati, A. Vioni, N. Ellinas, K. Nikitaras, K. Klapsas, J. S. Sung, G. Jho, A. Chalamandaris, and P. Tsiakoulis, “SOMOS: The Samsung Open MOS Dataset for the Evaluation of Neural Text-to- Speech Synthesis,” in Proc. Interspeech, 2022, pp. 2388–2392

work page 2022

[31] [31]

Singmos: An extensive open- source singing voice dataset for mos prediction,

Y . Tang, J. Shi, Y . Wu, and Q. Jin, “Singmos: An extensive open- source singing voice dataset for mos prediction,” arXiv preprint arXiv:2406.10911, 2024

work page arXiv 2024

[32] [32]

NISQA: A Deep CNN-Self-Attention Model for Multidimensional Speech Quality Pre- diction with Crowdsourced Datasets,

G. Mittag, B. Naderi, A. Chehadi, and S. M ¨oller, “NISQA: A Deep CNN-Self-Attention Model for Multidimensional Speech Quality Pre- diction with Crowdsourced Datasets,” in Proc. Interspeech, 2021, pp. 2127–2131

work page 2021

[33] [33]

InQSS: a speech intelligibility and quality assessment model using a multi-task learning network,

Y .-W. Chen and Y . Tsao, “InQSS: a speech intelligibility and quality assessment model using a multi-task learning network,” in Proc. Inter- speech, 2022, pp. 3088–3092

work page 2022

[34] [34]

ConferencingSpeech 2022 Challenge: Non-intrusive Objective Speech Quality Assessment (NISQA) Challenge for Online Conferencing Ap- plications,

G. Yi, W. Xiao, Y . Xiao, B. Naderi, S. M ¨oller, W. Wardah, G. Mittag, R. Culter, Z. Zhang, D. S. Williamson, F. Chen, F. Yang, and S. Shang, “ConferencingSpeech 2022 Challenge: Non-intrusive Objective Speech Quality Assessment (NISQA) Challenge for Online Conferencing Ap- plications,” in Proc. Interspeech, 2022, pp. 3308–3312

work page 2022

[35] [35]

DNN No-Reference PSTN Speech Quality Prediction,

G. Mittag, R. Cutler, Y . Hosseinkashi, M. Revow, S. Srinivasan, N. Chande, and R. Aichner, “DNN No-Reference PSTN Speech Quality Prediction,” in Proc. Interspeech, 2020, pp. 2867–2871

work page 2020

[36] [36]

The Blizzard Challenge 2019,

Z. Wu, Z. Xie, and S. King, “The Blizzard Challenge 2019,” in Proc. Blizzard Challenge Workshop, vol. 2019, 2019

work page 2019

[37] [37]

The Blizzard Challenge 2023,

O. Perrotin, B. Stephenson, S. Gerber, and G. Bailly, “The Blizzard Challenge 2023,” in Proc. 18th Blizzard Challenge Workshop, Grenoble, France, August 29 2023

work page 2023

[38] [38]

The Singing V oice Conversion Challenge 2023,

W.-C. Huang, L. P. Violeta, S. Liu, J. Shi, and T. Toda, “The Singing V oice Conversion Challenge 2023,” in Proc. ASRU, 2023, pp. 1–8

work page 2023

[39] [39]

A study on incorporating Whisper for robust speech assessment,

R. E. Zezario, Y .-W. Chen, S.-W. Fu, Y . Tsao, H.-M. Wang, and C.-S. Fuh, “A study on incorporating Whisper for robust speech assessment,” in Proc. ICME, 2024

work page 2024

[40] [40]

The V oicemos Challenge 2024: Beyond Speech Quality Prediction,

W.-C. Huang, S.-W. Fu, E. Cooper, R. Zezario, T. Toda, H.-M. Wang, J. Yamagishi, and Y . Tsao, “The V oicemos Challenge 2024: Beyond Speech Quality Prediction,” in Proc. SLT, 2024

work page 2024

[41] [41]

Espnet-TTS: Unified, Reproducible, and Integratable Open Source End-to-End Text-to-Speech Toolkit,

T. Hayashi, R. Yamamoto, K. Inoue, T. Yoshimura, S. Watanabe, T. Toda, K. Takeda, Y . Zhang, and X. Tan, “Espnet-TTS: Unified, Reproducible, and Integratable Open Source End-to-End Text-to-Speech Toolkit,” in Proc. ICASSP, 2020, pp. 7654–7658

work page 2020

[42] [42]

The LJ Speech Dataset,

K. Ito and L. Johnson, “The LJ Speech Dataset,” https://keithito.com/ LJ-Speech-Dataset/, 2017

work page 2017

[43] [43]

LPCNet: Improving neural speech synthe- sis through linear prediction,

J.-M. Valin and J. Skoglund, “LPCNet: Improving neural speech synthe- sis through linear prediction,” in Proc. ICASSP, 2019, pp. 5891–5895

work page 2019

[44] [44]

The Kaldi Speech Recognition Toolkit,

D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y . Qian, P. Schwarz et al. , “The Kaldi Speech Recognition Toolkit,” in Proc. ASRU, 2011

work page 2011

[45] [45]

ESPnet: End-to-End Speech Processing Toolkit,

S. Watanabe, T. Hori, S. Karita, T. Hayashi, J. Nishitoba, Y . Unno, N. E. Y . Soplin, J. Heymann, M. Wiesner, N. Chen, A. Renduchintala, and T. Ochiai, “ESPnet: End-to-End Speech Processing Toolkit,” in Proc. Interspeech, 2018, pp. 2207–2211

work page 2018

[46] [46]

RAMP: Retrieval-Augmented MOS Prediction via Confidence-based Dynamic Weighting,

H. Wang, S. Zhao, X. Zheng, and Y . Qin, “RAMP: Retrieval-Augmented MOS Prediction via Confidence-based Dynamic Weighting,” in Proc. Interspeech, 2023, pp. 1095–1099

work page 2023

[47] [47]

MOSNet: Deep Learning-Based Objective Assessment for V oice Conversion,

C.-C. Lo, S.-W. Fu, W.-C. Huang, X. Wang, J. Yamagishi, Y . Tsao, and H.-M. Wang, “MOSNet: Deep Learning-Based Objective Assessment for V oice Conversion,” inProc. Interspeech, 2019, pp. 1541–1545

work page 2019

[48] [48]

MOS-Bench: Benchmarking Generalization Abilities of Subjective Speech Quality Assessment Models

L. Van der Maaten and G. Hinton, “Visualizing data using t-SNE,” Journal of machine learning research , vol. 9, no. 11, 2008. 1 Supplementary Materials for: MOS-Bench: Benchmarking Generalization Abilities of Subjective Speech Quality Assessment Models Wen-Chin Huang, Member , IEEE, Erica Cooper, Member , IEEE, and Tomoki Toda Member , IEEE I. A DDITIONAL...

work page internal anchor Pith review Pith/arXiv arXiv 2008