MOS-Bench: Benchmarking Generalization Abilities of Subjective Speech Quality Assessment Models
Pith reviewed 2026-05-23 17:55 UTC · model grok-4.3
The pith
Pooling multiple speech quality datasets yields better out-of-domain generalization than single-set or domain-aware training.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Existing SSQA models exhibit large performance drops on out-of-domain test sets. Training on the pooled collection of eight datasets produces higher correlation with human scores on the seventeen test sets than training on any single dataset or using AlignNet. Variation across the pooled data contributes to this gain beyond the effect of dataset size alone.
What carries the argument
MOS-Bench, the dataset collection of eight training sets and seventeen test sets used to measure out-of-domain generalization and to compare pooled versus domain-aware training.
If this is right
- SSQA models can be made more robust by collecting and pooling existing labeled datasets rather than designing new training objectives.
- Increasing the number of distinct listening-test conditions in the training pool improves generalization more reliably than scaling the total number of utterances.
- Domain-aware adaptation methods may be unnecessary when sufficient data diversity is already present through pooling.
- Future SSQA papers should report performance on multiple held-out test sets to demonstrate generalization.
Where Pith is reading between the lines
- The same pooling approach could be tested on other perceptual prediction tasks such as image quality or music preference where labeled data also exist in separate collections.
- If variation is the key driver, then deliberately constructing training sets that cover more listener demographics or acoustic environments may yield further gains.
- Model developers could prioritize releasing their training data under open licenses to enable larger pooled collections.
Load-bearing premise
The seventeen test sets capture the kinds of distribution shifts that actually occur when speech quality models are deployed in new environments.
What would settle it
A new listening test collected under conditions absent from all seventeen current test sets where the pooled model shows no improvement over single-dataset baselines.
Figures
read the original abstract
In this paper, we study the task of subjective speech quality assessment (SSQA), which refers to predicting the perceptual quality of speech. Owing to the development of deep neural network models, SSQA has greatly advanced and has been widely applied in scientific papers to evaluate speech generation systems. Nonetheless, the insufficient out-of-domain (OOD) generalization ability of current SSQA models is underexplored and often overlooked by researchers. To study this problem systematically, we present MOS-Bench, a diverse SSQA dataset collection that currently contains 8 training sets and 17 test sets. Through extensive experiments, we first highlight the OOD generalization challenges of existing models. We then evaluate the efficacy of multiple-dataset training, comparing straightforward data pooling against AlignNet, an existing domain-aware method. We demonstrate that pooling multiple training sets provides a simple yet effective solution, and variation in the data is a key factor for robust generalization beyond training data size.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces MOS-Bench, a benchmark collection of 8 training sets and 17 test sets for subjective speech quality assessment (SSQA). It demonstrates OOD generalization challenges in existing models via extensive experiments and shows that straightforward pooling of multiple training sets is a simple yet effective approach for robust generalization, with data variation mattering more than training data size alone; this is compared against AlignNet, a domain-aware baseline.
Significance. If the empirical results hold, the work is significant because it supplies a diverse, publicly usable benchmark for an underexplored but practically important problem in SSQA, where models are routinely used to evaluate speech generation systems. The finding that simple multi-dataset pooling outperforms or matches more complex domain-adaptation methods, together with the emphasis on data variation, supplies a concrete, immediately actionable recommendation and falsifiable predictions for future model training.
major comments (2)
- [Abstract and experimental-setup section] Abstract and experimental-setup section: the central claim that pooling improves OOD generalization and that variation (not size) is the key factor rests on comparisons whose details—model architectures, precise OOD definitions, statistical tests, and error bars—are not supplied in the abstract and are only alluded to in the high-level description of “extensive experiments.” Without these, the reported superiority of pooling cannot be independently verified.
- [Test-set construction (Section describing the 17 test sets)] Test-set construction (Section describing the 17 test sets): the claim that the 17 test sets constitute genuine out-of-domain shifts representative of real deployment variations is load-bearing for all generalization conclusions. The manuscript must explicitly document collection protocols, labeling differences, and acoustic or perceptual mismatches that distinguish these sets from the training distributions; absent such justification, the observed gains could be artifacts of dataset curation rather than true domain shift.
minor comments (2)
- [Results figures and tables] Add error bars or confidence intervals to all tables and figures that compare pooling against AlignNet and single-dataset baselines.
- [Discussion of data-variation results] Clarify the exact definition of “variation in the data” (e.g., acoustic diversity metrics, speaker coverage, or perceptual-score distribution spread) when asserting it is more important than data size.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and positive recommendation. We address the two major comments below and will revise the manuscript accordingly to improve clarity and verifiability.
read point-by-point responses
-
Referee: [Abstract and experimental-setup section] Abstract and experimental-setup section: the central claim that pooling improves OOD generalization and that variation (not size) is the key factor rests on comparisons whose details—model architectures, precise OOD definitions, statistical tests, and error bars—are not supplied in the abstract and are only alluded to in the high-level description of “extensive experiments.” Without these, the reported superiority of pooling cannot be independently verified.
Authors: We agree that the abstract would benefit from additional detail to support independent verification of the claims. In the revision we will expand the abstract to briefly specify the model architectures evaluated, the precise criteria used to define OOD test sets, and the statistical procedures (including error bars) employed in the comparisons. Full experimental protocols, architectures, OOD definitions, and statistical results already appear in Sections 3–5; the abstract update will make these elements more immediately accessible. revision: yes
-
Referee: [Test-set construction (Section describing the 17 test sets)] Test-set construction (Section describing the 17 test sets): the claim that the 17 test sets constitute genuine out-of-domain shifts representative of real deployment variations is load-bearing for all generalization conclusions. The manuscript must explicitly document collection protocols, labeling differences, and acoustic or perceptual mismatches that distinguish these sets from the training distributions; absent such justification, the observed gains could be artifacts of dataset curation rather than true domain shift.
Authors: We acknowledge that a more explicit justification of domain shift is necessary. In the revised manuscript we will add a dedicated subsection (or expanded appendix) that documents, for each of the 17 test sets: (i) collection protocols and source conditions, (ii) labeling procedures and any differences from training-set protocols, and (iii) acoustic and perceptual characteristics that differentiate them from the training distributions. This will strengthen the argument that the observed generalization gaps reflect genuine domain shifts. revision: yes
Circularity Check
No significant circularity
full rationale
The paper is an empirical benchmark study with no equations, derivations, or parameter-fitting steps. All claims rest on direct experimental comparisons across provided training and test sets. No self-citations, ansatzes, or uniqueness theorems are invoked as load-bearing elements in any derivation chain. The central result (pooling improves OOD performance) is presented as an observed outcome of the experiments rather than a constructed equivalence.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 2 Pith papers
-
VITA-QinYu: Expressive Spoken Language Model for Role-Playing and Singing
VITA-QinYu is the first expressive end-to-end spoken language model supporting role-playing and singing alongside conversation, trained on 15.8K hours of data and outperforming prior models on expressiveness and conve...
-
Meta Audiobox Aesthetics: Unified Automatic Quality Assessment for Speech, Music, and Sound
Unified no-reference models assess audio aesthetics across speech, music, and sound via four perceptual axes and achieve performance comparable or superior to human mean opinion scores.
Reference graph
Works this paper leans on
-
[1]
Recommendation p.800: Methods for subjec- tive determination of transmission quality,
ITUT Recommendation, “Recommendation p.800: Methods for subjec- tive determination of transmission quality,” International Telecommuni- cations Union—Radiocommunication (ITU-T) , 1998
work page 1998
-
[2]
Speech Quality Estimation: Models and Trends,
S. M ¨oller, W.-Y . Chan, N. C ˆot´e, T. H. Falk, A. Raake, and M. W¨altermann, “Speech Quality Estimation: Models and Trends,”IEEE Signal Processing Magazine , vol. 28, no. 6, pp. 18–28, 2011
work page 2011
-
[3]
A review on subjective and objective evaluation of synthetic speech,
E. Cooper, W.-C. Huang, Y . Tsao, H.-M. Wang, T. Toda, and J. Ya- magishi, “A review on subjective and objective evaluation of synthetic speech,” Acoustical Science and Technology, vol. 45, no. 4, pp. 161–183, 2024
work page 2024
-
[4]
P. Wagner, J. Beskow, S. Betz, J. Edlund, J. Gustafson, G. Eje Henter, S. Le Maguer, Z. Malisz, ´Eva Sz ´ekely, C. T ˚annander, and J. V oße, “Speech Synthesis Evaluation — State-of-the-Art Assessment and Sug- gestion for a Novel Research Program,” in Proc. 10th ISCA Workshop on Speech Synthesis (SSW 10) , 2019, pp. 105–110
work page 2019
-
[5]
T. Saeki, S. Maiti, S. Takamichi, S. Watanabe, and H. Saruwatari, “SpeechBERTScore: Reference-aware automatic evaluation of speech generation leveraging nlp evaluation metrics,” arXiv preprint arXiv:2401.16812, 2024
-
[6]
W.-C. Huang, E. Cooper, Y . Tsao, H.-M. Wang, T. Toda, and J. Yamag- ishi, “The V oiceMOS Challenge 2022,” in Proc. Interspeech, 2022, pp. 4536–4540
work page 2022
-
[7]
Generalization ability of MOS prediction networks,
E. Cooper, W.-C. Huang, T. Toda, and J. Yamagishi, “Generalization ability of MOS prediction networks,” in Proc. ICASSP, 2022, pp. 8442– 8446
work page 2022
-
[8]
The V oicemos Challenge 2023: Zero-Shot Subjective Speech Quality Prediction for Multiple Domains,
E. Cooper, W.-C. Huang, Y . Tsao, H.-M. Wang, T. Toda, and J. Ya- magishi, “The V oicemos Challenge 2023: Zero-Shot Subjective Speech Quality Prediction for Multiple Domains,” in Proc. ASRU , 2023, pp. 1–7. JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 11
work page 2023
-
[9]
Self-supervised speech representation learning: A review,
A. Mohamed, H.-y. Lee, L. Borgholt, J. D. Havtorn, J. Edin, C. Igel, K. Kirchhoff, S.-W. Li, K. Livescu, L. Maaløe et al. , “Self-supervised speech representation learning: A review,” IEEE Journal of Selected Topics in Signal Processing , vol. 16, no. 6, pp. 1179–1210, 2022
work page 2022
-
[10]
wav2vec 2.0: A framework for self-supervised learning of speech representations,
A. Baevski, H. Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self-supervised learning of speech representations,” in Proc. NeruIPS, 2020
work page 2020
-
[11]
Hubert: Self-supervised speech representation learning by masked prediction of hidden units,
W.-N. Hsu, B. Bolte, Y .-H. H. Tsai, K. Lakhotia, R. Salakhutdinov, and A. Mohamed, “Hubert: Self-supervised speech representation learning by masked prediction of hidden units,” IEEE/ACM TASLP, vol. 29, pp. 3451–3460, 2021
work page 2021
-
[12]
WavLM: Large-scale self-supervised pre- training for full stack speech processing,
S. Chen, C. Wang, Z. Chen, Y . Wu, S. Liu, Z. Chen, J. Li, N. Kanda, T. Yoshioka, X. Xiao et al., “WavLM: Large-scale self-supervised pre- training for full stack speech processing,” IEEE Journal of Selected Topics in Signal Processing , vol. 16, no. 6, pp. 1505–1518, 2022
work page 2022
-
[13]
Utilizing self-supervised representations for mos prediction,
W.-C. Tseng, C. yu Huang, W.-T. Kao, Y . Y . Lin, and H. yi Lee, “Utilizing self-supervised representations for mos prediction,” in Proc. Interspeech, 2021, pp. 2781–2785
work page 2021
-
[14]
Squid: Measuring speech naturalness in many languages,
T. Sellam, A. Bapna, J. Camp, D. Mackinnon, A. P. Parikh, and J. Riesa, “Squid: Measuring speech naturalness in many languages,” in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . IEEE, 2023, pp. 1–5
work page 2023
-
[15]
R. E. Zezario, S.-W. Fu, F. Chen, C.-S. Fuh, H.-M. Wang, and Y . Tsao, “Deep learning-based non-intrusive multi-objective speech assessment model with cross-domain features,” IEEE/ACM TASLP, vol. 31, pp. 54– 70, 2023
work page 2023
-
[16]
Efficient Speech Quality Assessment Using Self-Supervised Framewise Embeddings,
K. El Hajal, Z. Wu, N. Scheidwasser-Clow, G. Elbanna, and M. Cernak, “Efficient Speech Quality Assessment Using Self-Supervised Framewise Embeddings,” in Proc. ICASSP, 2023, pp. 1–5
work page 2023
-
[17]
SpeechLMScore: evaluat- ing speech generation using speech language model,
S. Maiti, Y . Peng, T. Saeki, and S. Watanabe, “SpeechLMScore: evaluat- ing speech generation using speech language model,” in Proc. ICASSP, 2023, pp. 1–5
work page 2023
-
[18]
Self- Supervised Speech Quality Estimation and Enhancement Using Only Clean Speech,
S.-W. Fu, K.-H. Hung, Y . Tsao, and Y .-C. F. Wang, “Self- Supervised Speech Quality Estimation and Enhancement Using Only Clean Speech,” in Proc. ICLR , 2024. [Online]. Available: https://openreview.net/forum?id=ale56Ya59q
work page 2024
-
[19]
UTMOS: UTokyo-SaruLab System for V oiceMOS Chal- lenge 2022,
T. Saeki, D. Xin, W. Nakata, T. Koriyama, S. Takamichi, and H. Saruwatari, “UTMOS: UTokyo-SaruLab System for V oiceMOS Chal- lenge 2022,” in Proc. Interspeech, 2022, pp. 4521–4525
work page 2022
-
[20]
D. H. Wolpert, “Stacked generalization,” Neural networks, vol. 5, no. 2, pp. 241–259, 1992
work page 1992
-
[21]
LDNet: unified listener dependent modeling in MOS prediction for synthetic speech,
W.-C. Huang, E. Cooper, J. Yamagishi, and T. Toda, “LDNet: unified listener dependent modeling in MOS prediction for synthetic speech,” in Proc. ICASSP, 2022, pp. 896–900
work page 2022
-
[22]
LE-SSL-MOS: Self-Supervised Learning MOS Prediction with Listener Enhancement,
Z. Qi, X. Hu, W. Zhou, S. Li, H. Wu, J. Lu, and X. Xu, “LE-SSL-MOS: Self-Supervised Learning MOS Prediction with Listener Enhancement,” in Proc. IEEE ASRU, 2023, pp. 1–6
work page 2023
-
[23]
ITUT Recommendation, “Recommendation p.1401: Methods, metrics and procedures for statistical evaluation, qualification and comparison of objective quality prediction models,” International Telecommunications Union—Radiocommunication (ITU-T) , 2001
work page 2001
-
[24]
Back to the future: Extending the blizzard challenge 2013,
S. Le Maguer, S. King, and N. Harte, “Back to the future: Extending the blizzard challenge 2013,” in Interspeech, 2022, pp. 2378–2382
work page 2013
-
[25]
Investigating Range-Equalizing Bias in Mean Opinion Score Ratings of Synthesized Speech,
E. Cooper and J. Yamagishi, “Investigating Range-Equalizing Bias in Mean Opinion Score Ratings of Synthesized Speech,” in Proc. Interspeech, 2023, pp. 1104–1108
work page 2023
-
[26]
Bias- aware loss for training image and speech quality prediction models from multiple datasets,
G. Mittag, S. Zadtootaghaj, T. Michael, B. Naderi, and S. M ¨oller, “Bias- aware loss for training image and speech quality prediction models from multiple datasets,” in International Conference on Quality of Multimedia Experience (QoMEX), 2021, pp. 97–102
work page 2021
-
[27]
MBNET: MOS Prediction for Synthesized Speech with Mean-Bias Network,
Y . Leng, X. Tan, S. Zhao, F. Soong, X.-Y . Li, and T. Qin, “MBNET: MOS Prediction for Synthesized Speech with Mean-Bias Network,” in Proc. ICASSP, 2021, pp. 391–395
work page 2021
-
[28]
J. Pieper and S. V oran, “Alignnet: Learning dataset score alignment functions to enable better training of speech quality estimators,” in Proc. Interspeech, 2024, pp. 82–86
work page 2024
-
[29]
How do voices from past speech synthesis challenges compare today?
E. Cooper and J. Yamagishi, “How do voices from past speech synthesis challenges compare today?” in Proc. 11th ISCA Speech Synthesis Workshop (SSW 11), 2021, pp. 183–188
work page 2021
-
[30]
SOMOS: The Samsung Open MOS Dataset for the Evaluation of Neural Text-to- Speech Synthesis,
G. Maniati, A. Vioni, N. Ellinas, K. Nikitaras, K. Klapsas, J. S. Sung, G. Jho, A. Chalamandaris, and P. Tsiakoulis, “SOMOS: The Samsung Open MOS Dataset for the Evaluation of Neural Text-to- Speech Synthesis,” in Proc. Interspeech, 2022, pp. 2388–2392
work page 2022
-
[31]
Singmos: An extensive open- source singing voice dataset for mos prediction,
Y . Tang, J. Shi, Y . Wu, and Q. Jin, “Singmos: An extensive open- source singing voice dataset for mos prediction,” arXiv preprint arXiv:2406.10911, 2024
-
[32]
G. Mittag, B. Naderi, A. Chehadi, and S. M ¨oller, “NISQA: A Deep CNN-Self-Attention Model for Multidimensional Speech Quality Pre- diction with Crowdsourced Datasets,” in Proc. Interspeech, 2021, pp. 2127–2131
work page 2021
-
[33]
InQSS: a speech intelligibility and quality assessment model using a multi-task learning network,
Y .-W. Chen and Y . Tsao, “InQSS: a speech intelligibility and quality assessment model using a multi-task learning network,” in Proc. Inter- speech, 2022, pp. 3088–3092
work page 2022
-
[34]
G. Yi, W. Xiao, Y . Xiao, B. Naderi, S. M ¨oller, W. Wardah, G. Mittag, R. Culter, Z. Zhang, D. S. Williamson, F. Chen, F. Yang, and S. Shang, “ConferencingSpeech 2022 Challenge: Non-intrusive Objective Speech Quality Assessment (NISQA) Challenge for Online Conferencing Ap- plications,” in Proc. Interspeech, 2022, pp. 3308–3312
work page 2022
-
[35]
DNN No-Reference PSTN Speech Quality Prediction,
G. Mittag, R. Cutler, Y . Hosseinkashi, M. Revow, S. Srinivasan, N. Chande, and R. Aichner, “DNN No-Reference PSTN Speech Quality Prediction,” in Proc. Interspeech, 2020, pp. 2867–2871
work page 2020
-
[36]
Z. Wu, Z. Xie, and S. King, “The Blizzard Challenge 2019,” in Proc. Blizzard Challenge Workshop, vol. 2019, 2019
work page 2019
-
[37]
O. Perrotin, B. Stephenson, S. Gerber, and G. Bailly, “The Blizzard Challenge 2023,” in Proc. 18th Blizzard Challenge Workshop, Grenoble, France, August 29 2023
work page 2023
-
[38]
The Singing V oice Conversion Challenge 2023,
W.-C. Huang, L. P. Violeta, S. Liu, J. Shi, and T. Toda, “The Singing V oice Conversion Challenge 2023,” in Proc. ASRU, 2023, pp. 1–8
work page 2023
-
[39]
A study on incorporating Whisper for robust speech assessment,
R. E. Zezario, Y .-W. Chen, S.-W. Fu, Y . Tsao, H.-M. Wang, and C.-S. Fuh, “A study on incorporating Whisper for robust speech assessment,” in Proc. ICME, 2024
work page 2024
-
[40]
The V oicemos Challenge 2024: Beyond Speech Quality Prediction,
W.-C. Huang, S.-W. Fu, E. Cooper, R. Zezario, T. Toda, H.-M. Wang, J. Yamagishi, and Y . Tsao, “The V oicemos Challenge 2024: Beyond Speech Quality Prediction,” in Proc. SLT, 2024
work page 2024
-
[41]
Espnet-TTS: Unified, Reproducible, and Integratable Open Source End-to-End Text-to-Speech Toolkit,
T. Hayashi, R. Yamamoto, K. Inoue, T. Yoshimura, S. Watanabe, T. Toda, K. Takeda, Y . Zhang, and X. Tan, “Espnet-TTS: Unified, Reproducible, and Integratable Open Source End-to-End Text-to-Speech Toolkit,” in Proc. ICASSP, 2020, pp. 7654–7658
work page 2020
-
[42]
K. Ito and L. Johnson, “The LJ Speech Dataset,” https://keithito.com/ LJ-Speech-Dataset/, 2017
work page 2017
-
[43]
LPCNet: Improving neural speech synthe- sis through linear prediction,
J.-M. Valin and J. Skoglund, “LPCNet: Improving neural speech synthe- sis through linear prediction,” in Proc. ICASSP, 2019, pp. 5891–5895
work page 2019
-
[44]
The Kaldi Speech Recognition Toolkit,
D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y . Qian, P. Schwarz et al. , “The Kaldi Speech Recognition Toolkit,” in Proc. ASRU, 2011
work page 2011
-
[45]
ESPnet: End-to-End Speech Processing Toolkit,
S. Watanabe, T. Hori, S. Karita, T. Hayashi, J. Nishitoba, Y . Unno, N. E. Y . Soplin, J. Heymann, M. Wiesner, N. Chen, A. Renduchintala, and T. Ochiai, “ESPnet: End-to-End Speech Processing Toolkit,” in Proc. Interspeech, 2018, pp. 2207–2211
work page 2018
-
[46]
RAMP: Retrieval-Augmented MOS Prediction via Confidence-based Dynamic Weighting,
H. Wang, S. Zhao, X. Zheng, and Y . Qin, “RAMP: Retrieval-Augmented MOS Prediction via Confidence-based Dynamic Weighting,” in Proc. Interspeech, 2023, pp. 1095–1099
work page 2023
-
[47]
MOSNet: Deep Learning-Based Objective Assessment for V oice Conversion,
C.-C. Lo, S.-W. Fu, W.-C. Huang, X. Wang, J. Yamagishi, Y . Tsao, and H.-M. Wang, “MOSNet: Deep Learning-Based Objective Assessment for V oice Conversion,” inProc. Interspeech, 2019, pp. 1541–1545
work page 2019
-
[48]
MOS-Bench: Benchmarking Generalization Abilities of Subjective Speech Quality Assessment Models
L. Van der Maaten and G. Hinton, “Visualizing data using t-SNE,” Journal of machine learning research , vol. 9, no. 11, 2008. 1 Supplementary Materials for: MOS-Bench: Benchmarking Generalization Abilities of Subjective Speech Quality Assessment Models Wen-Chin Huang, Member , IEEE, Erica Cooper, Member , IEEE, and Tomoki Toda Member , IEEE I. A DDITIONAL...
work page internal anchor Pith review Pith/arXiv arXiv 2008
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.