MOS models match humans on acoustic degradation but are insensitive to prosodic errors and show a double dissociation on speaker characteristics like mean F0 bias and insensitivity to rate and F0 variability.
Investigating Human-Model Discrepancies in Speech Quality Assessment via Acoustic and Prosodic Perturbations
1 Pith paper cite this work. Polarity classification is still indexing.
abstract
Mean opinion score (MOS) prediction models are widely used as proxy metrics in text-to-speech (TTS) research, yet their ability to capture quality differences beyond acoustic fidelity remains unclear. We investigate this via controlled perturbations on speech: acoustic degradation, prosodic errors, and manipulation of speaker-specific characteristics such as pitch and speaking rate. We obtained MOS predictions for these speech samples from both human listeners and the model, and analyzed the differences in their perceptual characteristics. Results show that most models track acoustic degradation well, while all are insensitive to prosodic errors despite large subjective score drops. For speaker characteristics, models exhibit a double dissociation: strong mean fundamental frequency (F0) biases absent in human ratings, yet insensitivity to speaking rate and F0 variability that humans notice. These findings highlight limitations of scalar MOS prediction beyond acoustic fidelity.
fields
eess.AS 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
Investigating Human-Model Discrepancies in Speech Quality Assessment via Acoustic and Prosodic Perturbations
MOS models match humans on acoustic degradation but are insensitive to prosodic errors and show a double dissociation on speaker characteristics like mean F0 bias and insensitivity to rate and F0 variability.