pith. machine review for the scientific record. sign in

arxiv: 1611.09207 · v1 · submitted 2016-11-28 · 💻 cs.CL · cs.LG· stat.ML

Recognition: unknown

AutoMOS: Learning a non-intrusive assessor of naturalness-of-speech

Authors on Pith no claims yet
classification 💻 cs.CL cs.LGstat.ML
keywords humanautomosratersspeechcorrelationsmodelqualitysynthesized
0
0 comments X
read the original abstract

Developers of text-to-speech synthesizers (TTS) often make use of human raters to assess the quality of synthesized speech. We demonstrate that we can model human raters' mean opinion scores (MOS) of synthesized speech using a deep recurrent neural network whose inputs consist solely of a raw waveform. Our best models provide utterance-level estimates of MOS only moderately inferior to sampled human ratings, as shown by Pearson and Spearman correlations. When multiple utterances are scored and averaged, a scenario common in synthesizer quality assessment, AutoMOS achieves correlations approaching those of human raters. The AutoMOS model has a number of applications, such as the ability to explore the parameter space of a speech synthesizer without requiring a human-in-the-loop.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Meta Audiobox Aesthetics: Unified Automatic Quality Assessment for Speech, Music, and Sound

    cs.SD 2025-02 unverdicted novelty 6.0

    Unified no-reference models assess audio aesthetics across speech, music, and sound via four perceptual axes and achieve performance comparable or superior to human mean opinion scores.