NeuralSBS reaches 73.7% accuracy on side-by-side TTS comparisons and enhanced MOS models reach RMSE 0.40, beating the human inter-rater baseline of 0.62.
Mosnet: Deep Learning Based Objective Assessment for Voice Conversion
3 Pith papers cite this work. Polarity classification is still indexing.
years
2026 3representative citing papers
Voice range indicates TTS model capability with VITS highest, Glow-TTS best at soft phonation, and CPPs of 7-8 dB marking natural quality while values over 10 dB sound robotic.
A structured survey of audio bandwidth extension that organizes the transition from deterministic discriminative DNNs to generative approaches including GANs, diffusion models, and flow-based methods.
citing papers explorer
-
Neural networks for Text-to-Speech evaluation
NeuralSBS reaches 73.7% accuracy on side-by-side TTS comparisons and enhanced MOS models reach RMSE 0.40, beating the human inter-rater baseline of 0.62.
-
Voice Mapping of Text-to-Speech Systems: A Metric-Based Approach for Voice Quality Assessment
Voice range indicates TTS model capability with VITS highest, Glow-TTS best at soft phonation, and CPPs of 7-8 dB marking natural quality while values over 10 dB sound robotic.
-
A Survey of Advancing Audio Super-Resolution and Bandwidth Extension from Discriminative to Generative Models
A structured survey of audio bandwidth extension that organizes the transition from deterministic discriminative DNNs to generative approaches including GANs, diffusion models, and flow-based methods.