Partial mix-up on clean-degraded speech pairs plus contrastive loss produces frame-level embeddings that cluster by degradation type and improve detection and classification on in- and out-of-domain data.
Speech Quality Embeddings for Improved Detection and Classification of Degradations in Speech Signals
1 Pith paper cite this work. Polarity classification is still indexing.
abstract
Automatic subjective speech quality assessment (SSQA) traditionally estimates speech quality on an utterance or system level. While this resolution was adequate for older transmission or synthesis systems that produced speech signals of mediocre quality, modern systems generate high-quality speech with degradations that may occur only locally. With suitable model architectures and regularization losses, SSQA models trained with utterance-level targets can also yield useful local predictions of speech quality. In this work, we extend such models to produce frame-level embeddings that cluster by degradation type. Specifically, we employ a partial mix-up strategy on a parallel corpus of clean and degraded utterances and apply a contrastive loss to distinguish between degradation types. Through experiments on both in- and out-of-domain data, we demonstrate that our approach improves degradation detection and enables the identification of degradation types by analyzing embedding clusters.
fields
eess.AS 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
Speech Quality Embeddings for Improved Detection and Classification of Degradations in Speech Signals
Partial mix-up on clean-degraded speech pairs plus contrastive loss produces frame-level embeddings that cluster by degradation type and improve detection and classification on in- and out-of-domain data.