Speech Quality Embeddings for Improved Detection and Classification of Degradations in Speech Signals

Michael Kuhlmann; Reinhold Haeb-Umbach; Tobias Cord-Landwehr

arxiv: 2605.21332 · v1 · pith:I5C6SDWQnew · submitted 2026-05-20 · 📡 eess.AS

Speech Quality Embeddings for Improved Detection and Classification of Degradations in Speech Signals

Michael Kuhlmann , Tobias Cord-Landwehr , Reinhold Haeb-Umbach This is my paper

Pith reviewed 2026-05-21 03:24 UTC · model grok-4.3

classification 📡 eess.AS

keywords speech quality assessmentdegradation detectionframe-level embeddingscontrastive losspartial mix-upspeech signalslocal quality prediction

0 comments

The pith

Frame-level embeddings from speech quality models cluster by degradation type when trained with partial mix-up and contrastive loss.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Modern speech systems produce mostly high-quality audio with degradations that appear only in short segments, so utterance-level quality scores miss important local details. This work extends standard speech quality assessment models to output frame-level embeddings instead. It does so by mixing clean and degraded versions of the same utterances in a controlled way and adding a contrastive loss that separates different degradation types in embedding space. Experiments show the resulting embeddings improve detection of where degradations occur and allow the types of degradation to be read off from how the embeddings group together, and this holds for both familiar and new data.

Core claim

Speech quality assessment models trained with a partial mix-up strategy on parallel clean and degraded utterances together with a contrastive loss produce frame-level embeddings that form clusters corresponding to distinct degradation types. These clusters improve degradation detection and enable identification of degradation types by analysis of the embedding space on both in-domain and out-of-domain data.

What carries the argument

Frame-level embeddings produced by a speech quality model after partial mix-up on clean-degraded utterance pairs and contrastive loss to separate degradation types.

If this is right

Local frame-level predictions become available for spotting degradations that affect only parts of an utterance.
Degradation detection accuracy rises on both data seen during training and new out-of-domain recordings.
Specific degradation types can be identified simply by inspecting which cluster an embedding belongs to.
The approach extends older utterance-level quality assessment to the localized problems typical of current high-quality systems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

These embeddings could feed into targeted enhancement algorithms that apply different fixes depending on the detected degradation cluster.
Similar mix-up and contrastive training might transfer to non-speech audio such as music or environmental sounds for degradation classification.
The clusters could be monitored in real time within communication pipelines to flag emerging quality issues before they affect listeners.

Load-bearing premise

That the partial mix-up and contrastive loss will cause embeddings to group by degradation type rather than by speaker identity or speech content.

What would settle it

If the learned embeddings in experiments cluster primarily by speaker or phonetic content instead of by degradation type, the central claim would not hold.

Figures

Figures reproduced from arXiv: 2605.21332 by Michael Kuhlmann, Reinhold Haeb-Umbach, Tobias Cord-Landwehr.

**Figure 1.** Figure 1: Block schema of the full proposed model. The encoder Enc feeds two decoder heads: DecMOS for frame-level scores q, followed by mean pooling for utterance-level estimates yˆ, and Dec gscl for frame-level embeddings Z scl, followed by a projection layer for contrastive training. Z MOS denotes the embeddings of the MOS decoder before projection to the frame-level scores. where IP (cb,l) = {(b ′ , l′ ) ∈ {1, .… view at source ↗

**Figure 2.** Figure 2: CON1 MOS-based (upper) and embeddingbased (lower) detection results for a single utterance from NISQA TEST SIM-partial-mixup. The threshold is tuned to maximize the intersection-based F1-score on the full test set. EER: 4.88% versus 3.87%). 6.3. Embedding analysis To analyze whether frame-level embeddings of the same degradation type group together and are distinguishable from embeddings of other degrad… view at source ↗

**Figure 3.** Figure 3: EER and accuracy when constraining the number of concurrent degradations in NISQA TEST SIM-partial-mixup. (CON1 EER: 7.71%, CON2 EER: 0.93%). Here, it becomes evident that excluding clean frames from the set of positive classes during training (CON2) is clearly beneficial for degradation identification. When increasing the maximum number of simultaneous degradations, accuracy starts to drop and EER gets h… view at source ↗

read the original abstract

Automatic subjective speech quality assessment (SSQA) traditionally estimates speech quality on an utterance or system level. While this resolution was adequate for older transmission or synthesis systems that produced speech signals of mediocre quality, modern systems generate high-quality speech with degradations that may occur only locally. With suitable model architectures and regularization losses, SSQA models trained with utterance-level targets can also yield useful local predictions of speech quality. In this work, we extend such models to produce frame-level embeddings that cluster by degradation type. Specifically, we employ a partial mix-up strategy on a parallel corpus of clean and degraded utterances and apply a contrastive loss to distinguish between degradation types. Through experiments on both in- and out-of-domain data, we demonstrate that our approach improves degradation detection and enables the identification of degradation types by analyzing embedding clusters.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

They extend SSQA models with partial mix-up and contrastive loss to produce frame-level embeddings that cluster by degradation type, but the abstract leaves the disentanglement from speaker and content unaddressed.

read the letter

The key takeaway is that the authors extend speech quality models to frame level by using partial mix-up on clean-degraded pairs and a contrastive loss, so the embeddings cluster by degradation type. This supports local detection and type identification, which matters for today's high-quality speech systems. They do well in showing results on both in- and out-of-domain data, and the approach builds logically on prior utterance-level SSQA work. The partial mix-up idea is a solid way to create the necessary training signal for distinguishing degradation types without needing new architectures. Credit to them for testing out-of-domain as well, which adds some robustness to the claims. The main soft spot is the lack of clear evidence that the clusters are truly about degradation and not speaker or phonetic content. The stress test concern holds up from the abstract: without controls that fix speaker and content while changing degradation, the clustering could be driven by those confounders. The abstract claims the improvements but does not describe the exact loss formulations or verification steps, so the support for the claims is not fully convincing yet. This is a moderate issue rather than a fatal one, but it needs checking in the full experiments. This is for speech processing folks who need local quality analysis rather than whole-utterance scores. A reader working on practical SSQA enhancements would get value from the method and the reported gains. It deserves a serious referee because the idea is testable and the experiments cover in and out of domain cases. I recommend sending it out for peer review, with the referee asked to look closely at how the contrastive pairs are constructed.

Referee Report

2 major / 2 minor

Summary. The paper proposes extending utterance-level speech quality assessment models to produce frame-level embeddings by applying partial mix-up on parallel clean/degraded utterance pairs together with a contrastive loss. The central claim is that the resulting embeddings form clusters corresponding to distinct degradation types, which in turn improves degradation detection and enables type identification. Experiments are reported on both in-domain and out-of-domain data, with supporting visualizations of embedding clusters.

Significance. If the embeddings can be shown to isolate degradation type from speaker and content factors, the approach would meaningfully advance local SSQA for modern high-quality speech systems where degradations are often localized. The combination of partial mix-up and contrastive loss on parallel data is a plausible mechanism, and the in-/out-of-domain evaluation design is appropriate for testing generalization.

major comments (2)

[§3.2] §3.2 (Contrastive Loss and Partial Mix-up): The loss formulation and pairing strategy are not described with explicit controls that hold speaker identity or phonetic content fixed while varying degradation (or vice versa). Without such controls or corresponding ablations, the t-SNE and k-means results cannot rule out the possibility that clusters are driven by speaker or content rather than degradation type. This directly undermines the claim that clusters enable degradation-type identification.
[§4.3] §4.3 (Out-of-domain Experiments): The reported improvements in detection are given, but no quantitative cluster-purity or normalized mutual information metrics are provided that compare alignment with degradation labels versus speaker or utterance-content labels. This leaves the type-identification claim without direct evidence on the out-of-domain data where the concern is most acute.

minor comments (2)

[§1] The abstract and §1 use the term 'partial mix-up' without an immediate reference to its precise definition or the mixing ratio schedule; a short equation or pseudocode would improve clarity.
[Figure 3] Figure 3 (t-SNE visualizations): Color coding by degradation type is helpful, but adding a second panel colored by speaker ID would allow readers to visually assess the disentanglement claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address the concerns about explicit controls in the pairing strategy and the lack of quantitative cluster metrics on out-of-domain data. Revisions will be made to strengthen the manuscript.

read point-by-point responses

Referee: [§3.2] §3.2 (Contrastive Loss and Partial Mix-up): The loss formulation and pairing strategy are not described with explicit controls that hold speaker identity or phonetic content fixed while varying degradation (or vice versa). Without such controls or corresponding ablations, the t-SNE and k-means results cannot rule out the possibility that clusters are driven by speaker or content rather than degradation type. This directly undermines the claim that clusters enable degradation-type identification.

Authors: The parallel corpus consists of clean and degraded versions of identical utterances, which by design holds speaker identity and phonetic content fixed while varying only the degradation. We will revise §3.2 to explicitly describe this pairing strategy and the resulting controls. We will also add an ablation comparing same-utterance pairs against cross-speaker pairs to confirm that degradation type, rather than speaker or content, drives the observed clustering. revision: yes
Referee: [§4.3] §4.3 (Out-of-domain Experiments): The reported improvements in detection are given, but no quantitative cluster-purity or normalized mutual information metrics are provided that compare alignment with degradation labels versus speaker or utterance-content labels. This leaves the type-identification claim without direct evidence on the out-of-domain data where the concern is most acute.

Authors: We agree that quantitative metrics would provide stronger evidence. We will add cluster purity and normalized mutual information (NMI) scores on the out-of-domain embeddings, explicitly comparing alignment with degradation labels against speaker and content labels. This will directly address the type-identification claim for out-of-domain data. revision: yes

Circularity Check

0 steps flagged

No significant circularity; claims rest on experimental outcomes

full rationale

The paper presents an empirical ML approach: partial mix-up on parallel clean/degraded utterances combined with contrastive loss to produce frame-level embeddings, followed by clustering analysis for degradation detection and type identification. Central claims are validated through reported experiments on in-domain and out-of-domain data rather than any closed-form derivation, self-referential equations, or parameters fitted to a subset then renamed as predictions. No load-bearing steps invoke self-citations for uniqueness theorems, smuggle ansatzes, or reduce the result to the input by construction. The method is self-contained via direct empirical demonstration of embedding clusters and performance gains.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Only the abstract is available, so the ledger is necessarily incomplete. No explicit free parameters, invented entities, or non-standard axioms are described. The work assumes standard deep-learning components and that utterance-level training can yield useful local predictions.

axioms (1)

domain assumption Suitable model architectures and regularization losses allow utterance-level trained SSQA models to produce useful local predictions.
Stated in the abstract as the starting point for the extension.

pith-pipeline@v0.9.0 · 5676 in / 1241 out tokens · 36261 ms · 2026-05-21T03:24:32.471409+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

44 extracted references · 44 canonical work pages · 1 internal anchor

[1]

Speech Quality Embeddings for Improved Detection and Classification of Degradations in Speech Signals

Introduction Automatic subjective speech quality assessment (SSQA) as- signs a quality score to speech signals that reflects subjective quality perception (e.g., poor or excellent) [1]. Usually, these approaches worknon-intrusively, i.e., without knowledge of a clean matching reference signal. Furthermore, they are trained with mean opinion score (MOS) su...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[2]

Similar to utterance-level assessment, encoder-decoder mod- els [1] can be used to infer frame-level quality scores3 [2]

Local speech quality assessment Local subjective speech quality assessment (LSSQA) 2 aims to estimate quality at a finer resolution than the utterance level. Similar to utterance-level assessment, encoder-decoder mod- els [1] can be used to infer frame-level quality scores3 [2]. Given a datasetD={(s 1(t), y1),(s 2(t), y2), . . .} of speech signalss i(t)wi...

work page
[3]

As motivated at the beginning, we expect that training with frame-level targets will improve detection performance

Improved frame-level scores Local degradation detection is challenging because only utterance-level, i.e., weak, quality targets are available at scale. As motivated at the beginning, we expect that training with frame-level targets will improve detection performance. To this end, we propose a straightforward data augmentation strategy to produce frame-le...

work page
[4]

This can be ben- eficial in system analysis, e.g., identifying frequent degradation types or retrieving specific degradations from large databases

Speech quality embeddings For downstream applications, not only the location of a degra- dation in a signal, but also itstypeis relevant. This can be ben- eficial in system analysis, e.g., identifying frequent degradation types or retrieving specific degradations from large databases. Cumlin et al. [7] studied the latent space of DNSMOS-like [13] models a...

work page
[5]

Local degradation detection: MOS-based versus embedding-based Given estimated frame-level scoresˆ q,MOS-baseddetection of local degradations infers, for each frame, whether it suffers from a quality degradation by comparing its frame-level score to a score thresholdq deg that is tuned on a validation set. Assuming that the embeddings form clusters by degr...

work page
[6]

First, the proposed LSSQA extensions are evaluated with respect to their MOS- and embedding-based degradation detection

Experiments To demonstrate that our contributions improve the detection of local degradations and the identification of degradation types, we conduct three evaluations. First, the proposed LSSQA extensions are evaluated with respect to their MOS- and embedding-based degradation detection. Then, the speech quality embeddings are assessed for their specific...

work page arXiv
[7]

The con- trastive loss also improved discrimination between degradation types and retrieval of the same types

Conclusion We have shown that the detection of local degradation in speech signals can be significantly improved using (i) frame- level pseudo-targets via a partial mix-up data augmentation, (ii) adding a supervised contrastive loss exploiting knowledge about the degradation types, and (iii) switching from MOS- based to embedding-based detection, resultin...

work page
[8]

Acknowledgements Computational resources were provided by the Paderborn Cen- ter for Parallel Computing

work page
[9]

SHEET: A Multi-purpose Open-source Speech Human Evaluation Estimation Toolkit,

W.-C. Huang, E. Cooper, and T. Toda, “SHEET: A Multi-purpose Open-source Speech Human Evaluation Estimation Toolkit,” in Proceedings of ISCA Interspeech, 2025, pp. 2355–2359

work page 2025
[10]

Towards Frame-level Quality Predictions of Synthetic Speech ,

M. Kuhlmann, F. Seebauer, P. Wagner, and R. Haeb-Umbach, “Towards Frame-level Quality Predictions of Synthetic Speech ,” inProceedings of ISCA Interspeech, 2025, pp. 2300–2304

work page 2025
[11]

Speech quality-based localization of low-quality speech and text-to-speech synthesis artefacts,

M. Kuhlmann, A. Werning, T. von Neumann, and R. Haeb- Umbach, “Speech quality-based localization of low-quality speech and text-to-speech synthesis artefacts,”Proceedings of In- ternational Conference on Acoustics, Speech and Signal Process- ing (ICASSP), 2026

work page 2026
[12]

Quality-Net: An End-to-End Non-intrusive Speech Quality Assessment Model Based on BLSTM,

S. wei Fu, Y . Tsao, H.-T. Hwang, and H.-M. Wang, “Quality-Net: An End-to-End Non-intrusive Speech Quality Assessment Model Based on BLSTM,” inProceedings of ISCA Interspeech, 2018

work page 2018
[13]

Fu, C.-F

S.-W. Fu, C.-F. Liao, and Y . Tsao,icassp, pp. 26–30, 2020

work page 2020
[14]

MetricGAN: Gen- erative adversarial networks based black-box metric scores opti- mization for speech enhancement,

S.-W. Fu, C.-F. Liao, Y . Tsao, and S.-D. Lin, “MetricGAN: Gen- erative adversarial networks based black-box metric scores opti- mization for speech enhancement,” inProceedings of the Interna- tional Conference on Machine Learning, 2019, pp. 2031–2041

work page 2019
[15]

Impairments are clustered in latents of deep neural network-based speech quality models,

F. Cumlin, X. Liang, V . Ungureanu, C. K. Reddy, C. Sch¨uldt, and S. Chatterjee, “Impairments are clustered in latents of deep neural network-based speech quality models,” inProceedings of Inter- national Conference on Acoustics, Speech and Signal Processing (ICASSP), 2025, pp. 1–5

work page 2025
[16]

Sound Event Detection in Domestic Environments with Weakly Labeled Data and Soundscape Synthesis,

N. Turpault, R. Serizel, A. Shah, and J. Salamon, “Sound Event Detection in Domestic Environments with Weakly Labeled Data and Soundscape Synthesis,” inWorkshop on Detection and Clas- sification of Acoustic Scenes and Events (DCASE), 2019

work page 2019
[17]

DCASE 2022 TASK4 CHALLENGE TECHNICAL REPORT,

J. Hao, S. Ye, C. Lu, F. Dong, and J. Liu, “DCASE 2022 TASK4 CHALLENGE TECHNICAL REPORT,”Workshop on Detection and Classification of Acoustic Scenes and Events (DCASE), 2022

work page 2022
[18]

DCASE 2023 Challenge Task4 Technical Report,

M. Chen, Y . Jin, J. Shao, Y . Liu, B. Peng, and J. Chen, “DCASE 2023 Challenge Task4 Technical Report,”Workshop on Detection and Classification of Acoustic Scenes and Events (DCASE), 2023

work page 2023
[19]

FLAM: Frame-wise language-audio modeling,

Y . Wu, C. Tsirigotis, K. Chen, C.-Z. A. Huang, A. Courville, O. Nieto, P. Seetharaman, and J. Salamon, “FLAM: Frame-wise language-audio modeling,” inProceedings of the International Conference on Machine Learning (ICML), 2025. [Online]. Available: https://openreview.net/forum?id=7fQohcFrxG

work page 2025
[20]

Tacos: Temporally- aligned audio captions for language-audio pretraining,

P. Primus, F. Schmid, and G. Widmer, “Tacos: Temporally- aligned audio captions for language-audio pretraining,” inPro- ceedings of Workshop on Applications of Signal Processing to Au- dio and Acoustics (WASPAA), 2025, pp. 1–5

work page 2025
[21]

Dnsmos: A non-intrusive perceptual objective speech quality metric to evaluate noise sup- pressors,

C. K. Reddy, V . Gopal, and R. Cutler, “Dnsmos: A non-intrusive perceptual objective speech quality metric to evaluate noise sup- pressors,” inProceedings of International Conference on Acous- tics, Speech and Signal Processing (ICASSP), 2021, pp. 6493– 6497

work page 2021
[22]

Perceptual evaluation of speech quality (pesq)-a new method for speech quality assessment of telephone networks and codecs,

A. W. Rix, J. G. Beerends, M. P. Hollier, and A. P. Hekstra, “Perceptual evaluation of speech quality (pesq)-a new method for speech quality assessment of telephone networks and codecs,” inProceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2001, pp. 749–752

work page 2001
[23]

Visqol v3: An open source production ready objective speech and audio metric,

M. Chinen, F. S. Lim, J. Skoglund, N. Gureev, F. O’Gorman, and A. Hines, “Visqol v3: An open source production ready objective speech and audio metric,” inInternational Conference on Quality of Multimedia Experience (QoMEX), 2020, pp. 1–6

work page 2020
[24]

NOMAD: Unsupervised Learning of Perceptual Embeddings For Speech Enhancement and Non-Matching Reference Audio Quality Assessment,

A. Ragano, J. Skoglund, and A. Hines, “NOMAD: Unsupervised Learning of Perceptual Embeddings For Speech Enhancement and Non-Matching Reference Audio Quality Assessment,” inPro- ceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2024, pp. 1011–1015

work page 2024
[25]

Speech intelligibility prediction using a neurogram similarity index measure,

A. Hines and N. Harte, “Speech intelligibility prediction using a neurogram similarity index measure,”Speech Communication, vol. 54, no. 2, pp. 306–320, 2012

work page 2012
[26]

Supervised contrastive learning,

P. Khosla, P. Teterwak, C. Wang, A. Sarna, Y . Tian, P. Isola, A. Maschinot, C. Liu, and D. Krishnan, “Supervised contrastive learning,”Advances in Neural Information Processing Systems (NeurIPS), pp. 18 661–18 673, 2020

work page 2020
[27]

Generaliza- tion Ability of MOS Prediction Networks,

E. Cooper, W.-C. Huang, T. Toda, and J. Yamagishi, “Generaliza- tion Ability of MOS Prediction Networks,”Proceedings of Inter- national Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8442–8446, 2022

work page 2022
[28]

NISQA: A Deep CNN-Self-Attention Model for Multidimensional Speech Quality Prediction with Crowdsourced Datasets,

G. Mittag, B. Naderi, A. Chehadi, and S. M ¨oller, “NISQA: A Deep CNN-Self-Attention Model for Multidimensional Speech Quality Prediction with Crowdsourced Datasets,” inProceedings of ISCA Interspeech, 2021, pp. 2127–2131

work page 2021
[29]

UT- MOS: UTokyo-SaruLab System for V oiceMOS Challenge 2022,

Takaaki Saeki and Detai Xin and Wataru Nakata and Tomoki Ko- riyama and Shinnosuke Takamichi and Hiroshi Saruwatari, “UT- MOS: UTokyo-SaruLab System for V oiceMOS Challenge 2022,” inProceedings of ISCA Interspeech, 2022, pp. 4521–4525

work page 2022
[30]

MB- NET: MOS Prediction for Synthesized Speech with Mean-Bias Network,

Y . Leng, X. Tan, S. Zhao, F. K. Soong, X.-Y . Li, and T. Qin, “MB- NET: MOS Prediction for Synthesized Speech with Mean-Bias Network,”Proceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 391–395, 2021

work page 2021
[31]

Investigating the sensitivity of pre-trained audio embeddings to common effects,

V . Deng, C. Wang, G. Richard, and B. McFee, “Investigating the sensitivity of pre-trained audio embeddings to common effects,” inProceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2025, pp. 1–5

work page 2025
[32]

Stablerep: Synthetic images from text-to-image models make strong visual representation learners,

Y . Tian, L. Fan, P. Isola, H. Chang, and D. Krishnan, “Stablerep: Synthetic images from text-to-image models make strong visual representation learners,”Advances in Neural Information Pro- cessing Systems (NeurIPS), pp. 48 382–48 402, 2023

work page 2023
[33]

X-Vectors: Robust DNN Embeddings for Speaker Recog- nition,

D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khudan- pur, “X-Vectors: Robust DNN Embeddings for Speaker Recog- nition,” inProceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018, pp. 5329–5333

work page 2018
[34]

In defence of metric learning for speaker recognition,

J. S. Chung, J. Huh, S. Mun, M. Lee, H.-S. Heo, S. Choe, C. Ham, S. Jung, B.-J. Lee, and I. Han, “In defence of metric learning for speaker recognition,” inProceedings of ISCA Interspeech, 2020, pp. 2977–2981

work page 2020
[35]

How do voices from past speech synthesis challenges compare today?

E. Cooper and J. Yamagishi, “How do voices from past speech synthesis challenges compare today?” in11th ISCA Speech Syn- thesis Workshop, 2021, pp. 183–188

work page 2021
[36]

rV AD: An unsupervised segment- based robust voice activity detection method,

Z.-H. Tan, N. Dehaket al., “rV AD: An unsupervised segment- based robust voice activity detection method,”Computer Speech & Language, pp. 1–21, 2020

work page 2020
[37]

Lib- rispeech: An ASR corpus based on public domain audio books,

V . Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Lib- rispeech: An ASR corpus based on public domain audio books,” inProceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2015, pp. 5206–5210

work page 2015
[38]

The third ’CHiME’ speech separation and recognition challenge: Dataset, task and baselines,

J. Barker, R. Marxer, E. Vincent, and S. Watanabe, “The third ’CHiME’ speech separation and recognition challenge: Dataset, task and baselines,” inProceedings of Automatic Speech Recogni- tion and Understanding Workshop (ASRU), 2015, pp. 504–511

work page 2015
[39]

Robust wav2vec 2.0: Analyzing Domain Shift in Self- Supervised Pre-Training,

W.-N. Hsu, A. Sriram, A. Baevski, T. Likhomanenko, Q. Xu, V . Pratap, J. Kahn, A. Lee, R. Collobert, G. Synnaeve, and M. Auli, “Robust wav2vec 2.0: Analyzing Domain Shift in Self- Supervised Pre-Training,” inProceedings of ISCA Interspeech, 2021, pp. 721–725

work page 2021
[40]

A short- time objective intelligibility measure for time-frequency weighted noisy speech,

C. H. Taal, R. C. Hendriks, R. Heusdens, and J. Jensen, “A short- time objective intelligibility measure for time-frequency weighted noisy speech,” inProceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2010, pp. 4214–4217

work page 2010
[41]

NIST speaker recognition eval- uation chronicles,

M. Przybocki and A. F. Martin, “NIST speaker recognition eval- uation chronicles,” inSpeaker and Language Recognition Work- shop (Odyssey), 2004, pp. 15–22

work page 2004
[42]

A Framework for the Robust Evaluation of Sound Event Detection,

C ¸ a˘gdas ¸ Bilen, G. Ferroni, F. Tuveri, J. Azcarreta, and S. Krstulovic, “A Framework for the Robust Evaluation of Sound Event Detection,”Proceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 61–65, 2020

work page 2020
[43]

DCASE 2024 task 4: Sound event detection with heterogeneous data and missing labels,

S. Cornell, J. Ebbers, C. Douwes, I. Mart ´ın-Morat´o, M. Harju, A. Mesaros, and R. Serizel, “DCASE 2024 task 4: Sound event detection with heterogeneous data and missing labels,”arXiv preprint arXiv:2406.08056, 2024

work page arXiv 2024
[44]

Post-Processing Independent Evaluation of Sound Event Detection Systems,

J. Ebbers, R. Haeb-Umbach, and R. Serizel, “Post-Processing Independent Evaluation of Sound Event Detection Systems,” in Workshop on Detection and Classification of Acoustic Scenes and Events (DCASE), 2023

work page 2023

[1] [1]

Speech Quality Embeddings for Improved Detection and Classification of Degradations in Speech Signals

Introduction Automatic subjective speech quality assessment (SSQA) as- signs a quality score to speech signals that reflects subjective quality perception (e.g., poor or excellent) [1]. Usually, these approaches worknon-intrusively, i.e., without knowledge of a clean matching reference signal. Furthermore, they are trained with mean opinion score (MOS) su...

work page internal anchor Pith review Pith/arXiv arXiv 2026

[2] [2]

Similar to utterance-level assessment, encoder-decoder mod- els [1] can be used to infer frame-level quality scores3 [2]

Local speech quality assessment Local subjective speech quality assessment (LSSQA) 2 aims to estimate quality at a finer resolution than the utterance level. Similar to utterance-level assessment, encoder-decoder mod- els [1] can be used to infer frame-level quality scores3 [2]. Given a datasetD={(s 1(t), y1),(s 2(t), y2), . . .} of speech signalss i(t)wi...

work page

[3] [3]

As motivated at the beginning, we expect that training with frame-level targets will improve detection performance

Improved frame-level scores Local degradation detection is challenging because only utterance-level, i.e., weak, quality targets are available at scale. As motivated at the beginning, we expect that training with frame-level targets will improve detection performance. To this end, we propose a straightforward data augmentation strategy to produce frame-le...

work page

[4] [4]

This can be ben- eficial in system analysis, e.g., identifying frequent degradation types or retrieving specific degradations from large databases

Speech quality embeddings For downstream applications, not only the location of a degra- dation in a signal, but also itstypeis relevant. This can be ben- eficial in system analysis, e.g., identifying frequent degradation types or retrieving specific degradations from large databases. Cumlin et al. [7] studied the latent space of DNSMOS-like [13] models a...

work page

[5] [5]

Local degradation detection: MOS-based versus embedding-based Given estimated frame-level scoresˆ q,MOS-baseddetection of local degradations infers, for each frame, whether it suffers from a quality degradation by comparing its frame-level score to a score thresholdq deg that is tuned on a validation set. Assuming that the embeddings form clusters by degr...

work page

[6] [6]

First, the proposed LSSQA extensions are evaluated with respect to their MOS- and embedding-based degradation detection

Experiments To demonstrate that our contributions improve the detection of local degradations and the identification of degradation types, we conduct three evaluations. First, the proposed LSSQA extensions are evaluated with respect to their MOS- and embedding-based degradation detection. Then, the speech quality embeddings are assessed for their specific...

work page arXiv

[7] [7]

The con- trastive loss also improved discrimination between degradation types and retrieval of the same types

Conclusion We have shown that the detection of local degradation in speech signals can be significantly improved using (i) frame- level pseudo-targets via a partial mix-up data augmentation, (ii) adding a supervised contrastive loss exploiting knowledge about the degradation types, and (iii) switching from MOS- based to embedding-based detection, resultin...

work page

[8] [8]

Acknowledgements Computational resources were provided by the Paderborn Cen- ter for Parallel Computing

work page

[9] [9]

SHEET: A Multi-purpose Open-source Speech Human Evaluation Estimation Toolkit,

W.-C. Huang, E. Cooper, and T. Toda, “SHEET: A Multi-purpose Open-source Speech Human Evaluation Estimation Toolkit,” in Proceedings of ISCA Interspeech, 2025, pp. 2355–2359

work page 2025

[10] [10]

Towards Frame-level Quality Predictions of Synthetic Speech ,

M. Kuhlmann, F. Seebauer, P. Wagner, and R. Haeb-Umbach, “Towards Frame-level Quality Predictions of Synthetic Speech ,” inProceedings of ISCA Interspeech, 2025, pp. 2300–2304

work page 2025

[11] [11]

Speech quality-based localization of low-quality speech and text-to-speech synthesis artefacts,

M. Kuhlmann, A. Werning, T. von Neumann, and R. Haeb- Umbach, “Speech quality-based localization of low-quality speech and text-to-speech synthesis artefacts,”Proceedings of In- ternational Conference on Acoustics, Speech and Signal Process- ing (ICASSP), 2026

work page 2026

[12] [12]

Quality-Net: An End-to-End Non-intrusive Speech Quality Assessment Model Based on BLSTM,

S. wei Fu, Y . Tsao, H.-T. Hwang, and H.-M. Wang, “Quality-Net: An End-to-End Non-intrusive Speech Quality Assessment Model Based on BLSTM,” inProceedings of ISCA Interspeech, 2018

work page 2018

[13] [13]

Fu, C.-F

S.-W. Fu, C.-F. Liao, and Y . Tsao,icassp, pp. 26–30, 2020

work page 2020

[14] [14]

MetricGAN: Gen- erative adversarial networks based black-box metric scores opti- mization for speech enhancement,

S.-W. Fu, C.-F. Liao, Y . Tsao, and S.-D. Lin, “MetricGAN: Gen- erative adversarial networks based black-box metric scores opti- mization for speech enhancement,” inProceedings of the Interna- tional Conference on Machine Learning, 2019, pp. 2031–2041

work page 2019

[15] [15]

Impairments are clustered in latents of deep neural network-based speech quality models,

F. Cumlin, X. Liang, V . Ungureanu, C. K. Reddy, C. Sch¨uldt, and S. Chatterjee, “Impairments are clustered in latents of deep neural network-based speech quality models,” inProceedings of Inter- national Conference on Acoustics, Speech and Signal Processing (ICASSP), 2025, pp. 1–5

work page 2025

[16] [16]

Sound Event Detection in Domestic Environments with Weakly Labeled Data and Soundscape Synthesis,

N. Turpault, R. Serizel, A. Shah, and J. Salamon, “Sound Event Detection in Domestic Environments with Weakly Labeled Data and Soundscape Synthesis,” inWorkshop on Detection and Clas- sification of Acoustic Scenes and Events (DCASE), 2019

work page 2019

[17] [17]

DCASE 2022 TASK4 CHALLENGE TECHNICAL REPORT,

J. Hao, S. Ye, C. Lu, F. Dong, and J. Liu, “DCASE 2022 TASK4 CHALLENGE TECHNICAL REPORT,”Workshop on Detection and Classification of Acoustic Scenes and Events (DCASE), 2022

work page 2022

[18] [18]

DCASE 2023 Challenge Task4 Technical Report,

M. Chen, Y . Jin, J. Shao, Y . Liu, B. Peng, and J. Chen, “DCASE 2023 Challenge Task4 Technical Report,”Workshop on Detection and Classification of Acoustic Scenes and Events (DCASE), 2023

work page 2023

[19] [19]

FLAM: Frame-wise language-audio modeling,

Y . Wu, C. Tsirigotis, K. Chen, C.-Z. A. Huang, A. Courville, O. Nieto, P. Seetharaman, and J. Salamon, “FLAM: Frame-wise language-audio modeling,” inProceedings of the International Conference on Machine Learning (ICML), 2025. [Online]. Available: https://openreview.net/forum?id=7fQohcFrxG

work page 2025

[20] [20]

Tacos: Temporally- aligned audio captions for language-audio pretraining,

P. Primus, F. Schmid, and G. Widmer, “Tacos: Temporally- aligned audio captions for language-audio pretraining,” inPro- ceedings of Workshop on Applications of Signal Processing to Au- dio and Acoustics (WASPAA), 2025, pp. 1–5

work page 2025

[21] [21]

Dnsmos: A non-intrusive perceptual objective speech quality metric to evaluate noise sup- pressors,

C. K. Reddy, V . Gopal, and R. Cutler, “Dnsmos: A non-intrusive perceptual objective speech quality metric to evaluate noise sup- pressors,” inProceedings of International Conference on Acous- tics, Speech and Signal Processing (ICASSP), 2021, pp. 6493– 6497

work page 2021

[22] [22]

Perceptual evaluation of speech quality (pesq)-a new method for speech quality assessment of telephone networks and codecs,

A. W. Rix, J. G. Beerends, M. P. Hollier, and A. P. Hekstra, “Perceptual evaluation of speech quality (pesq)-a new method for speech quality assessment of telephone networks and codecs,” inProceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2001, pp. 749–752

work page 2001

[23] [23]

Visqol v3: An open source production ready objective speech and audio metric,

M. Chinen, F. S. Lim, J. Skoglund, N. Gureev, F. O’Gorman, and A. Hines, “Visqol v3: An open source production ready objective speech and audio metric,” inInternational Conference on Quality of Multimedia Experience (QoMEX), 2020, pp. 1–6

work page 2020

[24] [24]

NOMAD: Unsupervised Learning of Perceptual Embeddings For Speech Enhancement and Non-Matching Reference Audio Quality Assessment,

A. Ragano, J. Skoglund, and A. Hines, “NOMAD: Unsupervised Learning of Perceptual Embeddings For Speech Enhancement and Non-Matching Reference Audio Quality Assessment,” inPro- ceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2024, pp. 1011–1015

work page 2024

[25] [25]

Speech intelligibility prediction using a neurogram similarity index measure,

A. Hines and N. Harte, “Speech intelligibility prediction using a neurogram similarity index measure,”Speech Communication, vol. 54, no. 2, pp. 306–320, 2012

work page 2012

[26] [26]

Supervised contrastive learning,

P. Khosla, P. Teterwak, C. Wang, A. Sarna, Y . Tian, P. Isola, A. Maschinot, C. Liu, and D. Krishnan, “Supervised contrastive learning,”Advances in Neural Information Processing Systems (NeurIPS), pp. 18 661–18 673, 2020

work page 2020

[27] [27]

Generaliza- tion Ability of MOS Prediction Networks,

E. Cooper, W.-C. Huang, T. Toda, and J. Yamagishi, “Generaliza- tion Ability of MOS Prediction Networks,”Proceedings of Inter- national Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8442–8446, 2022

work page 2022

[28] [28]

NISQA: A Deep CNN-Self-Attention Model for Multidimensional Speech Quality Prediction with Crowdsourced Datasets,

G. Mittag, B. Naderi, A. Chehadi, and S. M ¨oller, “NISQA: A Deep CNN-Self-Attention Model for Multidimensional Speech Quality Prediction with Crowdsourced Datasets,” inProceedings of ISCA Interspeech, 2021, pp. 2127–2131

work page 2021

[29] [29]

UT- MOS: UTokyo-SaruLab System for V oiceMOS Challenge 2022,

Takaaki Saeki and Detai Xin and Wataru Nakata and Tomoki Ko- riyama and Shinnosuke Takamichi and Hiroshi Saruwatari, “UT- MOS: UTokyo-SaruLab System for V oiceMOS Challenge 2022,” inProceedings of ISCA Interspeech, 2022, pp. 4521–4525

work page 2022

[30] [30]

MB- NET: MOS Prediction for Synthesized Speech with Mean-Bias Network,

Y . Leng, X. Tan, S. Zhao, F. K. Soong, X.-Y . Li, and T. Qin, “MB- NET: MOS Prediction for Synthesized Speech with Mean-Bias Network,”Proceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 391–395, 2021

work page 2021

[31] [31]

Investigating the sensitivity of pre-trained audio embeddings to common effects,

V . Deng, C. Wang, G. Richard, and B. McFee, “Investigating the sensitivity of pre-trained audio embeddings to common effects,” inProceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2025, pp. 1–5

work page 2025

[32] [32]

Stablerep: Synthetic images from text-to-image models make strong visual representation learners,

Y . Tian, L. Fan, P. Isola, H. Chang, and D. Krishnan, “Stablerep: Synthetic images from text-to-image models make strong visual representation learners,”Advances in Neural Information Pro- cessing Systems (NeurIPS), pp. 48 382–48 402, 2023

work page 2023

[33] [33]

X-Vectors: Robust DNN Embeddings for Speaker Recog- nition,

D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khudan- pur, “X-Vectors: Robust DNN Embeddings for Speaker Recog- nition,” inProceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018, pp. 5329–5333

work page 2018

[34] [34]

In defence of metric learning for speaker recognition,

J. S. Chung, J. Huh, S. Mun, M. Lee, H.-S. Heo, S. Choe, C. Ham, S. Jung, B.-J. Lee, and I. Han, “In defence of metric learning for speaker recognition,” inProceedings of ISCA Interspeech, 2020, pp. 2977–2981

work page 2020

[35] [35]

How do voices from past speech synthesis challenges compare today?

E. Cooper and J. Yamagishi, “How do voices from past speech synthesis challenges compare today?” in11th ISCA Speech Syn- thesis Workshop, 2021, pp. 183–188

work page 2021

[36] [36]

rV AD: An unsupervised segment- based robust voice activity detection method,

Z.-H. Tan, N. Dehaket al., “rV AD: An unsupervised segment- based robust voice activity detection method,”Computer Speech & Language, pp. 1–21, 2020

work page 2020

[37] [37]

Lib- rispeech: An ASR corpus based on public domain audio books,

V . Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Lib- rispeech: An ASR corpus based on public domain audio books,” inProceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2015, pp. 5206–5210

work page 2015

[38] [38]

The third ’CHiME’ speech separation and recognition challenge: Dataset, task and baselines,

J. Barker, R. Marxer, E. Vincent, and S. Watanabe, “The third ’CHiME’ speech separation and recognition challenge: Dataset, task and baselines,” inProceedings of Automatic Speech Recogni- tion and Understanding Workshop (ASRU), 2015, pp. 504–511

work page 2015

[39] [39]

Robust wav2vec 2.0: Analyzing Domain Shift in Self- Supervised Pre-Training,

W.-N. Hsu, A. Sriram, A. Baevski, T. Likhomanenko, Q. Xu, V . Pratap, J. Kahn, A. Lee, R. Collobert, G. Synnaeve, and M. Auli, “Robust wav2vec 2.0: Analyzing Domain Shift in Self- Supervised Pre-Training,” inProceedings of ISCA Interspeech, 2021, pp. 721–725

work page 2021

[40] [40]

A short- time objective intelligibility measure for time-frequency weighted noisy speech,

C. H. Taal, R. C. Hendriks, R. Heusdens, and J. Jensen, “A short- time objective intelligibility measure for time-frequency weighted noisy speech,” inProceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2010, pp. 4214–4217

work page 2010

[41] [41]

NIST speaker recognition eval- uation chronicles,

M. Przybocki and A. F. Martin, “NIST speaker recognition eval- uation chronicles,” inSpeaker and Language Recognition Work- shop (Odyssey), 2004, pp. 15–22

work page 2004

[42] [42]

A Framework for the Robust Evaluation of Sound Event Detection,

C ¸ a˘gdas ¸ Bilen, G. Ferroni, F. Tuveri, J. Azcarreta, and S. Krstulovic, “A Framework for the Robust Evaluation of Sound Event Detection,”Proceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 61–65, 2020

work page 2020

[43] [43]

DCASE 2024 task 4: Sound event detection with heterogeneous data and missing labels,

S. Cornell, J. Ebbers, C. Douwes, I. Mart ´ın-Morat´o, M. Harju, A. Mesaros, and R. Serizel, “DCASE 2024 task 4: Sound event detection with heterogeneous data and missing labels,”arXiv preprint arXiv:2406.08056, 2024

work page arXiv 2024

[44] [44]

Post-Processing Independent Evaluation of Sound Event Detection Systems,

J. Ebbers, R. Haeb-Umbach, and R. Serizel, “Post-Processing Independent Evaluation of Sound Event Detection Systems,” in Workshop on Detection and Classification of Acoustic Scenes and Events (DCASE), 2023

work page 2023