pith. sign in

arxiv: 2605.21332 · v1 · pith:I5C6SDWQnew · submitted 2026-05-20 · 📡 eess.AS

Speech Quality Embeddings for Improved Detection and Classification of Degradations in Speech Signals

Pith reviewed 2026-05-21 03:24 UTC · model grok-4.3

classification 📡 eess.AS
keywords speech quality assessmentdegradation detectionframe-level embeddingscontrastive losspartial mix-upspeech signalslocal quality prediction
0
0 comments X

The pith

Frame-level embeddings from speech quality models cluster by degradation type when trained with partial mix-up and contrastive loss.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Modern speech systems produce mostly high-quality audio with degradations that appear only in short segments, so utterance-level quality scores miss important local details. This work extends standard speech quality assessment models to output frame-level embeddings instead. It does so by mixing clean and degraded versions of the same utterances in a controlled way and adding a contrastive loss that separates different degradation types in embedding space. Experiments show the resulting embeddings improve detection of where degradations occur and allow the types of degradation to be read off from how the embeddings group together, and this holds for both familiar and new data.

Core claim

Speech quality assessment models trained with a partial mix-up strategy on parallel clean and degraded utterances together with a contrastive loss produce frame-level embeddings that form clusters corresponding to distinct degradation types. These clusters improve degradation detection and enable identification of degradation types by analysis of the embedding space on both in-domain and out-of-domain data.

What carries the argument

Frame-level embeddings produced by a speech quality model after partial mix-up on clean-degraded utterance pairs and contrastive loss to separate degradation types.

If this is right

  • Local frame-level predictions become available for spotting degradations that affect only parts of an utterance.
  • Degradation detection accuracy rises on both data seen during training and new out-of-domain recordings.
  • Specific degradation types can be identified simply by inspecting which cluster an embedding belongs to.
  • The approach extends older utterance-level quality assessment to the localized problems typical of current high-quality systems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • These embeddings could feed into targeted enhancement algorithms that apply different fixes depending on the detected degradation cluster.
  • Similar mix-up and contrastive training might transfer to non-speech audio such as music or environmental sounds for degradation classification.
  • The clusters could be monitored in real time within communication pipelines to flag emerging quality issues before they affect listeners.

Load-bearing premise

That the partial mix-up and contrastive loss will cause embeddings to group by degradation type rather than by speaker identity or speech content.

What would settle it

If the learned embeddings in experiments cluster primarily by speaker or phonetic content instead of by degradation type, the central claim would not hold.

Figures

Figures reproduced from arXiv: 2605.21332 by Michael Kuhlmann, Reinhold Haeb-Umbach, Tobias Cord-Landwehr.

Figure 1
Figure 1. Figure 1: Block schema of the full proposed model. The encoder Enc feeds two decoder heads: DecMOS for frame-level scores q, followed by mean pooling for utterance-level estimates yˆ, and Dec gscl for frame-level embeddings Z scl, followed by a projection layer for contrastive training. Z MOS denotes the embeddings of the MOS decoder before projection to the frame-level scores. where IP (cb,l) = {(b ′ , l′ ) ∈ {1, .… view at source ↗
Figure 2
Figure 2. Figure 2: CON1 MOS-based (upper) and embedding￾based (lower) detection results for a single utterance from NISQA TEST SIM-partial-mixup. The threshold is tuned to maximize the intersection-based F1-score on the full test set. EER: 4.88% versus 3.87%). 6.3. Embedding analysis To analyze whether frame-level embeddings of the same degra￾dation type group together and are distinguishable from em￾beddings of other degrad… view at source ↗
Figure 3
Figure 3. Figure 3: EER and accuracy when constraining the number of concurrent degradations in NISQA TEST SIM-partial-mixup. (CON1 EER: 7.71%, CON2 EER: 0.93%). Here, it becomes evident that excluding clean frames from the set of positive classes during training (CON2) is clearly beneficial for degra￾dation identification. When increasing the maximum number of simultaneous degradations, accuracy starts to drop and EER gets h… view at source ↗
read the original abstract

Automatic subjective speech quality assessment (SSQA) traditionally estimates speech quality on an utterance or system level. While this resolution was adequate for older transmission or synthesis systems that produced speech signals of mediocre quality, modern systems generate high-quality speech with degradations that may occur only locally. With suitable model architectures and regularization losses, SSQA models trained with utterance-level targets can also yield useful local predictions of speech quality. In this work, we extend such models to produce frame-level embeddings that cluster by degradation type. Specifically, we employ a partial mix-up strategy on a parallel corpus of clean and degraded utterances and apply a contrastive loss to distinguish between degradation types. Through experiments on both in- and out-of-domain data, we demonstrate that our approach improves degradation detection and enables the identification of degradation types by analyzing embedding clusters.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes extending utterance-level speech quality assessment models to produce frame-level embeddings by applying partial mix-up on parallel clean/degraded utterance pairs together with a contrastive loss. The central claim is that the resulting embeddings form clusters corresponding to distinct degradation types, which in turn improves degradation detection and enables type identification. Experiments are reported on both in-domain and out-of-domain data, with supporting visualizations of embedding clusters.

Significance. If the embeddings can be shown to isolate degradation type from speaker and content factors, the approach would meaningfully advance local SSQA for modern high-quality speech systems where degradations are often localized. The combination of partial mix-up and contrastive loss on parallel data is a plausible mechanism, and the in-/out-of-domain evaluation design is appropriate for testing generalization.

major comments (2)
  1. [§3.2] §3.2 (Contrastive Loss and Partial Mix-up): The loss formulation and pairing strategy are not described with explicit controls that hold speaker identity or phonetic content fixed while varying degradation (or vice versa). Without such controls or corresponding ablations, the t-SNE and k-means results cannot rule out the possibility that clusters are driven by speaker or content rather than degradation type. This directly undermines the claim that clusters enable degradation-type identification.
  2. [§4.3] §4.3 (Out-of-domain Experiments): The reported improvements in detection are given, but no quantitative cluster-purity or normalized mutual information metrics are provided that compare alignment with degradation labels versus speaker or utterance-content labels. This leaves the type-identification claim without direct evidence on the out-of-domain data where the concern is most acute.
minor comments (2)
  1. [§1] The abstract and §1 use the term 'partial mix-up' without an immediate reference to its precise definition or the mixing ratio schedule; a short equation or pseudocode would improve clarity.
  2. [Figure 3] Figure 3 (t-SNE visualizations): Color coding by degradation type is helpful, but adding a second panel colored by speaker ID would allow readers to visually assess the disentanglement claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address the concerns about explicit controls in the pairing strategy and the lack of quantitative cluster metrics on out-of-domain data. Revisions will be made to strengthen the manuscript.

read point-by-point responses
  1. Referee: [§3.2] §3.2 (Contrastive Loss and Partial Mix-up): The loss formulation and pairing strategy are not described with explicit controls that hold speaker identity or phonetic content fixed while varying degradation (or vice versa). Without such controls or corresponding ablations, the t-SNE and k-means results cannot rule out the possibility that clusters are driven by speaker or content rather than degradation type. This directly undermines the claim that clusters enable degradation-type identification.

    Authors: The parallel corpus consists of clean and degraded versions of identical utterances, which by design holds speaker identity and phonetic content fixed while varying only the degradation. We will revise §3.2 to explicitly describe this pairing strategy and the resulting controls. We will also add an ablation comparing same-utterance pairs against cross-speaker pairs to confirm that degradation type, rather than speaker or content, drives the observed clustering. revision: yes

  2. Referee: [§4.3] §4.3 (Out-of-domain Experiments): The reported improvements in detection are given, but no quantitative cluster-purity or normalized mutual information metrics are provided that compare alignment with degradation labels versus speaker or utterance-content labels. This leaves the type-identification claim without direct evidence on the out-of-domain data where the concern is most acute.

    Authors: We agree that quantitative metrics would provide stronger evidence. We will add cluster purity and normalized mutual information (NMI) scores on the out-of-domain embeddings, explicitly comparing alignment with degradation labels against speaker and content labels. This will directly address the type-identification claim for out-of-domain data. revision: yes

Circularity Check

0 steps flagged

No significant circularity; claims rest on experimental outcomes

full rationale

The paper presents an empirical ML approach: partial mix-up on parallel clean/degraded utterances combined with contrastive loss to produce frame-level embeddings, followed by clustering analysis for degradation detection and type identification. Central claims are validated through reported experiments on in-domain and out-of-domain data rather than any closed-form derivation, self-referential equations, or parameters fitted to a subset then renamed as predictions. No load-bearing steps invoke self-citations for uniqueness theorems, smuggle ansatzes, or reduce the result to the input by construction. The method is self-contained via direct empirical demonstration of embedding clusters and performance gains.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Only the abstract is available, so the ledger is necessarily incomplete. No explicit free parameters, invented entities, or non-standard axioms are described. The work assumes standard deep-learning components and that utterance-level training can yield useful local predictions.

axioms (1)
  • domain assumption Suitable model architectures and regularization losses allow utterance-level trained SSQA models to produce useful local predictions.
    Stated in the abstract as the starting point for the extension.

pith-pipeline@v0.9.0 · 5676 in / 1241 out tokens · 36261 ms · 2026-05-21T03:24:32.471409+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

44 extracted references · 44 canonical work pages · 1 internal anchor

  1. [1]

    Speech Quality Embeddings for Improved Detection and Classification of Degradations in Speech Signals

    Introduction Automatic subjective speech quality assessment (SSQA) as- signs a quality score to speech signals that reflects subjective quality perception (e.g., poor or excellent) [1]. Usually, these approaches worknon-intrusively, i.e., without knowledge of a clean matching reference signal. Furthermore, they are trained with mean opinion score (MOS) su...

  2. [2]

    Similar to utterance-level assessment, encoder-decoder mod- els [1] can be used to infer frame-level quality scores3 [2]

    Local speech quality assessment Local subjective speech quality assessment (LSSQA) 2 aims to estimate quality at a finer resolution than the utterance level. Similar to utterance-level assessment, encoder-decoder mod- els [1] can be used to infer frame-level quality scores3 [2]. Given a datasetD={(s 1(t), y1),(s 2(t), y2), . . .} of speech signalss i(t)wi...

  3. [3]

    As motivated at the beginning, we expect that training with frame-level targets will improve detection performance

    Improved frame-level scores Local degradation detection is challenging because only utterance-level, i.e., weak, quality targets are available at scale. As motivated at the beginning, we expect that training with frame-level targets will improve detection performance. To this end, we propose a straightforward data augmentation strategy to produce frame-le...

  4. [4]

    This can be ben- eficial in system analysis, e.g., identifying frequent degradation types or retrieving specific degradations from large databases

    Speech quality embeddings For downstream applications, not only the location of a degra- dation in a signal, but also itstypeis relevant. This can be ben- eficial in system analysis, e.g., identifying frequent degradation types or retrieving specific degradations from large databases. Cumlin et al. [7] studied the latent space of DNSMOS-like [13] models a...

  5. [5]

    Local degradation detection: MOS-based versus embedding-based Given estimated frame-level scoresˆ q,MOS-baseddetection of local degradations infers, for each frame, whether it suffers from a quality degradation by comparing its frame-level score to a score thresholdq deg that is tuned on a validation set. Assuming that the embeddings form clusters by degr...

  6. [6]

    First, the proposed LSSQA extensions are evaluated with respect to their MOS- and embedding-based degradation detection

    Experiments To demonstrate that our contributions improve the detection of local degradations and the identification of degradation types, we conduct three evaluations. First, the proposed LSSQA extensions are evaluated with respect to their MOS- and embedding-based degradation detection. Then, the speech quality embeddings are assessed for their specific...

  7. [7]

    The con- trastive loss also improved discrimination between degradation types and retrieval of the same types

    Conclusion We have shown that the detection of local degradation in speech signals can be significantly improved using (i) frame- level pseudo-targets via a partial mix-up data augmentation, (ii) adding a supervised contrastive loss exploiting knowledge about the degradation types, and (iii) switching from MOS- based to embedding-based detection, resultin...

  8. [8]

    Acknowledgements Computational resources were provided by the Paderborn Cen- ter for Parallel Computing

  9. [9]

    SHEET: A Multi-purpose Open-source Speech Human Evaluation Estimation Toolkit,

    W.-C. Huang, E. Cooper, and T. Toda, “SHEET: A Multi-purpose Open-source Speech Human Evaluation Estimation Toolkit,” in Proceedings of ISCA Interspeech, 2025, pp. 2355–2359

  10. [10]

    Towards Frame-level Quality Predictions of Synthetic Speech ,

    M. Kuhlmann, F. Seebauer, P. Wagner, and R. Haeb-Umbach, “Towards Frame-level Quality Predictions of Synthetic Speech ,” inProceedings of ISCA Interspeech, 2025, pp. 2300–2304

  11. [11]

    Speech quality-based localization of low-quality speech and text-to-speech synthesis artefacts,

    M. Kuhlmann, A. Werning, T. von Neumann, and R. Haeb- Umbach, “Speech quality-based localization of low-quality speech and text-to-speech synthesis artefacts,”Proceedings of In- ternational Conference on Acoustics, Speech and Signal Process- ing (ICASSP), 2026

  12. [12]

    Quality-Net: An End-to-End Non-intrusive Speech Quality Assessment Model Based on BLSTM,

    S. wei Fu, Y . Tsao, H.-T. Hwang, and H.-M. Wang, “Quality-Net: An End-to-End Non-intrusive Speech Quality Assessment Model Based on BLSTM,” inProceedings of ISCA Interspeech, 2018

  13. [13]

    Fu, C.-F

    S.-W. Fu, C.-F. Liao, and Y . Tsao,icassp, pp. 26–30, 2020

  14. [14]

    MetricGAN: Gen- erative adversarial networks based black-box metric scores opti- mization for speech enhancement,

    S.-W. Fu, C.-F. Liao, Y . Tsao, and S.-D. Lin, “MetricGAN: Gen- erative adversarial networks based black-box metric scores opti- mization for speech enhancement,” inProceedings of the Interna- tional Conference on Machine Learning, 2019, pp. 2031–2041

  15. [15]

    Impairments are clustered in latents of deep neural network-based speech quality models,

    F. Cumlin, X. Liang, V . Ungureanu, C. K. Reddy, C. Sch¨uldt, and S. Chatterjee, “Impairments are clustered in latents of deep neural network-based speech quality models,” inProceedings of Inter- national Conference on Acoustics, Speech and Signal Processing (ICASSP), 2025, pp. 1–5

  16. [16]

    Sound Event Detection in Domestic Environments with Weakly Labeled Data and Soundscape Synthesis,

    N. Turpault, R. Serizel, A. Shah, and J. Salamon, “Sound Event Detection in Domestic Environments with Weakly Labeled Data and Soundscape Synthesis,” inWorkshop on Detection and Clas- sification of Acoustic Scenes and Events (DCASE), 2019

  17. [17]

    DCASE 2022 TASK4 CHALLENGE TECHNICAL REPORT,

    J. Hao, S. Ye, C. Lu, F. Dong, and J. Liu, “DCASE 2022 TASK4 CHALLENGE TECHNICAL REPORT,”Workshop on Detection and Classification of Acoustic Scenes and Events (DCASE), 2022

  18. [18]

    DCASE 2023 Challenge Task4 Technical Report,

    M. Chen, Y . Jin, J. Shao, Y . Liu, B. Peng, and J. Chen, “DCASE 2023 Challenge Task4 Technical Report,”Workshop on Detection and Classification of Acoustic Scenes and Events (DCASE), 2023

  19. [19]

    FLAM: Frame-wise language-audio modeling,

    Y . Wu, C. Tsirigotis, K. Chen, C.-Z. A. Huang, A. Courville, O. Nieto, P. Seetharaman, and J. Salamon, “FLAM: Frame-wise language-audio modeling,” inProceedings of the International Conference on Machine Learning (ICML), 2025. [Online]. Available: https://openreview.net/forum?id=7fQohcFrxG

  20. [20]

    Tacos: Temporally- aligned audio captions for language-audio pretraining,

    P. Primus, F. Schmid, and G. Widmer, “Tacos: Temporally- aligned audio captions for language-audio pretraining,” inPro- ceedings of Workshop on Applications of Signal Processing to Au- dio and Acoustics (WASPAA), 2025, pp. 1–5

  21. [21]

    Dnsmos: A non-intrusive perceptual objective speech quality metric to evaluate noise sup- pressors,

    C. K. Reddy, V . Gopal, and R. Cutler, “Dnsmos: A non-intrusive perceptual objective speech quality metric to evaluate noise sup- pressors,” inProceedings of International Conference on Acous- tics, Speech and Signal Processing (ICASSP), 2021, pp. 6493– 6497

  22. [22]

    Perceptual evaluation of speech quality (pesq)-a new method for speech quality assessment of telephone networks and codecs,

    A. W. Rix, J. G. Beerends, M. P. Hollier, and A. P. Hekstra, “Perceptual evaluation of speech quality (pesq)-a new method for speech quality assessment of telephone networks and codecs,” inProceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2001, pp. 749–752

  23. [23]

    Visqol v3: An open source production ready objective speech and audio metric,

    M. Chinen, F. S. Lim, J. Skoglund, N. Gureev, F. O’Gorman, and A. Hines, “Visqol v3: An open source production ready objective speech and audio metric,” inInternational Conference on Quality of Multimedia Experience (QoMEX), 2020, pp. 1–6

  24. [24]

    NOMAD: Unsupervised Learning of Perceptual Embeddings For Speech Enhancement and Non-Matching Reference Audio Quality Assessment,

    A. Ragano, J. Skoglund, and A. Hines, “NOMAD: Unsupervised Learning of Perceptual Embeddings For Speech Enhancement and Non-Matching Reference Audio Quality Assessment,” inPro- ceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2024, pp. 1011–1015

  25. [25]

    Speech intelligibility prediction using a neurogram similarity index measure,

    A. Hines and N. Harte, “Speech intelligibility prediction using a neurogram similarity index measure,”Speech Communication, vol. 54, no. 2, pp. 306–320, 2012

  26. [26]

    Supervised contrastive learning,

    P. Khosla, P. Teterwak, C. Wang, A. Sarna, Y . Tian, P. Isola, A. Maschinot, C. Liu, and D. Krishnan, “Supervised contrastive learning,”Advances in Neural Information Processing Systems (NeurIPS), pp. 18 661–18 673, 2020

  27. [27]

    Generaliza- tion Ability of MOS Prediction Networks,

    E. Cooper, W.-C. Huang, T. Toda, and J. Yamagishi, “Generaliza- tion Ability of MOS Prediction Networks,”Proceedings of Inter- national Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8442–8446, 2022

  28. [28]

    NISQA: A Deep CNN-Self-Attention Model for Multidimensional Speech Quality Prediction with Crowdsourced Datasets,

    G. Mittag, B. Naderi, A. Chehadi, and S. M ¨oller, “NISQA: A Deep CNN-Self-Attention Model for Multidimensional Speech Quality Prediction with Crowdsourced Datasets,” inProceedings of ISCA Interspeech, 2021, pp. 2127–2131

  29. [29]

    UT- MOS: UTokyo-SaruLab System for V oiceMOS Challenge 2022,

    Takaaki Saeki and Detai Xin and Wataru Nakata and Tomoki Ko- riyama and Shinnosuke Takamichi and Hiroshi Saruwatari, “UT- MOS: UTokyo-SaruLab System for V oiceMOS Challenge 2022,” inProceedings of ISCA Interspeech, 2022, pp. 4521–4525

  30. [30]

    MB- NET: MOS Prediction for Synthesized Speech with Mean-Bias Network,

    Y . Leng, X. Tan, S. Zhao, F. K. Soong, X.-Y . Li, and T. Qin, “MB- NET: MOS Prediction for Synthesized Speech with Mean-Bias Network,”Proceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 391–395, 2021

  31. [31]

    Investigating the sensitivity of pre-trained audio embeddings to common effects,

    V . Deng, C. Wang, G. Richard, and B. McFee, “Investigating the sensitivity of pre-trained audio embeddings to common effects,” inProceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2025, pp. 1–5

  32. [32]

    Stablerep: Synthetic images from text-to-image models make strong visual representation learners,

    Y . Tian, L. Fan, P. Isola, H. Chang, and D. Krishnan, “Stablerep: Synthetic images from text-to-image models make strong visual representation learners,”Advances in Neural Information Pro- cessing Systems (NeurIPS), pp. 48 382–48 402, 2023

  33. [33]

    X-Vectors: Robust DNN Embeddings for Speaker Recog- nition,

    D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khudan- pur, “X-Vectors: Robust DNN Embeddings for Speaker Recog- nition,” inProceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018, pp. 5329–5333

  34. [34]

    In defence of metric learning for speaker recognition,

    J. S. Chung, J. Huh, S. Mun, M. Lee, H.-S. Heo, S. Choe, C. Ham, S. Jung, B.-J. Lee, and I. Han, “In defence of metric learning for speaker recognition,” inProceedings of ISCA Interspeech, 2020, pp. 2977–2981

  35. [35]

    How do voices from past speech synthesis challenges compare today?

    E. Cooper and J. Yamagishi, “How do voices from past speech synthesis challenges compare today?” in11th ISCA Speech Syn- thesis Workshop, 2021, pp. 183–188

  36. [36]

    rV AD: An unsupervised segment- based robust voice activity detection method,

    Z.-H. Tan, N. Dehaket al., “rV AD: An unsupervised segment- based robust voice activity detection method,”Computer Speech & Language, pp. 1–21, 2020

  37. [37]

    Lib- rispeech: An ASR corpus based on public domain audio books,

    V . Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Lib- rispeech: An ASR corpus based on public domain audio books,” inProceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2015, pp. 5206–5210

  38. [38]

    The third ’CHiME’ speech separation and recognition challenge: Dataset, task and baselines,

    J. Barker, R. Marxer, E. Vincent, and S. Watanabe, “The third ’CHiME’ speech separation and recognition challenge: Dataset, task and baselines,” inProceedings of Automatic Speech Recogni- tion and Understanding Workshop (ASRU), 2015, pp. 504–511

  39. [39]

    Robust wav2vec 2.0: Analyzing Domain Shift in Self- Supervised Pre-Training,

    W.-N. Hsu, A. Sriram, A. Baevski, T. Likhomanenko, Q. Xu, V . Pratap, J. Kahn, A. Lee, R. Collobert, G. Synnaeve, and M. Auli, “Robust wav2vec 2.0: Analyzing Domain Shift in Self- Supervised Pre-Training,” inProceedings of ISCA Interspeech, 2021, pp. 721–725

  40. [40]

    A short- time objective intelligibility measure for time-frequency weighted noisy speech,

    C. H. Taal, R. C. Hendriks, R. Heusdens, and J. Jensen, “A short- time objective intelligibility measure for time-frequency weighted noisy speech,” inProceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2010, pp. 4214–4217

  41. [41]

    NIST speaker recognition eval- uation chronicles,

    M. Przybocki and A. F. Martin, “NIST speaker recognition eval- uation chronicles,” inSpeaker and Language Recognition Work- shop (Odyssey), 2004, pp. 15–22

  42. [42]

    A Framework for the Robust Evaluation of Sound Event Detection,

    C ¸ a˘gdas ¸ Bilen, G. Ferroni, F. Tuveri, J. Azcarreta, and S. Krstulovic, “A Framework for the Robust Evaluation of Sound Event Detection,”Proceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 61–65, 2020

  43. [43]

    DCASE 2024 task 4: Sound event detection with heterogeneous data and missing labels,

    S. Cornell, J. Ebbers, C. Douwes, I. Mart ´ın-Morat´o, M. Harju, A. Mesaros, and R. Serizel, “DCASE 2024 task 4: Sound event detection with heterogeneous data and missing labels,”arXiv preprint arXiv:2406.08056, 2024

  44. [44]

    Post-Processing Independent Evaluation of Sound Event Detection Systems,

    J. Ebbers, R. Haeb-Umbach, and R. Serizel, “Post-Processing Independent Evaluation of Sound Event Detection Systems,” in Workshop on Detection and Classification of Acoustic Scenes and Events (DCASE), 2023