pith. sign in

arxiv: 1907.01742 · v1 · pith:CQ3HFEFHnew · submitted 2019-07-03 · 💻 cs.SD · cs.LG· eess.AS

Supervised Classifiers for Audio Impairments with Noisy Labels

Pith reviewed 2026-05-25 10:05 UTC · model grok-4.3

classification 💻 cs.SD cs.LGeess.AS
keywords audio impairment classificationnoisy labelsconvolutional neural networksVoIPspeech qualitysupervised learninglabel noise
0
0 comments X

The pith

CNNs generalize better than dense networks when trained on large volumes of noisy user labels for audio impairment classification.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether supervised models can classify speech impairments in VoIP calls when the only available labels come from ordinary users and therefore contain substantial noise. It trains both dense networks and convolutional neural networks on engineered features, spectrograms, and raw waveforms, using both synthetic random label flips and actual human labeling errors. CNNs maintain higher test accuracy despite the noise, but the required training set size grows in direct proportion to the noise fraction. This matters because user feedback supplies far more examples than expert annotation can ever provide, yet the noise level has historically made such data unusable for training reliable classifiers.

Core claim

Convolutional neural networks can generalize better on training data containing large numbers of noisy labels and thereby achieve higher test performance than dense networks for the task of audio impairment classification. The advantage appears for both randomly injected label noise and noise arising from human errors. Training with noisy labels also demands a substantial increase in dataset size, with the required size scaling proportionally to the fraction of incorrect labels.

What carries the argument

Convolutional neural networks operating on spectrograms or raw audio, which extract local patterns that remain informative even when many training labels are wrong.

If this is right

  • CNN test accuracy exceeds that of dense networks across both synthetic and human-generated label noise.
  • Dataset size must increase linearly with label-noise fraction to keep performance constant.
  • The same CNN advantage appears whether inputs are engineered features, spectrograms, or raw waveforms.
  • Human feedback can serve as usable training labels once the model architecture and data volume are adjusted for the noise level.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Large-scale collection of user feedback could become a practical substitute for expert labeling if CNNs continue to tolerate the observed noise rates.
  • The same scaling relationship between noise fraction and required data volume may appear in other audio or multimedia classification tasks that rely on crowd-sourced labels.
  • Real-world deployment would require verifying that the noise distribution in live user feedback matches the distribution studied in the paper.

Load-bearing premise

The observed CNN advantage on noisy training labels will persist when the test set is also labeled by the same noisy human process rather than by clean expert labels.

What would settle it

Collect a test set labeled by the identical human-error process used for training and measure whether CNN accuracy remains higher than dense-network accuracy once both models are trained on equally noisy data.

Figures

Figures reproduced from arXiv: 1907.01742 by Chandan K A Reddy, Johannes Gehrke, Ross Cutler.

Figure 2
Figure 2. Figure 2: Block diagrams of audio impairment classifiers using (a) Engineered features, (b) Mel Spectrogram and 2D CNN and (c) Raw Audio Samples and 1D CNN [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Relationship between training dataset size and noise [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
read the original abstract

Voice-over-Internet-Protocol (VoIP) calls are prone to various speech impairments due to environmental and network conditions resulting in bad user experience. A reliable audio impairment classifier helps to identify the cause for bad audio quality. The user feedback after the call can act as the ground truth labels for training a supervised classifier on a large audio dataset. However, the labels are noisy as most of the users lack the expertise to precisely articulate the impairment in the perceived speech. In this paper, we analyze the effects of massive noise in labels in training dense networks and Convolutional Neural Networks (CNN) using engineered features, spectrograms and raw audio samples as inputs. We demonstrate that CNN can generalize better on the training data with a large number of noisy labels and gives remarkably higher test performance. The classifiers were trained both on randomly generated label noise and the label noise introduced by human errors. We also show that training with noisy labels requires a significant increase in the training dataset size, which is in proportion to the amount of noise in the labels.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The manuscript presents an empirical study on training supervised classifiers (dense networks and CNNs) for detecting speech impairments in VoIP audio using noisy labels from user feedback. It examines the effects of label noise (both synthetic random and human-induced) on generalization, using engineered features, spectrograms, and raw audio as inputs, and concludes that CNNs achieve better generalization and test performance on large noisy datasets, while requiring proportionally larger training sets as noise increases.

Significance. If substantiated by detailed quantitative results, the findings would be significant for real-world audio quality monitoring systems that rely on noisy user feedback. The comparison of multiple input modalities and both synthetic and human label noise provides a practical contribution to handling noisy supervision in audio classification tasks.

major comments (1)
  1. [Abstract] Abstract: the central claim that CNNs 'generalize better' and give 'remarkably higher test performance' is asserted without any quantitative metrics (accuracy, F1, dataset sizes, noise rates, baseline comparisons, or significance tests), making it impossible to assess whether the empirical results support the claim.
minor comments (1)
  1. [Abstract] The statement that dataset size must increase 'in proportion to the amount of noise' should be supported by explicit scaling experiments or a figure showing performance vs. dataset size at different noise levels.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and positive assessment of the work's significance. We agree that the abstract would be strengthened by including quantitative metrics to support the claims, and we will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim that CNNs 'generalize better' and give 'remarkably higher test performance' is asserted without any quantitative metrics (accuracy, F1, dataset sizes, noise rates, baseline comparisons, or significance tests), making it impossible to assess whether the empirical results support the claim.

    Authors: We acknowledge the validity of this observation. The abstract summarizes the findings in qualitative terms, while the full manuscript reports detailed quantitative results (accuracy, F1, dataset sizes, noise rates) comparing dense networks and CNNs across engineered features, spectrograms, and raw audio under both synthetic and human-induced label noise. To address the concern, we will revise the abstract to include key quantitative highlights from the experiments, such as specific performance gains and the scaling of dataset size with noise level. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper is purely empirical, reporting experiments on training dense networks and CNNs with engineered features, spectrograms, and raw audio on both synthetic and human-induced noisy labels. No derivations, equations, or parameter-fitting steps are present that could reduce to inputs by construction. Claims rest on direct test performance comparisons against clean expert labels, which is the appropriate metric for the application and does not rely on self-citation chains or ansatzes.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that user feedback constitutes usable (if noisy) supervision for impairment classification and that the modeled noise types (random and human-error) are representative.

axioms (1)
  • domain assumption User feedback after VoIP calls provides noisy but still informative labels for training impairment classifiers
    Explicitly stated in the abstract as the source of ground truth labels.

pith-pipeline@v0.9.0 · 5714 in / 1111 out tokens · 35036 ms · 2026-05-25T10:05:10.399172+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

23 extracted references · 23 canonical work pages · 4 internal anchors

  1. [1]

    The speech signal perceived by the huma n is degraded due to various environmental noises, bad room acoustics and distortions introduced in the communication systems

    Introduction In recent times, the quality of speech transmitted over the Internet is being monitored closely by most of the voice service providers as it correlates highly with the user experience. The speech signal perceived by the huma n is degraded due to various environmental noises, bad room acoustics and distortions introduced in the communication s...

  2. [2]

    The 4 impairment classes are: i) Background noise, ii) Reverberation, iii) Speech distortion and iv) Low volume

    Audio dataset In this project, we considered 4 impairment classes and 1 no-impairment class. The 4 impairment classes are: i) Background noise, ii) Reverberation, iii) Speech distortion and iv) Low volume. These are some of the top audio impairments that users perceive frequently in Skype VoIP calls. The no - impairment class is composed of clean speech d...

  3. [3]

    Hit application for online subjective evaluation Once the audio dataset is synthesized for different impairments, the next step is to label the clips

    Online evaluation and noisy labels 3.1. Hit application for online subjective evaluation Once the audio dataset is synthesized for different impairments, the next step is to label the clips. Note that the ground truth labels are known since the data is synthesized. Nevertheless, we collect labels from human judges to capture their noise. The synthesized a...

  4. [4]

    Engineered audio features with dense network In the first approach, we extract 18 engineered signal processing features from the audio signal

    Supervised classifiers 4.1. Engineered audio features with dense network In the first approach, we extract 18 engineered signal processing features from the audio signal. The features extracted are spectral centroid, spectral flux, spectral flatness, spectral dynamics, spectral roll -off, zero crossing rate, signal energy, energy entropy, Global SNR and c...

  5. [5]

    Experiments and results 5.1. Baseline evaluation The three supervised classifiers described in Section 4 are trained with clean labels and the evaluation results are used as the baseline to analyze the impact on the accuracy when trained with erroneous labels. The data is divided into 70% for training, 15% for va lidation and testing each. The engineered ...

  6. [6]

    Conclusion In this paper, we investigated the effects of noisy labels o n training an audio impairment classifier using three different input and network architectures. Experimental results suggest that a Log Mel Spectrogram with 2D CNN architecture can be a feasible option to train a supervised audio impairment classifier with noisy labels, provided a su...

  7. [7]

    Recommendation P.800: Methods for subjective determination of transmission quality,

    ITU-T, “Recommendation P.800: Methods for subjective determination of transmission quality,” Feb. 1998

  8. [8]

    ITU-T, “Recommendation P.862: Perceptual evaluation of speech quality (PESQ), an objective method for end -to-end speech quality assessment of narrowband telephone networks and speech codecs,” Feb. 2001

  9. [9]

    Perceptual objective listening quality assessment (POLQA), the third generation ITU -T standard for end-to-end speech quality measurement part I —Temporal alignment,

    J. G. Beerends et al., “Perceptual objective listening quality assessment (POLQA), the third generation ITU -T standard for end-to-end speech quality measurement part I —Temporal alignment,” J. Aud io Eng. Soc., vol. 61, no. 6, pp. 366 –384, 2013

  10. [10]

    Non -intrusive Speech Quality Assessment Using Neural Networks,

    A. R. Avila, H. Gamper, C. Reddy, R. Cutler, I. Tashev and J. Gehrke, "Non -intrusive Speech Quality Assessment Using Neural Networks," ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech an d Signal Processing (ICASSP), Brighton, United Kingdom, 2019, pp. 631 -635. doi: 10.1109/ICASSP.2019.8683175

  11. [11]

    https://www.mturk.com/

  12. [12]

    Learning with noisy labels,

    N. Natarajan, I. S. Dhillon, P. K. Ravikumar, and A. Tewari, “Learning with noisy labels,” in Advances in neural information processing systems, 2013, pp. 1196–1204

  13. [13]

    Learning Deep Networks from Noisy Labels with Dropout Regularization

    Jindal, Ishan, Matthew S. Nokleby and Xuewen Chen. “Learning Deep Networks from Noisy Labels with Dropout Regularization.” 2016 IEEE 16th International Conference on Data Mining (ICDM) (2016): 967-972

  14. [14]

    Training Convolutional Networks with Noisy Labels

    Sukhbaatar, S ainbayar and Rob Fergus. “Learning from Noisy Labels with Deep Neural Networks.” CoRR abs/1406.2080 (2014): n. pag

  15. [15]

    Learning Sound Event Classifiers from Web Audio with Noisy Labels

    Fonseca, Eduardo, Manoj Plakal, Daniel P. W. Ellis, Frederic Font, Xavier Favory and Xavier Serra. “Learning Sound Event Classifiers from Web Audio with Noisy Labels.” CoRR abs/1901.01189 (2019): n. pag

  16. [16]

    General-purpose audio tagging from noisy labels using convolutional neural networks

    Iqbal, Turab, Qiuqiang Kong, Mark. Plumbley and Wenwu Wang. “General-purpose audio tagging from noisy labels using convolutional neural networks.” (2018)

  17. [17]

    A Closer Look at Weak Label Learning for Audio Events

    Shah, Ankit, Anurag Kumar, Alexander G. Hauptmann and Bhiksha Raj. “A Closer Look at Weak Label Learning for Audio Events.” CoRR abs/1804.09288 (2018): n. pag

  18. [18]

    https://docs.microsoft.com/en-us/windows- hardware/drivers/taef/

  19. [19]

    Training deep neural networks on noisy labels with bootstrapping,

    Scott E. Reed, Honglak Lee, Dragomir Anguelov, Christian Szegedy, Dumitru Erhan, and Andrew Rabinovich, “Training deep neural networks on noisy labels with bootstrapping,” in ICLR 2015

  20. [20]

    Joint optimization framework for learning with noisy labels,

    Daiki Tanaka, Daiki Ikami, Toshihiko Yamasaki, and Kiyoharu Aizawa, “Joint optimization framework for learning with noisy labels,” in Proceedings of CVPR, 2018, pp. 5552–5560

  21. [21]

    Generalized cross entropy loss for training deep neur al networks with noisy labels,

    Zhilu Zhang and Mert Sabuncu, “Generalized cross entropy loss for training deep neur al networks with noisy labels,” in Advances in Neural Information Processing Systems, 2018

  22. [22]

    Deep Learning is Robust to Massive Label Noise

    D. Rolnick, A. Veit, S. J. Belongie, and N. Shavit. Deep learning is robust to massive label noise. CoRR, abs/1705.10694, 2017

  23. [23]

    Low-Complexity, Nonintrusive Speech Quality Assessment,

    V. Grancharov, D. Y. Zhao, J. Lindbl om and W. B. Kleijn, "Low-Complexity, Nonintrusive Speech Quality Assessment," in IEEE Transactions on Audio, Speech, and Language Processing, vol. 14, no. 6, pp. 1948 -1956, Nov. 2006. doi: 10.1109/TASL.2006.883250