Supervised Classifiers for Audio Impairments with Noisy Labels

Chandan K A Reddy; Johannes Gehrke; Ross Cutler

arxiv: 1907.01742 · v1 · pith:CQ3HFEFHnew · submitted 2019-07-03 · 💻 cs.SD · cs.LG· eess.AS

Supervised Classifiers for Audio Impairments with Noisy Labels

Chandan K A Reddy , Ross Cutler , Johannes Gehrke This is my paper

Pith reviewed 2026-05-25 10:05 UTC · model grok-4.3

classification 💻 cs.SD cs.LGeess.AS

keywords audio impairment classificationnoisy labelsconvolutional neural networksVoIPspeech qualitysupervised learninglabel noise

0 comments

The pith

CNNs generalize better than dense networks when trained on large volumes of noisy user labels for audio impairment classification.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether supervised models can classify speech impairments in VoIP calls when the only available labels come from ordinary users and therefore contain substantial noise. It trains both dense networks and convolutional neural networks on engineered features, spectrograms, and raw waveforms, using both synthetic random label flips and actual human labeling errors. CNNs maintain higher test accuracy despite the noise, but the required training set size grows in direct proportion to the noise fraction. This matters because user feedback supplies far more examples than expert annotation can ever provide, yet the noise level has historically made such data unusable for training reliable classifiers.

Core claim

Convolutional neural networks can generalize better on training data containing large numbers of noisy labels and thereby achieve higher test performance than dense networks for the task of audio impairment classification. The advantage appears for both randomly injected label noise and noise arising from human errors. Training with noisy labels also demands a substantial increase in dataset size, with the required size scaling proportionally to the fraction of incorrect labels.

What carries the argument

Convolutional neural networks operating on spectrograms or raw audio, which extract local patterns that remain informative even when many training labels are wrong.

If this is right

CNN test accuracy exceeds that of dense networks across both synthetic and human-generated label noise.
Dataset size must increase linearly with label-noise fraction to keep performance constant.
The same CNN advantage appears whether inputs are engineered features, spectrograms, or raw waveforms.
Human feedback can serve as usable training labels once the model architecture and data volume are adjusted for the noise level.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Large-scale collection of user feedback could become a practical substitute for expert labeling if CNNs continue to tolerate the observed noise rates.
The same scaling relationship between noise fraction and required data volume may appear in other audio or multimedia classification tasks that rely on crowd-sourced labels.
Real-world deployment would require verifying that the noise distribution in live user feedback matches the distribution studied in the paper.

Load-bearing premise

The observed CNN advantage on noisy training labels will persist when the test set is also labeled by the same noisy human process rather than by clean expert labels.

What would settle it

Collect a test set labeled by the identical human-error process used for training and measure whether CNN accuracy remains higher than dense-network accuracy once both models are trained on equally noisy data.

Figures

Figures reproduced from arXiv: 1907.01742 by Chandan K A Reddy, Johannes Gehrke, Ross Cutler.

**Figure 2.** Figure 2: Block diagrams of audio impairment classifiers using (a) Engineered features, (b) Mel Spectrogram and 2D CNN and (c) Raw Audio Samples and 1D CNN [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Relationship between training dataset size and noise [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

read the original abstract

Voice-over-Internet-Protocol (VoIP) calls are prone to various speech impairments due to environmental and network conditions resulting in bad user experience. A reliable audio impairment classifier helps to identify the cause for bad audio quality. The user feedback after the call can act as the ground truth labels for training a supervised classifier on a large audio dataset. However, the labels are noisy as most of the users lack the expertise to precisely articulate the impairment in the perceived speech. In this paper, we analyze the effects of massive noise in labels in training dense networks and Convolutional Neural Networks (CNN) using engineered features, spectrograms and raw audio samples as inputs. We demonstrate that CNN can generalize better on the training data with a large number of noisy labels and gives remarkably higher test performance. The classifiers were trained both on randomly generated label noise and the label noise introduced by human errors. We also show that training with noisy labels requires a significant increase in the training dataset size, which is in proportion to the amount of noise in the labels.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CNNs handle noisy user labels better than dense nets for VoIP impairment classification, but the abstract supplies no numbers to size the effect.

read the letter

The core finding is that CNNs trained on large noisy label sets from user feedback outperform dense networks on clean test data for classifying VoIP audio impairments. The paper tests this with both synthetic random noise and actual human labeling errors, using engineered features, spectrograms, and raw audio as inputs. It also notes that noisier labels require proportionally larger training sets to maintain performance. That matches the practical need in communication systems where only user ratings are available at scale. The comparison across input types and noise sources is a straightforward empirical contribution that could matter for anyone building audio quality monitors. The stress-test note is right that evaluating against clean expert labels is the right metric here, since the goal is to recover true impairment types despite training noise. No internal contradiction shows up in the stated claims. The main limitation is the complete absence of quantitative results in the abstract: no accuracies, no dataset sizes, no noise rates, no baseline numbers, and no significance tests. Without those, the size of the reported gain and whether it justifies the extra data cost remain unclear. The work is narrow in scope, focused on this one task rather than new theory or broad methods. It is aimed at practitioners in audio signal processing and real-time communication systems who already deal with user-generated labels. A reader in that niche could extract useful guidance on dataset scaling, but the lack of reported metrics limits how far the results can be taken without the full paper. I would send it for peer review so the experiments can be checked in detail.

Referee Report

1 major / 1 minor

Summary. The manuscript presents an empirical study on training supervised classifiers (dense networks and CNNs) for detecting speech impairments in VoIP audio using noisy labels from user feedback. It examines the effects of label noise (both synthetic random and human-induced) on generalization, using engineered features, spectrograms, and raw audio as inputs, and concludes that CNNs achieve better generalization and test performance on large noisy datasets, while requiring proportionally larger training sets as noise increases.

Significance. If substantiated by detailed quantitative results, the findings would be significant for real-world audio quality monitoring systems that rely on noisy user feedback. The comparison of multiple input modalities and both synthetic and human label noise provides a practical contribution to handling noisy supervision in audio classification tasks.

major comments (1)

[Abstract] Abstract: the central claim that CNNs 'generalize better' and give 'remarkably higher test performance' is asserted without any quantitative metrics (accuracy, F1, dataset sizes, noise rates, baseline comparisons, or significance tests), making it impossible to assess whether the empirical results support the claim.

minor comments (1)

[Abstract] The statement that dataset size must increase 'in proportion to the amount of noise' should be supported by explicit scaling experiments or a figure showing performance vs. dataset size at different noise levels.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and positive assessment of the work's significance. We agree that the abstract would be strengthened by including quantitative metrics to support the claims, and we will revise the manuscript accordingly.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that CNNs 'generalize better' and give 'remarkably higher test performance' is asserted without any quantitative metrics (accuracy, F1, dataset sizes, noise rates, baseline comparisons, or significance tests), making it impossible to assess whether the empirical results support the claim.

Authors: We acknowledge the validity of this observation. The abstract summarizes the findings in qualitative terms, while the full manuscript reports detailed quantitative results (accuracy, F1, dataset sizes, noise rates) comparing dense networks and CNNs across engineered features, spectrograms, and raw audio under both synthetic and human-induced label noise. To address the concern, we will revise the abstract to include key quantitative highlights from the experiments, such as specific performance gains and the scaling of dataset size with noise level. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper is purely empirical, reporting experiments on training dense networks and CNNs with engineered features, spectrograms, and raw audio on both synthetic and human-induced noisy labels. No derivations, equations, or parameter-fitting steps are present that could reduce to inputs by construction. Claims rest on direct test performance comparisons against clean expert labels, which is the appropriate metric for the application and does not rely on self-citation chains or ansatzes.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that user feedback constitutes usable (if noisy) supervision for impairment classification and that the modeled noise types (random and human-error) are representative.

axioms (1)

domain assumption User feedback after VoIP calls provides noisy but still informative labels for training impairment classifiers
Explicitly stated in the abstract as the source of ground truth labels.

pith-pipeline@v0.9.0 · 5714 in / 1111 out tokens · 35036 ms · 2026-05-25T10:05:10.399172+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

23 extracted references · 23 canonical work pages · 4 internal anchors

[1]

The speech signal perceived by the huma n is degraded due to various environmental noises, bad room acoustics and distortions introduced in the communication systems

Introduction In recent times, the quality of speech transmitted over the Internet is being monitored closely by most of the voice service providers as it correlates highly with the user experience. The speech signal perceived by the huma n is degraded due to various environmental noises, bad room acoustics and distortions introduced in the communication s...

work page
[2]

The 4 impairment classes are: i) Background noise, ii) Reverberation, iii) Speech distortion and iv) Low volume

Audio dataset In this project, we considered 4 impairment classes and 1 no-impairment class. The 4 impairment classes are: i) Background noise, ii) Reverberation, iii) Speech distortion and iv) Low volume. These are some of the top audio impairments that users perceive frequently in Skype VoIP calls. The no - impairment class is composed of clean speech d...

work page
[3]

Hit application for online subjective evaluation Once the audio dataset is synthesized for different impairments, the next step is to label the clips

Online evaluation and noisy labels 3.1. Hit application for online subjective evaluation Once the audio dataset is synthesized for different impairments, the next step is to label the clips. Note that the ground truth labels are known since the data is synthesized. Nevertheless, we collect labels from human judges to capture their noise. The synthesized a...

work page
[4]

Engineered audio features with dense network In the first approach, we extract 18 engineered signal processing features from the audio signal

Supervised classifiers 4.1. Engineered audio features with dense network In the first approach, we extract 18 engineered signal processing features from the audio signal. The features extracted are spectral centroid, spectral flux, spectral flatness, spectral dynamics, spectral roll -off, zero crossing rate, signal energy, energy entropy, Global SNR and c...

work page
[5]

Experiments and results 5.1. Baseline evaluation The three supervised classifiers described in Section 4 are trained with clean labels and the evaluation results are used as the baseline to analyze the impact on the accuracy when trained with erroneous labels. The data is divided into 70% for training, 15% for va lidation and testing each. The engineered ...

work page
[6]

Conclusion In this paper, we investigated the effects of noisy labels o n training an audio impairment classifier using three different input and network architectures. Experimental results suggest that a Log Mel Spectrogram with 2D CNN architecture can be a feasible option to train a supervised audio impairment classifier with noisy labels, provided a su...

work page
[7]

Recommendation P.800: Methods for subjective determination of transmission quality,

ITU-T, “Recommendation P.800: Methods for subjective determination of transmission quality,” Feb. 1998

work page 1998
[8]

ITU-T, “Recommendation P.862: Perceptual evaluation of speech quality (PESQ), an objective method for end -to-end speech quality assessment of narrowband telephone networks and speech codecs,” Feb. 2001

work page 2001
[9]

Perceptual objective listening quality assessment (POLQA), the third generation ITU -T standard for end-to-end speech quality measurement part I —Temporal alignment,

J. G. Beerends et al., “Perceptual objective listening quality assessment (POLQA), the third generation ITU -T standard for end-to-end speech quality measurement part I —Temporal alignment,” J. Aud io Eng. Soc., vol. 61, no. 6, pp. 366 –384, 2013

work page 2013
[10]

Non -intrusive Speech Quality Assessment Using Neural Networks,

A. R. Avila, H. Gamper, C. Reddy, R. Cutler, I. Tashev and J. Gehrke, "Non -intrusive Speech Quality Assessment Using Neural Networks," ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech an d Signal Processing (ICASSP), Brighton, United Kingdom, 2019, pp. 631 -635. doi: 10.1109/ICASSP.2019.8683175

work page doi:10.1109/icassp.2019.8683175 2019
[11]

https://www.mturk.com/

work page
[12]

Learning with noisy labels,

N. Natarajan, I. S. Dhillon, P. K. Ravikumar, and A. Tewari, “Learning with noisy labels,” in Advances in neural information processing systems, 2013, pp. 1196–1204

work page 2013
[13]

Learning Deep Networks from Noisy Labels with Dropout Regularization

Jindal, Ishan, Matthew S. Nokleby and Xuewen Chen. “Learning Deep Networks from Noisy Labels with Dropout Regularization.” 2016 IEEE 16th International Conference on Data Mining (ICDM) (2016): 967-972

work page 2016
[14]

Training Convolutional Networks with Noisy Labels

Sukhbaatar, S ainbayar and Rob Fergus. “Learning from Noisy Labels with Deep Neural Networks.” CoRR abs/1406.2080 (2014): n. pag

work page internal anchor Pith review Pith/arXiv arXiv 2080
[15]

Learning Sound Event Classifiers from Web Audio with Noisy Labels

Fonseca, Eduardo, Manoj Plakal, Daniel P. W. Ellis, Frederic Font, Xavier Favory and Xavier Serra. “Learning Sound Event Classifiers from Web Audio with Noisy Labels.” CoRR abs/1901.01189 (2019): n. pag

work page internal anchor Pith review Pith/arXiv arXiv 1901
[16]

General-purpose audio tagging from noisy labels using convolutional neural networks

Iqbal, Turab, Qiuqiang Kong, Mark. Plumbley and Wenwu Wang. “General-purpose audio tagging from noisy labels using convolutional neural networks.” (2018)

work page 2018
[17]

A Closer Look at Weak Label Learning for Audio Events

Shah, Ankit, Anurag Kumar, Alexander G. Hauptmann and Bhiksha Raj. “A Closer Look at Weak Label Learning for Audio Events.” CoRR abs/1804.09288 (2018): n. pag

work page internal anchor Pith review Pith/arXiv arXiv 2018
[18]

https://docs.microsoft.com/en-us/windows- hardware/drivers/taef/

work page
[19]

Training deep neural networks on noisy labels with bootstrapping,

Scott E. Reed, Honglak Lee, Dragomir Anguelov, Christian Szegedy, Dumitru Erhan, and Andrew Rabinovich, “Training deep neural networks on noisy labels with bootstrapping,” in ICLR 2015

work page 2015
[20]

Joint optimization framework for learning with noisy labels,

Daiki Tanaka, Daiki Ikami, Toshihiko Yamasaki, and Kiyoharu Aizawa, “Joint optimization framework for learning with noisy labels,” in Proceedings of CVPR, 2018, pp. 5552–5560

work page 2018
[21]

Generalized cross entropy loss for training deep neur al networks with noisy labels,

Zhilu Zhang and Mert Sabuncu, “Generalized cross entropy loss for training deep neur al networks with noisy labels,” in Advances in Neural Information Processing Systems, 2018

work page 2018
[22]

Deep Learning is Robust to Massive Label Noise

D. Rolnick, A. Veit, S. J. Belongie, and N. Shavit. Deep learning is robust to massive label noise. CoRR, abs/1705.10694, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[23]

Low-Complexity, Nonintrusive Speech Quality Assessment,

V. Grancharov, D. Y. Zhao, J. Lindbl om and W. B. Kleijn, "Low-Complexity, Nonintrusive Speech Quality Assessment," in IEEE Transactions on Audio, Speech, and Language Processing, vol. 14, no. 6, pp. 1948 -1956, Nov. 2006. doi: 10.1109/TASL.2006.883250

work page doi:10.1109/tasl.2006.883250 1948

[1] [1]

The speech signal perceived by the huma n is degraded due to various environmental noises, bad room acoustics and distortions introduced in the communication systems

Introduction In recent times, the quality of speech transmitted over the Internet is being monitored closely by most of the voice service providers as it correlates highly with the user experience. The speech signal perceived by the huma n is degraded due to various environmental noises, bad room acoustics and distortions introduced in the communication s...

work page

[2] [2]

The 4 impairment classes are: i) Background noise, ii) Reverberation, iii) Speech distortion and iv) Low volume

Audio dataset In this project, we considered 4 impairment classes and 1 no-impairment class. The 4 impairment classes are: i) Background noise, ii) Reverberation, iii) Speech distortion and iv) Low volume. These are some of the top audio impairments that users perceive frequently in Skype VoIP calls. The no - impairment class is composed of clean speech d...

work page

[3] [3]

Hit application for online subjective evaluation Once the audio dataset is synthesized for different impairments, the next step is to label the clips

Online evaluation and noisy labels 3.1. Hit application for online subjective evaluation Once the audio dataset is synthesized for different impairments, the next step is to label the clips. Note that the ground truth labels are known since the data is synthesized. Nevertheless, we collect labels from human judges to capture their noise. The synthesized a...

work page

[4] [4]

Engineered audio features with dense network In the first approach, we extract 18 engineered signal processing features from the audio signal

Supervised classifiers 4.1. Engineered audio features with dense network In the first approach, we extract 18 engineered signal processing features from the audio signal. The features extracted are spectral centroid, spectral flux, spectral flatness, spectral dynamics, spectral roll -off, zero crossing rate, signal energy, energy entropy, Global SNR and c...

work page

[5] [5]

Experiments and results 5.1. Baseline evaluation The three supervised classifiers described in Section 4 are trained with clean labels and the evaluation results are used as the baseline to analyze the impact on the accuracy when trained with erroneous labels. The data is divided into 70% for training, 15% for va lidation and testing each. The engineered ...

work page

[6] [6]

Conclusion In this paper, we investigated the effects of noisy labels o n training an audio impairment classifier using three different input and network architectures. Experimental results suggest that a Log Mel Spectrogram with 2D CNN architecture can be a feasible option to train a supervised audio impairment classifier with noisy labels, provided a su...

work page

[7] [7]

Recommendation P.800: Methods for subjective determination of transmission quality,

ITU-T, “Recommendation P.800: Methods for subjective determination of transmission quality,” Feb. 1998

work page 1998

[8] [8]

ITU-T, “Recommendation P.862: Perceptual evaluation of speech quality (PESQ), an objective method for end -to-end speech quality assessment of narrowband telephone networks and speech codecs,” Feb. 2001

work page 2001

[9] [9]

Perceptual objective listening quality assessment (POLQA), the third generation ITU -T standard for end-to-end speech quality measurement part I —Temporal alignment,

J. G. Beerends et al., “Perceptual objective listening quality assessment (POLQA), the third generation ITU -T standard for end-to-end speech quality measurement part I —Temporal alignment,” J. Aud io Eng. Soc., vol. 61, no. 6, pp. 366 –384, 2013

work page 2013

[10] [10]

Non -intrusive Speech Quality Assessment Using Neural Networks,

A. R. Avila, H. Gamper, C. Reddy, R. Cutler, I. Tashev and J. Gehrke, "Non -intrusive Speech Quality Assessment Using Neural Networks," ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech an d Signal Processing (ICASSP), Brighton, United Kingdom, 2019, pp. 631 -635. doi: 10.1109/ICASSP.2019.8683175

work page doi:10.1109/icassp.2019.8683175 2019

[11] [11]

https://www.mturk.com/

work page

[12] [12]

Learning with noisy labels,

N. Natarajan, I. S. Dhillon, P. K. Ravikumar, and A. Tewari, “Learning with noisy labels,” in Advances in neural information processing systems, 2013, pp. 1196–1204

work page 2013

[13] [13]

Learning Deep Networks from Noisy Labels with Dropout Regularization

Jindal, Ishan, Matthew S. Nokleby and Xuewen Chen. “Learning Deep Networks from Noisy Labels with Dropout Regularization.” 2016 IEEE 16th International Conference on Data Mining (ICDM) (2016): 967-972

work page 2016

[14] [14]

Training Convolutional Networks with Noisy Labels

Sukhbaatar, S ainbayar and Rob Fergus. “Learning from Noisy Labels with Deep Neural Networks.” CoRR abs/1406.2080 (2014): n. pag

work page internal anchor Pith review Pith/arXiv arXiv 2080

[15] [15]

Learning Sound Event Classifiers from Web Audio with Noisy Labels

Fonseca, Eduardo, Manoj Plakal, Daniel P. W. Ellis, Frederic Font, Xavier Favory and Xavier Serra. “Learning Sound Event Classifiers from Web Audio with Noisy Labels.” CoRR abs/1901.01189 (2019): n. pag

work page internal anchor Pith review Pith/arXiv arXiv 1901

[16] [16]

General-purpose audio tagging from noisy labels using convolutional neural networks

Iqbal, Turab, Qiuqiang Kong, Mark. Plumbley and Wenwu Wang. “General-purpose audio tagging from noisy labels using convolutional neural networks.” (2018)

work page 2018

[17] [17]

A Closer Look at Weak Label Learning for Audio Events

Shah, Ankit, Anurag Kumar, Alexander G. Hauptmann and Bhiksha Raj. “A Closer Look at Weak Label Learning for Audio Events.” CoRR abs/1804.09288 (2018): n. pag

work page internal anchor Pith review Pith/arXiv arXiv 2018

[18] [18]

https://docs.microsoft.com/en-us/windows- hardware/drivers/taef/

work page

[19] [19]

Training deep neural networks on noisy labels with bootstrapping,

Scott E. Reed, Honglak Lee, Dragomir Anguelov, Christian Szegedy, Dumitru Erhan, and Andrew Rabinovich, “Training deep neural networks on noisy labels with bootstrapping,” in ICLR 2015

work page 2015

[20] [20]

Joint optimization framework for learning with noisy labels,

Daiki Tanaka, Daiki Ikami, Toshihiko Yamasaki, and Kiyoharu Aizawa, “Joint optimization framework for learning with noisy labels,” in Proceedings of CVPR, 2018, pp. 5552–5560

work page 2018

[21] [21]

Generalized cross entropy loss for training deep neur al networks with noisy labels,

Zhilu Zhang and Mert Sabuncu, “Generalized cross entropy loss for training deep neur al networks with noisy labels,” in Advances in Neural Information Processing Systems, 2018

work page 2018

[22] [22]

Deep Learning is Robust to Massive Label Noise

D. Rolnick, A. Veit, S. J. Belongie, and N. Shavit. Deep learning is robust to massive label noise. CoRR, abs/1705.10694, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[23] [23]

Low-Complexity, Nonintrusive Speech Quality Assessment,

V. Grancharov, D. Y. Zhao, J. Lindbl om and W. B. Kleijn, "Low-Complexity, Nonintrusive Speech Quality Assessment," in IEEE Transactions on Audio, Speech, and Language Processing, vol. 14, no. 6, pp. 1948 -1956, Nov. 2006. doi: 10.1109/TASL.2006.883250

work page doi:10.1109/tasl.2006.883250 1948