pith. sign in

arxiv: 2211.12080 · v2 · submitted 2022-11-22 · 💻 cs.SD · eess.AS

Robust Training for Speaker Verification against Noisy Labels

Pith reviewed 2026-05-24 10:37 UTC · model grok-4.3

classification 💻 cs.SD eess.AS
keywords speaker verificationnoisy labelsrobust trainingOR-Gate top-klabel filteringVoxCelebtwo-stage learningdeep neural networks
0
0 comments X

The pith

A two-stage training process with an OR-Gate top-k mechanism filters noisy labels during speaker verification model training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to establish that models for speaker verification can be trained reliably even when many dataset labels are incorrect. It begins with a short initial training pass over the full dataset because networks tend to fit clean examples before noisy ones. The method then applies an OR-Gate top-k rule that keeps only those samples where the current model prediction agrees with the given label and retrains on the selected subset, repeating until training ends. A reader would care because speaker datasets grow large quickly yet labeling errors are common and expensive to fix by hand. If the claim holds, practitioners can use bigger imperfect collections without separate cleaning steps.

Core claim

The authors establish that their two-stage learning method, which first trains on all data for several epochs and then iteratively uses an OR-Gate top-k comparison of model predictions to given labels to retain clean samples, removes the effect of noisy labels and produces strong verification performance on VoxCeleb1 and VoxCeleb2 at multiple added noise rates.

What carries the argument

The OR-Gate with top-k mechanism, which retains training examples only when model output matches the supplied label and selects the top-k matches per batch or class.

If this is right

  • Verification accuracy on VoxCeleb1 and VoxCeleb2 stays high across different added noise rates.
  • The iterative selection finishes training without any external label correction tool.
  • Standard training on the same noisy data yields lower performance than the filtered version.
  • The approach works with existing DNN architectures for speaker embedding without architectural changes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same early-training agreement signal could be tested on other audio classification tasks that suffer label noise.
  • Replacing the top-k rule with a learned threshold might reduce the need to tune the selection parameter.
  • Real-world label errors, rather than artificially injected noise, would provide a stronger test of the method's practicality.

Load-bearing premise

Deep neural networks fit clean-labeled examples before noisy ones during the first few training epochs.

What would settle it

An experiment showing that prediction-label agreement rates do not separate clean from noisy examples after the initial epochs would undermine the selection step.

Figures

Figures reproduced from arXiv: 2211.12080 by Hanhan Ma, Liang He, Lin Li, Xiaochen Guo, Zhihua Fang.

Figure 1
Figure 1. Figure 1: The framework of the proposed two-stage learning. Stage I: Train the network with all data for a few epochs, and store the model’s predictions for each sample at all epochs into the prediction set P. Stage II: Matching all data with the prediction set P, the data is divided into data with correct labels and data with noisy labels using the OR-Gate (see [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The OR-Gate mechanism. The picture represents the decision process of the t-th sample at the i-th epoch. I is the indicator function and P i t represents the set of labels for the top-k prediction of the sample xt at the i-th epoch model. When the outputt is 1, the label yt is correct, otherwise, it is noisy. sidered successful as long as the label yi exists in the label set Pi corresponding to the top-k p… view at source ↗
read the original abstract

The deep learning models used for speaker verification rely heavily on large amounts of data and correct labeling. However, noisy (incorrect) labels often occur, which degrades the performance of the system. In this paper, we propose a novel two-stage learning method to filter out noisy labels from speaker datasets. Since a DNN will first fit data with clean labels, we first train the model with all data for several epochs. Then, based on this model, the model predictions are compared with the labels using our proposed the OR-Gate with top-k mechanism to select the data with clean labels and the selected data is used to train the model. This process is iterated until the training is completed. We have demonstrated the effectiveness of this method in filtering noisy labels through extensive experiments and have achieved excellent performance on the VoxCeleb (1 and 2) with different added noise rates.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes a novel two-stage iterative training method for speaker verification to mitigate noisy labels. The approach first trains a DNN on the entire (potentially noisy) dataset for several epochs, then applies an OR-Gate top-k selection mechanism that compares model predictions against given labels to retain presumed clean samples, retrains on the filtered set, and repeats the process until training completes. The authors report that this filters noisy labels effectively and achieves excellent performance on VoxCeleb1 and VoxCeleb2 under varying rates of added synthetic noise.

Significance. If the central claim holds, the method offers a relatively simple, assumption-driven procedure for robust training on imperfect speaker datasets, which is practically relevant given the scale and labeling challenges of corpora like VoxCeleb. The controlled noise-injection experiments on standard benchmarks provide a concrete testbed, though the significance hinges on whether the early-learning separation generalizes to the speaker-embedding setting.

major comments (2)
  1. [§3 (Proposed Method)] §3 (Proposed Method): The procedure is motivated by the statement that 'a DNN will first fit data with clean labels,' which is used to justify the initial training epochs before OR-Gate top-k selection. No domain-specific validation is supplied (e.g., per-epoch loss or accuracy curves on explicitly partitioned clean vs. noisy subsets of VoxCeleb) for the chosen architecture, loss, or embedding extractor. This assumption is load-bearing: if the separation does not occur at the selected epoch, the subsequent filtering retains noisy samples and the iterative claim fails.
  2. [§4 (Experiments)] §4 (Experiments): The manuscript claims 'excellent performance' across noise rates but does not report direct head-to-head comparisons against established noisy-label baselines (e.g., Co-teaching or standard label smoothing) using identical noise schedules, model backbone, and evaluation protocol on VoxCeleb. Without these controls it is impossible to isolate the contribution of the OR-Gate top-k mechanism from other factors.
minor comments (2)
  1. The abstract states the performance claim without any numerical results (EER, minDCF, etc.); including at least the key metrics for the highest noise rate would strengthen the summary.
  2. [§3] A formal definition or pseudocode box for the OR-Gate top-k operation would clarify the exact selection rule and its hyperparameters (k, number of initial epochs).

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and commit to revisions that directly respond to the concerns raised.

read point-by-point responses
  1. Referee: [§3 (Proposed Method)] The procedure is motivated by the statement that 'a DNN will first fit data with clean labels,' which is used to justify the initial training epochs before OR-Gate top-k selection. No domain-specific validation is supplied (e.g., per-epoch loss or accuracy curves on explicitly partitioned clean vs. noisy subsets of VoxCeleb) for the chosen architecture, loss, or embedding extractor. This assumption is load-bearing: if the separation does not occur at the selected epoch, the subsequent filtering retains noisy samples and the iterative claim fails.

    Authors: We agree that the early-learning assumption would be strengthened by speaker-verification-specific evidence. While the assumption draws from established noisy-label literature and our empirical gains under controlled noise are consistent with it holding, we will add per-epoch loss and accuracy curves on explicitly partitioned clean/noisy subsets of VoxCeleb in the revised manuscript to validate the separation timing for our architecture and loss. revision: yes

  2. Referee: [§4 (Experiments)] The manuscript claims 'excellent performance' across noise rates but does not report direct head-to-head comparisons against established noisy-label baselines (e.g., Co-teaching or standard label smoothing) using identical noise schedules, model backbone, and evaluation protocol on VoxCeleb. Without these controls it is impossible to isolate the contribution of the OR-Gate top-k mechanism from other factors.

    Authors: We concur that head-to-head comparisons are necessary to isolate the OR-Gate top-k contribution. In the revision we will include experiments against Co-teaching and label smoothing under identical noise-injection schedules, model backbones, and VoxCeleb evaluation protocols, allowing direct assessment of our method relative to these baselines. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical iterative filter with external assumption

full rationale

The paper describes a two-stage iterative procedure: train on all data for initial epochs (invoking the general early-learning property of DNNs), then apply OR-Gate top-k selection on predictions vs. labels to retain presumed-clean samples, and repeat. This selection heuristic and iteration are procedural choices, not a mathematical derivation whose output equals its inputs by construction. The early-learning premise is cited from prior literature on noisy-label learning rather than self-citation or internal definition. No equations, fitted parameters renamed as predictions, or uniqueness theorems appear in the provided abstract or described method. Performance claims rest on experiments with added noise on VoxCeleb, which are externally falsifiable and not forced by the procedure itself. The derivation chain is therefore self-contained and non-circular.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that DNNs learn clean labels first; no free parameters, additional axioms, or invented entities are identifiable from the abstract.

axioms (1)
  • domain assumption A DNN will first fit data with clean labels
    Used to justify the initial training phase before selection begins.

pith-pipeline@v0.9.0 · 5682 in / 1180 out tokens · 50868 ms · 2026-05-24T10:37:44.395361+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

39 extracted references · 39 canonical work pages · 1 internal anchor

  1. [1]

    The success of these models depends on large-scale labelled datasets [8, 9]

    Introduction In recent years, speaker models based on deep neural networks (e.g, TDNN [1], ResNet [2], and ECAPA-TDNN [3]) have be- come the mainstream approaches for speaker verification, de- riving many variants [4, 5, 6, 7] with excellent performance. The success of these models depends on large-scale labelled datasets [8, 9]. Unfortunately, high-quali...

  2. [2]

    Robust Training for Speaker Verification against Noisy Labels

    Preliminaries Notation. For the speaker verification task, let c be the num- ber of speakers and e be a one-hot vector with dimension of c. D = {(xi, yi)}n i=1 denotes the i.i.d. samples and correspond- ing ground-truth labels, where n is the number of utterances. eD = {(xi, eyi)}n i=1 is the dataset where the labels are corrupted and the proportion of no...

  3. [3]

    The details are as follows

    The Proposed Approach For the problem of noisy labels in the speaker dataset, we pro- pose a two-stage learning approach and maintain a prediction set P for filtering noisy labels during the whole training pro- cess. The details are as follows. 3.1. Stage I: Early Learning Since the number of labels in the speaker dataset is huge, it is difficult for the ...

  4. [4]

    Experimental Details Data

    Experiments 4.1. Experimental Details Data. We demonstrated the superiority of our proposed method by conducting comprehensive experiments on V oxCeleb1 and 2 with different proportions of noisy labels. V oxCeleb1 [29] contains 1211 speakers and 148,642 utterances, and V oxCeleb2

  5. [5]

    self-confident

    contains 5994 speakers and 109,2009 utterances. Their labels were manually checked and can be considered clean datasets (η = 0). To verify the noisy label robustness of the method and con- sider real-world scenarios, we set the noisy label ratio η to 0%, 5%, 10%, 20%, 30%, and 50%. Specifically, for a given noise rate η, we randomly select the correspondi...

  6. [6]

    Conclusions In this paper, we propose a novel and easy-to-implement frame- work for filtering noisy labels in speaker datasets. Specifically, a model with basic speaker discrimination ability is first obtained by early learning, and then self-confident learning is conducted based on this model, where the network is trained using our pro- posed the OR-Gate...

  7. [7]

    X-vectors: Robust dnn embeddings for speaker recognition,

    D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khudan- pur, “X-vectors: Robust dnn embeddings for speaker recognition,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018, pp. 5329–5333

  8. [8]

    Deep residual learning for image recognition,

    K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 770–778

  9. [9]

    ECAPA- TDNN: Emphasized Channel Attention, Propagation and Ag- gregation in TDNN Based Speaker Verification,

    B. Desplanques, J. Thienpondt, and K. Demuynck, “ECAPA- TDNN: Emphasized Channel Attention, Propagation and Ag- gregation in TDNN Based Speaker Verification,” in Proc. Interspeech 2020 , 2020, pp. 3830–3834. [Online]. Available: http://dx.doi.org/10.21437/Interspeech.2020-2650

  10. [10]

    Statistical pyramid dense time delay neural network for speaker verifica- tion,

    Z.-K. Wan, Q.-H. Ren, Y .-C. Qin, and Q.-R. Mao, “Statistical pyramid dense time delay neural network for speaker verifica- tion,” in ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022, pp. 7532–7536

  11. [11]

    Mlp-svnet: A multi-layer perceptrons based network for speaker verification,

    B. Han, Z. Chen, B. Liu, and Y . Qian, “Mlp-svnet: A multi-layer perceptrons based network for speaker verification,” in ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022, pp. 7522–7526

  12. [12]

    MFA-Conformer: Multi-scale Feature Aggrega- tion Conformer for Automatic Speaker Verification,

    Y . Zhang, Z. Lv, H. Wu, S. Zhang, P. Hu, Z. Wu, H. yi Lee, and H. Meng, “MFA-Conformer: Multi-scale Feature Aggrega- tion Conformer for Automatic Speaker Verification,” in Proc. In- terspeech 2022, 2022, pp. 306–310

  13. [13]

    Self-Supervised Speaker Verifi- cation Using Dynamic Loss-Gate and Label Correction,

    B. Han, Z. Chen, and Y . Qian, “Self-Supervised Speaker Verifi- cation Using Dynamic Loss-Gate and Label Correction,” in Proc. Interspeech 2022, 2022, pp. 4780–4784

  14. [14]

    V oxceleb: Large-scale speaker verification in the wild,

    A. Nagrani, J. S. Chung, W. Xie, and A. Zisserman, “V oxceleb: Large-scale speaker verification in the wild,”Computer Speech & Language, vol. 60, p. 101027, 2020. [Online]. Available: https:// www.sciencedirect.com/science/article/pii/S0885230819302712

  15. [15]

    Cn-celeb: Multi-genre speaker recogni- tion,

    L. Li, R. Liu, J. Kang, Y . Fan, H. Cui, Y . Cai, R. Vipperla, T. F. Zheng, and D. Wang, “Cn-celeb: Multi-genre speaker recogni- tion,”Speech Communication, vol. 137, pp. 77–91, 2022

  16. [16]

    Ms-celeb-1m: A dataset and benchmark for large-scale face recognition,

    Y . Guo, L. Zhang, Y . Hu, X. He, and J. Gao, “Ms-celeb-1m: A dataset and benchmark for large-scale face recognition,” in Com- puter Vision – ECCV 2016 , B. Leibe, J. Matas, N. Sebe, and M. Welling, Eds. Cham: Springer International Publishing, 2016, pp. 87–102

  17. [17]

    Learning sound event classifiers from web audio with noisy labels,

    E. Fonseca, M. Plakal, D. P. W. Ellis, F. Font, X. Favory, and X. Serra, “Learning sound event classifiers from web audio with noisy labels,” inICASSP 2019 - 2019 IEEE International Confer- ence on Acoustics, Speech and Signal Processing (ICASSP), 2019, pp. 21–25

  18. [18]

    Learning from noisy labels with deep neural networks: A survey,

    H. Song, M. Kim, D. Park, Y . Shin, and J.-G. Lee, “Learning from noisy labels with deep neural networks: A survey,” IEEE Transactions on Neural Networks and Learning Systems , pp. 1– 19, 2022

  19. [19]

    Deep learning from noisy image labels with quality embedding,

    J. Yao, J. Wang, I. W. Tsang, Y . Zhang, J. Sun, C. Zhang, and R. Zhang, “Deep learning from noisy image labels with quality embedding,” IEEE Transactions on Image Processing , vol. 28, no. 4, pp. 1909–1922, 2019

  20. [20]

    Robust inference via generative classifiers for handling noisy labels,

    K. Lee, S. Yun, K. Lee, H. Lee, B. Li, and J. Shin, “Robust inference via generative classifiers for handling noisy labels,” in ICML, 09–15 Jun 2019, pp. 3763–3772. [Online]. Available: https://proceedings.mlr.press/v97/lee19f.html

  21. [21]

    mixup: Beyond empirical risk minimization,

    H. Zhang, M. Cisse, Y . N. Dauphin, and D. Lopez-Paz, “mixup: Beyond empirical risk minimization,” inInternational Conference on Learning Representations, 2018

  22. [22]

    Open-set label noise can improve robustness against inherent label noise,

    H. Wei, L. Tao, R. XIE, and B. An, “Open-set label noise can improve robustness against inherent label noise,” in Advances in Neural Information Processing Systems , vol. 34. Curran Associates, Inc., 2021, pp. 7978–7992. [Online]. Available: https://proceedings.neurips.cc/paper/2021/ file/428fca9bc1921c25c5121f9da7815cde-Paper.pdf

  23. [23]

    Symmet- ric cross entropy for robust learning with noisy labels,

    Y . Wang, X. Ma, Z. Chen, Y . Luo, J. Yi, and J. Bailey, “Symmet- ric cross entropy for robust learning with noisy labels,” in 2019 IEEE/CVF International Conference on Computer Vision (ICCV), 2019, pp. 322–330

  24. [24]

    Can cross entropy loss be robust to label noise?

    L. Feng, S. Shu, Z. Lin, F. Lv, L. Li, and B. An, “Can cross entropy loss be robust to label noise?” inProceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, IJCAI- 20, 7 2020, pp. 2206–2212

  25. [25]

    Dividemix: Learning with noisy labels as semi-supervised learning,

    J. Li, R. Socher, and S. C. Hoi, “Dividemix: Learning with noisy labels as semi-supervised learning,” in International Conference on Learning Representations, 2020

  26. [26]

    Self: Learning to filter noisy labels with self-ensembling,

    D. T. Nguyen, C. K. Mummadi, T. P. N. Ngo, T. H. P. Nguyen, L. Beggel, and T. Brox, “Self: Learning to filter noisy labels with self-ensembling,” in International Conference on Learning Rep- resentations, 2020

  27. [27]

    When speaker recognition meets noisy labels: Optimizations for front-ends and back-ends,

    L. Li, F. Tong, and Q. Hong, “When speaker recognition meets noisy labels: Optimizations for front-ends and back-ends,” IEEE/ACM Transactions on Audio, Speech, and Language Pro- cessing, vol. 30, pp. 1586–1599, 2022

  28. [28]

    Automatic Error Correction for Speaker Embedding Learning with Noisy La- bels,

    F. Tong, Y . Liu, S. Li, J. Wang, L. Li, and Q. Hong, “Automatic Error Correction for Speaker Embedding Learning with Noisy La- bels,” inProc. Interspeech 2021, 2021, pp. 4628–4632

  29. [29]

    Bayesian estimation of plda with noisy training labels, with applications to speaker ver- ification,

    B. J. Borgström and P. Torres-Carrasquillo, “Bayesian estimation of plda with noisy training labels, with applications to speaker ver- ification,” in 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020, pp. 7594–7598

  30. [30]

    Adaptive early-learning correction for segmentation from noisy annotations,

    S. Liu, K. Liu, W. Zhu, Y . Shen, and C. Fernandez-Granda, “Adaptive early-learning correction for segmentation from noisy annotations,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2022, pp. 2606–2616

  31. [31]

    Understanding and improving early stopping for learning with noisy labels,

    Y . Bai, E. Yang, B. Han, Y . Yang, J. Li, Y . Mao, G. Niu, and T. Liu, “Understanding and improving early stopping for learning with noisy labels,” inAdvances in Neural Information Processing Systems, M. Ranzato, A. Beygelzimer, Y . Dauphin, P. Liang, and J. W. Vaughan, Eds., vol. 34. Curran Associates, Inc., 2021, pp. 24 392–24 403

  32. [32]

    Investigating why contrastive learning benefits robustness against label noise,

    Y . Xue, K. Whitecross, and B. Mirzasoleiman, “Investigating why contrastive learning benefits robustness against label noise,” in Proceedings of the 39th International Conference on Ma- chine Learning, ser. Proceedings of Machine Learning Research, K. Chaudhuri, S. Jegelka, L. Song, C. Szepesvari, G. Niu, and S. Sabato, Eds., vol. 162. PMLR, 17–23 Jul 20...

  33. [33]

    Curriculum learning, in: Proceedings of the 26th Annual International Conference on Machine Learning, Associa- tion for Computing Machinery, New York, NY, USA

    Y . Bengio, J. Louradour, R. Collobert, and J. Weston, “Curriculum learning,” in Proceedings of the 26th Annual International Conference on Machine Learning , ser. ICML ’09. New York, NY , USA: Association for Computing Machinery, 2009, p. 41–48. [Online]. Available: https://doi.org/10.1145/1553374.1553380

  34. [34]

    Curriculum learning based ap- proaches for noise robust speaker recognition,

    S. Ranjan and J. H. L. Hansen, “Curriculum learning based ap- proaches for noise robust speaker recognition,”IEEE/ACM Trans- actions on Audio, Speech, and Language Processing , vol. 26, no. 1, pp. 197–210, 2018

  35. [35]

    V oxCeleb: A Large- Scale Speaker Identification Dataset,

    A. Nagrani, J. S. Chung, and A. Zisserman, “V oxCeleb: A Large- Scale Speaker Identification Dataset,” in Proc. Interspeech 2017, 2017, pp. 2616–2620

  36. [36]

    V oxCeleb2: Deep Speaker Recognition,

    J. S. Chung, A. Nagrani, and A. Zisserman, “V oxCeleb2: Deep Speaker Recognition,” inProc. Interspeech 2018, 2018, pp. 1086– 1090

  37. [37]

    Attentive Statistics Pooling for Deep Speaker Embedding,

    K. Okabe, T. Koshinaka, and K. Shinoda, “Attentive Statistics Pooling for Deep Speaker Embedding,” in Proc. Interspeech 2018, 2018, pp. 2252–2256

  38. [38]

    Additive margin softmax for face verification,

    F. Wang, J. Cheng, W. Liu, and H. Liu, “Additive margin softmax for face verification,” IEEE Signal Processing Letters , vol. 25, no. 7, pp. 926–930, 2018

  39. [39]

    Self-paced learning for latent variable models,

    M. P. Kumar, B. Packer, and D. Koller, “Self-paced learning for latent variable models,” in Proceedings of the 23rd International Conference on Neural Information Processing Systems - Volume 1, ser. NIPS’10. Red Hook, NY , USA: Curran Associates Inc., 2010, p. 1189–1197