Robust Training for Speaker Verification against Noisy Labels
Pith reviewed 2026-05-24 10:37 UTC · model grok-4.3
The pith
A two-stage training process with an OR-Gate top-k mechanism filters noisy labels during speaker verification model training.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors establish that their two-stage learning method, which first trains on all data for several epochs and then iteratively uses an OR-Gate top-k comparison of model predictions to given labels to retain clean samples, removes the effect of noisy labels and produces strong verification performance on VoxCeleb1 and VoxCeleb2 at multiple added noise rates.
What carries the argument
The OR-Gate with top-k mechanism, which retains training examples only when model output matches the supplied label and selects the top-k matches per batch or class.
If this is right
- Verification accuracy on VoxCeleb1 and VoxCeleb2 stays high across different added noise rates.
- The iterative selection finishes training without any external label correction tool.
- Standard training on the same noisy data yields lower performance than the filtered version.
- The approach works with existing DNN architectures for speaker embedding without architectural changes.
Where Pith is reading between the lines
- The same early-training agreement signal could be tested on other audio classification tasks that suffer label noise.
- Replacing the top-k rule with a learned threshold might reduce the need to tune the selection parameter.
- Real-world label errors, rather than artificially injected noise, would provide a stronger test of the method's practicality.
Load-bearing premise
Deep neural networks fit clean-labeled examples before noisy ones during the first few training epochs.
What would settle it
An experiment showing that prediction-label agreement rates do not separate clean from noisy examples after the initial epochs would undermine the selection step.
Figures
read the original abstract
The deep learning models used for speaker verification rely heavily on large amounts of data and correct labeling. However, noisy (incorrect) labels often occur, which degrades the performance of the system. In this paper, we propose a novel two-stage learning method to filter out noisy labels from speaker datasets. Since a DNN will first fit data with clean labels, we first train the model with all data for several epochs. Then, based on this model, the model predictions are compared with the labels using our proposed the OR-Gate with top-k mechanism to select the data with clean labels and the selected data is used to train the model. This process is iterated until the training is completed. We have demonstrated the effectiveness of this method in filtering noisy labels through extensive experiments and have achieved excellent performance on the VoxCeleb (1 and 2) with different added noise rates.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes a novel two-stage iterative training method for speaker verification to mitigate noisy labels. The approach first trains a DNN on the entire (potentially noisy) dataset for several epochs, then applies an OR-Gate top-k selection mechanism that compares model predictions against given labels to retain presumed clean samples, retrains on the filtered set, and repeats the process until training completes. The authors report that this filters noisy labels effectively and achieves excellent performance on VoxCeleb1 and VoxCeleb2 under varying rates of added synthetic noise.
Significance. If the central claim holds, the method offers a relatively simple, assumption-driven procedure for robust training on imperfect speaker datasets, which is practically relevant given the scale and labeling challenges of corpora like VoxCeleb. The controlled noise-injection experiments on standard benchmarks provide a concrete testbed, though the significance hinges on whether the early-learning separation generalizes to the speaker-embedding setting.
major comments (2)
- [§3 (Proposed Method)] §3 (Proposed Method): The procedure is motivated by the statement that 'a DNN will first fit data with clean labels,' which is used to justify the initial training epochs before OR-Gate top-k selection. No domain-specific validation is supplied (e.g., per-epoch loss or accuracy curves on explicitly partitioned clean vs. noisy subsets of VoxCeleb) for the chosen architecture, loss, or embedding extractor. This assumption is load-bearing: if the separation does not occur at the selected epoch, the subsequent filtering retains noisy samples and the iterative claim fails.
- [§4 (Experiments)] §4 (Experiments): The manuscript claims 'excellent performance' across noise rates but does not report direct head-to-head comparisons against established noisy-label baselines (e.g., Co-teaching or standard label smoothing) using identical noise schedules, model backbone, and evaluation protocol on VoxCeleb. Without these controls it is impossible to isolate the contribution of the OR-Gate top-k mechanism from other factors.
minor comments (2)
- The abstract states the performance claim without any numerical results (EER, minDCF, etc.); including at least the key metrics for the highest noise rate would strengthen the summary.
- [§3] A formal definition or pseudocode box for the OR-Gate top-k operation would clarify the exact selection rule and its hyperparameters (k, number of initial epochs).
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address each major comment below and commit to revisions that directly respond to the concerns raised.
read point-by-point responses
-
Referee: [§3 (Proposed Method)] The procedure is motivated by the statement that 'a DNN will first fit data with clean labels,' which is used to justify the initial training epochs before OR-Gate top-k selection. No domain-specific validation is supplied (e.g., per-epoch loss or accuracy curves on explicitly partitioned clean vs. noisy subsets of VoxCeleb) for the chosen architecture, loss, or embedding extractor. This assumption is load-bearing: if the separation does not occur at the selected epoch, the subsequent filtering retains noisy samples and the iterative claim fails.
Authors: We agree that the early-learning assumption would be strengthened by speaker-verification-specific evidence. While the assumption draws from established noisy-label literature and our empirical gains under controlled noise are consistent with it holding, we will add per-epoch loss and accuracy curves on explicitly partitioned clean/noisy subsets of VoxCeleb in the revised manuscript to validate the separation timing for our architecture and loss. revision: yes
-
Referee: [§4 (Experiments)] The manuscript claims 'excellent performance' across noise rates but does not report direct head-to-head comparisons against established noisy-label baselines (e.g., Co-teaching or standard label smoothing) using identical noise schedules, model backbone, and evaluation protocol on VoxCeleb. Without these controls it is impossible to isolate the contribution of the OR-Gate top-k mechanism from other factors.
Authors: We concur that head-to-head comparisons are necessary to isolate the OR-Gate top-k contribution. In the revision we will include experiments against Co-teaching and label smoothing under identical noise-injection schedules, model backbones, and VoxCeleb evaluation protocols, allowing direct assessment of our method relative to these baselines. revision: yes
Circularity Check
No circularity: empirical iterative filter with external assumption
full rationale
The paper describes a two-stage iterative procedure: train on all data for initial epochs (invoking the general early-learning property of DNNs), then apply OR-Gate top-k selection on predictions vs. labels to retain presumed-clean samples, and repeat. This selection heuristic and iteration are procedural choices, not a mathematical derivation whose output equals its inputs by construction. The early-learning premise is cited from prior literature on noisy-label learning rather than self-citation or internal definition. No equations, fitted parameters renamed as predictions, or uniqueness theorems appear in the provided abstract or described method. Performance claims rest on experiments with added noise on VoxCeleb, which are externally falsifiable and not forced by the procedure itself. The derivation chain is therefore self-contained and non-circular.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption A DNN will first fit data with clean labels
Reference graph
Works this paper leans on
-
[1]
The success of these models depends on large-scale labelled datasets [8, 9]
Introduction In recent years, speaker models based on deep neural networks (e.g, TDNN [1], ResNet [2], and ECAPA-TDNN [3]) have be- come the mainstream approaches for speaker verification, de- riving many variants [4, 5, 6, 7] with excellent performance. The success of these models depends on large-scale labelled datasets [8, 9]. Unfortunately, high-quali...
-
[2]
Robust Training for Speaker Verification against Noisy Labels
Preliminaries Notation. For the speaker verification task, let c be the num- ber of speakers and e be a one-hot vector with dimension of c. D = {(xi, yi)}n i=1 denotes the i.i.d. samples and correspond- ing ground-truth labels, where n is the number of utterances. eD = {(xi, eyi)}n i=1 is the dataset where the labels are corrupted and the proportion of no...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[3]
The Proposed Approach For the problem of noisy labels in the speaker dataset, we pro- pose a two-stage learning approach and maintain a prediction set P for filtering noisy labels during the whole training pro- cess. The details are as follows. 3.1. Stage I: Early Learning Since the number of labels in the speaker dataset is huge, it is difficult for the ...
-
[4]
Experiments 4.1. Experimental Details Data. We demonstrated the superiority of our proposed method by conducting comprehensive experiments on V oxCeleb1 and 2 with different proportions of noisy labels. V oxCeleb1 [29] contains 1211 speakers and 148,642 utterances, and V oxCeleb2
-
[5]
contains 5994 speakers and 109,2009 utterances. Their labels were manually checked and can be considered clean datasets (η = 0). To verify the noisy label robustness of the method and con- sider real-world scenarios, we set the noisy label ratio η to 0%, 5%, 10%, 20%, 30%, and 50%. Specifically, for a given noise rate η, we randomly select the correspondi...
work page 2009
-
[6]
Conclusions In this paper, we propose a novel and easy-to-implement frame- work for filtering noisy labels in speaker datasets. Specifically, a model with basic speaker discrimination ability is first obtained by early learning, and then self-confident learning is conducted based on this model, where the network is trained using our pro- posed the OR-Gate...
-
[7]
X-vectors: Robust dnn embeddings for speaker recognition,
D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khudan- pur, “X-vectors: Robust dnn embeddings for speaker recognition,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018, pp. 5329–5333
work page 2018
-
[8]
Deep residual learning for image recognition,
K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 770–778
work page 2016
-
[9]
B. Desplanques, J. Thienpondt, and K. Demuynck, “ECAPA- TDNN: Emphasized Channel Attention, Propagation and Ag- gregation in TDNN Based Speaker Verification,” in Proc. Interspeech 2020 , 2020, pp. 3830–3834. [Online]. Available: http://dx.doi.org/10.21437/Interspeech.2020-2650
-
[10]
Statistical pyramid dense time delay neural network for speaker verifica- tion,
Z.-K. Wan, Q.-H. Ren, Y .-C. Qin, and Q.-R. Mao, “Statistical pyramid dense time delay neural network for speaker verifica- tion,” in ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022, pp. 7532–7536
work page 2022
-
[11]
Mlp-svnet: A multi-layer perceptrons based network for speaker verification,
B. Han, Z. Chen, B. Liu, and Y . Qian, “Mlp-svnet: A multi-layer perceptrons based network for speaker verification,” in ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022, pp. 7522–7526
work page 2022
-
[12]
MFA-Conformer: Multi-scale Feature Aggrega- tion Conformer for Automatic Speaker Verification,
Y . Zhang, Z. Lv, H. Wu, S. Zhang, P. Hu, Z. Wu, H. yi Lee, and H. Meng, “MFA-Conformer: Multi-scale Feature Aggrega- tion Conformer for Automatic Speaker Verification,” in Proc. In- terspeech 2022, 2022, pp. 306–310
work page 2022
-
[13]
Self-Supervised Speaker Verifi- cation Using Dynamic Loss-Gate and Label Correction,
B. Han, Z. Chen, and Y . Qian, “Self-Supervised Speaker Verifi- cation Using Dynamic Loss-Gate and Label Correction,” in Proc. Interspeech 2022, 2022, pp. 4780–4784
work page 2022
-
[14]
V oxceleb: Large-scale speaker verification in the wild,
A. Nagrani, J. S. Chung, W. Xie, and A. Zisserman, “V oxceleb: Large-scale speaker verification in the wild,”Computer Speech & Language, vol. 60, p. 101027, 2020. [Online]. Available: https:// www.sciencedirect.com/science/article/pii/S0885230819302712
work page 2020
-
[15]
Cn-celeb: Multi-genre speaker recogni- tion,
L. Li, R. Liu, J. Kang, Y . Fan, H. Cui, Y . Cai, R. Vipperla, T. F. Zheng, and D. Wang, “Cn-celeb: Multi-genre speaker recogni- tion,”Speech Communication, vol. 137, pp. 77–91, 2022
work page 2022
-
[16]
Ms-celeb-1m: A dataset and benchmark for large-scale face recognition,
Y . Guo, L. Zhang, Y . Hu, X. He, and J. Gao, “Ms-celeb-1m: A dataset and benchmark for large-scale face recognition,” in Com- puter Vision – ECCV 2016 , B. Leibe, J. Matas, N. Sebe, and M. Welling, Eds. Cham: Springer International Publishing, 2016, pp. 87–102
work page 2016
-
[17]
Learning sound event classifiers from web audio with noisy labels,
E. Fonseca, M. Plakal, D. P. W. Ellis, F. Font, X. Favory, and X. Serra, “Learning sound event classifiers from web audio with noisy labels,” inICASSP 2019 - 2019 IEEE International Confer- ence on Acoustics, Speech and Signal Processing (ICASSP), 2019, pp. 21–25
work page 2019
-
[18]
Learning from noisy labels with deep neural networks: A survey,
H. Song, M. Kim, D. Park, Y . Shin, and J.-G. Lee, “Learning from noisy labels with deep neural networks: A survey,” IEEE Transactions on Neural Networks and Learning Systems , pp. 1– 19, 2022
work page 2022
-
[19]
Deep learning from noisy image labels with quality embedding,
J. Yao, J. Wang, I. W. Tsang, Y . Zhang, J. Sun, C. Zhang, and R. Zhang, “Deep learning from noisy image labels with quality embedding,” IEEE Transactions on Image Processing , vol. 28, no. 4, pp. 1909–1922, 2019
work page 1909
-
[20]
Robust inference via generative classifiers for handling noisy labels,
K. Lee, S. Yun, K. Lee, H. Lee, B. Li, and J. Shin, “Robust inference via generative classifiers for handling noisy labels,” in ICML, 09–15 Jun 2019, pp. 3763–3772. [Online]. Available: https://proceedings.mlr.press/v97/lee19f.html
work page 2019
-
[21]
mixup: Beyond empirical risk minimization,
H. Zhang, M. Cisse, Y . N. Dauphin, and D. Lopez-Paz, “mixup: Beyond empirical risk minimization,” inInternational Conference on Learning Representations, 2018
work page 2018
-
[22]
Open-set label noise can improve robustness against inherent label noise,
H. Wei, L. Tao, R. XIE, and B. An, “Open-set label noise can improve robustness against inherent label noise,” in Advances in Neural Information Processing Systems , vol. 34. Curran Associates, Inc., 2021, pp. 7978–7992. [Online]. Available: https://proceedings.neurips.cc/paper/2021/ file/428fca9bc1921c25c5121f9da7815cde-Paper.pdf
work page 2021
-
[23]
Symmet- ric cross entropy for robust learning with noisy labels,
Y . Wang, X. Ma, Z. Chen, Y . Luo, J. Yi, and J. Bailey, “Symmet- ric cross entropy for robust learning with noisy labels,” in 2019 IEEE/CVF International Conference on Computer Vision (ICCV), 2019, pp. 322–330
work page 2019
-
[24]
Can cross entropy loss be robust to label noise?
L. Feng, S. Shu, Z. Lin, F. Lv, L. Li, and B. An, “Can cross entropy loss be robust to label noise?” inProceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, IJCAI- 20, 7 2020, pp. 2206–2212
work page 2020
-
[25]
Dividemix: Learning with noisy labels as semi-supervised learning,
J. Li, R. Socher, and S. C. Hoi, “Dividemix: Learning with noisy labels as semi-supervised learning,” in International Conference on Learning Representations, 2020
work page 2020
-
[26]
Self: Learning to filter noisy labels with self-ensembling,
D. T. Nguyen, C. K. Mummadi, T. P. N. Ngo, T. H. P. Nguyen, L. Beggel, and T. Brox, “Self: Learning to filter noisy labels with self-ensembling,” in International Conference on Learning Rep- resentations, 2020
work page 2020
-
[27]
When speaker recognition meets noisy labels: Optimizations for front-ends and back-ends,
L. Li, F. Tong, and Q. Hong, “When speaker recognition meets noisy labels: Optimizations for front-ends and back-ends,” IEEE/ACM Transactions on Audio, Speech, and Language Pro- cessing, vol. 30, pp. 1586–1599, 2022
work page 2022
-
[28]
Automatic Error Correction for Speaker Embedding Learning with Noisy La- bels,
F. Tong, Y . Liu, S. Li, J. Wang, L. Li, and Q. Hong, “Automatic Error Correction for Speaker Embedding Learning with Noisy La- bels,” inProc. Interspeech 2021, 2021, pp. 4628–4632
work page 2021
-
[29]
Bayesian estimation of plda with noisy training labels, with applications to speaker ver- ification,
B. J. Borgström and P. Torres-Carrasquillo, “Bayesian estimation of plda with noisy training labels, with applications to speaker ver- ification,” in 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020, pp. 7594–7598
work page 2020
-
[30]
Adaptive early-learning correction for segmentation from noisy annotations,
S. Liu, K. Liu, W. Zhu, Y . Shen, and C. Fernandez-Granda, “Adaptive early-learning correction for segmentation from noisy annotations,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2022, pp. 2606–2616
work page 2022
-
[31]
Understanding and improving early stopping for learning with noisy labels,
Y . Bai, E. Yang, B. Han, Y . Yang, J. Li, Y . Mao, G. Niu, and T. Liu, “Understanding and improving early stopping for learning with noisy labels,” inAdvances in Neural Information Processing Systems, M. Ranzato, A. Beygelzimer, Y . Dauphin, P. Liang, and J. W. Vaughan, Eds., vol. 34. Curran Associates, Inc., 2021, pp. 24 392–24 403
work page 2021
-
[32]
Investigating why contrastive learning benefits robustness against label noise,
Y . Xue, K. Whitecross, and B. Mirzasoleiman, “Investigating why contrastive learning benefits robustness against label noise,” in Proceedings of the 39th International Conference on Ma- chine Learning, ser. Proceedings of Machine Learning Research, K. Chaudhuri, S. Jegelka, L. Song, C. Szepesvari, G. Niu, and S. Sabato, Eds., vol. 162. PMLR, 17–23 Jul 20...
work page 2022
-
[33]
Y . Bengio, J. Louradour, R. Collobert, and J. Weston, “Curriculum learning,” in Proceedings of the 26th Annual International Conference on Machine Learning , ser. ICML ’09. New York, NY , USA: Association for Computing Machinery, 2009, p. 41–48. [Online]. Available: https://doi.org/10.1145/1553374.1553380
-
[34]
Curriculum learning based ap- proaches for noise robust speaker recognition,
S. Ranjan and J. H. L. Hansen, “Curriculum learning based ap- proaches for noise robust speaker recognition,”IEEE/ACM Trans- actions on Audio, Speech, and Language Processing , vol. 26, no. 1, pp. 197–210, 2018
work page 2018
-
[35]
V oxCeleb: A Large- Scale Speaker Identification Dataset,
A. Nagrani, J. S. Chung, and A. Zisserman, “V oxCeleb: A Large- Scale Speaker Identification Dataset,” in Proc. Interspeech 2017, 2017, pp. 2616–2620
work page 2017
-
[36]
V oxCeleb2: Deep Speaker Recognition,
J. S. Chung, A. Nagrani, and A. Zisserman, “V oxCeleb2: Deep Speaker Recognition,” inProc. Interspeech 2018, 2018, pp. 1086– 1090
work page 2018
-
[37]
Attentive Statistics Pooling for Deep Speaker Embedding,
K. Okabe, T. Koshinaka, and K. Shinoda, “Attentive Statistics Pooling for Deep Speaker Embedding,” in Proc. Interspeech 2018, 2018, pp. 2252–2256
work page 2018
-
[38]
Additive margin softmax for face verification,
F. Wang, J. Cheng, W. Liu, and H. Liu, “Additive margin softmax for face verification,” IEEE Signal Processing Letters , vol. 25, no. 7, pp. 926–930, 2018
work page 2018
-
[39]
Self-paced learning for latent variable models,
M. P. Kumar, B. Packer, and D. Koller, “Self-paced learning for latent variable models,” in Proceedings of the 23rd International Conference on Neural Information Processing Systems - Volume 1, ser. NIPS’10. Red Hook, NY , USA: Curran Associates Inc., 2010, p. 1189–1197
work page 2010
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.