The DKU System for the Speaker Recognition Task of the 2019 VOiCES from a Distance Challenge
Pith reviewed 2026-05-25 09:06 UTC · model grok-4.3
The pith
The DKU system reaches 0.3532 minDCF and 4.96% EER on the 2019 VOiCES far-field speaker verification evaluation set.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The submitted primary system obtains 0.3532 minDCF and 4.96% EER on the evaluation set. The best single system employs a residual neural network trained with angular softmax loss; weighted prediction error algorithms further improve performance, and the system reaches 0.3668 minDCF and 5.58% EER with simple cosine similarity scoring before final normalization steps are added.
What carries the argument
Residual neural network trained with angular softmax loss for utterance-level speaker modeling, augmented by weighted prediction error signal processing and cosine scoring with normalization.
If this is right
- The full pipeline of pre-processing, residual network embeddings, weighted prediction error, and normalization reduces both minDCF and EER relative to simpler baselines.
- Angular softmax training produces embeddings that support effective cosine scoring in far-field conditions.
- Adding weighted prediction error yields measurable gains beyond the neural network alone.
Where Pith is reading between the lines
- The same modular pipeline could be transferred to other distant-audio tasks such as meeting transcription or voice commands in large rooms.
- Replacing the residual network with newer architectures might produce further reductions in EER if retrained under the same loss.
- Score normalization appears to correct for score distribution shifts caused by varying distances and room acoustics.
Load-bearing premise
The challenge evaluation set provides a representative test of far-field speaker verification performance without hidden domain shifts or selection effects that would make the reported numbers unrepresentative.
What would settle it
An independent team re-running the exact pipeline on the same evaluation set and obtaining error rates materially different from 0.3532 minDCF and 4.96% EER would show the reported figures do not hold.
Figures
read the original abstract
In this paper, we present the DKU system for the speaker recognition task of the VOiCES from a distance challenge 2019. We investigate the whole system pipeline for the far-field speaker verification, including data pre-processing, short-term spectral feature representation, utterance-level speaker modeling, back-end scoring, and score normalization. Our best single system employs a residual neural network trained with angular softmax loss. Also, the weighted prediction error algorithms can further improve performance. It achieves 0.3668 minDCF and 5.58% EER on the evaluation set by using a simple cosine similarity scoring. Finally, the submitted primary system obtains 0.3532 minDCF and 4.96% EER on the evaluation set.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents the DKU system submitted to the speaker recognition task of the 2019 VOiCES from a Distance Challenge. It describes the full pipeline for far-field speaker verification, covering data pre-processing, short-term spectral features, utterance-level modeling via a residual neural network trained with angular softmax loss, back-end scoring with cosine similarity, and score normalization. The best single system with WPE preprocessing achieves 0.3668 minDCF and 5.58% EER; the submitted primary system reaches 0.3532 minDCF and 4.96% EER on the evaluation set.
Significance. If the reported numbers are reproducible and the evaluation set is unbiased, the work supplies a concrete empirical reference point on the VOiCES benchmark, demonstrating that ResNet + angular softmax combined with WPE can deliver competitive far-field performance. Challenge system papers of this type are useful for establishing baselines, but the lack of supporting analysis reduces the strength of any broader claims about generalization.
major comments (1)
- [Abstract] Abstract: The central claim that the primary system obtains 0.3532 minDCF and 4.96% EER is presented without error bars, ablation results, or any description of evaluation-set selection criteria, acoustic-condition statistics, or validation steps confirming absence of domain shift relative to training data. This directly affects the ability to interpret the numerical result as evidence of effective far-field verification.
Simulated Author's Rebuttal
We thank the referee for the detailed review of our manuscript on the DKU system for the 2019 VOiCES challenge. Below we respond point-by-point to the major comment.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central claim that the primary system obtains 0.3532 minDCF and 4.96% EER is presented without error bars, ablation results, or any description of evaluation-set selection criteria, acoustic-condition statistics, or validation steps confirming absence of domain shift relative to training data. This directly affects the ability to interpret the numerical result as evidence of effective far-field verification.
Authors: The abstract is intentionally concise, as is conventional for challenge system descriptions. The full pipeline (WPE preprocessing, 64-dim log-Mel features, ResNet34 with angular softmax, cosine scoring, and AS-Norm) is detailed in Sections 2-4 of the manuscript, along with the training data (VoxCeleb + VOiCES dev) and the fact that all results are on the official VOiCES 2019 evaluation set. We agree that a short clause noting the official challenge evaluation protocol would aid interpretation and will add one sentence to the abstract. Error bars are not reported because the challenge provides a single fixed evaluation set with no provision for multiple independent runs. Ablation studies are outside the scope of this system paper, which focuses on the submitted primary system rather than comparative analysis. revision: partial
- Error bars on the reported minDCF and EER, because the challenge evaluation consists of a single run on a fixed test set.
- Evaluation-set selection criteria and complete acoustic-condition statistics, which are determined by the VOiCES organizers and not fully disclosed to participants.
Circularity Check
No derivation chain; purely empirical system report
full rationale
The paper describes a speaker verification pipeline (data preprocessing, ResNet with angular softmax, WPE, cosine scoring) and reports challenge metrics (0.3532 minDCF, 4.96% EER) on the public VOiCES evaluation set. No equations, fitted parameters renamed as predictions, self-citations, or ansatzes are present that could reduce any claim to its own inputs by construction. The central claims are experimental results on an external benchmark, which are self-contained and externally falsifiable.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
VOiCES from a Distance Challenge 2019
Introduction In the past decade, the performance of speaker recognition has improved significantly. The i-vector based method [1] and the deep neural network (DNN) based methods [2, 3] have promoted the development of speaker recognition technology in telephone channel and closed talking scenarios. However, speaker recognition under far-field and complex en...
work page 2019
-
[2]
The DKU System for the Speaker Recognition Task of the 2019 VOiCES from a Distance Challenge
System descriptions 2.1. Data pre-processing 2.1.1. Data augmentation We adopt two kinds of data augmentation strategies. The first is the same as the x-vector system available at Kaldi V oxceleb recipe, which employs additive noises and reverberation. We also use pyroomacoustics [24] to simulate the room acoustic based on RIR generator using Image Source ...
work page internal anchor Pith review Pith/arXiv arXiv 1907
-
[3]
Data usage The training data includes V oxCeleb 1 [37] and V oxCeleb 2 [38]
Experiments 3.1. Data usage The training data includes V oxCeleb 1 [37] and V oxCeleb 2 [38]. The original distribution of V oxCeleb split each video into multiple short segments. During training, the segments from the same video are concatenated into a single sound wave, which results in 167897 utterances from 7245 speakers. No voice activity detection (...
-
[4]
Conclusions We presented the components and analyzed the results of the DKU-SMIIP speaker recognition system for the VOiCES from a Distance Challenge 2019. We use different acoustic fea- tures, different front-end modeling methods, and various back- end scoring methods. To further improve the performance, we use WPE to dereverberate the development and ev...
work page 2019
-
[5]
Front- End Factor Analysis for Speaker Verification,
N. Dehak, P. J. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet, “Front- End Factor Analysis for Speaker Verification,” IEEE Transactions on Au- dio, Speech, and Language Processing, vol. 19, no. 4, pp. 788–798, 2011
work page 2011
-
[6]
x- vectors: Robust DNN Embeddings for Speaker Recognition,
D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khudanpur, “x- vectors: Robust DNN Embeddings for Speaker Recognition,” in IEEE In- ternational Conference on Acoustics, Speech and Signal Processing, 2018, pp. 5329–5333
work page 2018
-
[7]
W. Cai, J. Chen, and M. Li, “Exploring the Encoding Layer and Loss Function in End-to-End Speaker and Language Recognition System,” in Odyssey: The Speaker and Language Recognition Workshop, 2018, pp. 74– 81
work page 2018
-
[8]
M. Wolfel and J. McDonough, Distant Speech Recognition. John Wiley & Sons, Incorporated, 2009
work page 2009
-
[9]
The Perception of Speech Under Ad- verse Conditions,
P. Assmann and Q. Summerfield, “The Perception of Speech Under Ad- verse Conditions,” in Speech Processing in the Auditory System. Springer New York, 2004, pp. 231–308
work page 2004
-
[10]
Speech Dereverberation Based on Variance-Normalized Delayed Linear Prediction,
T. Nakatani, T. Yoshioka, K. Kinoshita, M. Miyoshi, and Biing-Hwang Juang, “Speech Dereverberation Based on Variance-Normalized Delayed Linear Prediction,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 18, no. 7, pp. 1717–1731, 2010
work page 2010
-
[11]
Robust Speaker Identification in Noisy and Reverberant Conditions,
X. Zhao, Y . Wang, and D. Wang, “Robust Speaker Identification in Noisy and Reverberant Conditions,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 22, no. 4, pp. 836–845, 2014
work page 2014
-
[12]
M. Kolboek, Z.-H. Tan, and J. Jensen, “Speech Enhancement Using Long Short-Term Memory based Recurrent Neural Networks for Noise Robust Speaker Verification,” in IEEE Spoken Language Technology Workshop , 2016, pp. 305–311
work page 2016
-
[13]
DNN-Based Amplitude and Phase Feature Enhancement for Noise Robust Speaker Identification,
Z. Oo, Y . Kawakami, L. Wang, S. Nakagawa, X. Xiao, and M. Iwahashi, “DNN-Based Amplitude and Phase Feature Enhancement for Noise Robust Speaker Identification,” in Proceedings of the Annual Conference of the International Speech Communication Association, 2016, pp. 2204–2208
work page 2016
-
[14]
Front-end speech enhancement for commercial speaker verification systems,
S. E. Eskimez, P. Soufleris, Z. Duan, and W. Heinzelman, “Front-end speech enhancement for commercial speaker verification systems,” Speech Communication, vol. 99, pp. 101–113, 2018
work page 2018
-
[15]
Neural Network Based Spectral Mask Estimation for Acoustic Beamforming,
J. Heymann, L. Drude, and R. Haeb-Umbach, “Neural Network Based Spectral Mask Estimation for Acoustic Beamforming,” in2016 IEEE Inter- national Conference on Acoustics, Speech and Signal Processing. IEEE, 2016, pp. 196–200
work page 2016
-
[16]
Blind Acoustic Beamforming Based on Generalized Eigenvalue Decomposition,
E. Warsitz and R. Haeb-Umbach, “Blind Acoustic Beamforming Based on Generalized Eigenvalue Decomposition,” IEEE Transactions on Audio, Speech and Language Processing, vol. 15, no. 5, pp. 1529–1539, 2007
work page 2007
-
[17]
Modulation Spectral Features for Robust Far- Field Speaker Identification,
T. Falk and Wai-Yip Chan, “Modulation Spectral Features for Robust Far- Field Speaker Identification,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 18, no. 1, pp. 90–100, 2010
work page 2010
-
[18]
S. O. Sadjadi and J. H. Hansen, “Hilbert Envelope Based Features for Ro- bust Speaker Identification Under Reverberant Mismatched Conditions,” in 2011 IEEE International Conference on Acoustics, Speech and Signal Pro- cessing, 2011, pp. 5448–5451
work page 2011
-
[19]
Speaker Identifica- tion with Distant Microphone Speech,
Q. Jin, R. Li, Q. Yang, K. Laskowski, and T. Schultz, “Speaker Identifica- tion with Distant Microphone Speech,” in2010 IEEE International Confer- ence on Acoustics, Speech and Signal Processing, 2010, pp. 4518–4521
work page 2010
-
[20]
Blind Spectral Weighting for Robust Speaker Identification under Reverberation Mismatch,
S. O. Sadjadi and J. H. L. Hansen, “Blind Spectral Weighting for Robust Speaker Identification under Reverberation Mismatch,” IEEE/ACM Trans- actions on Audio, Speech, and Language Processing , vol. 22, no. 5, pp. 937–945, 2014
work page 2014
-
[21]
Reverberation Matching for Speaker Recognition,
I. Peer, B. Rafaely, and Y . Zigel, “Reverberation Matching for Speaker Recognition,” in2008 IEEE International Conference on Acoustics, Speech and Signal Processing, 2008, pp. 4829–4832
work page 2008
-
[22]
A. R. Avila, M. Sarria-Paja, F. J. Fraga, D. O’Shaughnessy, and T. H. Falk, “Improving the Performance of Far-Field Speaker Verification Using Multi- Condition Training: The Case of GMM-UBM and i-Vector Systems,” in Proceedings of the Annual Conference of the International Speech Com- munication Association, 2014, pp. 1096–1100
work page 2014
-
[23]
D. Garcia-Romero, X. Zhou, and C. Y . Espy-Wilson, “Multicondition train- ing of Gaussian PLDA models in i-vector space for noise and reverbera- tion robust speaker recognition,” in 2012 IEEE International Conference on Acoustics, Speech and Signal Processing, 2012, pp. 4257–4260
work page 2012
-
[24]
M. K. Nandwana, J. van Hout, M. McLaren, A. Stauffer, C. Richey, A. Law- son, and M. Graciarena, “Robust Speaker Recognition from Distant Speech under Real Reverberant Environments Using Speaker Embeddings,” inPro- ceedings of the Annual Conference of the International Speech Communi- cation Association, 2018, pp. 1106–1110
work page 2018
-
[25]
Far-Field Speaker Recognition,
Q. Jin, T. Schultz, and A. Waibel, “Far-Field Speaker Recognition,” IEEE Transactions on Audio, Speech and Language Processing , vol. 15, no. 7, pp. 2023–2032, 2007
work page 2023
-
[26]
Text- Independent Speaker Identification using Soft Channel Selection in Home Robot Environments,
Mikyong Ji, Sungtak Kim, Hoirin Kim, and Ho-Sub Yoon, “Text- Independent Speaker Identification using Soft Channel Selection in Home Robot Environments,” IEEE Transactions on Consumer Electronics , vol. 54, no. 1, pp. 140–144, 2008
work page 2008
-
[27]
The VOiCES from a Distance Challenge 2019 Evaluation Plan
M. K. Nandwana, J. V . Hout, M. McLaren, C. Richey, A. Lawson, and M. A. Barrios, “The VOiCES from a Distance Challenge 2019 Evaluation Plan,” arXiv:1902.10828 [eess.AS], 2019
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[28]
Pyroomacoustics: A Python Package for Audio Room Simulation and Array Processing Algorithms,
R. Scheibler, E. Bezzam, and I. Dokmanic, “Pyroomacoustics: A Python Package for Audio Room Simulation and Array Processing Algorithms,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing, 2018, pp. 351–355
work page 2018
-
[29]
V oices Obscured in Complex Environmental Settings (VOICES) corpus,
C. Richey, M. A. Barrios, Z. Armstrong, C. Bartels, H. Franco, M. Gra- ciarena, A. Lawson, M. K. Nandwana, A. Stauffer, J. van Hout, P. Gam- ble, J. Hetherly, C. Stephenson, and K. Ni, “V oices Obscured in Complex Environmental Settings (VOICES) corpus,” in Proceedings of the Annual Conference of the International Speech Communication Association, 2018, p...
work page 2018
-
[30]
MUSAN: A Music, Speech, and Noise Corpus
D. Snyder, G. Chen, and D. Povey, “MUSAN: A Music, Speech, and Noise Corpus,” arXiv:1510.08484 [cs], 2015
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[31]
Power-Normalized Cepstral Coefcients (PNCC) for Robust Speech Recognition,
C. Kim and R. M. Stern, “Power-Normalized Cepstral Coefcients (PNCC) for Robust Speech Recognition,” IEEE/ACM Transactions on Audio, Speech and Language Processing, vol. 24, no. 7, pp. 1315–1329, 2016
work page 2016
-
[32]
Complex Sounds and Auditory Images,
R. Patterson, K. Robinson, J. Holdsworth, D. McKeown, C. Zhang, and M. Allerhand, “Complex Sounds and Auditory Images,” inAuditory Physi- ology and Perception. Oxford, UK: Y . Cazals, L. Demany, and K. Horner, (Eds), Pergamon Press, 1992, pp. 429–446
work page 1992
-
[33]
Insights into End-to-End Learning Scheme for Language Identification,
W. Cai, Z. Cai, W. Liu, X. Wang, and M. Li, “Insights into End-to-End Learning Scheme for Language Identification,” in2018 IEEE International Conference on Acoustics, Speech and Signal Processing , 2018, pp. 5209– 5213
work page 2018
-
[34]
Analysis of length normalization in end-to- end speaker verification system,
W. Cai, J. Chen, and M. Li, “Analysis of length normalization in end-to- end speaker verification system,” in Proc. INTERSPEECH 2018, 2018, pp. 3618–3622
work page 2018
-
[35]
Sphereface: Deep Hypersphere Embedding for Face Recognition,
W. Liu, Y . Wen, Z. Yu, M. Li, B. Raj, and L. Song, “Sphereface: Deep Hypersphere Embedding for Face Recognition,” in The IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 212–220
work page 2017
-
[36]
Return of Frustratingly Easy Domain Adaptation,
B. Sun, J. Feng, and K. Saenko, “Return of Frustratingly Easy Domain Adaptation,” in Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, 2016, pp. 2058–2065
work page 2016
-
[37]
Speaker Verification in Mismatched Conditions with Frustratingly Easy Domain Adaptation,
M. J. Alam, G. Bhattacharya, and P. Kenny, “Speaker Verification in Mismatched Conditions with Frustratingly Easy Domain Adaptation,” in Odyssey: The Speaker and Language Recognition Workshop, 2018
work page 2018
-
[38]
Analysis of i-vector Length Normalization in Speaker Recognition Systems,
D. Garcia-Romero and C. Y . Espy-Wilson, “Analysis of i-vector Length Normalization in Speaker Recognition Systems,” inProceedings of the An- nual Conference of the International Speech Communication Association , 2011, pp. 249–252
work page 2011
-
[39]
Analysis of Score Normalization in Multilingual Speaker Recognition,
P. Matjka, O. Novotn, O. Plchot, L. Burget, M. D. Snchez, and J. ernock, “Analysis of Score Normalization in Multilingual Speaker Recognition,” in Proceedings of the Annual Conference of the International Speech Commu- nication Association, 2017
work page 2017
-
[40]
The BOSARIS Toolkit: Theory, Algorithms and Code for Surviving the New DCF
N. Br ¨ummer and E. De Villiers, “The BOSARIS Toolkit: Theory, Algorithms and Code for Surviving the New DCF,” arXiv preprint arXiv:1304.2865, 2013
work page internal anchor Pith review Pith/arXiv arXiv 2013
-
[41]
V oxceleb: A Large-Scale Speaker Identification Dataset,
A. Nagrani, J. S. Chung, and A. Zisserman, “V oxceleb: A Large-Scale Speaker Identification Dataset,” in Proceedings of the Annual Conference of the International Speech Communication Association , 2017, pp. 2616– 2620
work page 2017
-
[42]
V oxceleb2: Deep Speaker Recognition,
J. S. Chung, A. Nagrani, and A. Zisserman, “V oxceleb2: Deep Speaker Recognition,” inProceedings of the Annual Conference of the International Speech Communication Association, 2018
work page 2018
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.