The DKU-SMIIP System for NIST 2018 Speaker Recognition Evaluation
Pith reviewed 2026-05-25 09:09 UTC · model grok-4.3
The pith
A multi-extractor speaker verification system achieves detection costs of 0.392 and 0.494 on two evaluation sets.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The submitted system, which employs multiple state-of-the-art front-end extractors including the MFCC i-vector, the DNN tandem i-vector, the TDNN x-vector, and the deep ResNet, along with back-end modeling for variability compensation and domain adaptation, obtains an actual detection cost of 0.392 on CMN2 evaluation data and 0.494 on VAST evaluation data under the fixed condition. Further post-evaluation experiments investigate different encoding layer designs and loss functions for the deep ResNet component.
What carries the argument
The pipeline of multiple front-end speaker embedding extractors followed by back-end variability compensation and domain adaptation.
If this is right
- Combining different embedding extractors improves performance in mismatched conditions.
- Domain adaptation in the back-end is key to handling training-test differences.
- Post-evaluation analysis of ResNet designs can lead to further refinements in embedding quality.
- The system extends the use of tandem and deep neural network based extractors for speaker tasks.
Where Pith is reading between the lines
- Such combinations may generalize to other audio classification tasks with domain shifts.
- Real-time applications could benefit from optimized versions of these multi-extractor systems.
- Comparing these costs to single-extractor baselines would clarify the contribution of each component.
Load-bearing premise
The described combination of front-end extractors and back-end steps directly produces the reported detection costs on the specified evaluation data without additional undisclosed modifications.
What would settle it
Independent reproduction of the system on the CMN2 and VAST data partitions yielding different detection costs would indicate that the reported figures do not hold under the stated conditions.
read the original abstract
In this paper, we present the system submission for the NIST 2018 Speaker Recognition Evaluation by DKU Speech and Multi-Modal Intelligent Information Processing (SMIIP) Lab. We explore various kinds of state-of-the-art front-end extractors as well as back-end modeling for text-independent speaker verifications. Our submitted primary systems employ multiple state-of-the-art front-end extractors, including the MFCC i-vector, the DNN tandem i-vector, the TDNN x-vector, and the deep ResNet. After speaker embedding is extracted, we exploit several kinds of back-end modeling to perform variability compensation and domain adaptation for mismatch training and testing conditions. The final submitted system on the fixed condition obtains actual detection cost of 0.392 and 0.494 on CMN2 and VAST evaluation data respectively. After the official evaluation, we further extend our experiments by investigating multiple encoding layer designs and loss functions for the deep ResNet system.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript describes the DKU-SMIIP Lab submission to the NIST 2018 Speaker Recognition Evaluation. It details multiple front-end speaker embedding extractors (MFCC i-vector, DNN tandem i-vector, TDNN x-vector, ResNet) and back-end techniques for variability compensation and domain adaptation under mismatched conditions. The primary fixed-condition system is reported to achieve actual detection costs of 0.392 on CMN2 and 0.494 on VAST evaluation data; additional post-evaluation experiments on ResNet encoding layers and loss functions are also presented.
Significance. If the reported scores hold, the work supplies a concrete reference point from official blind evaluation data for a competitive multi-system entry in a major speaker recognition benchmark. The enumeration of diverse front-ends and adaptation steps offers practical guidance on handling domain mismatch, though the absence of component-wise ablations limits insight into which elements drove the final costs.
major comments (2)
- [Abstract] Abstract: the central performance claims (detection costs 0.392 / 0.494) are stated without error bars, trial counts, or any indication of variability across the evaluation partitions, which directly affects assessment of the reliability of the reported figures.
- [Description of the submitted primary system] Description of the submitted primary system: no explicit account is given of the fusion strategy, score normalization, or weighting among the four front-end extractors, which is load-bearing for understanding how the stated detection costs were obtained from the listed components.
minor comments (2)
- The manuscript would benefit from a table or section listing the amount and sources of training data used for each extractor and back-end.
- Post-evaluation ResNet experiments are clearly separated from the submitted system, but a brief statement confirming that none of those changes were retroactively applied to the official scores would improve clarity.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and recommendation for minor revision. We address each major comment below.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central performance claims (detection costs 0.392 / 0.494) are stated without error bars, trial counts, or any indication of variability across the evaluation partitions, which directly affects assessment of the reliability of the reported figures.
Authors: The reported detection costs are the official single-run results from the NIST 2018 SRE blind evaluation and therefore do not include variability across repeated trials or partitions. We will revise the abstract to state the number of trials (obtainable from the NIST evaluation plan) for each condition and to clarify that the figures are official evaluation scores rather than experimental averages. revision: partial
-
Referee: [Description of the submitted primary system] Description of the submitted primary system: no explicit account is given of the fusion strategy, score normalization, or weighting among the four front-end extractors, which is load-bearing for understanding how the stated detection costs were obtained from the listed components.
Authors: We agree that the fusion details require greater explicitness. In the revised manuscript we will add a dedicated paragraph in the primary-system section specifying the score normalization procedure, the fusion method (linear logistic-regression fusion), and the combination weights applied to the four front-end systems. revision: yes
Circularity Check
No significant circularity; empirical evaluation scores on blind partitions
full rationale
The paper is a system-description submission to NIST SRE 2018. Its central claim consists of factual detection-cost numbers (0.392 / 0.494) obtained by the submitted primary system on the official blind CMN2 and VAST partitions. These numbers are fixed by the evaluation protocol itself; the paper enumerates front-end extractors and back-end steps but presents no derivation, fitted equation, or modeling assumption whose output is then re-used as a “prediction.” No self-citation chain, ansatz, or uniqueness theorem is invoked to justify the reported figures. Consequently the result cannot reduce to its own inputs by construction and receives the default non-circularity score.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Introduction Since 1996, the US National Institute of Standards and Tech- nology (NIST) has been conducting speaker recognition eval - uations (SRE) to explore promising new ideas and measure the performance the state-of-the-art speaker recognition sys- tems [1]. NIST SRE 2018 focus on text-independent speaker verification (TISV) and contains two testing t...
work page 1996
-
[2]
System descriptions 2.1. Front-end extractor 2.1.1. MFCC i-vector The MFCC i-vector system is developed by adapting the Kaldi SRE16 recipe. 20-dimensional MFCC is augmented with their delta and double delta coefficients, making 60-dimensional feature vectors. A simple energy-based voice activity de- tector (V AD) is used. A short-time cepstral mean subtrac...
work page 2048
-
[3]
Submitted system performance 3.1. Data preparation The training data includes SRE04-16, MIXER 6, Switchboard, V oxCeleb1 [21] and V oxCeleb2 [22], resulting 14,467 speakers altogether. Data augmentation is utilized for both x-vector and ResNet systems. We adopt the same data augmentation strategy as the Kaldi x-vector recipe. It employs additive noises an...
work page 2018
-
[4]
Post evaluation After the evaluation, we further extend our experiments by i n- vestigating more encoding layer designs and loss functions for our deep ResNet system. First, following the x-vector system that accumulates mean and standard deviation statistics in pooling layer, we boos t the GAP layer by adding the global standard deviation statistic s of ...
work page 2018
-
[5]
V arious kinds of state-of-the-art speaker embedding extractors are explored
Conclusions In this paper, our submitted DKU-SMIIP system for the NIST SRE 2018 is described. V arious kinds of state-of-the-art speaker embedding extractors are explored. We also utilize variabi l- ities compensation, domain adaptation, in-domain whiteni ng, and score normalization algorithms to reduce the mismatch con- dition between training and testin...
work page 2018
-
[6]
NIST 2 018 Speaker Recognition Evaluation Plan,
National Institute of Standards and Technology, “NIST 2 018 Speaker Recognition Evaluation Plan,” 2018. [Online]. Available: https://www.nist.gov/sites/default/files/documents/2018/08/17/sre18 eval plan 2018-05-31 v6.pdf
work page 2018
-
[7]
Front-End Factor Analysis for Speaker V erification,
N. Dehak, P . J. Kenny, R. Dehak, P . Dumouchel, and P . Ouel- let, “Front-End Factor Analysis for Speaker V erification,” IEEE Transactions on Audio, Speech, and Language Processing , vol. 19, no. 4, pp. 788–798, 2011
work page 2011
-
[8]
M. Li and W. Liu, “Speaker V erification and Spoken Languag e Identification using a Generalized i-vector Framework with Pho- netic Tokenizations and Tandem Features,” in Proceedings of the Annual Conference of the International Speech Communicati on Association (INTERSPEECH), 2014, pp. 1120–1124
work page 2014
-
[9]
M. Li, L. Liu, W. Cai, and W. Liu, “Generalized i-vector Re pre- sentation with Phonetic Tokenizations and Tandem Features for both Text Independent and Text Dependent Speaker V erification,” Journal of Signal Processing Systems, vol. 82, no. 2, pp. 207–215, 2016
work page 2016
-
[10]
X-V ectors: Robust DNN Embeddings for Speaker Recog- nition,
D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khu dan- pur, “X-V ectors: Robust DNN Embeddings for Speaker Recog- nition,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , 2018, pp. 5329–5333
work page 2018
-
[11]
W. Cai, J. Chen, and M. Li, “Exploring the Encoding Layer a nd Loss Function in End-to-End Speaker and Language Recogniti on System,” in Odyssey 2018: The Speaker and Language Recogni- tion W orkshop, 2018, pp. 74–81
work page 2018
-
[12]
A Novel Learn- able Dictionary Encoding Layer for End-to-End Language Ide n- tification,
W. Cai, Z. Cai, X. Zhang, X. Wang, and M. Li, “A Novel Learn- able Dictionary Encoding Layer for End-to-End Language Ide n- tification,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , 2018, pp. 5189–5193
work page 2018
-
[13]
End- to-end Language Identification using NetFV and NetVLAD,
J. Chen, W. Cai, D. Cai, Z. Cai, H. Zhong, and M. Li, “End- to-end Language Identification using NetFV and NetVLAD,” in The 11th International Symposium on Chinese Spoken Languag e Processing (ISCSLP), 2018
work page 2018
-
[14]
A Discriminative Fea ture Learning Approach for Deep Face Recognition,
Y . Wen, K. Zhang, Z. Li, and Y . Qiao, “A Discriminative Fea ture Learning Approach for Deep Face Recognition,” in Proceedings of the 14th European Conference on Computer Vision (ECCV) , 2016, pp. 499–515
work page 2016
-
[15]
Spherefa ce: Deep Hypersphere Embedding for Face Recognition,
W. Liu, Y . Wen, Z. Y u, M. Li, B. Raj, and L. Song, “Spherefa ce: Deep Hypersphere Embedding for Face Recognition,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 212–220
work page 2017
-
[16]
Insights into E nd- to-End Learning Scheme for Language Identification,
W. Cai, Z. Cai, W. Liu, X. Wang, and M. Li, “Insights into E nd- to-End Learning Scheme for Language Identification,” in 2018 IEEE International Conference on Acoustics, Speech and Sig nal Processing, 2018, pp. 5209–5213
work page 2018
-
[17]
Analysis of Length Normaliza tion in End-to-End Speaker V erification System,
W. Cai, J. Chen, and M. Li, “Analysis of Length Normaliza tion in End-to-End Speaker V erification System,” in Proceedings of the Annual Conference of the International Speech Communicati on Association (INTERSPEECH), 2018
work page 2018
-
[18]
Locality Sensi tive Discriminant Analysis,
D. Cai, X. He, K. Zhou, J. Han, and H. Bao, “Locality Sensi tive Discriminant Analysis,” in Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI) , 2007, pp. 1713– 1726
work page 2007
-
[19]
Locality Sensitive Disc rimi- nant Analysis for Speaker V erification,
D. Cai, W. Cai, Z. Ni, and M. Li, “Locality Sensitive Disc rimi- nant Analysis for Speaker V erification,” in 2016 Asia-Pacific Sig- nal and Information Processing Association Annual Summit a nd Conference (APSIPA), 2016
work page 2016
-
[20]
Return of Frustratingly Easy Domain Adaptation,
B. Sun, J. Feng, and K. Saenko, “Return of Frustratingly Easy Domain Adaptation,” in Proceedings of the Thirtieth AAAI Con- ference on Artificial Intelligence (AAAI) , 2016, pp. 2058–2065
work page 2016
-
[21]
Speaker V erification in Mismatched Conditions with Frustratingly Easy Domain Adap- tation,
M. J. Alam, G. Bhattacharya, and P . Kenny, “Speaker V erification in Mismatched Conditions with Frustratingly Easy Domain Adap- tation,” in Odyssey 2018: The Speaker and Language Recognition W orkshop, 2018
work page 2018
-
[22]
The Spe akers in the Wild (SITW) Speaker Recognition Database,
M. McLaren, L. Ferrer, D. Castan, and A. Lawson, “The Spe akers in the Wild (SITW) Speaker Recognition Database,” in Proceed- ings of the Annual Conference of the International Speech Co m- munication Association (INTERSPEECH) , 2016, pp. 818–822
work page 2016
-
[23]
Analysis of i- vector Length Normalization in Speaker Recognition Systems,
D. Garcia-Romero and C. Y . Espy-Wilson, “Analysis of i- vector Length Normalization in Speaker Recognition Systems,” in Pro- ceedings of the Annual Conference of the International Spee ch Communication Association (INTERSPEECH) , 2011, pp. 249– 252
work page 2011
-
[24]
Analysis of Score Normalization in Multilin gual Speaker Recognition,
P . Matjka, O. Novotn, O. Plchot, L. Burget, M. D. Snchez, and J. ernock, “Analysis of Score Normalization in Multilin gual Speaker Recognition,” in Proceedings of the Annual Conference of the International Speech Communication Association (IN TER- SPEECH), 2017
work page 2017
-
[25]
The BOSARIS Toolkit: Theory, Algorithms and Code for Surviving the New DCF
N. Br¨ ummer and E. De Villiers, “The BOSARIS Toolkit: Th eory, Algorithms and Code for Surviving the New DCF,” arXiv preprint arXiv:1304.2865, 2013
work page internal anchor Pith review Pith/arXiv arXiv 2013
-
[26]
V oxceleb: A L arge- Scale Speaker Identification Dataset,
A. Nagrani, J. S. Chung, and A. Zisserman, “V oxceleb: A L arge- Scale Speaker Identification Dataset,” in Proceedings of the An- nual Conference of the International Speech Communication As- sociation (INTERSPEECH), 2017, pp. 2616–2620
work page 2017
-
[27]
V oxceleb2: D eep Speaker Recognition,
J. S. Chung, A. Nagrani, and A. Zisserman, “V oxceleb2: D eep Speaker Recognition,” in Proceedings of the Annual Conference of the International Speech Communication Association (IN TER- SPEECH), 2018
work page 2018
-
[28]
A Study on Data Augmentation of Reverberant Speech for Ro- bust Speech Recognition,
T. Ko, V . Peddinti, D. Povey, M. L. Seltzer, and S. Khudan pur, “A Study on Data Augmentation of Reverberant Speech for Ro- bust Speech Recognition,” in 2017 IEEE International Confer- ence on Acoustics, Speech and Signal Processing (ICASSP), 2017, pp. 5220–5224
work page 2017
-
[29]
MUSAN: A Music, Speech, and Noise Corpus
D. Snyder, G. Chen, and D. Povey, “MUSAN: A Music, Speech , and Noise Corpus,” arXiv:1510.08484 [cs], 2015
work page internal anchor Pith review Pith/arXiv arXiv 2015
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.