pith. sign in

arxiv: 1907.02191 · v1 · pith:WIUZL3K5new · submitted 2019-07-04 · 📡 eess.AS

The DKU-SMIIP System for NIST 2018 Speaker Recognition Evaluation

Pith reviewed 2026-05-25 09:09 UTC · model grok-4.3

classification 📡 eess.AS
keywords speaker verificationspeaker recognitioni-vectorx-vectorResNet embeddingdomain adaptationdetection costfront-end extractor
0
0 comments X

The pith

A multi-extractor speaker verification system achieves detection costs of 0.392 and 0.494 on two evaluation sets.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper describes the development of a text-independent speaker verification system that integrates several advanced front-end methods for extracting speaker embeddings. These include MFCC i-vectors, DNN tandem i-vectors, TDNN x-vectors, and deep ResNet embeddings. Back-end techniques are applied to compensate for variability and adapt to domain differences between training and test conditions. The reported performance on the fixed condition of the evaluation demonstrates the effectiveness of this combined approach for handling challenging mismatch scenarios.

Core claim

The submitted system, which employs multiple state-of-the-art front-end extractors including the MFCC i-vector, the DNN tandem i-vector, the TDNN x-vector, and the deep ResNet, along with back-end modeling for variability compensation and domain adaptation, obtains an actual detection cost of 0.392 on CMN2 evaluation data and 0.494 on VAST evaluation data under the fixed condition. Further post-evaluation experiments investigate different encoding layer designs and loss functions for the deep ResNet component.

What carries the argument

The pipeline of multiple front-end speaker embedding extractors followed by back-end variability compensation and domain adaptation.

If this is right

  • Combining different embedding extractors improves performance in mismatched conditions.
  • Domain adaptation in the back-end is key to handling training-test differences.
  • Post-evaluation analysis of ResNet designs can lead to further refinements in embedding quality.
  • The system extends the use of tandem and deep neural network based extractors for speaker tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Such combinations may generalize to other audio classification tasks with domain shifts.
  • Real-time applications could benefit from optimized versions of these multi-extractor systems.
  • Comparing these costs to single-extractor baselines would clarify the contribution of each component.

Load-bearing premise

The described combination of front-end extractors and back-end steps directly produces the reported detection costs on the specified evaluation data without additional undisclosed modifications.

What would settle it

Independent reproduction of the system on the CMN2 and VAST data partitions yielding different detection costs would indicate that the reported figures do not hold under the stated conditions.

read the original abstract

In this paper, we present the system submission for the NIST 2018 Speaker Recognition Evaluation by DKU Speech and Multi-Modal Intelligent Information Processing (SMIIP) Lab. We explore various kinds of state-of-the-art front-end extractors as well as back-end modeling for text-independent speaker verifications. Our submitted primary systems employ multiple state-of-the-art front-end extractors, including the MFCC i-vector, the DNN tandem i-vector, the TDNN x-vector, and the deep ResNet. After speaker embedding is extracted, we exploit several kinds of back-end modeling to perform variability compensation and domain adaptation for mismatch training and testing conditions. The final submitted system on the fixed condition obtains actual detection cost of 0.392 and 0.494 on CMN2 and VAST evaluation data respectively. After the official evaluation, we further extend our experiments by investigating multiple encoding layer designs and loss functions for the deep ResNet system.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript describes the DKU-SMIIP Lab submission to the NIST 2018 Speaker Recognition Evaluation. It details multiple front-end speaker embedding extractors (MFCC i-vector, DNN tandem i-vector, TDNN x-vector, ResNet) and back-end techniques for variability compensation and domain adaptation under mismatched conditions. The primary fixed-condition system is reported to achieve actual detection costs of 0.392 on CMN2 and 0.494 on VAST evaluation data; additional post-evaluation experiments on ResNet encoding layers and loss functions are also presented.

Significance. If the reported scores hold, the work supplies a concrete reference point from official blind evaluation data for a competitive multi-system entry in a major speaker recognition benchmark. The enumeration of diverse front-ends and adaptation steps offers practical guidance on handling domain mismatch, though the absence of component-wise ablations limits insight into which elements drove the final costs.

major comments (2)
  1. [Abstract] Abstract: the central performance claims (detection costs 0.392 / 0.494) are stated without error bars, trial counts, or any indication of variability across the evaluation partitions, which directly affects assessment of the reliability of the reported figures.
  2. [Description of the submitted primary system] Description of the submitted primary system: no explicit account is given of the fusion strategy, score normalization, or weighting among the four front-end extractors, which is load-bearing for understanding how the stated detection costs were obtained from the listed components.
minor comments (2)
  1. The manuscript would benefit from a table or section listing the amount and sources of training data used for each extractor and back-end.
  2. Post-evaluation ResNet experiments are clearly separated from the submitted system, but a brief statement confirming that none of those changes were retroactively applied to the official scores would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and recommendation for minor revision. We address each major comment below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central performance claims (detection costs 0.392 / 0.494) are stated without error bars, trial counts, or any indication of variability across the evaluation partitions, which directly affects assessment of the reliability of the reported figures.

    Authors: The reported detection costs are the official single-run results from the NIST 2018 SRE blind evaluation and therefore do not include variability across repeated trials or partitions. We will revise the abstract to state the number of trials (obtainable from the NIST evaluation plan) for each condition and to clarify that the figures are official evaluation scores rather than experimental averages. revision: partial

  2. Referee: [Description of the submitted primary system] Description of the submitted primary system: no explicit account is given of the fusion strategy, score normalization, or weighting among the four front-end extractors, which is load-bearing for understanding how the stated detection costs were obtained from the listed components.

    Authors: We agree that the fusion details require greater explicitness. In the revised manuscript we will add a dedicated paragraph in the primary-system section specifying the score normalization procedure, the fusion method (linear logistic-regression fusion), and the combination weights applied to the four front-end systems. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical evaluation scores on blind partitions

full rationale

The paper is a system-description submission to NIST SRE 2018. Its central claim consists of factual detection-cost numbers (0.392 / 0.494) obtained by the submitted primary system on the official blind CMN2 and VAST partitions. These numbers are fixed by the evaluation protocol itself; the paper enumerates front-end extractors and back-end steps but presents no derivation, fitted equation, or modeling assumption whose output is then re-used as a “prediction.” No self-citation chain, ansatz, or uniqueness theorem is invoked to justify the reported figures. Consequently the result cannot reduce to its own inputs by construction and receives the default non-circularity score.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper is an empirical systems report; it introduces no new mathematical axioms, free parameters fitted inside the paper, or invented physical entities.

pith-pipeline@v0.9.0 · 5697 in / 1105 out tokens · 21118 ms · 2026-05-25T09:09:50.641401+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

29 extracted references · 29 canonical work pages · 2 internal anchors

  1. [1]

    Introduction Since 1996, the US National Institute of Standards and Tech- nology (NIST) has been conducting speaker recognition eval - uations (SRE) to explore promising new ideas and measure the performance the state-of-the-art speaker recognition sys- tems [1]. NIST SRE 2018 focus on text-independent speaker verification (TISV) and contains two testing t...

  2. [2]

    Front-end extractor 2.1.1

    System descriptions 2.1. Front-end extractor 2.1.1. MFCC i-vector The MFCC i-vector system is developed by adapting the Kaldi SRE16 recipe. 20-dimensional MFCC is augmented with their delta and double delta coefficients, making 60-dimensional feature vectors. A simple energy-based voice activity de- tector (V AD) is used. A short-time cepstral mean subtrac...

  3. [3]

    Data preparation The training data includes SRE04-16, MIXER 6, Switchboard, V oxCeleb1 [21] and V oxCeleb2 [22], resulting 14,467 speakers altogether

    Submitted system performance 3.1. Data preparation The training data includes SRE04-16, MIXER 6, Switchboard, V oxCeleb1 [21] and V oxCeleb2 [22], resulting 14,467 speakers altogether. Data augmentation is utilized for both x-vector and ResNet systems. We adopt the same data augmentation strategy as the Kaldi x-vector recipe. It employs additive noises an...

  4. [4]

    Post evaluation After the evaluation, we further extend our experiments by i n- vestigating more encoding layer designs and loss functions for our deep ResNet system. First, following the x-vector system that accumulates mean and standard deviation statistics in pooling layer, we boos t the GAP layer by adding the global standard deviation statistic s of ...

  5. [5]

    V arious kinds of state-of-the-art speaker embedding extractors are explored

    Conclusions In this paper, our submitted DKU-SMIIP system for the NIST SRE 2018 is described. V arious kinds of state-of-the-art speaker embedding extractors are explored. We also utilize variabi l- ities compensation, domain adaptation, in-domain whiteni ng, and score normalization algorithms to reduce the mismatch con- dition between training and testin...

  6. [6]

    NIST 2 018 Speaker Recognition Evaluation Plan,

    National Institute of Standards and Technology, “NIST 2 018 Speaker Recognition Evaluation Plan,” 2018. [Online]. Available: https://www.nist.gov/sites/default/files/documents/2018/08/17/sre18 eval plan 2018-05-31 v6.pdf

  7. [7]

    Front-End Factor Analysis for Speaker V erification,

    N. Dehak, P . J. Kenny, R. Dehak, P . Dumouchel, and P . Ouel- let, “Front-End Factor Analysis for Speaker V erification,” IEEE Transactions on Audio, Speech, and Language Processing , vol. 19, no. 4, pp. 788–798, 2011

  8. [8]

    Speaker V erification and Spoken Languag e Identification using a Generalized i-vector Framework with Pho- netic Tokenizations and Tandem Features,

    M. Li and W. Liu, “Speaker V erification and Spoken Languag e Identification using a Generalized i-vector Framework with Pho- netic Tokenizations and Tandem Features,” in Proceedings of the Annual Conference of the International Speech Communicati on Association (INTERSPEECH), 2014, pp. 1120–1124

  9. [9]

    Generalized i-vector Re pre- sentation with Phonetic Tokenizations and Tandem Features for both Text Independent and Text Dependent Speaker V erification,

    M. Li, L. Liu, W. Cai, and W. Liu, “Generalized i-vector Re pre- sentation with Phonetic Tokenizations and Tandem Features for both Text Independent and Text Dependent Speaker V erification,” Journal of Signal Processing Systems, vol. 82, no. 2, pp. 207–215, 2016

  10. [10]

    X-V ectors: Robust DNN Embeddings for Speaker Recog- nition,

    D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khu dan- pur, “X-V ectors: Robust DNN Embeddings for Speaker Recog- nition,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , 2018, pp. 5329–5333

  11. [11]

    Exploring the Encoding Layer a nd Loss Function in End-to-End Speaker and Language Recogniti on System,

    W. Cai, J. Chen, and M. Li, “Exploring the Encoding Layer a nd Loss Function in End-to-End Speaker and Language Recogniti on System,” in Odyssey 2018: The Speaker and Language Recogni- tion W orkshop, 2018, pp. 74–81

  12. [12]

    A Novel Learn- able Dictionary Encoding Layer for End-to-End Language Ide n- tification,

    W. Cai, Z. Cai, X. Zhang, X. Wang, and M. Li, “A Novel Learn- able Dictionary Encoding Layer for End-to-End Language Ide n- tification,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , 2018, pp. 5189–5193

  13. [13]

    End- to-end Language Identification using NetFV and NetVLAD,

    J. Chen, W. Cai, D. Cai, Z. Cai, H. Zhong, and M. Li, “End- to-end Language Identification using NetFV and NetVLAD,” in The 11th International Symposium on Chinese Spoken Languag e Processing (ISCSLP), 2018

  14. [14]

    A Discriminative Fea ture Learning Approach for Deep Face Recognition,

    Y . Wen, K. Zhang, Z. Li, and Y . Qiao, “A Discriminative Fea ture Learning Approach for Deep Face Recognition,” in Proceedings of the 14th European Conference on Computer Vision (ECCV) , 2016, pp. 499–515

  15. [15]

    Spherefa ce: Deep Hypersphere Embedding for Face Recognition,

    W. Liu, Y . Wen, Z. Y u, M. Li, B. Raj, and L. Song, “Spherefa ce: Deep Hypersphere Embedding for Face Recognition,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 212–220

  16. [16]

    Insights into E nd- to-End Learning Scheme for Language Identification,

    W. Cai, Z. Cai, W. Liu, X. Wang, and M. Li, “Insights into E nd- to-End Learning Scheme for Language Identification,” in 2018 IEEE International Conference on Acoustics, Speech and Sig nal Processing, 2018, pp. 5209–5213

  17. [17]

    Analysis of Length Normaliza tion in End-to-End Speaker V erification System,

    W. Cai, J. Chen, and M. Li, “Analysis of Length Normaliza tion in End-to-End Speaker V erification System,” in Proceedings of the Annual Conference of the International Speech Communicati on Association (INTERSPEECH), 2018

  18. [18]

    Locality Sensi tive Discriminant Analysis,

    D. Cai, X. He, K. Zhou, J. Han, and H. Bao, “Locality Sensi tive Discriminant Analysis,” in Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI) , 2007, pp. 1713– 1726

  19. [19]

    Locality Sensitive Disc rimi- nant Analysis for Speaker V erification,

    D. Cai, W. Cai, Z. Ni, and M. Li, “Locality Sensitive Disc rimi- nant Analysis for Speaker V erification,” in 2016 Asia-Pacific Sig- nal and Information Processing Association Annual Summit a nd Conference (APSIPA), 2016

  20. [20]

    Return of Frustratingly Easy Domain Adaptation,

    B. Sun, J. Feng, and K. Saenko, “Return of Frustratingly Easy Domain Adaptation,” in Proceedings of the Thirtieth AAAI Con- ference on Artificial Intelligence (AAAI) , 2016, pp. 2058–2065

  21. [21]

    Speaker V erification in Mismatched Conditions with Frustratingly Easy Domain Adap- tation,

    M. J. Alam, G. Bhattacharya, and P . Kenny, “Speaker V erification in Mismatched Conditions with Frustratingly Easy Domain Adap- tation,” in Odyssey 2018: The Speaker and Language Recognition W orkshop, 2018

  22. [22]

    The Spe akers in the Wild (SITW) Speaker Recognition Database,

    M. McLaren, L. Ferrer, D. Castan, and A. Lawson, “The Spe akers in the Wild (SITW) Speaker Recognition Database,” in Proceed- ings of the Annual Conference of the International Speech Co m- munication Association (INTERSPEECH) , 2016, pp. 818–822

  23. [23]

    Analysis of i- vector Length Normalization in Speaker Recognition Systems,

    D. Garcia-Romero and C. Y . Espy-Wilson, “Analysis of i- vector Length Normalization in Speaker Recognition Systems,” in Pro- ceedings of the Annual Conference of the International Spee ch Communication Association (INTERSPEECH) , 2011, pp. 249– 252

  24. [24]

    Analysis of Score Normalization in Multilin gual Speaker Recognition,

    P . Matjka, O. Novotn, O. Plchot, L. Burget, M. D. Snchez, and J. ernock, “Analysis of Score Normalization in Multilin gual Speaker Recognition,” in Proceedings of the Annual Conference of the International Speech Communication Association (IN TER- SPEECH), 2017

  25. [25]

    The BOSARIS Toolkit: Theory, Algorithms and Code for Surviving the New DCF

    N. Br¨ ummer and E. De Villiers, “The BOSARIS Toolkit: Th eory, Algorithms and Code for Surviving the New DCF,” arXiv preprint arXiv:1304.2865, 2013

  26. [26]

    V oxceleb: A L arge- Scale Speaker Identification Dataset,

    A. Nagrani, J. S. Chung, and A. Zisserman, “V oxceleb: A L arge- Scale Speaker Identification Dataset,” in Proceedings of the An- nual Conference of the International Speech Communication As- sociation (INTERSPEECH), 2017, pp. 2616–2620

  27. [27]

    V oxceleb2: D eep Speaker Recognition,

    J. S. Chung, A. Nagrani, and A. Zisserman, “V oxceleb2: D eep Speaker Recognition,” in Proceedings of the Annual Conference of the International Speech Communication Association (IN TER- SPEECH), 2018

  28. [28]

    A Study on Data Augmentation of Reverberant Speech for Ro- bust Speech Recognition,

    T. Ko, V . Peddinti, D. Povey, M. L. Seltzer, and S. Khudan pur, “A Study on Data Augmentation of Reverberant Speech for Ro- bust Speech Recognition,” in 2017 IEEE International Confer- ence on Acoustics, Speech and Signal Processing (ICASSP), 2017, pp. 5220–5224

  29. [29]

    MUSAN: A Music, Speech, and Noise Corpus

    D. Snyder, G. Chen, and D. Povey, “MUSAN: A Music, Speech , and Noise Corpus,” arXiv:1510.08484 [cs], 2015