The DKU-SMIIP System for NIST 2018 Speaker Recognition Evaluation

Danwei Cai; Ming Li; Weicheng Cai

arxiv: 1907.02191 · v1 · pith:WIUZL3K5new · submitted 2019-07-04 · 📡 eess.AS

The DKU-SMIIP System for NIST 2018 Speaker Recognition Evaluation

Danwei Cai , Weicheng Cai , Ming Li This is my paper

Pith reviewed 2026-05-25 09:09 UTC · model grok-4.3

classification 📡 eess.AS

keywords speaker verificationspeaker recognitioni-vectorx-vectorResNet embeddingdomain adaptationdetection costfront-end extractor

0 comments

The pith

A multi-extractor speaker verification system achieves detection costs of 0.392 and 0.494 on two evaluation sets.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper describes the development of a text-independent speaker verification system that integrates several advanced front-end methods for extracting speaker embeddings. These include MFCC i-vectors, DNN tandem i-vectors, TDNN x-vectors, and deep ResNet embeddings. Back-end techniques are applied to compensate for variability and adapt to domain differences between training and test conditions. The reported performance on the fixed condition of the evaluation demonstrates the effectiveness of this combined approach for handling challenging mismatch scenarios.

Core claim

The submitted system, which employs multiple state-of-the-art front-end extractors including the MFCC i-vector, the DNN tandem i-vector, the TDNN x-vector, and the deep ResNet, along with back-end modeling for variability compensation and domain adaptation, obtains an actual detection cost of 0.392 on CMN2 evaluation data and 0.494 on VAST evaluation data under the fixed condition. Further post-evaluation experiments investigate different encoding layer designs and loss functions for the deep ResNet component.

What carries the argument

The pipeline of multiple front-end speaker embedding extractors followed by back-end variability compensation and domain adaptation.

If this is right

Combining different embedding extractors improves performance in mismatched conditions.
Domain adaptation in the back-end is key to handling training-test differences.
Post-evaluation analysis of ResNet designs can lead to further refinements in embedding quality.
The system extends the use of tandem and deep neural network based extractors for speaker tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Such combinations may generalize to other audio classification tasks with domain shifts.
Real-time applications could benefit from optimized versions of these multi-extractor systems.
Comparing these costs to single-extractor baselines would clarify the contribution of each component.

Load-bearing premise

The described combination of front-end extractors and back-end steps directly produces the reported detection costs on the specified evaluation data without additional undisclosed modifications.

What would settle it

Independent reproduction of the system on the CMN2 and VAST data partitions yielding different detection costs would indicate that the reported figures do not hold under the stated conditions.

read the original abstract

In this paper, we present the system submission for the NIST 2018 Speaker Recognition Evaluation by DKU Speech and Multi-Modal Intelligent Information Processing (SMIIP) Lab. We explore various kinds of state-of-the-art front-end extractors as well as back-end modeling for text-independent speaker verifications. Our submitted primary systems employ multiple state-of-the-art front-end extractors, including the MFCC i-vector, the DNN tandem i-vector, the TDNN x-vector, and the deep ResNet. After speaker embedding is extracted, we exploit several kinds of back-end modeling to perform variability compensation and domain adaptation for mismatch training and testing conditions. The final submitted system on the fixed condition obtains actual detection cost of 0.392 and 0.494 on CMN2 and VAST evaluation data respectively. After the official evaluation, we further extend our experiments by investigating multiple encoding layer designs and loss functions for the deep ResNet system.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Standard competition system paper reporting NIST 2018 scores from known components with no new techniques.

read the letter

This paper is a report of the DKU team's entry in the NIST 2018 Speaker Recognition Evaluation. Their primary fixed-condition system reached detection costs of 0.392 on CMN2 and 0.494 on VAST by fusing four existing front-end extractors with ordinary back-end compensation and adaptation steps. After the evaluation they ran extra checks on ResNet encoding layers and loss functions. The paper states the official scores plainly and lists the components used, which is the expected content for this kind of submission. The numbers themselves are fixed by the blind evaluation protocol, so they stand as direct evidence. Nothing in the main system is new. The extractors (MFCC i-vector, DNN tandem i-vector, TDNN x-vector, deep ResNet) and the back-end techniques were already published. The post-evaluation ResNet runs are routine hyper-parameter sweeps rather than fresh methods. The abstract gives no error bars, no ablation numbers, and no training-data details, which limits how much a reader can learn about why the fusion worked. This paper is mainly useful to people who track speaker recognition evaluations and want one concrete recipe from 2018. It will not shift thinking about embeddings or adaptation. A reader hunting for new ideas or reproducible advances will find little to take away. It still deserves peer review because competition system descriptions are standard in the field and the central claims rest on verifiable official scores rather than internal modeling assumptions.

Referee Report

2 major / 2 minor

Summary. The manuscript describes the DKU-SMIIP Lab submission to the NIST 2018 Speaker Recognition Evaluation. It details multiple front-end speaker embedding extractors (MFCC i-vector, DNN tandem i-vector, TDNN x-vector, ResNet) and back-end techniques for variability compensation and domain adaptation under mismatched conditions. The primary fixed-condition system is reported to achieve actual detection costs of 0.392 on CMN2 and 0.494 on VAST evaluation data; additional post-evaluation experiments on ResNet encoding layers and loss functions are also presented.

Significance. If the reported scores hold, the work supplies a concrete reference point from official blind evaluation data for a competitive multi-system entry in a major speaker recognition benchmark. The enumeration of diverse front-ends and adaptation steps offers practical guidance on handling domain mismatch, though the absence of component-wise ablations limits insight into which elements drove the final costs.

major comments (2)

[Abstract] Abstract: the central performance claims (detection costs 0.392 / 0.494) are stated without error bars, trial counts, or any indication of variability across the evaluation partitions, which directly affects assessment of the reliability of the reported figures.
[Description of the submitted primary system] Description of the submitted primary system: no explicit account is given of the fusion strategy, score normalization, or weighting among the four front-end extractors, which is load-bearing for understanding how the stated detection costs were obtained from the listed components.

minor comments (2)

The manuscript would benefit from a table or section listing the amount and sources of training data used for each extractor and back-end.
Post-evaluation ResNet experiments are clearly separated from the submitted system, but a brief statement confirming that none of those changes were retroactively applied to the official scores would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and recommendation for minor revision. We address each major comment below.

read point-by-point responses

Referee: [Abstract] Abstract: the central performance claims (detection costs 0.392 / 0.494) are stated without error bars, trial counts, or any indication of variability across the evaluation partitions, which directly affects assessment of the reliability of the reported figures.

Authors: The reported detection costs are the official single-run results from the NIST 2018 SRE blind evaluation and therefore do not include variability across repeated trials or partitions. We will revise the abstract to state the number of trials (obtainable from the NIST evaluation plan) for each condition and to clarify that the figures are official evaluation scores rather than experimental averages. revision: partial
Referee: [Description of the submitted primary system] Description of the submitted primary system: no explicit account is given of the fusion strategy, score normalization, or weighting among the four front-end extractors, which is load-bearing for understanding how the stated detection costs were obtained from the listed components.

Authors: We agree that the fusion details require greater explicitness. In the revised manuscript we will add a dedicated paragraph in the primary-system section specifying the score normalization procedure, the fusion method (linear logistic-regression fusion), and the combination weights applied to the four front-end systems. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical evaluation scores on blind partitions

full rationale

The paper is a system-description submission to NIST SRE 2018. Its central claim consists of factual detection-cost numbers (0.392 / 0.494) obtained by the submitted primary system on the official blind CMN2 and VAST partitions. These numbers are fixed by the evaluation protocol itself; the paper enumerates front-end extractors and back-end steps but presents no derivation, fitted equation, or modeling assumption whose output is then re-used as a “prediction.” No self-citation chain, ansatz, or uniqueness theorem is invoked to justify the reported figures. Consequently the result cannot reduce to its own inputs by construction and receives the default non-circularity score.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper is an empirical systems report; it introduces no new mathematical axioms, free parameters fitted inside the paper, or invented physical entities.

pith-pipeline@v0.9.0 · 5697 in / 1105 out tokens · 21118 ms · 2026-05-25T09:09:50.641401+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

29 extracted references · 29 canonical work pages · 2 internal anchors

[1]

Introduction Since 1996, the US National Institute of Standards and Tech- nology (NIST) has been conducting speaker recognition eval - uations (SRE) to explore promising new ideas and measure the performance the state-of-the-art speaker recognition sys- tems [1]. NIST SRE 2018 focus on text-independent speaker veriﬁcation (TISV) and contains two testing t...

work page 1996
[2]

Front-end extractor 2.1.1

System descriptions 2.1. Front-end extractor 2.1.1. MFCC i-vector The MFCC i-vector system is developed by adapting the Kaldi SRE16 recipe. 20-dimensional MFCC is augmented with their delta and double delta coefﬁcients, making 60-dimensional feature vectors. A simple energy-based voice activity de- tector (V AD) is used. A short-time cepstral mean subtrac...

work page 2048
[3]

Data preparation The training data includes SRE04-16, MIXER 6, Switchboard, V oxCeleb1 [21] and V oxCeleb2 [22], resulting 14,467 speakers altogether

Submitted system performance 3.1. Data preparation The training data includes SRE04-16, MIXER 6, Switchboard, V oxCeleb1 [21] and V oxCeleb2 [22], resulting 14,467 speakers altogether. Data augmentation is utilized for both x-vector and ResNet systems. We adopt the same data augmentation strategy as the Kaldi x-vector recipe. It employs additive noises an...

work page 2018
[4]

Post evaluation After the evaluation, we further extend our experiments by i n- vestigating more encoding layer designs and loss functions for our deep ResNet system. First, following the x-vector system that accumulates mean and standard deviation statistics in pooling layer, we boos t the GAP layer by adding the global standard deviation statistic s of ...

work page 2018
[5]

V arious kinds of state-of-the-art speaker embedding extractors are explored

Conclusions In this paper, our submitted DKU-SMIIP system for the NIST SRE 2018 is described. V arious kinds of state-of-the-art speaker embedding extractors are explored. We also utilize variabi l- ities compensation, domain adaptation, in-domain whiteni ng, and score normalization algorithms to reduce the mismatch con- dition between training and testin...

work page 2018
[6]

NIST 2 018 Speaker Recognition Evaluation Plan,

National Institute of Standards and Technology, “NIST 2 018 Speaker Recognition Evaluation Plan,” 2018. [Online]. Available: https://www.nist.gov/sites/default/ﬁles/documents/2018/08/17/sre18 eval plan 2018-05-31 v6.pdf

work page 2018
[7]

Front-End Factor Analysis for Speaker V eriﬁcation,

N. Dehak, P . J. Kenny, R. Dehak, P . Dumouchel, and P . Ouel- let, “Front-End Factor Analysis for Speaker V eriﬁcation,” IEEE Transactions on Audio, Speech, and Language Processing , vol. 19, no. 4, pp. 788–798, 2011

work page 2011
[8]

Speaker V eriﬁcation and Spoken Languag e Identiﬁcation using a Generalized i-vector Framework with Pho- netic Tokenizations and Tandem Features,

M. Li and W. Liu, “Speaker V eriﬁcation and Spoken Languag e Identiﬁcation using a Generalized i-vector Framework with Pho- netic Tokenizations and Tandem Features,” in Proceedings of the Annual Conference of the International Speech Communicati on Association (INTERSPEECH), 2014, pp. 1120–1124

work page 2014
[9]

Generalized i-vector Re pre- sentation with Phonetic Tokenizations and Tandem Features for both Text Independent and Text Dependent Speaker V eriﬁcation,

M. Li, L. Liu, W. Cai, and W. Liu, “Generalized i-vector Re pre- sentation with Phonetic Tokenizations and Tandem Features for both Text Independent and Text Dependent Speaker V eriﬁcation,” Journal of Signal Processing Systems, vol. 82, no. 2, pp. 207–215, 2016

work page 2016
[10]

X-V ectors: Robust DNN Embeddings for Speaker Recog- nition,

D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khu dan- pur, “X-V ectors: Robust DNN Embeddings for Speaker Recog- nition,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , 2018, pp. 5329–5333

work page 2018
[11]

Exploring the Encoding Layer a nd Loss Function in End-to-End Speaker and Language Recogniti on System,

W. Cai, J. Chen, and M. Li, “Exploring the Encoding Layer a nd Loss Function in End-to-End Speaker and Language Recogniti on System,” in Odyssey 2018: The Speaker and Language Recogni- tion W orkshop, 2018, pp. 74–81

work page 2018
[12]

A Novel Learn- able Dictionary Encoding Layer for End-to-End Language Ide n- tiﬁcation,

W. Cai, Z. Cai, X. Zhang, X. Wang, and M. Li, “A Novel Learn- able Dictionary Encoding Layer for End-to-End Language Ide n- tiﬁcation,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , 2018, pp. 5189–5193

work page 2018
[13]

End- to-end Language Identiﬁcation using NetFV and NetVLAD,

J. Chen, W. Cai, D. Cai, Z. Cai, H. Zhong, and M. Li, “End- to-end Language Identiﬁcation using NetFV and NetVLAD,” in The 11th International Symposium on Chinese Spoken Languag e Processing (ISCSLP), 2018

work page 2018
[14]

A Discriminative Fea ture Learning Approach for Deep Face Recognition,

Y . Wen, K. Zhang, Z. Li, and Y . Qiao, “A Discriminative Fea ture Learning Approach for Deep Face Recognition,” in Proceedings of the 14th European Conference on Computer Vision (ECCV) , 2016, pp. 499–515

work page 2016
[15]

Spherefa ce: Deep Hypersphere Embedding for Face Recognition,

W. Liu, Y . Wen, Z. Y u, M. Li, B. Raj, and L. Song, “Spherefa ce: Deep Hypersphere Embedding for Face Recognition,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 212–220

work page 2017
[16]

Insights into E nd- to-End Learning Scheme for Language Identiﬁcation,

W. Cai, Z. Cai, W. Liu, X. Wang, and M. Li, “Insights into E nd- to-End Learning Scheme for Language Identiﬁcation,” in 2018 IEEE International Conference on Acoustics, Speech and Sig nal Processing, 2018, pp. 5209–5213

work page 2018
[17]

Analysis of Length Normaliza tion in End-to-End Speaker V eriﬁcation System,

W. Cai, J. Chen, and M. Li, “Analysis of Length Normaliza tion in End-to-End Speaker V eriﬁcation System,” in Proceedings of the Annual Conference of the International Speech Communicati on Association (INTERSPEECH), 2018

work page 2018
[18]

Locality Sensi tive Discriminant Analysis,

D. Cai, X. He, K. Zhou, J. Han, and H. Bao, “Locality Sensi tive Discriminant Analysis,” in Proceedings of the International Joint Conference on Artiﬁcial Intelligence (IJCAI) , 2007, pp. 1713– 1726

work page 2007
[19]

Locality Sensitive Disc rimi- nant Analysis for Speaker V eriﬁcation,

D. Cai, W. Cai, Z. Ni, and M. Li, “Locality Sensitive Disc rimi- nant Analysis for Speaker V eriﬁcation,” in 2016 Asia-Paciﬁc Sig- nal and Information Processing Association Annual Summit a nd Conference (APSIPA), 2016

work page 2016
[20]

Return of Frustratingly Easy Domain Adaptation,

B. Sun, J. Feng, and K. Saenko, “Return of Frustratingly Easy Domain Adaptation,” in Proceedings of the Thirtieth AAAI Con- ference on Artiﬁcial Intelligence (AAAI) , 2016, pp. 2058–2065

work page 2016
[21]

Speaker V eriﬁcation in Mismatched Conditions with Frustratingly Easy Domain Adap- tation,

M. J. Alam, G. Bhattacharya, and P . Kenny, “Speaker V eriﬁcation in Mismatched Conditions with Frustratingly Easy Domain Adap- tation,” in Odyssey 2018: The Speaker and Language Recognition W orkshop, 2018

work page 2018
[22]

The Spe akers in the Wild (SITW) Speaker Recognition Database,

M. McLaren, L. Ferrer, D. Castan, and A. Lawson, “The Spe akers in the Wild (SITW) Speaker Recognition Database,” in Proceed- ings of the Annual Conference of the International Speech Co m- munication Association (INTERSPEECH) , 2016, pp. 818–822

work page 2016
[23]

Analysis of i- vector Length Normalization in Speaker Recognition Systems,

D. Garcia-Romero and C. Y . Espy-Wilson, “Analysis of i- vector Length Normalization in Speaker Recognition Systems,” in Pro- ceedings of the Annual Conference of the International Spee ch Communication Association (INTERSPEECH) , 2011, pp. 249– 252

work page 2011
[24]

Analysis of Score Normalization in Multilin gual Speaker Recognition,

P . Matjka, O. Novotn, O. Plchot, L. Burget, M. D. Snchez, and J. ernock, “Analysis of Score Normalization in Multilin gual Speaker Recognition,” in Proceedings of the Annual Conference of the International Speech Communication Association (IN TER- SPEECH), 2017

work page 2017
[25]

The BOSARIS Toolkit: Theory, Algorithms and Code for Surviving the New DCF

N. Br¨ ummer and E. De Villiers, “The BOSARIS Toolkit: Th eory, Algorithms and Code for Surviving the New DCF,” arXiv preprint arXiv:1304.2865, 2013

work page internal anchor Pith review Pith/arXiv arXiv 2013
[26]

V oxceleb: A L arge- Scale Speaker Identiﬁcation Dataset,

A. Nagrani, J. S. Chung, and A. Zisserman, “V oxceleb: A L arge- Scale Speaker Identiﬁcation Dataset,” in Proceedings of the An- nual Conference of the International Speech Communication As- sociation (INTERSPEECH), 2017, pp. 2616–2620

work page 2017
[27]

V oxceleb2: D eep Speaker Recognition,

J. S. Chung, A. Nagrani, and A. Zisserman, “V oxceleb2: D eep Speaker Recognition,” in Proceedings of the Annual Conference of the International Speech Communication Association (IN TER- SPEECH), 2018

work page 2018
[28]

A Study on Data Augmentation of Reverberant Speech for Ro- bust Speech Recognition,

T. Ko, V . Peddinti, D. Povey, M. L. Seltzer, and S. Khudan pur, “A Study on Data Augmentation of Reverberant Speech for Ro- bust Speech Recognition,” in 2017 IEEE International Confer- ence on Acoustics, Speech and Signal Processing (ICASSP), 2017, pp. 5220–5224

work page 2017
[29]

MUSAN: A Music, Speech, and Noise Corpus

D. Snyder, G. Chen, and D. Povey, “MUSAN: A Music, Speech , and Noise Corpus,” arXiv:1510.08484 [cs], 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015

[1] [1]

Introduction Since 1996, the US National Institute of Standards and Tech- nology (NIST) has been conducting speaker recognition eval - uations (SRE) to explore promising new ideas and measure the performance the state-of-the-art speaker recognition sys- tems [1]. NIST SRE 2018 focus on text-independent speaker veriﬁcation (TISV) and contains two testing t...

work page 1996

[2] [2]

Front-end extractor 2.1.1

System descriptions 2.1. Front-end extractor 2.1.1. MFCC i-vector The MFCC i-vector system is developed by adapting the Kaldi SRE16 recipe. 20-dimensional MFCC is augmented with their delta and double delta coefﬁcients, making 60-dimensional feature vectors. A simple energy-based voice activity de- tector (V AD) is used. A short-time cepstral mean subtrac...

work page 2048

[3] [3]

Data preparation The training data includes SRE04-16, MIXER 6, Switchboard, V oxCeleb1 [21] and V oxCeleb2 [22], resulting 14,467 speakers altogether

Submitted system performance 3.1. Data preparation The training data includes SRE04-16, MIXER 6, Switchboard, V oxCeleb1 [21] and V oxCeleb2 [22], resulting 14,467 speakers altogether. Data augmentation is utilized for both x-vector and ResNet systems. We adopt the same data augmentation strategy as the Kaldi x-vector recipe. It employs additive noises an...

work page 2018

[4] [4]

Post evaluation After the evaluation, we further extend our experiments by i n- vestigating more encoding layer designs and loss functions for our deep ResNet system. First, following the x-vector system that accumulates mean and standard deviation statistics in pooling layer, we boos t the GAP layer by adding the global standard deviation statistic s of ...

work page 2018

[5] [5]

V arious kinds of state-of-the-art speaker embedding extractors are explored

Conclusions In this paper, our submitted DKU-SMIIP system for the NIST SRE 2018 is described. V arious kinds of state-of-the-art speaker embedding extractors are explored. We also utilize variabi l- ities compensation, domain adaptation, in-domain whiteni ng, and score normalization algorithms to reduce the mismatch con- dition between training and testin...

work page 2018

[6] [6]

NIST 2 018 Speaker Recognition Evaluation Plan,

National Institute of Standards and Technology, “NIST 2 018 Speaker Recognition Evaluation Plan,” 2018. [Online]. Available: https://www.nist.gov/sites/default/ﬁles/documents/2018/08/17/sre18 eval plan 2018-05-31 v6.pdf

work page 2018

[7] [7]

Front-End Factor Analysis for Speaker V eriﬁcation,

N. Dehak, P . J. Kenny, R. Dehak, P . Dumouchel, and P . Ouel- let, “Front-End Factor Analysis for Speaker V eriﬁcation,” IEEE Transactions on Audio, Speech, and Language Processing , vol. 19, no. 4, pp. 788–798, 2011

work page 2011

[8] [8]

Speaker V eriﬁcation and Spoken Languag e Identiﬁcation using a Generalized i-vector Framework with Pho- netic Tokenizations and Tandem Features,

M. Li and W. Liu, “Speaker V eriﬁcation and Spoken Languag e Identiﬁcation using a Generalized i-vector Framework with Pho- netic Tokenizations and Tandem Features,” in Proceedings of the Annual Conference of the International Speech Communicati on Association (INTERSPEECH), 2014, pp. 1120–1124

work page 2014

[9] [9]

Generalized i-vector Re pre- sentation with Phonetic Tokenizations and Tandem Features for both Text Independent and Text Dependent Speaker V eriﬁcation,

M. Li, L. Liu, W. Cai, and W. Liu, “Generalized i-vector Re pre- sentation with Phonetic Tokenizations and Tandem Features for both Text Independent and Text Dependent Speaker V eriﬁcation,” Journal of Signal Processing Systems, vol. 82, no. 2, pp. 207–215, 2016

work page 2016

[10] [10]

X-V ectors: Robust DNN Embeddings for Speaker Recog- nition,

D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khu dan- pur, “X-V ectors: Robust DNN Embeddings for Speaker Recog- nition,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , 2018, pp. 5329–5333

work page 2018

[11] [11]

Exploring the Encoding Layer a nd Loss Function in End-to-End Speaker and Language Recogniti on System,

W. Cai, J. Chen, and M. Li, “Exploring the Encoding Layer a nd Loss Function in End-to-End Speaker and Language Recogniti on System,” in Odyssey 2018: The Speaker and Language Recogni- tion W orkshop, 2018, pp. 74–81

work page 2018

[12] [12]

A Novel Learn- able Dictionary Encoding Layer for End-to-End Language Ide n- tiﬁcation,

W. Cai, Z. Cai, X. Zhang, X. Wang, and M. Li, “A Novel Learn- able Dictionary Encoding Layer for End-to-End Language Ide n- tiﬁcation,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , 2018, pp. 5189–5193

work page 2018

[13] [13]

End- to-end Language Identiﬁcation using NetFV and NetVLAD,

J. Chen, W. Cai, D. Cai, Z. Cai, H. Zhong, and M. Li, “End- to-end Language Identiﬁcation using NetFV and NetVLAD,” in The 11th International Symposium on Chinese Spoken Languag e Processing (ISCSLP), 2018

work page 2018

[14] [14]

A Discriminative Fea ture Learning Approach for Deep Face Recognition,

Y . Wen, K. Zhang, Z. Li, and Y . Qiao, “A Discriminative Fea ture Learning Approach for Deep Face Recognition,” in Proceedings of the 14th European Conference on Computer Vision (ECCV) , 2016, pp. 499–515

work page 2016

[15] [15]

Spherefa ce: Deep Hypersphere Embedding for Face Recognition,

W. Liu, Y . Wen, Z. Y u, M. Li, B. Raj, and L. Song, “Spherefa ce: Deep Hypersphere Embedding for Face Recognition,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 212–220

work page 2017

[16] [16]

Insights into E nd- to-End Learning Scheme for Language Identiﬁcation,

W. Cai, Z. Cai, W. Liu, X. Wang, and M. Li, “Insights into E nd- to-End Learning Scheme for Language Identiﬁcation,” in 2018 IEEE International Conference on Acoustics, Speech and Sig nal Processing, 2018, pp. 5209–5213

work page 2018

[17] [17]

Analysis of Length Normaliza tion in End-to-End Speaker V eriﬁcation System,

W. Cai, J. Chen, and M. Li, “Analysis of Length Normaliza tion in End-to-End Speaker V eriﬁcation System,” in Proceedings of the Annual Conference of the International Speech Communicati on Association (INTERSPEECH), 2018

work page 2018

[18] [18]

Locality Sensi tive Discriminant Analysis,

D. Cai, X. He, K. Zhou, J. Han, and H. Bao, “Locality Sensi tive Discriminant Analysis,” in Proceedings of the International Joint Conference on Artiﬁcial Intelligence (IJCAI) , 2007, pp. 1713– 1726

work page 2007

[19] [19]

Locality Sensitive Disc rimi- nant Analysis for Speaker V eriﬁcation,

D. Cai, W. Cai, Z. Ni, and M. Li, “Locality Sensitive Disc rimi- nant Analysis for Speaker V eriﬁcation,” in 2016 Asia-Paciﬁc Sig- nal and Information Processing Association Annual Summit a nd Conference (APSIPA), 2016

work page 2016

[20] [20]

Return of Frustratingly Easy Domain Adaptation,

B. Sun, J. Feng, and K. Saenko, “Return of Frustratingly Easy Domain Adaptation,” in Proceedings of the Thirtieth AAAI Con- ference on Artiﬁcial Intelligence (AAAI) , 2016, pp. 2058–2065

work page 2016

[21] [21]

Speaker V eriﬁcation in Mismatched Conditions with Frustratingly Easy Domain Adap- tation,

M. J. Alam, G. Bhattacharya, and P . Kenny, “Speaker V eriﬁcation in Mismatched Conditions with Frustratingly Easy Domain Adap- tation,” in Odyssey 2018: The Speaker and Language Recognition W orkshop, 2018

work page 2018

[22] [22]

The Spe akers in the Wild (SITW) Speaker Recognition Database,

M. McLaren, L. Ferrer, D. Castan, and A. Lawson, “The Spe akers in the Wild (SITW) Speaker Recognition Database,” in Proceed- ings of the Annual Conference of the International Speech Co m- munication Association (INTERSPEECH) , 2016, pp. 818–822

work page 2016

[23] [23]

Analysis of i- vector Length Normalization in Speaker Recognition Systems,

D. Garcia-Romero and C. Y . Espy-Wilson, “Analysis of i- vector Length Normalization in Speaker Recognition Systems,” in Pro- ceedings of the Annual Conference of the International Spee ch Communication Association (INTERSPEECH) , 2011, pp. 249– 252

work page 2011

[24] [24]

Analysis of Score Normalization in Multilin gual Speaker Recognition,

P . Matjka, O. Novotn, O. Plchot, L. Burget, M. D. Snchez, and J. ernock, “Analysis of Score Normalization in Multilin gual Speaker Recognition,” in Proceedings of the Annual Conference of the International Speech Communication Association (IN TER- SPEECH), 2017

work page 2017

[25] [25]

The BOSARIS Toolkit: Theory, Algorithms and Code for Surviving the New DCF

N. Br¨ ummer and E. De Villiers, “The BOSARIS Toolkit: Th eory, Algorithms and Code for Surviving the New DCF,” arXiv preprint arXiv:1304.2865, 2013

work page internal anchor Pith review Pith/arXiv arXiv 2013

[26] [26]

V oxceleb: A L arge- Scale Speaker Identiﬁcation Dataset,

A. Nagrani, J. S. Chung, and A. Zisserman, “V oxceleb: A L arge- Scale Speaker Identiﬁcation Dataset,” in Proceedings of the An- nual Conference of the International Speech Communication As- sociation (INTERSPEECH), 2017, pp. 2616–2620

work page 2017

[27] [27]

V oxceleb2: D eep Speaker Recognition,

J. S. Chung, A. Nagrani, and A. Zisserman, “V oxceleb2: D eep Speaker Recognition,” in Proceedings of the Annual Conference of the International Speech Communication Association (IN TER- SPEECH), 2018

work page 2018

[28] [28]

A Study on Data Augmentation of Reverberant Speech for Ro- bust Speech Recognition,

T. Ko, V . Peddinti, D. Povey, M. L. Seltzer, and S. Khudan pur, “A Study on Data Augmentation of Reverberant Speech for Ro- bust Speech Recognition,” in 2017 IEEE International Confer- ence on Acoustics, Speech and Signal Processing (ICASSP), 2017, pp. 5220–5224

work page 2017

[29] [29]

MUSAN: A Music, Speech, and Noise Corpus

D. Snyder, G. Chen, and D. Povey, “MUSAN: A Music, Speech , and Noise Corpus,” arXiv:1510.08484 [cs], 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015