The DKU System for the Speaker Recognition Task of the 2019 VOiCES from a Distance Challenge

Danwei Cai; Ming Li; Weicheng Cai; Xiaoyi Qin

The DKU system reaches 0.3532 minDCF and 4.96% EER on the 2019 VOiCES far-field speaker verification evaluation set.

Reviewed by Pith at T0; open to challenge. T0 means a machine referee read the full paper against a public rubric. the ladder, T0–T4 →

Challenge this review Re-run · record.json Download PDF Read on arXiv ↗

T0 review · grok-4.3

2026-05-25 09:06 UTC pith:527U7R2A

load-bearing objection This is a standard challenge system paper that reports competitive VOiCES 2019 numbers with routine methods and little new analysis or detail. the 1 major comments →

arxiv 1907.02194 v1 pith:527U7R2A submitted 2019-07-04 eess.AS

The DKU System for the Speaker Recognition Task of the 2019 VOiCES from a Distance Challenge

Danwei Cai , Xiaoyi Qin , Weicheng Cai , Ming Li This is my paper

classification eess.AS

keywords speaker recognitionfar-field verificationresidual neural networkangular softmax lossweighted prediction errorVOiCES challengeminDCFequal error rate

verification ladder T0 review T1 audit T2 compute T3 formal T4 reserved

The pith

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper lays out a full pipeline for speaker recognition from distant audio, starting with data pre-processing and spectral features, then building utterance-level models, applying back-end scoring, and finishing with score normalization. Their primary system uses a residual neural network trained under angular softmax loss together with weighted prediction error processing. This combination produces the stated error rates on the challenge evaluation data. A reader would care because far-field conditions introduce distortions that standard close-talk systems do not handle well, so concrete numbers on a shared benchmark show whether the pipeline overcomes them.

Core claim

The submitted primary system obtains 0.3532 minDCF and 4.96% EER on the evaluation set. The best single system employs a residual neural network trained with angular softmax loss; weighted prediction error algorithms further improve performance, and the system reaches 0.3668 minDCF and 5.58% EER with simple cosine similarity scoring before final normalization steps are added.

What carries the argument

Residual neural network trained with angular softmax loss for utterance-level speaker modeling, augmented by weighted prediction error signal processing and cosine scoring with normalization.

Load-bearing premise

The challenge evaluation set provides a representative test of far-field speaker verification performance without hidden domain shifts or selection effects that would make the reported numbers unrepresentative.

What would settle it

An independent team re-running the exact pipeline on the same evaluation set and obtaining error rates materially different from 0.3532 minDCF and 4.96% EER would show the reported figures do not hold.

Watch this falsifier — get emailed when new claim-graph text bears on it.

If this is right

The full pipeline of pre-processing, residual network embeddings, weighted prediction error, and normalization reduces both minDCF and EER relative to simpler baselines.
Angular softmax training produces embeddings that support effective cosine scoring in far-field conditions.
Adding weighted prediction error yields measurable gains beyond the neural network alone.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same modular pipeline could be transferred to other distant-audio tasks such as meeting transcription or voice commands in large rooms.
Replacing the residual network with newer architectures might produce further reductions in EER if retrained under the same loss.
Score normalization appears to correct for score distribution shifts caused by varying distances and room acoustics.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit.

Desk Editor's Note

This is a standard challenge system paper that reports competitive VOiCES 2019 numbers with routine methods and little new analysis or detail.

read the letter

The DKU system paper is a standard challenge system paper that reports competitive VOiCES 2019 numbers with routine methods and little new analysis or detail. Their primary system reaches 0.3532 minDCF and 4.96% EER on the evaluation set using a residual network trained with angular softmax plus WPE dereverberation and cosine scoring after normalization. A single-system variant hits 0.3668 minDCF and 5.58% EER before whatever fusion they applied for the submission. The paper walks through the full pipeline from preprocessing to back-end scoring. This layout can be useful for seeing how these established pieces fit together in far-field conditions. The components themselves come from prior work the authors cite, so the contribution is mainly the specific application and the resulting numbers on this challenge set. The main limitations are the lack of ablation studies, error bars, or training details that would show how sensitive the results are. The evaluation set also gets little discussion on acoustic conditions, selection criteria, or possible domain shifts from training data, which leaves the numbers harder to interpret beyond this particular test. This paper is mainly for teams actively working on speaker recognition challenges who want a 2019 reference point on what produced decent far-field performance. Readers looking for new methods or deeper experiments will not find much here. I would send it to peer review for a challenge-related venue because the reported scores are concrete and the pipeline description is clear enough to be useful as a benchmark, even with the gaps in supporting analysis.

Referee Report

1 major / 0 minor

Summary. The manuscript presents the DKU system submitted to the speaker recognition task of the 2019 VOiCES from a Distance Challenge. It describes the full pipeline for far-field speaker verification, covering data pre-processing, short-term spectral features, utterance-level modeling via a residual neural network trained with angular softmax loss, back-end scoring with cosine similarity, and score normalization. The best single system with WPE preprocessing achieves 0.3668 minDCF and 5.58% EER; the submitted primary system reaches 0.3532 minDCF and 4.96% EER on the evaluation set.

Significance. If the reported numbers are reproducible and the evaluation set is unbiased, the work supplies a concrete empirical reference point on the VOiCES benchmark, demonstrating that ResNet + angular softmax combined with WPE can deliver competitive far-field performance. Challenge system papers of this type are useful for establishing baselines, but the lack of supporting analysis reduces the strength of any broader claims about generalization.

major comments (1)

[Abstract] Abstract: The central claim that the primary system obtains 0.3532 minDCF and 4.96% EER is presented without error bars, ablation results, or any description of evaluation-set selection criteria, acoustic-condition statistics, or validation steps confirming absence of domain shift relative to training data. This directly affects the ability to interpret the numerical result as evidence of effective far-field verification.

Simulated Author's Rebuttal

1 responses · 2 unresolved

We thank the referee for the detailed review of our manuscript on the DKU system for the 2019 VOiCES challenge. Below we respond point-by-point to the major comment.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that the primary system obtains 0.3532 minDCF and 4.96% EER is presented without error bars, ablation results, or any description of evaluation-set selection criteria, acoustic-condition statistics, or validation steps confirming absence of domain shift relative to training data. This directly affects the ability to interpret the numerical result as evidence of effective far-field verification.

Authors: The abstract is intentionally concise, as is conventional for challenge system descriptions. The full pipeline (WPE preprocessing, 64-dim log-Mel features, ResNet34 with angular softmax, cosine scoring, and AS-Norm) is detailed in Sections 2-4 of the manuscript, along with the training data (VoxCeleb + VOiCES dev) and the fact that all results are on the official VOiCES 2019 evaluation set. We agree that a short clause noting the official challenge evaluation protocol would aid interpretation and will add one sentence to the abstract. Error bars are not reported because the challenge provides a single fixed evaluation set with no provision for multiple independent runs. Ablation studies are outside the scope of this system paper, which focuses on the submitted primary system rather than comparative analysis. revision: partial

standing simulated objections not resolved

Error bars on the reported minDCF and EER, because the challenge evaluation consists of a single run on a fixed test set.
Evaluation-set selection criteria and complete acoustic-condition statistics, which are determined by the VOiCES organizers and not fully disclosed to participants.

Circularity Check

0 steps flagged

No derivation chain; purely empirical system report

full rationale

The paper describes a speaker verification pipeline (data preprocessing, ResNet with angular softmax, WPE, cosine scoring) and reports challenge metrics (0.3532 minDCF, 4.96% EER) on the public VOiCES evaluation set. No equations, fitted parameters renamed as predictions, self-citations, or ansatzes are present that could reduce any claim to its own inputs by construction. The central claims are experimental results on an external benchmark, which are self-contained and externally falsifiable.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an applied competition-system paper with no mathematical derivations, free parameters fitted inside a model, or newly postulated entities; all components are drawn from prior literature.

pith-pipeline@v0.9.0 · 5665 in / 1075 out tokens · 46855 ms · 2026-05-25T09:06:22.024177+00:00 · methodology

0 comments

read the original abstract

In this paper, we present the DKU system for the speaker recognition task of the VOiCES from a distance challenge 2019. We investigate the whole system pipeline for the far-field speaker verification, including data pre-processing, short-term spectral feature representation, utterance-level speaker modeling, back-end scoring, and score normalization. Our best single system employs a residual neural network trained with angular softmax loss. Also, the weighted prediction error algorithms can further improve performance. It achieves 0.3668 minDCF and 5.58% EER on the evaluation set by using a simple cosine similarity scoring. Finally, the submitted primary system obtains 0.3532 minDCF and 4.96% EER on the evaluation set.

Figures

Figures reproduced from arXiv: 1907.02194 by Danwei Cai, Ming Li, Weicheng Cai, Xiaoyi Qin.

discussion (0)

Reference graph

Works this paper leans on

42 extracted references · 42 canonical work pages · 4 internal anchors

[1]

VOiCES from a Distance Challenge 2019

Introduction In the past decade, the performance of speaker recognition has improved signiﬁcantly. The i-vector based method [1] and the deep neural network (DNN) based methods [2, 3] have promoted the development of speaker recognition technology in telephone channel and closed talking scenarios. However, speaker recognition under far-ﬁeld and complex en...

work page 2019
[2]

The DKU System for the Speaker Recognition Task of the 2019 VOiCES from a Distance Challenge

System descriptions 2.1. Data pre-processing 2.1.1. Data augmentation We adopt two kinds of data augmentation strategies. The ﬁrst is the same as the x-vector system available at Kaldi V oxceleb recipe, which employs additive noises and reverberation. We also use pyroomacoustics [24] to simulate the room acoustic based on RIR generator using Image Source ...

work page internal anchor Pith review Pith/arXiv arXiv 1907
[3]

Data usage The training data includes V oxCeleb 1 [37] and V oxCeleb 2 [38]

Experiments 3.1. Data usage The training data includes V oxCeleb 1 [37] and V oxCeleb 2 [38]. The original distribution of V oxCeleb split each video into multiple short segments. During training, the segments from the same video are concatenated into a single sound wave, which results in 167897 utterances from 7245 speakers. No voice activity detection (...

work page
[4]

We use different acoustic fea- tures, different front-end modeling methods, and various back- end scoring methods

Conclusions We presented the components and analyzed the results of the DKU-SMIIP speaker recognition system for the VOiCES from a Distance Challenge 2019. We use different acoustic fea- tures, different front-end modeling methods, and various back- end scoring methods. To further improve the performance, we use WPE to dereverberate the development and ev...

work page 2019
[5]

Front- End Factor Analysis for Speaker Veriﬁcation,

N. Dehak, P. J. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet, “Front- End Factor Analysis for Speaker Veriﬁcation,” IEEE Transactions on Au- dio, Speech, and Language Processing, vol. 19, no. 4, pp. 788–798, 2011

work page 2011
[6]

x- vectors: Robust DNN Embeddings for Speaker Recognition,

D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khudanpur, “x- vectors: Robust DNN Embeddings for Speaker Recognition,” in IEEE In- ternational Conference on Acoustics, Speech and Signal Processing, 2018, pp. 5329–5333

work page 2018
[7]

Exploring the Encoding Layer and Loss Function in End-to-End Speaker and Language Recognition System,

W. Cai, J. Chen, and M. Li, “Exploring the Encoding Layer and Loss Function in End-to-End Speaker and Language Recognition System,” in Odyssey: The Speaker and Language Recognition Workshop, 2018, pp. 74– 81

work page 2018
[8]

Wolfel and J

M. Wolfel and J. McDonough, Distant Speech Recognition. John Wiley & Sons, Incorporated, 2009

work page 2009
[9]

The Perception of Speech Under Ad- verse Conditions,

P. Assmann and Q. Summerﬁeld, “The Perception of Speech Under Ad- verse Conditions,” in Speech Processing in the Auditory System. Springer New York, 2004, pp. 231–308

work page 2004
[10]

Speech Dereverberation Based on Variance-Normalized Delayed Linear Prediction,

T. Nakatani, T. Yoshioka, K. Kinoshita, M. Miyoshi, and Biing-Hwang Juang, “Speech Dereverberation Based on Variance-Normalized Delayed Linear Prediction,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 18, no. 7, pp. 1717–1731, 2010

work page 2010
[11]

Robust Speaker Identiﬁcation in Noisy and Reverberant Conditions,

X. Zhao, Y . Wang, and D. Wang, “Robust Speaker Identiﬁcation in Noisy and Reverberant Conditions,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 22, no. 4, pp. 836–845, 2014

work page 2014
[12]

Speech Enhancement Using Long Short-Term Memory based Recurrent Neural Networks for Noise Robust Speaker Veriﬁcation,

M. Kolboek, Z.-H. Tan, and J. Jensen, “Speech Enhancement Using Long Short-Term Memory based Recurrent Neural Networks for Noise Robust Speaker Veriﬁcation,” in IEEE Spoken Language Technology Workshop , 2016, pp. 305–311

work page 2016
[13]

DNN-Based Amplitude and Phase Feature Enhancement for Noise Robust Speaker Identiﬁcation,

Z. Oo, Y . Kawakami, L. Wang, S. Nakagawa, X. Xiao, and M. Iwahashi, “DNN-Based Amplitude and Phase Feature Enhancement for Noise Robust Speaker Identiﬁcation,” in Proceedings of the Annual Conference of the International Speech Communication Association, 2016, pp. 2204–2208

work page 2016
[14]

Front-end speech enhancement for commercial speaker veriﬁcation systems,

S. E. Eskimez, P. Souﬂeris, Z. Duan, and W. Heinzelman, “Front-end speech enhancement for commercial speaker veriﬁcation systems,” Speech Communication, vol. 99, pp. 101–113, 2018

work page 2018
[15]

Neural Network Based Spectral Mask Estimation for Acoustic Beamforming,

J. Heymann, L. Drude, and R. Haeb-Umbach, “Neural Network Based Spectral Mask Estimation for Acoustic Beamforming,” in2016 IEEE Inter- national Conference on Acoustics, Speech and Signal Processing. IEEE, 2016, pp. 196–200

work page 2016
[16]

Blind Acoustic Beamforming Based on Generalized Eigenvalue Decomposition,

E. Warsitz and R. Haeb-Umbach, “Blind Acoustic Beamforming Based on Generalized Eigenvalue Decomposition,” IEEE Transactions on Audio, Speech and Language Processing, vol. 15, no. 5, pp. 1529–1539, 2007

work page 2007
[17]

Modulation Spectral Features for Robust Far- Field Speaker Identiﬁcation,

T. Falk and Wai-Yip Chan, “Modulation Spectral Features for Robust Far- Field Speaker Identiﬁcation,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 18, no. 1, pp. 90–100, 2010

work page 2010
[18]

Hilbert Envelope Based Features for Ro- bust Speaker Identiﬁcation Under Reverberant Mismatched Conditions,

S. O. Sadjadi and J. H. Hansen, “Hilbert Envelope Based Features for Ro- bust Speaker Identiﬁcation Under Reverberant Mismatched Conditions,” in 2011 IEEE International Conference on Acoustics, Speech and Signal Pro- cessing, 2011, pp. 5448–5451

work page 2011
[19]

Speaker Identiﬁca- tion with Distant Microphone Speech,

Q. Jin, R. Li, Q. Yang, K. Laskowski, and T. Schultz, “Speaker Identiﬁca- tion with Distant Microphone Speech,” in2010 IEEE International Confer- ence on Acoustics, Speech and Signal Processing, 2010, pp. 4518–4521

work page 2010
[20]

Blind Spectral Weighting for Robust Speaker Identiﬁcation under Reverberation Mismatch,

S. O. Sadjadi and J. H. L. Hansen, “Blind Spectral Weighting for Robust Speaker Identiﬁcation under Reverberation Mismatch,” IEEE/ACM Trans- actions on Audio, Speech, and Language Processing , vol. 22, no. 5, pp. 937–945, 2014

work page 2014
[21]

Reverberation Matching for Speaker Recognition,

I. Peer, B. Rafaely, and Y . Zigel, “Reverberation Matching for Speaker Recognition,” in2008 IEEE International Conference on Acoustics, Speech and Signal Processing, 2008, pp. 4829–4832

work page 2008
[22]

Improving the Performance of Far-Field Speaker Veriﬁcation Using Multi- Condition Training: The Case of GMM-UBM and i-Vector Systems,

A. R. Avila, M. Sarria-Paja, F. J. Fraga, D. O’Shaughnessy, and T. H. Falk, “Improving the Performance of Far-Field Speaker Veriﬁcation Using Multi- Condition Training: The Case of GMM-UBM and i-Vector Systems,” in Proceedings of the Annual Conference of the International Speech Com- munication Association, 2014, pp. 1096–1100

work page 2014
[23]

Multicondition train- ing of Gaussian PLDA models in i-vector space for noise and reverbera- tion robust speaker recognition,

D. Garcia-Romero, X. Zhou, and C. Y . Espy-Wilson, “Multicondition train- ing of Gaussian PLDA models in i-vector space for noise and reverbera- tion robust speaker recognition,” in 2012 IEEE International Conference on Acoustics, Speech and Signal Processing, 2012, pp. 4257–4260

work page 2012
[24]

Robust Speaker Recognition from Distant Speech under Real Reverberant Environments Using Speaker Embeddings,

M. K. Nandwana, J. van Hout, M. McLaren, A. Stauffer, C. Richey, A. Law- son, and M. Graciarena, “Robust Speaker Recognition from Distant Speech under Real Reverberant Environments Using Speaker Embeddings,” inPro- ceedings of the Annual Conference of the International Speech Communi- cation Association, 2018, pp. 1106–1110

work page 2018
[25]

Far-Field Speaker Recognition,

Q. Jin, T. Schultz, and A. Waibel, “Far-Field Speaker Recognition,” IEEE Transactions on Audio, Speech and Language Processing , vol. 15, no. 7, pp. 2023–2032, 2007

work page 2023
[26]

Text- Independent Speaker Identiﬁcation using Soft Channel Selection in Home Robot Environments,

Mikyong Ji, Sungtak Kim, Hoirin Kim, and Ho-Sub Yoon, “Text- Independent Speaker Identiﬁcation using Soft Channel Selection in Home Robot Environments,” IEEE Transactions on Consumer Electronics , vol. 54, no. 1, pp. 140–144, 2008

work page 2008
[27]

The VOiCES from a Distance Challenge 2019 Evaluation Plan

M. K. Nandwana, J. V . Hout, M. McLaren, C. Richey, A. Lawson, and M. A. Barrios, “The VOiCES from a Distance Challenge 2019 Evaluation Plan,” arXiv:1902.10828 [eess.AS], 2019

work page internal anchor Pith review Pith/arXiv arXiv 2019
[28]

Pyroomacoustics: A Python Package for Audio Room Simulation and Array Processing Algorithms,

R. Scheibler, E. Bezzam, and I. Dokmanic, “Pyroomacoustics: A Python Package for Audio Room Simulation and Array Processing Algorithms,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing, 2018, pp. 351–355

work page 2018
[29]

V oices Obscured in Complex Environmental Settings (VOICES) corpus,

C. Richey, M. A. Barrios, Z. Armstrong, C. Bartels, H. Franco, M. Gra- ciarena, A. Lawson, M. K. Nandwana, A. Stauffer, J. van Hout, P. Gam- ble, J. Hetherly, C. Stephenson, and K. Ni, “V oices Obscured in Complex Environmental Settings (VOICES) corpus,” in Proceedings of the Annual Conference of the International Speech Communication Association, 2018, p...

work page 2018
[30]

MUSAN: A Music, Speech, and Noise Corpus

D. Snyder, G. Chen, and D. Povey, “MUSAN: A Music, Speech, and Noise Corpus,” arXiv:1510.08484 [cs], 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[31]

Power-Normalized Cepstral Coefcients (PNCC) for Robust Speech Recognition,

C. Kim and R. M. Stern, “Power-Normalized Cepstral Coefcients (PNCC) for Robust Speech Recognition,” IEEE/ACM Transactions on Audio, Speech and Language Processing, vol. 24, no. 7, pp. 1315–1329, 2016

work page 2016
[32]

Complex Sounds and Auditory Images,

R. Patterson, K. Robinson, J. Holdsworth, D. McKeown, C. Zhang, and M. Allerhand, “Complex Sounds and Auditory Images,” inAuditory Physi- ology and Perception. Oxford, UK: Y . Cazals, L. Demany, and K. Horner, (Eds), Pergamon Press, 1992, pp. 429–446

work page 1992
[33]

Insights into End-to-End Learning Scheme for Language Identiﬁcation,

W. Cai, Z. Cai, W. Liu, X. Wang, and M. Li, “Insights into End-to-End Learning Scheme for Language Identiﬁcation,” in2018 IEEE International Conference on Acoustics, Speech and Signal Processing , 2018, pp. 5209– 5213

work page 2018
[34]

Analysis of length normalization in end-to- end speaker veriﬁcation system,

W. Cai, J. Chen, and M. Li, “Analysis of length normalization in end-to- end speaker veriﬁcation system,” in Proc. INTERSPEECH 2018, 2018, pp. 3618–3622

work page 2018
[35]

Sphereface: Deep Hypersphere Embedding for Face Recognition,

W. Liu, Y . Wen, Z. Yu, M. Li, B. Raj, and L. Song, “Sphereface: Deep Hypersphere Embedding for Face Recognition,” in The IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 212–220

work page 2017
[36]

Return of Frustratingly Easy Domain Adaptation,

B. Sun, J. Feng, and K. Saenko, “Return of Frustratingly Easy Domain Adaptation,” in Proceedings of the Thirtieth AAAI Conference on Artiﬁcial Intelligence, 2016, pp. 2058–2065

work page 2016
[37]

Speaker Veriﬁcation in Mismatched Conditions with Frustratingly Easy Domain Adaptation,

M. J. Alam, G. Bhattacharya, and P. Kenny, “Speaker Veriﬁcation in Mismatched Conditions with Frustratingly Easy Domain Adaptation,” in Odyssey: The Speaker and Language Recognition Workshop, 2018

work page 2018
[38]

Analysis of i-vector Length Normalization in Speaker Recognition Systems,

D. Garcia-Romero and C. Y . Espy-Wilson, “Analysis of i-vector Length Normalization in Speaker Recognition Systems,” inProceedings of the An- nual Conference of the International Speech Communication Association , 2011, pp. 249–252

work page 2011
[39]

Analysis of Score Normalization in Multilingual Speaker Recognition,

P. Matjka, O. Novotn, O. Plchot, L. Burget, M. D. Snchez, and J. ernock, “Analysis of Score Normalization in Multilingual Speaker Recognition,” in Proceedings of the Annual Conference of the International Speech Commu- nication Association, 2017

work page 2017
[40]

The BOSARIS Toolkit: Theory, Algorithms and Code for Surviving the New DCF

N. Br ¨ummer and E. De Villiers, “The BOSARIS Toolkit: Theory, Algorithms and Code for Surviving the New DCF,” arXiv preprint arXiv:1304.2865, 2013

work page internal anchor Pith review Pith/arXiv arXiv 2013
[41]

V oxceleb: A Large-Scale Speaker Identiﬁcation Dataset,

A. Nagrani, J. S. Chung, and A. Zisserman, “V oxceleb: A Large-Scale Speaker Identiﬁcation Dataset,” in Proceedings of the Annual Conference of the International Speech Communication Association , 2017, pp. 2616– 2620

work page 2017
[42]

V oxceleb2: Deep Speaker Recognition,

J. S. Chung, A. Nagrani, and A. Zisserman, “V oxceleb2: Deep Speaker Recognition,” inProceedings of the Annual Conference of the International Speech Communication Association, 2018

work page 2018

[1] [1]

VOiCES from a Distance Challenge 2019

Introduction In the past decade, the performance of speaker recognition has improved signiﬁcantly. The i-vector based method [1] and the deep neural network (DNN) based methods [2, 3] have promoted the development of speaker recognition technology in telephone channel and closed talking scenarios. However, speaker recognition under far-ﬁeld and complex en...

work page 2019

[2] [2]

The DKU System for the Speaker Recognition Task of the 2019 VOiCES from a Distance Challenge

System descriptions 2.1. Data pre-processing 2.1.1. Data augmentation We adopt two kinds of data augmentation strategies. The ﬁrst is the same as the x-vector system available at Kaldi V oxceleb recipe, which employs additive noises and reverberation. We also use pyroomacoustics [24] to simulate the room acoustic based on RIR generator using Image Source ...

work page internal anchor Pith review Pith/arXiv arXiv 1907

[3] [3]

Data usage The training data includes V oxCeleb 1 [37] and V oxCeleb 2 [38]

Experiments 3.1. Data usage The training data includes V oxCeleb 1 [37] and V oxCeleb 2 [38]. The original distribution of V oxCeleb split each video into multiple short segments. During training, the segments from the same video are concatenated into a single sound wave, which results in 167897 utterances from 7245 speakers. No voice activity detection (...

work page

[4] [4]

We use different acoustic fea- tures, different front-end modeling methods, and various back- end scoring methods

Conclusions We presented the components and analyzed the results of the DKU-SMIIP speaker recognition system for the VOiCES from a Distance Challenge 2019. We use different acoustic fea- tures, different front-end modeling methods, and various back- end scoring methods. To further improve the performance, we use WPE to dereverberate the development and ev...

work page 2019

[5] [5]

Front- End Factor Analysis for Speaker Veriﬁcation,

N. Dehak, P. J. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet, “Front- End Factor Analysis for Speaker Veriﬁcation,” IEEE Transactions on Au- dio, Speech, and Language Processing, vol. 19, no. 4, pp. 788–798, 2011

work page 2011

[6] [6]

x- vectors: Robust DNN Embeddings for Speaker Recognition,

D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khudanpur, “x- vectors: Robust DNN Embeddings for Speaker Recognition,” in IEEE In- ternational Conference on Acoustics, Speech and Signal Processing, 2018, pp. 5329–5333

work page 2018

[7] [7]

Exploring the Encoding Layer and Loss Function in End-to-End Speaker and Language Recognition System,

W. Cai, J. Chen, and M. Li, “Exploring the Encoding Layer and Loss Function in End-to-End Speaker and Language Recognition System,” in Odyssey: The Speaker and Language Recognition Workshop, 2018, pp. 74– 81

work page 2018

[8] [8]

Wolfel and J

M. Wolfel and J. McDonough, Distant Speech Recognition. John Wiley & Sons, Incorporated, 2009

work page 2009

[9] [9]

The Perception of Speech Under Ad- verse Conditions,

P. Assmann and Q. Summerﬁeld, “The Perception of Speech Under Ad- verse Conditions,” in Speech Processing in the Auditory System. Springer New York, 2004, pp. 231–308

work page 2004

[10] [10]

Speech Dereverberation Based on Variance-Normalized Delayed Linear Prediction,

T. Nakatani, T. Yoshioka, K. Kinoshita, M. Miyoshi, and Biing-Hwang Juang, “Speech Dereverberation Based on Variance-Normalized Delayed Linear Prediction,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 18, no. 7, pp. 1717–1731, 2010

work page 2010

[11] [11]

Robust Speaker Identiﬁcation in Noisy and Reverberant Conditions,

X. Zhao, Y . Wang, and D. Wang, “Robust Speaker Identiﬁcation in Noisy and Reverberant Conditions,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 22, no. 4, pp. 836–845, 2014

work page 2014

[12] [12]

Speech Enhancement Using Long Short-Term Memory based Recurrent Neural Networks for Noise Robust Speaker Veriﬁcation,

M. Kolboek, Z.-H. Tan, and J. Jensen, “Speech Enhancement Using Long Short-Term Memory based Recurrent Neural Networks for Noise Robust Speaker Veriﬁcation,” in IEEE Spoken Language Technology Workshop , 2016, pp. 305–311

work page 2016

[13] [13]

DNN-Based Amplitude and Phase Feature Enhancement for Noise Robust Speaker Identiﬁcation,

Z. Oo, Y . Kawakami, L. Wang, S. Nakagawa, X. Xiao, and M. Iwahashi, “DNN-Based Amplitude and Phase Feature Enhancement for Noise Robust Speaker Identiﬁcation,” in Proceedings of the Annual Conference of the International Speech Communication Association, 2016, pp. 2204–2208

work page 2016

[14] [14]

Front-end speech enhancement for commercial speaker veriﬁcation systems,

S. E. Eskimez, P. Souﬂeris, Z. Duan, and W. Heinzelman, “Front-end speech enhancement for commercial speaker veriﬁcation systems,” Speech Communication, vol. 99, pp. 101–113, 2018

work page 2018

[15] [15]

Neural Network Based Spectral Mask Estimation for Acoustic Beamforming,

J. Heymann, L. Drude, and R. Haeb-Umbach, “Neural Network Based Spectral Mask Estimation for Acoustic Beamforming,” in2016 IEEE Inter- national Conference on Acoustics, Speech and Signal Processing. IEEE, 2016, pp. 196–200

work page 2016

[16] [16]

Blind Acoustic Beamforming Based on Generalized Eigenvalue Decomposition,

E. Warsitz and R. Haeb-Umbach, “Blind Acoustic Beamforming Based on Generalized Eigenvalue Decomposition,” IEEE Transactions on Audio, Speech and Language Processing, vol. 15, no. 5, pp. 1529–1539, 2007

work page 2007

[17] [17]

Modulation Spectral Features for Robust Far- Field Speaker Identiﬁcation,

T. Falk and Wai-Yip Chan, “Modulation Spectral Features for Robust Far- Field Speaker Identiﬁcation,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 18, no. 1, pp. 90–100, 2010

work page 2010

[18] [18]

Hilbert Envelope Based Features for Ro- bust Speaker Identiﬁcation Under Reverberant Mismatched Conditions,

S. O. Sadjadi and J. H. Hansen, “Hilbert Envelope Based Features for Ro- bust Speaker Identiﬁcation Under Reverberant Mismatched Conditions,” in 2011 IEEE International Conference on Acoustics, Speech and Signal Pro- cessing, 2011, pp. 5448–5451

work page 2011

[19] [19]

Speaker Identiﬁca- tion with Distant Microphone Speech,

Q. Jin, R. Li, Q. Yang, K. Laskowski, and T. Schultz, “Speaker Identiﬁca- tion with Distant Microphone Speech,” in2010 IEEE International Confer- ence on Acoustics, Speech and Signal Processing, 2010, pp. 4518–4521

work page 2010

[20] [20]

Blind Spectral Weighting for Robust Speaker Identiﬁcation under Reverberation Mismatch,

S. O. Sadjadi and J. H. L. Hansen, “Blind Spectral Weighting for Robust Speaker Identiﬁcation under Reverberation Mismatch,” IEEE/ACM Trans- actions on Audio, Speech, and Language Processing , vol. 22, no. 5, pp. 937–945, 2014

work page 2014

[21] [21]

Reverberation Matching for Speaker Recognition,

I. Peer, B. Rafaely, and Y . Zigel, “Reverberation Matching for Speaker Recognition,” in2008 IEEE International Conference on Acoustics, Speech and Signal Processing, 2008, pp. 4829–4832

work page 2008

[22] [22]

Improving the Performance of Far-Field Speaker Veriﬁcation Using Multi- Condition Training: The Case of GMM-UBM and i-Vector Systems,

A. R. Avila, M. Sarria-Paja, F. J. Fraga, D. O’Shaughnessy, and T. H. Falk, “Improving the Performance of Far-Field Speaker Veriﬁcation Using Multi- Condition Training: The Case of GMM-UBM and i-Vector Systems,” in Proceedings of the Annual Conference of the International Speech Com- munication Association, 2014, pp. 1096–1100

work page 2014

[23] [23]

Multicondition train- ing of Gaussian PLDA models in i-vector space for noise and reverbera- tion robust speaker recognition,

D. Garcia-Romero, X. Zhou, and C. Y . Espy-Wilson, “Multicondition train- ing of Gaussian PLDA models in i-vector space for noise and reverbera- tion robust speaker recognition,” in 2012 IEEE International Conference on Acoustics, Speech and Signal Processing, 2012, pp. 4257–4260

work page 2012

[24] [24]

Robust Speaker Recognition from Distant Speech under Real Reverberant Environments Using Speaker Embeddings,

M. K. Nandwana, J. van Hout, M. McLaren, A. Stauffer, C. Richey, A. Law- son, and M. Graciarena, “Robust Speaker Recognition from Distant Speech under Real Reverberant Environments Using Speaker Embeddings,” inPro- ceedings of the Annual Conference of the International Speech Communi- cation Association, 2018, pp. 1106–1110

work page 2018

[25] [25]

Far-Field Speaker Recognition,

Q. Jin, T. Schultz, and A. Waibel, “Far-Field Speaker Recognition,” IEEE Transactions on Audio, Speech and Language Processing , vol. 15, no. 7, pp. 2023–2032, 2007

work page 2023

[26] [26]

Text- Independent Speaker Identiﬁcation using Soft Channel Selection in Home Robot Environments,

Mikyong Ji, Sungtak Kim, Hoirin Kim, and Ho-Sub Yoon, “Text- Independent Speaker Identiﬁcation using Soft Channel Selection in Home Robot Environments,” IEEE Transactions on Consumer Electronics , vol. 54, no. 1, pp. 140–144, 2008

work page 2008

[27] [27]

The VOiCES from a Distance Challenge 2019 Evaluation Plan

M. K. Nandwana, J. V . Hout, M. McLaren, C. Richey, A. Lawson, and M. A. Barrios, “The VOiCES from a Distance Challenge 2019 Evaluation Plan,” arXiv:1902.10828 [eess.AS], 2019

work page internal anchor Pith review Pith/arXiv arXiv 2019

[28] [28]

Pyroomacoustics: A Python Package for Audio Room Simulation and Array Processing Algorithms,

R. Scheibler, E. Bezzam, and I. Dokmanic, “Pyroomacoustics: A Python Package for Audio Room Simulation and Array Processing Algorithms,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing, 2018, pp. 351–355

work page 2018

[29] [29]

V oices Obscured in Complex Environmental Settings (VOICES) corpus,

C. Richey, M. A. Barrios, Z. Armstrong, C. Bartels, H. Franco, M. Gra- ciarena, A. Lawson, M. K. Nandwana, A. Stauffer, J. van Hout, P. Gam- ble, J. Hetherly, C. Stephenson, and K. Ni, “V oices Obscured in Complex Environmental Settings (VOICES) corpus,” in Proceedings of the Annual Conference of the International Speech Communication Association, 2018, p...

work page 2018

[30] [30]

MUSAN: A Music, Speech, and Noise Corpus

D. Snyder, G. Chen, and D. Povey, “MUSAN: A Music, Speech, and Noise Corpus,” arXiv:1510.08484 [cs], 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015

[31] [31]

Power-Normalized Cepstral Coefcients (PNCC) for Robust Speech Recognition,

C. Kim and R. M. Stern, “Power-Normalized Cepstral Coefcients (PNCC) for Robust Speech Recognition,” IEEE/ACM Transactions on Audio, Speech and Language Processing, vol. 24, no. 7, pp. 1315–1329, 2016

work page 2016

[32] [32]

Complex Sounds and Auditory Images,

R. Patterson, K. Robinson, J. Holdsworth, D. McKeown, C. Zhang, and M. Allerhand, “Complex Sounds and Auditory Images,” inAuditory Physi- ology and Perception. Oxford, UK: Y . Cazals, L. Demany, and K. Horner, (Eds), Pergamon Press, 1992, pp. 429–446

work page 1992

[33] [33]

Insights into End-to-End Learning Scheme for Language Identiﬁcation,

W. Cai, Z. Cai, W. Liu, X. Wang, and M. Li, “Insights into End-to-End Learning Scheme for Language Identiﬁcation,” in2018 IEEE International Conference on Acoustics, Speech and Signal Processing , 2018, pp. 5209– 5213

work page 2018

[34] [34]

Analysis of length normalization in end-to- end speaker veriﬁcation system,

W. Cai, J. Chen, and M. Li, “Analysis of length normalization in end-to- end speaker veriﬁcation system,” in Proc. INTERSPEECH 2018, 2018, pp. 3618–3622

work page 2018

[35] [35]

Sphereface: Deep Hypersphere Embedding for Face Recognition,

W. Liu, Y . Wen, Z. Yu, M. Li, B. Raj, and L. Song, “Sphereface: Deep Hypersphere Embedding for Face Recognition,” in The IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 212–220

work page 2017

[36] [36]

Return of Frustratingly Easy Domain Adaptation,

B. Sun, J. Feng, and K. Saenko, “Return of Frustratingly Easy Domain Adaptation,” in Proceedings of the Thirtieth AAAI Conference on Artiﬁcial Intelligence, 2016, pp. 2058–2065

work page 2016

[37] [37]

Speaker Veriﬁcation in Mismatched Conditions with Frustratingly Easy Domain Adaptation,

M. J. Alam, G. Bhattacharya, and P. Kenny, “Speaker Veriﬁcation in Mismatched Conditions with Frustratingly Easy Domain Adaptation,” in Odyssey: The Speaker and Language Recognition Workshop, 2018

work page 2018

[38] [38]

Analysis of i-vector Length Normalization in Speaker Recognition Systems,

D. Garcia-Romero and C. Y . Espy-Wilson, “Analysis of i-vector Length Normalization in Speaker Recognition Systems,” inProceedings of the An- nual Conference of the International Speech Communication Association , 2011, pp. 249–252

work page 2011

[39] [39]

Analysis of Score Normalization in Multilingual Speaker Recognition,

P. Matjka, O. Novotn, O. Plchot, L. Burget, M. D. Snchez, and J. ernock, “Analysis of Score Normalization in Multilingual Speaker Recognition,” in Proceedings of the Annual Conference of the International Speech Commu- nication Association, 2017

work page 2017

[40] [40]

The BOSARIS Toolkit: Theory, Algorithms and Code for Surviving the New DCF

N. Br ¨ummer and E. De Villiers, “The BOSARIS Toolkit: Theory, Algorithms and Code for Surviving the New DCF,” arXiv preprint arXiv:1304.2865, 2013

work page internal anchor Pith review Pith/arXiv arXiv 2013

[41] [41]

V oxceleb: A Large-Scale Speaker Identiﬁcation Dataset,

A. Nagrani, J. S. Chung, and A. Zisserman, “V oxceleb: A Large-Scale Speaker Identiﬁcation Dataset,” in Proceedings of the Annual Conference of the International Speech Communication Association , 2017, pp. 2616– 2620

work page 2017

[42] [42]

V oxceleb2: Deep Speaker Recognition,

J. S. Chung, A. Nagrani, and A. Zisserman, “V oxceleb2: Deep Speaker Recognition,” inProceedings of the Annual Conference of the International Speech Communication Association, 2018

work page 2018