pith. sign in

arxiv: 1907.02194 · v1 · pith:527U7R2Anew · submitted 2019-07-04 · 📡 eess.AS

The DKU System for the Speaker Recognition Task of the 2019 VOiCES from a Distance Challenge

Pith reviewed 2026-05-25 09:06 UTC · model grok-4.3

classification 📡 eess.AS
keywords speaker recognitionfar-field verificationresidual neural networkangular softmax lossweighted prediction errorVOiCES challengeminDCFequal error rate
0
0 comments X

The pith

The DKU system reaches 0.3532 minDCF and 4.96% EER on the 2019 VOiCES far-field speaker verification evaluation set.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper lays out a full pipeline for speaker recognition from distant audio, starting with data pre-processing and spectral features, then building utterance-level models, applying back-end scoring, and finishing with score normalization. Their primary system uses a residual neural network trained under angular softmax loss together with weighted prediction error processing. This combination produces the stated error rates on the challenge evaluation data. A reader would care because far-field conditions introduce distortions that standard close-talk systems do not handle well, so concrete numbers on a shared benchmark show whether the pipeline overcomes them.

Core claim

The submitted primary system obtains 0.3532 minDCF and 4.96% EER on the evaluation set. The best single system employs a residual neural network trained with angular softmax loss; weighted prediction error algorithms further improve performance, and the system reaches 0.3668 minDCF and 5.58% EER with simple cosine similarity scoring before final normalization steps are added.

What carries the argument

Residual neural network trained with angular softmax loss for utterance-level speaker modeling, augmented by weighted prediction error signal processing and cosine scoring with normalization.

If this is right

  • The full pipeline of pre-processing, residual network embeddings, weighted prediction error, and normalization reduces both minDCF and EER relative to simpler baselines.
  • Angular softmax training produces embeddings that support effective cosine scoring in far-field conditions.
  • Adding weighted prediction error yields measurable gains beyond the neural network alone.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same modular pipeline could be transferred to other distant-audio tasks such as meeting transcription or voice commands in large rooms.
  • Replacing the residual network with newer architectures might produce further reductions in EER if retrained under the same loss.
  • Score normalization appears to correct for score distribution shifts caused by varying distances and room acoustics.

Load-bearing premise

The challenge evaluation set provides a representative test of far-field speaker verification performance without hidden domain shifts or selection effects that would make the reported numbers unrepresentative.

What would settle it

An independent team re-running the exact pipeline on the same evaluation set and obtaining error rates materially different from 0.3532 minDCF and 4.96% EER would show the reported figures do not hold.

Figures

Figures reproduced from arXiv: 1907.02194 by Danwei Cai, Ming Li, Weicheng Cai, Xiaoyi Qin.

Figure 1
Figure 1. Figure 1: DET plots for development and evaluation dataset with original or dereverberated sound wave [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
read the original abstract

In this paper, we present the DKU system for the speaker recognition task of the VOiCES from a distance challenge 2019. We investigate the whole system pipeline for the far-field speaker verification, including data pre-processing, short-term spectral feature representation, utterance-level speaker modeling, back-end scoring, and score normalization. Our best single system employs a residual neural network trained with angular softmax loss. Also, the weighted prediction error algorithms can further improve performance. It achieves 0.3668 minDCF and 5.58% EER on the evaluation set by using a simple cosine similarity scoring. Finally, the submitted primary system obtains 0.3532 minDCF and 4.96% EER on the evaluation set.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The manuscript presents the DKU system submitted to the speaker recognition task of the 2019 VOiCES from a Distance Challenge. It describes the full pipeline for far-field speaker verification, covering data pre-processing, short-term spectral features, utterance-level modeling via a residual neural network trained with angular softmax loss, back-end scoring with cosine similarity, and score normalization. The best single system with WPE preprocessing achieves 0.3668 minDCF and 5.58% EER; the submitted primary system reaches 0.3532 minDCF and 4.96% EER on the evaluation set.

Significance. If the reported numbers are reproducible and the evaluation set is unbiased, the work supplies a concrete empirical reference point on the VOiCES benchmark, demonstrating that ResNet + angular softmax combined with WPE can deliver competitive far-field performance. Challenge system papers of this type are useful for establishing baselines, but the lack of supporting analysis reduces the strength of any broader claims about generalization.

major comments (1)
  1. [Abstract] Abstract: The central claim that the primary system obtains 0.3532 minDCF and 4.96% EER is presented without error bars, ablation results, or any description of evaluation-set selection criteria, acoustic-condition statistics, or validation steps confirming absence of domain shift relative to training data. This directly affects the ability to interpret the numerical result as evidence of effective far-field verification.

Simulated Author's Rebuttal

1 responses · 2 unresolved

We thank the referee for the detailed review of our manuscript on the DKU system for the 2019 VOiCES challenge. Below we respond point-by-point to the major comment.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim that the primary system obtains 0.3532 minDCF and 4.96% EER is presented without error bars, ablation results, or any description of evaluation-set selection criteria, acoustic-condition statistics, or validation steps confirming absence of domain shift relative to training data. This directly affects the ability to interpret the numerical result as evidence of effective far-field verification.

    Authors: The abstract is intentionally concise, as is conventional for challenge system descriptions. The full pipeline (WPE preprocessing, 64-dim log-Mel features, ResNet34 with angular softmax, cosine scoring, and AS-Norm) is detailed in Sections 2-4 of the manuscript, along with the training data (VoxCeleb + VOiCES dev) and the fact that all results are on the official VOiCES 2019 evaluation set. We agree that a short clause noting the official challenge evaluation protocol would aid interpretation and will add one sentence to the abstract. Error bars are not reported because the challenge provides a single fixed evaluation set with no provision for multiple independent runs. Ablation studies are outside the scope of this system paper, which focuses on the submitted primary system rather than comparative analysis. revision: partial

standing simulated objections not resolved
  • Error bars on the reported minDCF and EER, because the challenge evaluation consists of a single run on a fixed test set.
  • Evaluation-set selection criteria and complete acoustic-condition statistics, which are determined by the VOiCES organizers and not fully disclosed to participants.

Circularity Check

0 steps flagged

No derivation chain; purely empirical system report

full rationale

The paper describes a speaker verification pipeline (data preprocessing, ResNet with angular softmax, WPE, cosine scoring) and reports challenge metrics (0.3532 minDCF, 4.96% EER) on the public VOiCES evaluation set. No equations, fitted parameters renamed as predictions, self-citations, or ansatzes are present that could reduce any claim to its own inputs by construction. The central claims are experimental results on an external benchmark, which are self-contained and externally falsifiable.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an applied competition-system paper with no mathematical derivations, free parameters fitted inside a model, or newly postulated entities; all components are drawn from prior literature.

pith-pipeline@v0.9.0 · 5665 in / 1075 out tokens · 46855 ms · 2026-05-25T09:06:22.024177+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

42 extracted references · 42 canonical work pages · 4 internal anchors

  1. [1]

    VOiCES from a Distance Challenge 2019

    Introduction In the past decade, the performance of speaker recognition has improved significantly. The i-vector based method [1] and the deep neural network (DNN) based methods [2, 3] have promoted the development of speaker recognition technology in telephone channel and closed talking scenarios. However, speaker recognition under far-field and complex en...

  2. [2]

    The DKU System for the Speaker Recognition Task of the 2019 VOiCES from a Distance Challenge

    System descriptions 2.1. Data pre-processing 2.1.1. Data augmentation We adopt two kinds of data augmentation strategies. The first is the same as the x-vector system available at Kaldi V oxceleb recipe, which employs additive noises and reverberation. We also use pyroomacoustics [24] to simulate the room acoustic based on RIR generator using Image Source ...

  3. [3]

    Data usage The training data includes V oxCeleb 1 [37] and V oxCeleb 2 [38]

    Experiments 3.1. Data usage The training data includes V oxCeleb 1 [37] and V oxCeleb 2 [38]. The original distribution of V oxCeleb split each video into multiple short segments. During training, the segments from the same video are concatenated into a single sound wave, which results in 167897 utterances from 7245 speakers. No voice activity detection (...

  4. [4]

    We use different acoustic fea- tures, different front-end modeling methods, and various back- end scoring methods

    Conclusions We presented the components and analyzed the results of the DKU-SMIIP speaker recognition system for the VOiCES from a Distance Challenge 2019. We use different acoustic fea- tures, different front-end modeling methods, and various back- end scoring methods. To further improve the performance, we use WPE to dereverberate the development and ev...

  5. [5]

    Front- End Factor Analysis for Speaker Verification,

    N. Dehak, P. J. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet, “Front- End Factor Analysis for Speaker Verification,” IEEE Transactions on Au- dio, Speech, and Language Processing, vol. 19, no. 4, pp. 788–798, 2011

  6. [6]

    x- vectors: Robust DNN Embeddings for Speaker Recognition,

    D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khudanpur, “x- vectors: Robust DNN Embeddings for Speaker Recognition,” in IEEE In- ternational Conference on Acoustics, Speech and Signal Processing, 2018, pp. 5329–5333

  7. [7]

    Exploring the Encoding Layer and Loss Function in End-to-End Speaker and Language Recognition System,

    W. Cai, J. Chen, and M. Li, “Exploring the Encoding Layer and Loss Function in End-to-End Speaker and Language Recognition System,” in Odyssey: The Speaker and Language Recognition Workshop, 2018, pp. 74– 81

  8. [8]

    Wolfel and J

    M. Wolfel and J. McDonough, Distant Speech Recognition. John Wiley & Sons, Incorporated, 2009

  9. [9]

    The Perception of Speech Under Ad- verse Conditions,

    P. Assmann and Q. Summerfield, “The Perception of Speech Under Ad- verse Conditions,” in Speech Processing in the Auditory System. Springer New York, 2004, pp. 231–308

  10. [10]

    Speech Dereverberation Based on Variance-Normalized Delayed Linear Prediction,

    T. Nakatani, T. Yoshioka, K. Kinoshita, M. Miyoshi, and Biing-Hwang Juang, “Speech Dereverberation Based on Variance-Normalized Delayed Linear Prediction,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 18, no. 7, pp. 1717–1731, 2010

  11. [11]

    Robust Speaker Identification in Noisy and Reverberant Conditions,

    X. Zhao, Y . Wang, and D. Wang, “Robust Speaker Identification in Noisy and Reverberant Conditions,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 22, no. 4, pp. 836–845, 2014

  12. [12]

    Speech Enhancement Using Long Short-Term Memory based Recurrent Neural Networks for Noise Robust Speaker Verification,

    M. Kolboek, Z.-H. Tan, and J. Jensen, “Speech Enhancement Using Long Short-Term Memory based Recurrent Neural Networks for Noise Robust Speaker Verification,” in IEEE Spoken Language Technology Workshop , 2016, pp. 305–311

  13. [13]

    DNN-Based Amplitude and Phase Feature Enhancement for Noise Robust Speaker Identification,

    Z. Oo, Y . Kawakami, L. Wang, S. Nakagawa, X. Xiao, and M. Iwahashi, “DNN-Based Amplitude and Phase Feature Enhancement for Noise Robust Speaker Identification,” in Proceedings of the Annual Conference of the International Speech Communication Association, 2016, pp. 2204–2208

  14. [14]

    Front-end speech enhancement for commercial speaker verification systems,

    S. E. Eskimez, P. Soufleris, Z. Duan, and W. Heinzelman, “Front-end speech enhancement for commercial speaker verification systems,” Speech Communication, vol. 99, pp. 101–113, 2018

  15. [15]

    Neural Network Based Spectral Mask Estimation for Acoustic Beamforming,

    J. Heymann, L. Drude, and R. Haeb-Umbach, “Neural Network Based Spectral Mask Estimation for Acoustic Beamforming,” in2016 IEEE Inter- national Conference on Acoustics, Speech and Signal Processing. IEEE, 2016, pp. 196–200

  16. [16]

    Blind Acoustic Beamforming Based on Generalized Eigenvalue Decomposition,

    E. Warsitz and R. Haeb-Umbach, “Blind Acoustic Beamforming Based on Generalized Eigenvalue Decomposition,” IEEE Transactions on Audio, Speech and Language Processing, vol. 15, no. 5, pp. 1529–1539, 2007

  17. [17]

    Modulation Spectral Features for Robust Far- Field Speaker Identification,

    T. Falk and Wai-Yip Chan, “Modulation Spectral Features for Robust Far- Field Speaker Identification,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 18, no. 1, pp. 90–100, 2010

  18. [18]

    Hilbert Envelope Based Features for Ro- bust Speaker Identification Under Reverberant Mismatched Conditions,

    S. O. Sadjadi and J. H. Hansen, “Hilbert Envelope Based Features for Ro- bust Speaker Identification Under Reverberant Mismatched Conditions,” in 2011 IEEE International Conference on Acoustics, Speech and Signal Pro- cessing, 2011, pp. 5448–5451

  19. [19]

    Speaker Identifica- tion with Distant Microphone Speech,

    Q. Jin, R. Li, Q. Yang, K. Laskowski, and T. Schultz, “Speaker Identifica- tion with Distant Microphone Speech,” in2010 IEEE International Confer- ence on Acoustics, Speech and Signal Processing, 2010, pp. 4518–4521

  20. [20]

    Blind Spectral Weighting for Robust Speaker Identification under Reverberation Mismatch,

    S. O. Sadjadi and J. H. L. Hansen, “Blind Spectral Weighting for Robust Speaker Identification under Reverberation Mismatch,” IEEE/ACM Trans- actions on Audio, Speech, and Language Processing , vol. 22, no. 5, pp. 937–945, 2014

  21. [21]

    Reverberation Matching for Speaker Recognition,

    I. Peer, B. Rafaely, and Y . Zigel, “Reverberation Matching for Speaker Recognition,” in2008 IEEE International Conference on Acoustics, Speech and Signal Processing, 2008, pp. 4829–4832

  22. [22]

    Improving the Performance of Far-Field Speaker Verification Using Multi- Condition Training: The Case of GMM-UBM and i-Vector Systems,

    A. R. Avila, M. Sarria-Paja, F. J. Fraga, D. O’Shaughnessy, and T. H. Falk, “Improving the Performance of Far-Field Speaker Verification Using Multi- Condition Training: The Case of GMM-UBM and i-Vector Systems,” in Proceedings of the Annual Conference of the International Speech Com- munication Association, 2014, pp. 1096–1100

  23. [23]

    Multicondition train- ing of Gaussian PLDA models in i-vector space for noise and reverbera- tion robust speaker recognition,

    D. Garcia-Romero, X. Zhou, and C. Y . Espy-Wilson, “Multicondition train- ing of Gaussian PLDA models in i-vector space for noise and reverbera- tion robust speaker recognition,” in 2012 IEEE International Conference on Acoustics, Speech and Signal Processing, 2012, pp. 4257–4260

  24. [24]

    Robust Speaker Recognition from Distant Speech under Real Reverberant Environments Using Speaker Embeddings,

    M. K. Nandwana, J. van Hout, M. McLaren, A. Stauffer, C. Richey, A. Law- son, and M. Graciarena, “Robust Speaker Recognition from Distant Speech under Real Reverberant Environments Using Speaker Embeddings,” inPro- ceedings of the Annual Conference of the International Speech Communi- cation Association, 2018, pp. 1106–1110

  25. [25]

    Far-Field Speaker Recognition,

    Q. Jin, T. Schultz, and A. Waibel, “Far-Field Speaker Recognition,” IEEE Transactions on Audio, Speech and Language Processing , vol. 15, no. 7, pp. 2023–2032, 2007

  26. [26]

    Text- Independent Speaker Identification using Soft Channel Selection in Home Robot Environments,

    Mikyong Ji, Sungtak Kim, Hoirin Kim, and Ho-Sub Yoon, “Text- Independent Speaker Identification using Soft Channel Selection in Home Robot Environments,” IEEE Transactions on Consumer Electronics , vol. 54, no. 1, pp. 140–144, 2008

  27. [27]

    The VOiCES from a Distance Challenge 2019 Evaluation Plan

    M. K. Nandwana, J. V . Hout, M. McLaren, C. Richey, A. Lawson, and M. A. Barrios, “The VOiCES from a Distance Challenge 2019 Evaluation Plan,” arXiv:1902.10828 [eess.AS], 2019

  28. [28]

    Pyroomacoustics: A Python Package for Audio Room Simulation and Array Processing Algorithms,

    R. Scheibler, E. Bezzam, and I. Dokmanic, “Pyroomacoustics: A Python Package for Audio Room Simulation and Array Processing Algorithms,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing, 2018, pp. 351–355

  29. [29]

    V oices Obscured in Complex Environmental Settings (VOICES) corpus,

    C. Richey, M. A. Barrios, Z. Armstrong, C. Bartels, H. Franco, M. Gra- ciarena, A. Lawson, M. K. Nandwana, A. Stauffer, J. van Hout, P. Gam- ble, J. Hetherly, C. Stephenson, and K. Ni, “V oices Obscured in Complex Environmental Settings (VOICES) corpus,” in Proceedings of the Annual Conference of the International Speech Communication Association, 2018, p...

  30. [30]

    MUSAN: A Music, Speech, and Noise Corpus

    D. Snyder, G. Chen, and D. Povey, “MUSAN: A Music, Speech, and Noise Corpus,” arXiv:1510.08484 [cs], 2015

  31. [31]

    Power-Normalized Cepstral Coefcients (PNCC) for Robust Speech Recognition,

    C. Kim and R. M. Stern, “Power-Normalized Cepstral Coefcients (PNCC) for Robust Speech Recognition,” IEEE/ACM Transactions on Audio, Speech and Language Processing, vol. 24, no. 7, pp. 1315–1329, 2016

  32. [32]

    Complex Sounds and Auditory Images,

    R. Patterson, K. Robinson, J. Holdsworth, D. McKeown, C. Zhang, and M. Allerhand, “Complex Sounds and Auditory Images,” inAuditory Physi- ology and Perception. Oxford, UK: Y . Cazals, L. Demany, and K. Horner, (Eds), Pergamon Press, 1992, pp. 429–446

  33. [33]

    Insights into End-to-End Learning Scheme for Language Identification,

    W. Cai, Z. Cai, W. Liu, X. Wang, and M. Li, “Insights into End-to-End Learning Scheme for Language Identification,” in2018 IEEE International Conference on Acoustics, Speech and Signal Processing , 2018, pp. 5209– 5213

  34. [34]

    Analysis of length normalization in end-to- end speaker verification system,

    W. Cai, J. Chen, and M. Li, “Analysis of length normalization in end-to- end speaker verification system,” in Proc. INTERSPEECH 2018, 2018, pp. 3618–3622

  35. [35]

    Sphereface: Deep Hypersphere Embedding for Face Recognition,

    W. Liu, Y . Wen, Z. Yu, M. Li, B. Raj, and L. Song, “Sphereface: Deep Hypersphere Embedding for Face Recognition,” in The IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 212–220

  36. [36]

    Return of Frustratingly Easy Domain Adaptation,

    B. Sun, J. Feng, and K. Saenko, “Return of Frustratingly Easy Domain Adaptation,” in Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, 2016, pp. 2058–2065

  37. [37]

    Speaker Verification in Mismatched Conditions with Frustratingly Easy Domain Adaptation,

    M. J. Alam, G. Bhattacharya, and P. Kenny, “Speaker Verification in Mismatched Conditions with Frustratingly Easy Domain Adaptation,” in Odyssey: The Speaker and Language Recognition Workshop, 2018

  38. [38]

    Analysis of i-vector Length Normalization in Speaker Recognition Systems,

    D. Garcia-Romero and C. Y . Espy-Wilson, “Analysis of i-vector Length Normalization in Speaker Recognition Systems,” inProceedings of the An- nual Conference of the International Speech Communication Association , 2011, pp. 249–252

  39. [39]

    Analysis of Score Normalization in Multilingual Speaker Recognition,

    P. Matjka, O. Novotn, O. Plchot, L. Burget, M. D. Snchez, and J. ernock, “Analysis of Score Normalization in Multilingual Speaker Recognition,” in Proceedings of the Annual Conference of the International Speech Commu- nication Association, 2017

  40. [40]

    The BOSARIS Toolkit: Theory, Algorithms and Code for Surviving the New DCF

    N. Br ¨ummer and E. De Villiers, “The BOSARIS Toolkit: Theory, Algorithms and Code for Surviving the New DCF,” arXiv preprint arXiv:1304.2865, 2013

  41. [41]

    V oxceleb: A Large-Scale Speaker Identification Dataset,

    A. Nagrani, J. S. Chung, and A. Zisserman, “V oxceleb: A Large-Scale Speaker Identification Dataset,” in Proceedings of the Annual Conference of the International Speech Communication Association , 2017, pp. 2616– 2620

  42. [42]

    V oxceleb2: Deep Speaker Recognition,

    J. S. Chung, A. Nagrani, and A. Zisserman, “V oxceleb2: Deep Speaker Recognition,” inProceedings of the Annual Conference of the International Speech Communication Association, 2018