pith. sign in

arxiv: 1907.06112 · v1 · pith:6TM3X3ILnew · submitted 2019-07-13 · 📡 eess.AS · cs.CL· cs.SD

BUT VOiCES 2019 System Description

Pith reviewed 2026-05-24 21:47 UTC · model grok-4.3

classification 📡 eess.AS cs.CLcs.SD
keywords speaker recognitionx-vectorVOiCES challengeequal error ratePLDA adaptationsystem fusioni-vector
0
0 comments X

The pith

Fusion of three x-vector systems reaches 1.0% EER in the VOiCES 2019 speaker recognition challenge.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper reports the results of several speaker recognition systems submitted to the VOiCES 2019 challenge. All fixed-condition entries rely on the x-vector approach but vary the acoustic features and the exact neural network architectures used to extract speaker embeddings. The strongest single system records 1.2% equal error rate; combining the scores of three such systems lowers the error to 1.0%, a 15% relative reduction. When external data are allowed in the open condition, adapting the PLDA backend produces an additional gain of less than 10% relative. The final open-condition submission also includes one i-vector system alongside the three x-vector extractors.

Core claim

Systems built on the x-vector paradigm with differing features and DNN topologies reach 1.2% EER for the best single entry and 1.0% EER after fusing three systems, a 15% relative improvement. In the open condition, external data used only for PLDA adaptation yield less than ~10% relative improvement. The open submission combines three x-vector systems with one i-vector system.

What carries the argument

The x-vector paradigm that extracts fixed-length speaker embeddings from a deep neural network trained for speaker classification, together with score-level fusion across multiple feature and topology variants.

If this is right

  • Score fusion across different x-vector configurations reliably lowers error rates under fixed training conditions.
  • External data restricted to PLDA adaptation delivers only modest further gains once the embedding extractors are already strong.
  • Including an i-vector system in the open-condition fusion does not prevent the overall 1.0% EER target from being met.
  • System combination remains an effective route to performance improvement even when individual embeddings are already competitive.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The modest open-condition gain suggests that the fixed training data already capture most of the speaker variability needed for this test set.
  • Future work could test whether the same fusion benefit holds when the underlying embeddings come from newer architectures such as ResNets or transformers.
  • The results imply that challenge organizers should continue to publish both fixed and open tracks so that the value of additional data can be quantified separately from embedding quality.

Load-bearing premise

All reported EER figures were produced by strictly obeying the fixed-condition rules and evaluation protocol of the VOiCES 2019 challenge without undisclosed data or post-hoc tuning.

What would settle it

An independent run of the submitted systems on the official VOiCES 2019 test set that returns EER values materially above 1.0% would falsify the performance numbers.

read the original abstract

This is a description of our effort in VOiCES 2019 Speaker Recognition challenge. All systems in the fixed condition are based on the x-vector paradigm with different features and DNN topologies. The single best system reaches 1.2% EER and a fusion of 3 systems yields 1.0% EER, which is 15% relative improvement. The open condition allowed us to use external data which we did for the PLDA adaptation and achieved less than ~10% relative improvement. In the submission to open condition, we used 3 x-vector systems and also one i-vector based system.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript is a system description for the BUT team's entry in the VOiCES 2019 Speaker Recognition challenge. It states that all fixed-condition systems follow the x-vector paradigm with variations in features and DNN topologies. The single best system achieves 1.2% EER, with a fusion of three systems reaching 1.0% EER (15% relative improvement). For the open condition, external data is used only for PLDA adaptation, yielding less than ~10% relative improvement, and the submission includes three x-vector systems plus one i-vector system.

Significance. Assuming the EER figures were obtained following the challenge protocols, this work provides a record of effective x-vector configurations and the benefits of system fusion on the VOiCES 2019 evaluation set. The quantified improvement from fusion highlights the value of combining multiple systems. However, the limited gain from external data in the open condition suggests that PLDA adaptation alone may not yield substantial benefits. As a system description, its primary significance is in sharing practical implementation details, though the current text offers few such details.

major comments (2)
  1. [Abstract] The central performance claims (1.2% EER single best, 1.0% fused) are presented without any accompanying description of the specific features, DNN topologies, training procedures, or data used. This omission makes it impossible to assess or reproduce the results, which are the core contribution of the paper.
  2. [Abstract] The statement that open-condition systems 'achieved less than ~10% relative improvement' is vague and lacks the specific EER values or comparison to the fixed-condition baseline, weakening the ability to evaluate the impact of external data.
minor comments (2)
  1. The manuscript appears to be extremely brief; expanding with at least one section detailing the systems would greatly improve its utility as a system description.
  2. [Abstract] The phrase 'less than ~10%' combines 'less than' with an approximate symbol, which is redundant and unclear.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed review of our VOiCES 2019 system description manuscript. We address the major comments point by point below, indicating where revisions will be made to the abstract.

read point-by-point responses
  1. Referee: [Abstract] The central performance claims (1.2% EER single best, 1.0% fused) are presented without any accompanying description of the specific features, DNN topologies, training procedures, or data used. This omission makes it impossible to assess or reproduce the results, which are the core contribution of the paper.

    Authors: As a system description paper, the main text elaborates on the x-vector systems, including variations in features and DNN topologies. We agree the abstract is overly concise and will revise it to briefly note the key differences in acoustic features, network architectures, and training data employed across the systems. revision: yes

  2. Referee: [Abstract] The statement that open-condition systems 'achieved less than ~10% relative improvement' is vague and lacks the specific EER values or comparison to the fixed-condition baseline, weakening the ability to evaluate the impact of external data.

    Authors: We agree that including concrete EER numbers would improve transparency. We will revise the abstract to report the specific EER values obtained in the open condition along with the relative improvement compared to the fixed-condition baseline. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

This is a standard challenge system description paper. The central claims are measured EER values (1.2% single best, 1.0% fused) obtained by following the fixed-condition VOiCES 2019 evaluation protocol. No derivations, predictions, fitted parameters renamed as outputs, or self-citation chains are present that could reduce to inputs by construction. Results are direct outputs of an external benchmark.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No mathematical derivations, free parameters, or invented entities are present; the contribution is empirical performance reporting on a public challenge.

pith-pipeline@v0.9.0 · 5685 in / 950 out tokens · 16761 ms · 2026-05-24T21:47:22.053638+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

27 extracted references · 27 canonical work pages · 1 internal anchor

  1. [1]

    Introduction This submission is a description of our effort in VOiCES 2019 Speaker Recognition challenge [1]. Most of the systems are based on x-vectors [2] with an exception of the i-vector sub- system for open condition which uses concatenation of MFCCs and Stacked bottlenecks (SBN) features [3]. Our systems uti lize different features (MFCC, PLP , Mel-...

  2. [2]

    Experimental Setup 2.1. Training data, Augmentations For x-vector training we used only V oxceleb 1 and 2 dataset with 166 thousands audio files (distributed in 1.2 million speech segments) from 7146 speakers. We performed the following data augmentations based on the Kaldi recipe and created add i- tional 5 million segments based on these augmentations: •...

  3. [3]

    HTK MFCC with deltas and double deltas and SBN feature vec- tors were extracted from recordings (SBN were downsampled to 8kHz)

    i-vector Systems The system is based on gender independent i-vectors [11, 12] . HTK MFCC with deltas and double deltas and SBN feature vec- tors were extracted from recordings (SBN were downsampled to 8kHz). Final feature vector is concatenation of both as th ey proved to perform very well in NIST SRE [3]. This system uses V AD-NN. Universal background mo...

  4. [4]

    The systems were trained in Kaldi toolkit [14] using SRE16 recipe with modifications described below: • Using different feature sets • Training networks with 9 epochs (instead of 3)

    x-vector Systems All x-vectors used V AD-Energy from Kaldi SRE16 recipe 6. The systems were trained in Kaldi toolkit [14] using SRE16 recipe with modifications described below: • Using different feature sets • Training networks with 9 epochs (instead of 3). We did not see any considerable difference with 12 epochs. • Using modified example generation - we u...

  5. [5]

    Heavy-tailed PLDA Our i-vector system used HT-PLDA backend [16]

    Backend 5.1. Heavy-tailed PLDA Our i-vector system used HT-PLDA backend [16]. It was trained on V oxCeleb 1 and 2 datasets. Training set consisted of 166 thousands audio files from 7146 speakers. Length nor- malization, centering, LDA, reducing dimensionality of ve ctors to 300, followed by another length normalization were appli ed to all i-vectors. All i...

  6. [6]

    Each system pro- vided log-likelihood ratio scores that could be subjected to score normalization

    Calibration & Fusion The submission strategy was one common fusion trained on the labeled V oiCES development data [20, 1]. Each system pro- vided log-likelihood ratio scores that could be subjected to score normalization. These scores were first pre-calibrated and t hen passed into the fusion. The output of the fusion was then agai n re-calibrated. Both c...

  7. [7]

    The VOiCES from a Distance Challenge 2019 Evaluation Plan

    Mahesh Kumar Nandwana, Julien van Hout, Mitch McLaren, Aaron. Lawson, and Mar´ ıa Auxiliadora Barrios, “The voicesfrom a distance challenge 2019 evaluation plan,” in arXiv:1902.10828 [eess.AS], 2019

  8. [8]

    X-vectors: Robust dnn embed - dings for speaker recognition,

    David Snyder, Daniel Garcia-Romero, Gregory Sell, Dani el Povey, and Sanjeev Khudanpur, “X-vectors: Robust dnn embed - dings for speaker recognition,” Submitted to ICASSP , 2018

  9. [9]

    Analysis of dnn approaches to speaker identification,

    Pavel Matˇ ejka, Ondˇ rej Glembek, Ondˇ rej Novotn´ y, Oldˇ rich Plchot, Frantiˇ sek Gr´ ezl, Luk´ aˇ s Burget, and JanˇCernock´ y, “Analysis of dnn approaches to speaker identification,” in Proceedings of the 2011 IEEE International Conference on Acoustics, Speech, a nd Signal Processing, ICASSP 2016 . 2016, IEEE Signal Processing Society

  10. [10]

    Building and Evaluation of a Real Room Impulse Response Dataset,

    Igor Sz¨ oke, Miroslav Sk´ acel, Ladislav Moˇ sner, Jakub Paliesek, and Jan ˇCernock´ y, “Building and Evaluation of a Real Room Impulse Response Dataset,” Under review for IEEE Journal of Selected Topics in Signal Processing, 2019

  11. [11]

    Dereverberation and Beamforming in Robust Far-Field Speaker Recognition,

    Ladislav Moˇ sner, Oldˇ rich Plchot, Pavel Matˇ ejka, Ondˇ rej Novotn´ y, and Jan ˇCernock´ y, “Dereverberation and Beamforming in Robust Far-Field Speaker Recognition,” in Proceedings of Interspeech

  12. [12]

    1334–1338, International Speech Communication Association

    2018, pp. 1334–1338, International Speech Communication Association

  13. [13]

    A Snack implementati on and Tcl/Tk interface to the fundamental frequency variatio n spec- trum algorithm,

    Kornel Laskowski and Jens Edlund, “A Snack implementati on and Tcl/Tk interface to the fundamental frequency variatio n spec- trum algorithm,” in Proceedings of the Seventh International Con- ference on Language Resources and Evaluation (LREC’10) , V al- letta, Malta, may 2010

  14. [14]

    A robust algorithm for pitch tracking (RA PT),

    David Talkin, “A robust algorithm for pitch tracking (RA PT),” in Speech Coding and Synthesis , W. B. Kleijn and K. Paliwal, Eds., New Y ork, 1995, Elseviever

  15. [15]

    BUT 2014 Babel sys- tem: Analysis of adaptation in NN based systems,

    Martin Karafi´ at, Frantiˇ sek Gr´ ezl, Karel V esel´ y, Mirko Hanne- mann, Igor Sz˝ oke, and Jan ˇCernock´ y, “BUT 2014 Babel sys- tem: Analysis of adaptation in NN based systems,” in Interspeech 2014, 2014, pp. 3002–3006

  16. [16]

    Neural network bottleneck featu res for lan- guage identification,

    Pavel Matˇ ejka et al., “Neural network bottleneck featu res for lan- guage identification,” in IEEE Odyssey: The Speaker and Lan- guage Recognition W orkshop, Joensu, Finland, 2014

  17. [17]

    A pitch extraction algorithm tuned for au to- matic speech recognition,

    P . Ghahremani, B. BabaAli, D. Povey, K. Riedhammer, J. T rmal, and S. Khudanpur, “A pitch extraction algorithm tuned for au to- matic speech recognition,” in Acoustics, Speech and Signal Pro- cessing (ICASSP), 2014 IEEE International Conference on , May 2014, pp. 2494–2498

  18. [18]

    Front-end factor analysis for speaker verification,

    N. Dehak, P . Kenny, R. Dehak, P . Dumouchel, and P . Ouelle t, “Front-end factor analysis for speaker verification,” IEEE Trans- actions on Audio, Speech and Language Processing , vol. PP , no. 99, pp. 1 –1, 2010

  19. [19]

    Bayesian speaker verification with heavy–ta iled pri- ors,

    P . Kenny, “Bayesian speaker verification with heavy–ta iled pri- ors,” keynote presentation, Proc. of Odyssey 2010, June 201 0

  20. [20]

    Speech dereverberation based on variance-normalized del ayed linear prediction,

    T. Nakatani, T. Y oshioka, K. Kinoshita, M. Miyoshi, and B. Juang, “Speech dereverberation based on variance-normalized del ayed linear prediction,” IEEE Transactions on Audio, Speech, and Lan- guage Processing, vol. 18, no. 7, pp. 1717–1731, Sep. 2010

  21. [21]

    The kaldi spee ch recognition toolkit,

    Daniel Povey, Arnab Ghoshal, Gilles Boulianne, Lukas B ur- get, Ondrej Glembek, Nagendra Goel, Mirko Hannemann, Petr Motlicek, Y anmin Qian, Petr Schwarz, et al., “The kaldi spee ch recognition toolkit,” in IEEE 2011 workshop on automatic speech recognition and understanding . IEEE Signal Processing Society, 2011

  22. [22]

    Speaker recogn i- tion for multi-speaker conversations using x-vectors,

    David Snyder, Daniel Garcia-Romero, Gregory Sell, Ala n Mc- Cree, Daniel Povey, and Sanjeev Khudanpur, “Speaker recogn i- tion for multi-speaker conversations using x-vectors,” in ICASSP, 2019

  23. [23]

    Fast variational bayes for hea vy- tailed plda applied to i-vectors and x-vectors,

    Anna Silnova, Niko Brummer, Daniel Garcia-Romero, Dav id Snyder, and Luk´ aˇ s Burget, “Fast variational bayes for hea vy- tailed plda applied to i-vectors and x-vectors,” in Interspeech 2018, 19th Annual Conference of the International Speech Co m- munication Association, Hyderabad, India, 2-6 September 2 018., 2018

  24. [24]

    Analysis of score nor- malization in multilingual speaker recognition,

    Pavel Matˇ ejka, Ondˇ rej Novotn´ y, Oldˇ rich Plchot, Luk´ aˇ s Burget, Mireia S´ anchez Diez, and Jan ˇCernock´ y, “Analysis of score nor- malization in multilingual speaker recognition,” in Proceedings of Interspeech 2017 . 2017, pp. 1567–1571, International Speech Communication Association

  25. [25]

    Speaker adaptive cohort selection for tnorm in text-independent speaker verificati on,

    D. E. Sturim and Douglas A. Reynolds, “Speaker adaptive cohort selection for tnorm in text-independent speaker verificati on,” in ICASSP, 2005, pp. 741–744

  26. [26]

    How to deal with mult iple- targets in speaker identification systems?,

    Y aniv Zigel and Moshe Wasserblat, “How to deal with mult iple- targets in speaker identification systems?,” in Proceedings of the Speaker and Language Recognition W orkshop (IEEE-Odyssey 2006), San Juan, Puerto Rico, June 2006

  27. [27]

    V oice s obscured in complex environmental settings (VOICES) corpu s,

    Colleen Richey, Mar´ ıa Auxiliadora Barrios, Zeb Armstrong, Chris Bartels, Horacio Franco, Martin Graciarena, Aaron Lawson, Ma- hesh Kumar Nandwana, Allen R. Stauffer, Julien van Hout, Pau l Gamble, Jeff Hetherly, Cory Stephenson, and Karl Ni, “V oice s obscured in complex environmental settings (VOICES) corpu s,” in ISCA INTERSPEECH 2018 , 2018