BUT VOiCES 2019 System Description
Pith reviewed 2026-05-24 21:47 UTC · model grok-4.3
The pith
Fusion of three x-vector systems reaches 1.0% EER in the VOiCES 2019 speaker recognition challenge.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Systems built on the x-vector paradigm with differing features and DNN topologies reach 1.2% EER for the best single entry and 1.0% EER after fusing three systems, a 15% relative improvement. In the open condition, external data used only for PLDA adaptation yield less than ~10% relative improvement. The open submission combines three x-vector systems with one i-vector system.
What carries the argument
The x-vector paradigm that extracts fixed-length speaker embeddings from a deep neural network trained for speaker classification, together with score-level fusion across multiple feature and topology variants.
If this is right
- Score fusion across different x-vector configurations reliably lowers error rates under fixed training conditions.
- External data restricted to PLDA adaptation delivers only modest further gains once the embedding extractors are already strong.
- Including an i-vector system in the open-condition fusion does not prevent the overall 1.0% EER target from being met.
- System combination remains an effective route to performance improvement even when individual embeddings are already competitive.
Where Pith is reading between the lines
- The modest open-condition gain suggests that the fixed training data already capture most of the speaker variability needed for this test set.
- Future work could test whether the same fusion benefit holds when the underlying embeddings come from newer architectures such as ResNets or transformers.
- The results imply that challenge organizers should continue to publish both fixed and open tracks so that the value of additional data can be quantified separately from embedding quality.
Load-bearing premise
All reported EER figures were produced by strictly obeying the fixed-condition rules and evaluation protocol of the VOiCES 2019 challenge without undisclosed data or post-hoc tuning.
What would settle it
An independent run of the submitted systems on the official VOiCES 2019 test set that returns EER values materially above 1.0% would falsify the performance numbers.
read the original abstract
This is a description of our effort in VOiCES 2019 Speaker Recognition challenge. All systems in the fixed condition are based on the x-vector paradigm with different features and DNN topologies. The single best system reaches 1.2% EER and a fusion of 3 systems yields 1.0% EER, which is 15% relative improvement. The open condition allowed us to use external data which we did for the PLDA adaptation and achieved less than ~10% relative improvement. In the submission to open condition, we used 3 x-vector systems and also one i-vector based system.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript is a system description for the BUT team's entry in the VOiCES 2019 Speaker Recognition challenge. It states that all fixed-condition systems follow the x-vector paradigm with variations in features and DNN topologies. The single best system achieves 1.2% EER, with a fusion of three systems reaching 1.0% EER (15% relative improvement). For the open condition, external data is used only for PLDA adaptation, yielding less than ~10% relative improvement, and the submission includes three x-vector systems plus one i-vector system.
Significance. Assuming the EER figures were obtained following the challenge protocols, this work provides a record of effective x-vector configurations and the benefits of system fusion on the VOiCES 2019 evaluation set. The quantified improvement from fusion highlights the value of combining multiple systems. However, the limited gain from external data in the open condition suggests that PLDA adaptation alone may not yield substantial benefits. As a system description, its primary significance is in sharing practical implementation details, though the current text offers few such details.
major comments (2)
- [Abstract] The central performance claims (1.2% EER single best, 1.0% fused) are presented without any accompanying description of the specific features, DNN topologies, training procedures, or data used. This omission makes it impossible to assess or reproduce the results, which are the core contribution of the paper.
- [Abstract] The statement that open-condition systems 'achieved less than ~10% relative improvement' is vague and lacks the specific EER values or comparison to the fixed-condition baseline, weakening the ability to evaluate the impact of external data.
minor comments (2)
- The manuscript appears to be extremely brief; expanding with at least one section detailing the systems would greatly improve its utility as a system description.
- [Abstract] The phrase 'less than ~10%' combines 'less than' with an approximate symbol, which is redundant and unclear.
Simulated Author's Rebuttal
We thank the referee for the detailed review of our VOiCES 2019 system description manuscript. We address the major comments point by point below, indicating where revisions will be made to the abstract.
read point-by-point responses
-
Referee: [Abstract] The central performance claims (1.2% EER single best, 1.0% fused) are presented without any accompanying description of the specific features, DNN topologies, training procedures, or data used. This omission makes it impossible to assess or reproduce the results, which are the core contribution of the paper.
Authors: As a system description paper, the main text elaborates on the x-vector systems, including variations in features and DNN topologies. We agree the abstract is overly concise and will revise it to briefly note the key differences in acoustic features, network architectures, and training data employed across the systems. revision: yes
-
Referee: [Abstract] The statement that open-condition systems 'achieved less than ~10% relative improvement' is vague and lacks the specific EER values or comparison to the fixed-condition baseline, weakening the ability to evaluate the impact of external data.
Authors: We agree that including concrete EER numbers would improve transparency. We will revise the abstract to report the specific EER values obtained in the open condition along with the relative improvement compared to the fixed-condition baseline. revision: yes
Circularity Check
No significant circularity detected
full rationale
This is a standard challenge system description paper. The central claims are measured EER values (1.2% single best, 1.0% fused) obtained by following the fixed-condition VOiCES 2019 evaluation protocol. No derivations, predictions, fitted parameters renamed as outputs, or self-citation chains are present that could reduce to inputs by construction. Results are direct outputs of an external benchmark.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Introduction This submission is a description of our effort in VOiCES 2019 Speaker Recognition challenge [1]. Most of the systems are based on x-vectors [2] with an exception of the i-vector sub- system for open condition which uses concatenation of MFCCs and Stacked bottlenecks (SBN) features [3]. Our systems uti lize different features (MFCC, PLP , Mel-...
work page 2019
-
[2]
Experimental Setup 2.1. Training data, Augmentations For x-vector training we used only V oxceleb 1 and 2 dataset with 166 thousands audio files (distributed in 1.2 million speech segments) from 7146 speakers. We performed the following data augmentations based on the Kaldi recipe and created add i- tional 5 million segments based on these augmentations: •...
work page 2020
-
[3]
i-vector Systems The system is based on gender independent i-vectors [11, 12] . HTK MFCC with deltas and double deltas and SBN feature vec- tors were extracted from recordings (SBN were downsampled to 8kHz). Final feature vector is concatenation of both as th ey proved to perform very well in NIST SRE [3]. This system uses V AD-NN. Universal background mo...
work page 2048
-
[4]
x-vector Systems All x-vectors used V AD-Energy from Kaldi SRE16 recipe 6. The systems were trained in Kaldi toolkit [14] using SRE16 recipe with modifications described below: • Using different feature sets • Training networks with 9 epochs (instead of 3). We did not see any considerable difference with 12 epochs. • Using modified example generation - we u...
-
[5]
Heavy-tailed PLDA Our i-vector system used HT-PLDA backend [16]
Backend 5.1. Heavy-tailed PLDA Our i-vector system used HT-PLDA backend [16]. It was trained on V oxCeleb 1 and 2 datasets. Training set consisted of 166 thousands audio files from 7146 speakers. Length nor- malization, centering, LDA, reducing dimensionality of ve ctors to 300, followed by another length normalization were appli ed to all i-vectors. All i...
work page 2000
-
[6]
Each system pro- vided log-likelihood ratio scores that could be subjected to score normalization
Calibration & Fusion The submission strategy was one common fusion trained on the labeled V oiCES development data [20, 1]. Each system pro- vided log-likelihood ratio scores that could be subjected to score normalization. These scores were first pre-calibrated and t hen passed into the fusion. The output of the fusion was then agai n re-calibrated. Both c...
work page 1908
-
[7]
The VOiCES from a Distance Challenge 2019 Evaluation Plan
Mahesh Kumar Nandwana, Julien van Hout, Mitch McLaren, Aaron. Lawson, and Mar´ ıa Auxiliadora Barrios, “The voicesfrom a distance challenge 2019 evaluation plan,” in arXiv:1902.10828 [eess.AS], 2019
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[8]
X-vectors: Robust dnn embed - dings for speaker recognition,
David Snyder, Daniel Garcia-Romero, Gregory Sell, Dani el Povey, and Sanjeev Khudanpur, “X-vectors: Robust dnn embed - dings for speaker recognition,” Submitted to ICASSP , 2018
work page 2018
-
[9]
Analysis of dnn approaches to speaker identification,
Pavel Matˇ ejka, Ondˇ rej Glembek, Ondˇ rej Novotn´ y, Oldˇ rich Plchot, Frantiˇ sek Gr´ ezl, Luk´ aˇ s Burget, and JanˇCernock´ y, “Analysis of dnn approaches to speaker identification,” in Proceedings of the 2011 IEEE International Conference on Acoustics, Speech, a nd Signal Processing, ICASSP 2016 . 2016, IEEE Signal Processing Society
work page 2011
-
[10]
Building and Evaluation of a Real Room Impulse Response Dataset,
Igor Sz¨ oke, Miroslav Sk´ acel, Ladislav Moˇ sner, Jakub Paliesek, and Jan ˇCernock´ y, “Building and Evaluation of a Real Room Impulse Response Dataset,” Under review for IEEE Journal of Selected Topics in Signal Processing, 2019
work page 2019
-
[11]
Dereverberation and Beamforming in Robust Far-Field Speaker Recognition,
Ladislav Moˇ sner, Oldˇ rich Plchot, Pavel Matˇ ejka, Ondˇ rej Novotn´ y, and Jan ˇCernock´ y, “Dereverberation and Beamforming in Robust Far-Field Speaker Recognition,” in Proceedings of Interspeech
-
[12]
1334–1338, International Speech Communication Association
2018, pp. 1334–1338, International Speech Communication Association
work page 2018
-
[13]
Kornel Laskowski and Jens Edlund, “A Snack implementati on and Tcl/Tk interface to the fundamental frequency variatio n spec- trum algorithm,” in Proceedings of the Seventh International Con- ference on Language Resources and Evaluation (LREC’10) , V al- letta, Malta, may 2010
work page 2010
-
[14]
A robust algorithm for pitch tracking (RA PT),
David Talkin, “A robust algorithm for pitch tracking (RA PT),” in Speech Coding and Synthesis , W. B. Kleijn and K. Paliwal, Eds., New Y ork, 1995, Elseviever
work page 1995
-
[15]
BUT 2014 Babel sys- tem: Analysis of adaptation in NN based systems,
Martin Karafi´ at, Frantiˇ sek Gr´ ezl, Karel V esel´ y, Mirko Hanne- mann, Igor Sz˝ oke, and Jan ˇCernock´ y, “BUT 2014 Babel sys- tem: Analysis of adaptation in NN based systems,” in Interspeech 2014, 2014, pp. 3002–3006
work page 2014
-
[16]
Neural network bottleneck featu res for lan- guage identification,
Pavel Matˇ ejka et al., “Neural network bottleneck featu res for lan- guage identification,” in IEEE Odyssey: The Speaker and Lan- guage Recognition W orkshop, Joensu, Finland, 2014
work page 2014
-
[17]
A pitch extraction algorithm tuned for au to- matic speech recognition,
P . Ghahremani, B. BabaAli, D. Povey, K. Riedhammer, J. T rmal, and S. Khudanpur, “A pitch extraction algorithm tuned for au to- matic speech recognition,” in Acoustics, Speech and Signal Pro- cessing (ICASSP), 2014 IEEE International Conference on , May 2014, pp. 2494–2498
work page 2014
-
[18]
Front-end factor analysis for speaker verification,
N. Dehak, P . Kenny, R. Dehak, P . Dumouchel, and P . Ouelle t, “Front-end factor analysis for speaker verification,” IEEE Trans- actions on Audio, Speech and Language Processing , vol. PP , no. 99, pp. 1 –1, 2010
work page 2010
-
[19]
Bayesian speaker verification with heavy–ta iled pri- ors,
P . Kenny, “Bayesian speaker verification with heavy–ta iled pri- ors,” keynote presentation, Proc. of Odyssey 2010, June 201 0
work page 2010
-
[20]
Speech dereverberation based on variance-normalized del ayed linear prediction,
T. Nakatani, T. Y oshioka, K. Kinoshita, M. Miyoshi, and B. Juang, “Speech dereverberation based on variance-normalized del ayed linear prediction,” IEEE Transactions on Audio, Speech, and Lan- guage Processing, vol. 18, no. 7, pp. 1717–1731, Sep. 2010
work page 2010
-
[21]
The kaldi spee ch recognition toolkit,
Daniel Povey, Arnab Ghoshal, Gilles Boulianne, Lukas B ur- get, Ondrej Glembek, Nagendra Goel, Mirko Hannemann, Petr Motlicek, Y anmin Qian, Petr Schwarz, et al., “The kaldi spee ch recognition toolkit,” in IEEE 2011 workshop on automatic speech recognition and understanding . IEEE Signal Processing Society, 2011
work page 2011
-
[22]
Speaker recogn i- tion for multi-speaker conversations using x-vectors,
David Snyder, Daniel Garcia-Romero, Gregory Sell, Ala n Mc- Cree, Daniel Povey, and Sanjeev Khudanpur, “Speaker recogn i- tion for multi-speaker conversations using x-vectors,” in ICASSP, 2019
work page 2019
-
[23]
Fast variational bayes for hea vy- tailed plda applied to i-vectors and x-vectors,
Anna Silnova, Niko Brummer, Daniel Garcia-Romero, Dav id Snyder, and Luk´ aˇ s Burget, “Fast variational bayes for hea vy- tailed plda applied to i-vectors and x-vectors,” in Interspeech 2018, 19th Annual Conference of the International Speech Co m- munication Association, Hyderabad, India, 2-6 September 2 018., 2018
work page 2018
-
[24]
Analysis of score nor- malization in multilingual speaker recognition,
Pavel Matˇ ejka, Ondˇ rej Novotn´ y, Oldˇ rich Plchot, Luk´ aˇ s Burget, Mireia S´ anchez Diez, and Jan ˇCernock´ y, “Analysis of score nor- malization in multilingual speaker recognition,” in Proceedings of Interspeech 2017 . 2017, pp. 1567–1571, International Speech Communication Association
work page 2017
-
[25]
Speaker adaptive cohort selection for tnorm in text-independent speaker verificati on,
D. E. Sturim and Douglas A. Reynolds, “Speaker adaptive cohort selection for tnorm in text-independent speaker verificati on,” in ICASSP, 2005, pp. 741–744
work page 2005
-
[26]
How to deal with mult iple- targets in speaker identification systems?,
Y aniv Zigel and Moshe Wasserblat, “How to deal with mult iple- targets in speaker identification systems?,” in Proceedings of the Speaker and Language Recognition W orkshop (IEEE-Odyssey 2006), San Juan, Puerto Rico, June 2006
work page 2006
-
[27]
V oice s obscured in complex environmental settings (VOICES) corpu s,
Colleen Richey, Mar´ ıa Auxiliadora Barrios, Zeb Armstrong, Chris Bartels, Horacio Franco, Martin Graciarena, Aaron Lawson, Ma- hesh Kumar Nandwana, Allen R. Stauffer, Julien van Hout, Pau l Gamble, Jeff Hetherly, Cory Stephenson, and Karl Ni, “V oice s obscured in complex environmental settings (VOICES) corpu s,” in ISCA INTERSPEECH 2018 , 2018
work page 2018
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.