BUT VOiCES 2019 System Description

Anna Silnova; Hossein Zeinali; J\'an Profant; Ladislav Mo\v{s}ner; Luk\'a\v{s} Burget; Old\v{r}ich Plchot; Ond\v{r}ej Glembek; Ond\v{r}ej Novotn\'y; Pavel Mat\v{e}jka

arxiv: 1907.06112 · v1 · pith:6TM3X3ILnew · submitted 2019-07-13 · 📡 eess.AS · cs.CL· cs.SD

BUT VOiCES 2019 System Description

Hossein Zeinali , Pavel Mat\v{e}jka , Ladislav Mo\v{s}ner , Old\v{r}ich Plchot , Anna Silnova , Ond\v{r}ej Novotn\'y , J\'an Profant , Ond\v{r}ej Glembek

show 1 more author

Luk\'a\v{s} Burget

This is my paper

Pith reviewed 2026-05-24 21:47 UTC · model grok-4.3

classification 📡 eess.AS cs.CLcs.SD

keywords speaker recognitionx-vectorVOiCES challengeequal error ratePLDA adaptationsystem fusioni-vector

0 comments

The pith

Fusion of three x-vector systems reaches 1.0% EER in the VOiCES 2019 speaker recognition challenge.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper reports the results of several speaker recognition systems submitted to the VOiCES 2019 challenge. All fixed-condition entries rely on the x-vector approach but vary the acoustic features and the exact neural network architectures used to extract speaker embeddings. The strongest single system records 1.2% equal error rate; combining the scores of three such systems lowers the error to 1.0%, a 15% relative reduction. When external data are allowed in the open condition, adapting the PLDA backend produces an additional gain of less than 10% relative. The final open-condition submission also includes one i-vector system alongside the three x-vector extractors.

Core claim

Systems built on the x-vector paradigm with differing features and DNN topologies reach 1.2% EER for the best single entry and 1.0% EER after fusing three systems, a 15% relative improvement. In the open condition, external data used only for PLDA adaptation yield less than ~10% relative improvement. The open submission combines three x-vector systems with one i-vector system.

What carries the argument

The x-vector paradigm that extracts fixed-length speaker embeddings from a deep neural network trained for speaker classification, together with score-level fusion across multiple feature and topology variants.

If this is right

Score fusion across different x-vector configurations reliably lowers error rates under fixed training conditions.
External data restricted to PLDA adaptation delivers only modest further gains once the embedding extractors are already strong.
Including an i-vector system in the open-condition fusion does not prevent the overall 1.0% EER target from being met.
System combination remains an effective route to performance improvement even when individual embeddings are already competitive.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The modest open-condition gain suggests that the fixed training data already capture most of the speaker variability needed for this test set.
Future work could test whether the same fusion benefit holds when the underlying embeddings come from newer architectures such as ResNets or transformers.
The results imply that challenge organizers should continue to publish both fixed and open tracks so that the value of additional data can be quantified separately from embedding quality.

Load-bearing premise

All reported EER figures were produced by strictly obeying the fixed-condition rules and evaluation protocol of the VOiCES 2019 challenge without undisclosed data or post-hoc tuning.

What would settle it

An independent run of the submitted systems on the official VOiCES 2019 test set that returns EER values materially above 1.0% would falsify the performance numbers.

read the original abstract

This is a description of our effort in VOiCES 2019 Speaker Recognition challenge. All systems in the fixed condition are based on the x-vector paradigm with different features and DNN topologies. The single best system reaches 1.2% EER and a fusion of 3 systems yields 1.0% EER, which is 15% relative improvement. The open condition allowed us to use external data which we did for the PLDA adaptation and achieved less than ~10% relative improvement. In the submission to open condition, we used 3 x-vector systems and also one i-vector based system.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Standard x-vector system description reporting 1.0% fused EER on VOiCES 2019 fixed condition; incremental engineering result with no new framework.

read the letter

The core takeaway is that this is a competition system description, not a research advance. BUT's fixed-condition entries use the established x-vector approach with tweaks to features and DNN topologies. Their best single system hits 1.2% EER and a three-system fusion reaches 1.0% EER (15% relative gain). Open-condition runs add external data only for PLDA adaptation and mix in one i-vector system, for under 10% relative gain overall. The numbers are presented as direct outputs from the official protocol.

Referee Report

2 major / 2 minor

Summary. The manuscript is a system description for the BUT team's entry in the VOiCES 2019 Speaker Recognition challenge. It states that all fixed-condition systems follow the x-vector paradigm with variations in features and DNN topologies. The single best system achieves 1.2% EER, with a fusion of three systems reaching 1.0% EER (15% relative improvement). For the open condition, external data is used only for PLDA adaptation, yielding less than ~10% relative improvement, and the submission includes three x-vector systems plus one i-vector system.

Significance. Assuming the EER figures were obtained following the challenge protocols, this work provides a record of effective x-vector configurations and the benefits of system fusion on the VOiCES 2019 evaluation set. The quantified improvement from fusion highlights the value of combining multiple systems. However, the limited gain from external data in the open condition suggests that PLDA adaptation alone may not yield substantial benefits. As a system description, its primary significance is in sharing practical implementation details, though the current text offers few such details.

major comments (2)

[Abstract] The central performance claims (1.2% EER single best, 1.0% fused) are presented without any accompanying description of the specific features, DNN topologies, training procedures, or data used. This omission makes it impossible to assess or reproduce the results, which are the core contribution of the paper.
[Abstract] The statement that open-condition systems 'achieved less than ~10% relative improvement' is vague and lacks the specific EER values or comparison to the fixed-condition baseline, weakening the ability to evaluate the impact of external data.

minor comments (2)

The manuscript appears to be extremely brief; expanding with at least one section detailing the systems would greatly improve its utility as a system description.
[Abstract] The phrase 'less than ~10%' combines 'less than' with an approximate symbol, which is redundant and unclear.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed review of our VOiCES 2019 system description manuscript. We address the major comments point by point below, indicating where revisions will be made to the abstract.

read point-by-point responses

Referee: [Abstract] The central performance claims (1.2% EER single best, 1.0% fused) are presented without any accompanying description of the specific features, DNN topologies, training procedures, or data used. This omission makes it impossible to assess or reproduce the results, which are the core contribution of the paper.

Authors: As a system description paper, the main text elaborates on the x-vector systems, including variations in features and DNN topologies. We agree the abstract is overly concise and will revise it to briefly note the key differences in acoustic features, network architectures, and training data employed across the systems. revision: yes
Referee: [Abstract] The statement that open-condition systems 'achieved less than ~10% relative improvement' is vague and lacks the specific EER values or comparison to the fixed-condition baseline, weakening the ability to evaluate the impact of external data.

Authors: We agree that including concrete EER numbers would improve transparency. We will revise the abstract to report the specific EER values obtained in the open condition along with the relative improvement compared to the fixed-condition baseline. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

This is a standard challenge system description paper. The central claims are measured EER values (1.2% single best, 1.0% fused) obtained by following the fixed-condition VOiCES 2019 evaluation protocol. No derivations, predictions, fitted parameters renamed as outputs, or self-citation chains are present that could reduce to inputs by construction. Results are direct outputs of an external benchmark.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No mathematical derivations, free parameters, or invented entities are present; the contribution is empirical performance reporting on a public challenge.

pith-pipeline@v0.9.0 · 5685 in / 950 out tokens · 16761 ms · 2026-05-24T21:47:22.053638+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

27 extracted references · 27 canonical work pages · 1 internal anchor

[1]

Introduction This submission is a description of our effort in VOiCES 2019 Speaker Recognition challenge [1]. Most of the systems are based on x-vectors [2] with an exception of the i-vector sub- system for open condition which uses concatenation of MFCCs and Stacked bottlenecks (SBN) features [3]. Our systems uti lize different features (MFCC, PLP , Mel-...

work page 2019
[2]

Experimental Setup 2.1. Training data, Augmentations For x-vector training we used only V oxceleb 1 and 2 dataset with 166 thousands audio ﬁles (distributed in 1.2 million speech segments) from 7146 speakers. We performed the following data augmentations based on the Kaldi recipe and created add i- tional 5 million segments based on these augmentations: •...

work page 2020
[3]

HTK MFCC with deltas and double deltas and SBN feature vec- tors were extracted from recordings (SBN were downsampled to 8kHz)

i-vector Systems The system is based on gender independent i-vectors [11, 12] . HTK MFCC with deltas and double deltas and SBN feature vec- tors were extracted from recordings (SBN were downsampled to 8kHz). Final feature vector is concatenation of both as th ey proved to perform very well in NIST SRE [3]. This system uses V AD-NN. Universal background mo...

work page 2048
[4]

The systems were trained in Kaldi toolkit [14] using SRE16 recipe with modiﬁcations described below: • Using different feature sets • Training networks with 9 epochs (instead of 3)

x-vector Systems All x-vectors used V AD-Energy from Kaldi SRE16 recipe 6. The systems were trained in Kaldi toolkit [14] using SRE16 recipe with modiﬁcations described below: • Using different feature sets • Training networks with 9 epochs (instead of 3). We did not see any considerable difference with 12 epochs. • Using modiﬁed example generation - we u...

work page
[5]

Heavy-tailed PLDA Our i-vector system used HT-PLDA backend [16]

Backend 5.1. Heavy-tailed PLDA Our i-vector system used HT-PLDA backend [16]. It was trained on V oxCeleb 1 and 2 datasets. Training set consisted of 166 thousands audio ﬁles from 7146 speakers. Length nor- malization, centering, LDA, reducing dimensionality of ve ctors to 300, followed by another length normalization were appli ed to all i-vectors. All i...

work page 2000
[6]

Each system pro- vided log-likelihood ratio scores that could be subjected to score normalization

Calibration & Fusion The submission strategy was one common fusion trained on the labeled V oiCES development data [20, 1]. Each system pro- vided log-likelihood ratio scores that could be subjected to score normalization. These scores were ﬁrst pre-calibrated and t hen passed into the fusion. The output of the fusion was then agai n re-calibrated. Both c...

work page 1908
[7]

The VOiCES from a Distance Challenge 2019 Evaluation Plan

Mahesh Kumar Nandwana, Julien van Hout, Mitch McLaren, Aaron. Lawson, and Mar´ ıa Auxiliadora Barrios, “The voicesfrom a distance challenge 2019 evaluation plan,” in arXiv:1902.10828 [eess.AS], 2019

work page internal anchor Pith review Pith/arXiv arXiv 2019
[8]

X-vectors: Robust dnn embed - dings for speaker recognition,

David Snyder, Daniel Garcia-Romero, Gregory Sell, Dani el Povey, and Sanjeev Khudanpur, “X-vectors: Robust dnn embed - dings for speaker recognition,” Submitted to ICASSP , 2018

work page 2018
[9]

Analysis of dnn approaches to speaker identiﬁcation,

Pavel Matˇ ejka, Ondˇ rej Glembek, Ondˇ rej Novotn´ y, Oldˇ rich Plchot, Frantiˇ sek Gr´ ezl, Luk´ aˇ s Burget, and JanˇCernock´ y, “Analysis of dnn approaches to speaker identiﬁcation,” in Proceedings of the 2011 IEEE International Conference on Acoustics, Speech, a nd Signal Processing, ICASSP 2016 . 2016, IEEE Signal Processing Society

work page 2011
[10]

Building and Evaluation of a Real Room Impulse Response Dataset,

Igor Sz¨ oke, Miroslav Sk´ acel, Ladislav Moˇ sner, Jakub Paliesek, and Jan ˇCernock´ y, “Building and Evaluation of a Real Room Impulse Response Dataset,” Under review for IEEE Journal of Selected Topics in Signal Processing, 2019

work page 2019
[11]

Dereverberation and Beamforming in Robust Far-Field Speaker Recognition,

Ladislav Moˇ sner, Oldˇ rich Plchot, Pavel Matˇ ejka, Ondˇ rej Novotn´ y, and Jan ˇCernock´ y, “Dereverberation and Beamforming in Robust Far-Field Speaker Recognition,” in Proceedings of Interspeech

work page
[12]

1334–1338, International Speech Communication Association

2018, pp. 1334–1338, International Speech Communication Association

work page 2018
[13]

A Snack implementati on and Tcl/Tk interface to the fundamental frequency variatio n spec- trum algorithm,

Kornel Laskowski and Jens Edlund, “A Snack implementati on and Tcl/Tk interface to the fundamental frequency variatio n spec- trum algorithm,” in Proceedings of the Seventh International Con- ference on Language Resources and Evaluation (LREC’10) , V al- letta, Malta, may 2010

work page 2010
[14]

A robust algorithm for pitch tracking (RA PT),

David Talkin, “A robust algorithm for pitch tracking (RA PT),” in Speech Coding and Synthesis , W. B. Kleijn and K. Paliwal, Eds., New Y ork, 1995, Elseviever

work page 1995
[15]

BUT 2014 Babel sys- tem: Analysis of adaptation in NN based systems,

Martin Karaﬁ´ at, Frantiˇ sek Gr´ ezl, Karel V esel´ y, Mirko Hanne- mann, Igor Sz˝ oke, and Jan ˇCernock´ y, “BUT 2014 Babel sys- tem: Analysis of adaptation in NN based systems,” in Interspeech 2014, 2014, pp. 3002–3006

work page 2014
[16]

Neural network bottleneck featu res for lan- guage identiﬁcation,

Pavel Matˇ ejka et al., “Neural network bottleneck featu res for lan- guage identiﬁcation,” in IEEE Odyssey: The Speaker and Lan- guage Recognition W orkshop, Joensu, Finland, 2014

work page 2014
[17]

A pitch extraction algorithm tuned for au to- matic speech recognition,

P . Ghahremani, B. BabaAli, D. Povey, K. Riedhammer, J. T rmal, and S. Khudanpur, “A pitch extraction algorithm tuned for au to- matic speech recognition,” in Acoustics, Speech and Signal Pro- cessing (ICASSP), 2014 IEEE International Conference on , May 2014, pp. 2494–2498

work page 2014
[18]

Front-end factor analysis for speaker veriﬁcation,

N. Dehak, P . Kenny, R. Dehak, P . Dumouchel, and P . Ouelle t, “Front-end factor analysis for speaker veriﬁcation,” IEEE Trans- actions on Audio, Speech and Language Processing , vol. PP , no. 99, pp. 1 –1, 2010

work page 2010
[19]

Bayesian speaker veriﬁcation with heavy–ta iled pri- ors,

P . Kenny, “Bayesian speaker veriﬁcation with heavy–ta iled pri- ors,” keynote presentation, Proc. of Odyssey 2010, June 201 0

work page 2010
[20]

Speech dereverberation based on variance-normalized del ayed linear prediction,

T. Nakatani, T. Y oshioka, K. Kinoshita, M. Miyoshi, and B. Juang, “Speech dereverberation based on variance-normalized del ayed linear prediction,” IEEE Transactions on Audio, Speech, and Lan- guage Processing, vol. 18, no. 7, pp. 1717–1731, Sep. 2010

work page 2010
[21]

The kaldi spee ch recognition toolkit,

Daniel Povey, Arnab Ghoshal, Gilles Boulianne, Lukas B ur- get, Ondrej Glembek, Nagendra Goel, Mirko Hannemann, Petr Motlicek, Y anmin Qian, Petr Schwarz, et al., “The kaldi spee ch recognition toolkit,” in IEEE 2011 workshop on automatic speech recognition and understanding . IEEE Signal Processing Society, 2011

work page 2011
[22]

Speaker recogn i- tion for multi-speaker conversations using x-vectors,

David Snyder, Daniel Garcia-Romero, Gregory Sell, Ala n Mc- Cree, Daniel Povey, and Sanjeev Khudanpur, “Speaker recogn i- tion for multi-speaker conversations using x-vectors,” in ICASSP, 2019

work page 2019
[23]

Fast variational bayes for hea vy- tailed plda applied to i-vectors and x-vectors,

Anna Silnova, Niko Brummer, Daniel Garcia-Romero, Dav id Snyder, and Luk´ aˇ s Burget, “Fast variational bayes for hea vy- tailed plda applied to i-vectors and x-vectors,” in Interspeech 2018, 19th Annual Conference of the International Speech Co m- munication Association, Hyderabad, India, 2-6 September 2 018., 2018

work page 2018
[24]

Analysis of score nor- malization in multilingual speaker recognition,

Pavel Matˇ ejka, Ondˇ rej Novotn´ y, Oldˇ rich Plchot, Luk´ aˇ s Burget, Mireia S´ anchez Diez, and Jan ˇCernock´ y, “Analysis of score nor- malization in multilingual speaker recognition,” in Proceedings of Interspeech 2017 . 2017, pp. 1567–1571, International Speech Communication Association

work page 2017
[25]

Speaker adaptive cohort selection for tnorm in text-independent speaker veriﬁcati on,

D. E. Sturim and Douglas A. Reynolds, “Speaker adaptive cohort selection for tnorm in text-independent speaker veriﬁcati on,” in ICASSP, 2005, pp. 741–744

work page 2005
[26]

How to deal with mult iple- targets in speaker identiﬁcation systems?,

Y aniv Zigel and Moshe Wasserblat, “How to deal with mult iple- targets in speaker identiﬁcation systems?,” in Proceedings of the Speaker and Language Recognition W orkshop (IEEE-Odyssey 2006), San Juan, Puerto Rico, June 2006

work page 2006
[27]

V oice s obscured in complex environmental settings (VOICES) corpu s,

Colleen Richey, Mar´ ıa Auxiliadora Barrios, Zeb Armstrong, Chris Bartels, Horacio Franco, Martin Graciarena, Aaron Lawson, Ma- hesh Kumar Nandwana, Allen R. Stauffer, Julien van Hout, Pau l Gamble, Jeff Hetherly, Cory Stephenson, and Karl Ni, “V oice s obscured in complex environmental settings (VOICES) corpu s,” in ISCA INTERSPEECH 2018 , 2018

work page 2018

[1] [1]

Introduction This submission is a description of our effort in VOiCES 2019 Speaker Recognition challenge [1]. Most of the systems are based on x-vectors [2] with an exception of the i-vector sub- system for open condition which uses concatenation of MFCCs and Stacked bottlenecks (SBN) features [3]. Our systems uti lize different features (MFCC, PLP , Mel-...

work page 2019

[2] [2]

Experimental Setup 2.1. Training data, Augmentations For x-vector training we used only V oxceleb 1 and 2 dataset with 166 thousands audio ﬁles (distributed in 1.2 million speech segments) from 7146 speakers. We performed the following data augmentations based on the Kaldi recipe and created add i- tional 5 million segments based on these augmentations: •...

work page 2020

[3] [3]

HTK MFCC with deltas and double deltas and SBN feature vec- tors were extracted from recordings (SBN were downsampled to 8kHz)

i-vector Systems The system is based on gender independent i-vectors [11, 12] . HTK MFCC with deltas and double deltas and SBN feature vec- tors were extracted from recordings (SBN were downsampled to 8kHz). Final feature vector is concatenation of both as th ey proved to perform very well in NIST SRE [3]. This system uses V AD-NN. Universal background mo...

work page 2048

[4] [4]

The systems were trained in Kaldi toolkit [14] using SRE16 recipe with modiﬁcations described below: • Using different feature sets • Training networks with 9 epochs (instead of 3)

x-vector Systems All x-vectors used V AD-Energy from Kaldi SRE16 recipe 6. The systems were trained in Kaldi toolkit [14] using SRE16 recipe with modiﬁcations described below: • Using different feature sets • Training networks with 9 epochs (instead of 3). We did not see any considerable difference with 12 epochs. • Using modiﬁed example generation - we u...

work page

[5] [5]

Heavy-tailed PLDA Our i-vector system used HT-PLDA backend [16]

Backend 5.1. Heavy-tailed PLDA Our i-vector system used HT-PLDA backend [16]. It was trained on V oxCeleb 1 and 2 datasets. Training set consisted of 166 thousands audio ﬁles from 7146 speakers. Length nor- malization, centering, LDA, reducing dimensionality of ve ctors to 300, followed by another length normalization were appli ed to all i-vectors. All i...

work page 2000

[6] [6]

Each system pro- vided log-likelihood ratio scores that could be subjected to score normalization

Calibration & Fusion The submission strategy was one common fusion trained on the labeled V oiCES development data [20, 1]. Each system pro- vided log-likelihood ratio scores that could be subjected to score normalization. These scores were ﬁrst pre-calibrated and t hen passed into the fusion. The output of the fusion was then agai n re-calibrated. Both c...

work page 1908

[7] [7]

The VOiCES from a Distance Challenge 2019 Evaluation Plan

Mahesh Kumar Nandwana, Julien van Hout, Mitch McLaren, Aaron. Lawson, and Mar´ ıa Auxiliadora Barrios, “The voicesfrom a distance challenge 2019 evaluation plan,” in arXiv:1902.10828 [eess.AS], 2019

work page internal anchor Pith review Pith/arXiv arXiv 2019

[8] [8]

X-vectors: Robust dnn embed - dings for speaker recognition,

David Snyder, Daniel Garcia-Romero, Gregory Sell, Dani el Povey, and Sanjeev Khudanpur, “X-vectors: Robust dnn embed - dings for speaker recognition,” Submitted to ICASSP , 2018

work page 2018

[9] [9]

Analysis of dnn approaches to speaker identiﬁcation,

Pavel Matˇ ejka, Ondˇ rej Glembek, Ondˇ rej Novotn´ y, Oldˇ rich Plchot, Frantiˇ sek Gr´ ezl, Luk´ aˇ s Burget, and JanˇCernock´ y, “Analysis of dnn approaches to speaker identiﬁcation,” in Proceedings of the 2011 IEEE International Conference on Acoustics, Speech, a nd Signal Processing, ICASSP 2016 . 2016, IEEE Signal Processing Society

work page 2011

[10] [10]

Building and Evaluation of a Real Room Impulse Response Dataset,

Igor Sz¨ oke, Miroslav Sk´ acel, Ladislav Moˇ sner, Jakub Paliesek, and Jan ˇCernock´ y, “Building and Evaluation of a Real Room Impulse Response Dataset,” Under review for IEEE Journal of Selected Topics in Signal Processing, 2019

work page 2019

[11] [11]

Dereverberation and Beamforming in Robust Far-Field Speaker Recognition,

Ladislav Moˇ sner, Oldˇ rich Plchot, Pavel Matˇ ejka, Ondˇ rej Novotn´ y, and Jan ˇCernock´ y, “Dereverberation and Beamforming in Robust Far-Field Speaker Recognition,” in Proceedings of Interspeech

work page

[12] [12]

1334–1338, International Speech Communication Association

2018, pp. 1334–1338, International Speech Communication Association

work page 2018

[13] [13]

A Snack implementati on and Tcl/Tk interface to the fundamental frequency variatio n spec- trum algorithm,

Kornel Laskowski and Jens Edlund, “A Snack implementati on and Tcl/Tk interface to the fundamental frequency variatio n spec- trum algorithm,” in Proceedings of the Seventh International Con- ference on Language Resources and Evaluation (LREC’10) , V al- letta, Malta, may 2010

work page 2010

[14] [14]

A robust algorithm for pitch tracking (RA PT),

David Talkin, “A robust algorithm for pitch tracking (RA PT),” in Speech Coding and Synthesis , W. B. Kleijn and K. Paliwal, Eds., New Y ork, 1995, Elseviever

work page 1995

[15] [15]

BUT 2014 Babel sys- tem: Analysis of adaptation in NN based systems,

Martin Karaﬁ´ at, Frantiˇ sek Gr´ ezl, Karel V esel´ y, Mirko Hanne- mann, Igor Sz˝ oke, and Jan ˇCernock´ y, “BUT 2014 Babel sys- tem: Analysis of adaptation in NN based systems,” in Interspeech 2014, 2014, pp. 3002–3006

work page 2014

[16] [16]

Neural network bottleneck featu res for lan- guage identiﬁcation,

Pavel Matˇ ejka et al., “Neural network bottleneck featu res for lan- guage identiﬁcation,” in IEEE Odyssey: The Speaker and Lan- guage Recognition W orkshop, Joensu, Finland, 2014

work page 2014

[17] [17]

A pitch extraction algorithm tuned for au to- matic speech recognition,

P . Ghahremani, B. BabaAli, D. Povey, K. Riedhammer, J. T rmal, and S. Khudanpur, “A pitch extraction algorithm tuned for au to- matic speech recognition,” in Acoustics, Speech and Signal Pro- cessing (ICASSP), 2014 IEEE International Conference on , May 2014, pp. 2494–2498

work page 2014

[18] [18]

Front-end factor analysis for speaker veriﬁcation,

N. Dehak, P . Kenny, R. Dehak, P . Dumouchel, and P . Ouelle t, “Front-end factor analysis for speaker veriﬁcation,” IEEE Trans- actions on Audio, Speech and Language Processing , vol. PP , no. 99, pp. 1 –1, 2010

work page 2010

[19] [19]

Bayesian speaker veriﬁcation with heavy–ta iled pri- ors,

P . Kenny, “Bayesian speaker veriﬁcation with heavy–ta iled pri- ors,” keynote presentation, Proc. of Odyssey 2010, June 201 0

work page 2010

[20] [20]

Speech dereverberation based on variance-normalized del ayed linear prediction,

T. Nakatani, T. Y oshioka, K. Kinoshita, M. Miyoshi, and B. Juang, “Speech dereverberation based on variance-normalized del ayed linear prediction,” IEEE Transactions on Audio, Speech, and Lan- guage Processing, vol. 18, no. 7, pp. 1717–1731, Sep. 2010

work page 2010

[21] [21]

The kaldi spee ch recognition toolkit,

Daniel Povey, Arnab Ghoshal, Gilles Boulianne, Lukas B ur- get, Ondrej Glembek, Nagendra Goel, Mirko Hannemann, Petr Motlicek, Y anmin Qian, Petr Schwarz, et al., “The kaldi spee ch recognition toolkit,” in IEEE 2011 workshop on automatic speech recognition and understanding . IEEE Signal Processing Society, 2011

work page 2011

[22] [22]

Speaker recogn i- tion for multi-speaker conversations using x-vectors,

David Snyder, Daniel Garcia-Romero, Gregory Sell, Ala n Mc- Cree, Daniel Povey, and Sanjeev Khudanpur, “Speaker recogn i- tion for multi-speaker conversations using x-vectors,” in ICASSP, 2019

work page 2019

[23] [23]

Fast variational bayes for hea vy- tailed plda applied to i-vectors and x-vectors,

Anna Silnova, Niko Brummer, Daniel Garcia-Romero, Dav id Snyder, and Luk´ aˇ s Burget, “Fast variational bayes for hea vy- tailed plda applied to i-vectors and x-vectors,” in Interspeech 2018, 19th Annual Conference of the International Speech Co m- munication Association, Hyderabad, India, 2-6 September 2 018., 2018

work page 2018

[24] [24]

Analysis of score nor- malization in multilingual speaker recognition,

Pavel Matˇ ejka, Ondˇ rej Novotn´ y, Oldˇ rich Plchot, Luk´ aˇ s Burget, Mireia S´ anchez Diez, and Jan ˇCernock´ y, “Analysis of score nor- malization in multilingual speaker recognition,” in Proceedings of Interspeech 2017 . 2017, pp. 1567–1571, International Speech Communication Association

work page 2017

[25] [25]

Speaker adaptive cohort selection for tnorm in text-independent speaker veriﬁcati on,

D. E. Sturim and Douglas A. Reynolds, “Speaker adaptive cohort selection for tnorm in text-independent speaker veriﬁcati on,” in ICASSP, 2005, pp. 741–744

work page 2005

[26] [26]

How to deal with mult iple- targets in speaker identiﬁcation systems?,

Y aniv Zigel and Moshe Wasserblat, “How to deal with mult iple- targets in speaker identiﬁcation systems?,” in Proceedings of the Speaker and Language Recognition W orkshop (IEEE-Odyssey 2006), San Juan, Puerto Rico, June 2006

work page 2006

[27] [27]

V oice s obscured in complex environmental settings (VOICES) corpu s,

Colleen Richey, Mar´ ıa Auxiliadora Barrios, Zeb Armstrong, Chris Bartels, Horacio Franco, Martin Graciarena, Aaron Lawson, Ma- hesh Kumar Nandwana, Allen R. Stauffer, Julien van Hout, Pau l Gamble, Jeff Hetherly, Cory Stephenson, and Karl Ni, “V oice s obscured in complex environmental settings (VOICES) corpu s,” in ISCA INTERSPEECH 2018 , 2018

work page 2018