Unleashing the Unused Potential of I-Vectors Enabled by GPU Acceleration

Kong Aik Lee; Takafumi Koshinaka; Tomi H. Kinnunen; Ville Vestman

arxiv: 1906.08556 · v1 · pith:XJAJWEFKnew · submitted 2019-06-20 · 💻 cs.LG · cs.SD· eess.AS· stat.ML

Unleashing the Unused Potential of I-Vectors Enabled by GPU Acceleration

Ville Vestman , Kong Aik Lee , Tomi H. Kinnunen , Takafumi Koshinaka This is my paper

Pith reviewed 2026-05-25 19:48 UTC · model grok-4.3

classification 💻 cs.LG cs.SDeess.ASstat.ML

keywords i-vectorsspeaker verificationGPU accelerationuniversal background modelframe alignmentsKaldiVoxCelebspeaker embeddings

0 comments

The pith

Updating the universal background model and recomputing frame alignments during i-vector training improves speaker verification results.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that GPU acceleration speeds up i-vector computations enough to make iterative updates to the universal background model and frame alignments practical during extractor training. A reader would care because these steps produce measurable accuracy gains and expose that Kaldi's implementation includes extra steps yielding 1 to 2 percent better results than the textbook formulation on VoxCeleb tests. All claims rest on ensemble averages across multiple random-start runs to reduce noise from initialization.

Core claim

The central claim is that updating the universal background model and re-computing frame alignments while training the i-vector extractor is beneficial. Kaldi's i-vector extractor, which incorporates these and other undocumented steps, outperforms the standard formulation by a margin of 1 to 2 percent when evaluated with the VoxCeleb speaker verification protocol. The GPU speed-up of roughly 3000 times for frame posteriors and 25 times for extractor training makes such exploration feasible, and all reported gains are confirmed by averaging results from multiple runs that differ only in random initialization.

What carries the argument

The GPU-accelerated training loop for the i-vector extractor that includes iterative UBM updates and fresh frame alignment computation at each step.

If this is right

Kaldi's i-vector extractor shows a consistent 1-2 percent advantage over the standard formulation on VoxCeleb.
The GPU speed-up of 25 times in extractor training enables more exhaustive study of extractor variations than was previously practical.
Ensemble averaging across random starts confirms that the reported differences are stable rather than artifacts of a single initialization.
Frame posterior computation reaches 3000 times real-time speed, removing the main computational barrier to iterative alignment updates.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same GPU-enabled loop could be applied to test whether x-vector training also benefits from background model updates.
If the gains hold across other datasets, the standard i-vector recipe used in many older systems may need revision.
The approach suggests checking whether recomputing alignments at every training epoch helps in domain-mismatched or noisy conditions not covered by VoxCeleb.

Load-bearing premise

The observed accuracy gains come from the UBM updates and alignment recomputations rather than from other unstated differences in code, data handling, or implementation details between the compared systems.

What would settle it

A side-by-side run of the standard i-vector formulation with and without the UBM update plus alignment recomputation steps, using identical code bases and data pipelines, to test whether the 1-2 percent margin appears only when those steps are added.

Figures

Figures reproduced from arXiv: 1906.08556 by Kong Aik Lee, Takafumi Koshinaka, Tomi H. Kinnunen, Ville Vestman.

**Figure 1.** Figure 1: An overview of computational flow of frame alignment, i-vector extraction, and model training using a GPU. To keep the GPU memory requirements constant, fixed size batches of frames and utterances are used for frame alignment and i-vector extraction, respectively. of i-vectors from 400 to 200 using linear discriminant analysis (LDA) before subjecting them to probabilistic linear discriminant analysis (P… view at source ↗

**Figure 3.** Figure 3: Performance of the augmented formulation for varying intervals of frame alignment updates. The more often the alignments are updated, the faster the system performance improves. Each curve is obtained as an average of five runs with different random initializations. 1% lower error rates (relative) compared to training without updates. At best, we obtained an EER of 4.6%, which could be possibly made cl… view at source ↗

read the original abstract

Speaker embeddings are continuous-value vector representations that allow easy comparison between voices of speakers with simple geometric operations. Among others, i-vector and x-vector have emerged as the mainstream methods for speaker embedding. In this paper, we illustrate the use of modern computation platform to harness the benefit of GPU acceleration for i-vector extraction. In particular, we achieve an acceleration of 3000 times in frame posterior computation compared to real time and 25 times in training the i-vector extractor compared to the CPU baseline from Kaldi toolkit. This significant speed-up allows the exploration of ideas that were hitherto impossible. In particular, we show that it is beneficial to update the universal background model (UBM) and re-compute frame alignments while training the i-vector extractor. Additionally, we are able to study different variations of i-vector extractors more rigorously than before. In this process, we reveal some undocumented details of Kaldi's i-vector extractor and show that it outperforms the standard formulation by a margin of 1 to 2% when tested with VoxCeleb speaker verification protocol. All of our findings are asserted by ensemble averaging the results from multiple runs with random start.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GPU speed-up lets them update the UBM during i-vector training and claim a 1-2% edge over standard plus Kaldi, but the gains are not isolated from other implementation differences.

read the letter

The main thing here is that GPU acceleration makes it feasible to update the UBM and recompute alignments while training the i-vector extractor, and the authors report this helps on VoxCeleb. They also surface some undocumented Kaldi details that give a 1-2% advantage over the textbook formulation. The speed numbers are the clearest win: 3000x faster posterior computation and 25x faster extractor training versus the Kaldi CPU baseline. Ensemble averaging across random starts is a sensible way to stabilize the numbers. Those two practical points are what the work actually adds beyond prior i-vector papers. The scope stays narrow to this one embedding method, so it does not touch wider ML practice. The soft spot is the attribution of the 1-2% margin. The paper does not describe an ablation that holds initialization, data pipeline, T-matrix equations, scoring, and every hyperparameter fixed while toggling only the UBM update and alignment step. Without that isolation, other unstated code or data differences could explain part of the gap. The abstract also omits error bars, exact dataset sizes, and statistical tests, so the margin is only loosely quantified. All numbers are empirical on held-out data, with no circularity. This is useful reading for people still working on speaker embeddings or maintaining i-vector pipelines. It will not reshape the field, but the engineering details and the training tweak are worth knowing inside the subfield. The empirical grounding is solid enough for referee time even if the comparisons need tightening.

Referee Report

2 major / 1 minor

Summary. The paper claims that GPU acceleration yields 3000x speedup in frame posterior computation and 25x in i-vector extractor training versus Kaldi's CPU baseline. It asserts that updating the UBM and recomputing frame alignments during extractor training is beneficial, reveals undocumented Kaldi implementation details, and shows Kaldi's extractor outperforming the standard formulation by 1-2% on VoxCeleb speaker verification, with all empirical claims supported by ensemble averaging over multiple random-start runs.

Significance. If the 1-2% gains can be isolated to the UBM update and alignment recomputation, the work would offer a practical improvement to i-vector systems and demonstrate the value of GPU-enabled experimentation. The explicit use of ensemble averaging over random starts is a methodological strength that adds robustness against initialization variance.

major comments (2)

[Experimental evaluation] Experimental evaluation (results on VoxCeleb): the central claim that Kaldi outperforms the standard formulation by 1-2% due to UBM update and alignment recomputation is not supported by an ablation that holds initialization, data pipeline, exact T-matrix update equations, scoring backend, and all hyperparameters fixed while toggling only those two operations. Without such isolation the attribution of the observed margin remains unsecured.
[Abstract and results sections] Abstract and results sections: performance margins of 1-2% are reported without error bars, exact dataset sizes for each experiment, or statistical significance tests, leaving the quantification imprecise even though ensemble averaging is employed.

minor comments (1)

The description of the standard i-vector formulation versus Kaldi's variant would benefit from an explicit side-by-side equation comparison to make the undocumented details immediately visible.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and indicate planned revisions.

read point-by-point responses

Referee: [Experimental evaluation] Experimental evaluation (results on VoxCeleb): the central claim that Kaldi outperforms the standard formulation by 1-2% due to UBM update and alignment recomputation is not supported by an ablation that holds initialization, data pipeline, exact T-matrix update equations, scoring backend, and all hyperparameters fixed while toggling only those two operations. Without such isolation the attribution of the observed margin remains unsecured.

Authors: We agree that a controlled ablation isolating only the UBM update and alignment recomputation (while fixing initialization, data pipeline, T-matrix equations, backend, and hyperparameters) would strengthen causal attribution of the 1-2% margin. Our experiments compare the full Kaldi pipeline against the standard formulation with ensemble averaging over random starts, but do not include the requested toggle-only ablation. We will add this ablation study to the revised manuscript. revision: yes
Referee: [Abstract and results sections] Abstract and results sections: performance margins of 1-2% are reported without error bars, exact dataset sizes for each experiment, or statistical significance tests, leaving the quantification imprecise even though ensemble averaging is employed.

Authors: Although ensemble averaging over multiple random-start runs was performed, we did not report error bars, exact per-experiment dataset sizes, or statistical significance tests. In revision we will add error bars computed from the ensembles, state the precise dataset sizes used, and include statistical significance tests (e.g., paired t-tests) to support the reported margins. revision: yes

Circularity Check

0 steps flagged

No circularity detected; purely empirical claims on held-out data

full rationale

The paper reports GPU speedups and empirical gains (1-2% on VoxCeleb) from UBM update + alignment recomputation, supported by ensemble averaging over random starts. No mathematical derivation, fitted-parameter prediction, or self-citation chain is present; all assertions are direct measurements against external VoxCeleb protocol. No step reduces by construction to its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper is purely empirical and computational; it introduces no new mathematical axioms, free parameters fitted inside the central claim, or postulated entities.

pith-pipeline@v0.9.0 · 5754 in / 1073 out tokens · 20832 ms · 2026-05-25T19:48:02.648730+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

39 extracted references · 39 canonical work pages · 1 internal anchor

[1]

Introduction A decade ago, the i-vector speaker embedding was intro- duced [1]. Since its introduction, it has remained as a stan- dard solution for speaker recognition until recent years when it was excelled in many tasks by the deep neural network based embeddings [2, 3]. The recent developments are a result of the widespread interest among researchers ...

work page
[2]

Unleashing the Unused Potential of I-Vectors Enabled by GPU Acceleration

I-vector speaker embeddings We compare two different formulations of the total variability approach [5] of joint factor analysis [13] to extract i-vectors. In the total variability model, all of the variability in utterances is modeled using a single subspace only, without having separate subspaces to model speaker and channel effects. The ﬁrst of the for...

work page internal anchor Pith review Pith/arXiv arXiv 1906
[3]

The most basic one is to only update matrices Tc, while also updating residual covariances Σc gives a slight improve- ment to the performance as we will demonstrate later

Training enhancements The update step of the model training can have many varia- tions. The most basic one is to only update matrices Tc, while also updating residual covariances Σc gives a slight improve- ment to the performance as we will demonstrate later. An- other way to improve the model is to applyminimum divergence re-estimation to make the empiri...

work page
[4]

The computation of frame alignments and Baum-Welch statistics using the current UBM [12, 14]

work page
[5]

E-step: The computation of posterior means and covari- ances for the latent vectors using (3) and (4) to accumu- late the required terms for the M-step

work page
[6]

M-step: The update of matrices Tc followed by the up- date of residual covariances Σc [10]

work page
[7]

Minimum divergence re-estimation: The update of matrices Tc using the transforms P1 and P2 followed by the update of the prior offset p using (12)

work page
[8]

After the model has been trained, the updated UBM is used in the testing phase to compute the frame posteriors

If not the last iteration, the update of the mean vectors of the UBM with the ﬁrst columns of matrices Tupd c multi- plied byp. After the model has been trained, the updated UBM is used in the testing phase to compute the frame posteriors

work page
[9]

Experimentation setup We built the acoustic front-end of our systems on the basis of Kaldi [11] i-vector recipe for V oxCeleb [22, 23]

Experiments 4.1. Experimentation setup We built the acoustic front-end of our systems on the basis of Kaldi [11] i-vector recipe for V oxCeleb [22, 23]. That is, we relied on Kaldi to extract MFCCs, to perform voice activity de- tection (V AD), and to train the UBM. We used the same settings as in the Kaldi recipe: The MFCC vectors are 72-dimensional incl...

work page 2048
[10]

Discussion and conclusions We have a couple of remarks from the practical aspect of the study. First, we found that by using the modern deep learning platforms, such as PyTorch, the implementation of GPU accel- erated algorithms for generative models is almost as straight- forward as it is with their non-GPU counterparts (e.g. NumPy). The only concern is ...

work page
[11]

#309629) and by the Doctoral Programme in Science, Tech- nology and Computing (SCITECO) of the UEF

Acknowledgements This work was partially supported by Academy of Finland (proj. #309629) and by the Doctoral Programme in Science, Tech- nology and Computing (SCITECO) of the UEF. The authors at UEF were also supported by NVIDIA Corporation with the do- nation of Titan V GPU

work page
[12]

Discriminative and generative approaches for long- and short-term speaker characteristics modeling: application to speaker veriﬁcation,

N. Dehak, “Discriminative and generative approaches for long- and short-term speaker characteristics modeling: application to speaker veriﬁcation,” Ph.D. dissertation, ´Ecole de technologie sup´erieure, 2009

work page 2009
[13]

Deep neural network embeddings for text-independent speaker veriﬁcation

D. Snyder, D. Garcia-Romero, D. Povey, and S. Khudanpur, “Deep neural network embeddings for text-independent speaker veriﬁcation.” inInterspeech, 2017, pp. 999–1003

work page 2017
[14]

Spoken language recognition using x- vectors,

D. Snyder, D. Garcia-Romero, A. McCree, G. Sell, D. Povey, and S. Khudanpur, “Spoken language recognition using x- vectors,” in Proc. Odyssey 2018 The Speaker and Language Recognition Workshop, 2018, pp. 105–111. [Online]. Available: http://dx.doi.org/10.21437/Odyssey.2018-15

work page doi:10.21437/odyssey.2018-15 2018
[15]

Goodfellow, Y

I. Goodfellow, Y . Bengio, and A. Courville,Deep learning. MIT press, 2016

work page 2016
[16]

Front-end factor analysis for speaker veriﬁcation,

N. Dehak, P. J. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet, “Front-end factor analysis for speaker veriﬁcation,” IEEE Trans- actions on Audio, Speech, and Language Processing , vol. 19, no. 4, pp. 788–798, 2011

work page 2011
[17]

Simpliﬁcation and optimization of i-vector extraction,

O. Glembek, L. Burget, P. Mat ˇejka, M. Karaﬁ´at, and P. Kenny, “Simpliﬁcation and optimization of i-vector extraction,” in 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2011, pp. 4516–4519

work page 2011
[18]

Factorized sub-space estimation for fast and memory effective i-vector extraction,

S. Cumani and P. Laface, “Factorized sub-space estimation for fast and memory effective i-vector extraction,” IEEE/ACM Transac- tions on Audio, Speech, and Language Processing, vol. 22, no. 1, pp. 248–259, 2014

work page 2014
[19]

Generalizing i-vector esti- mation for rapid speaker recognition,

L. Xu, K. A. Lee, H. Li, and Z. Yang, “Generalizing i-vector esti- mation for rapid speaker recognition,” IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP) , vol. 26, no. 4, pp. 749–759, 2018

work page 2018
[20]

Joint factor analysis of speaker and session variability: Theory and algorithms,

P. Kenny, “Joint factor analysis of speaker and session variability: Theory and algorithms,” CRIM, Montreal,(Report) CRIM-06/08- 13, vol. 14, pp. 28–29, 2005

work page 2005
[21]

Eigenvoice model- ing with sparse training data,

P. Kenny, G. Boulianne, and P. Dumouchel, “Eigenvoice model- ing with sparse training data,” IEEE Transactions on Speech and Audio Processing, vol. 13, no. 3, pp. 345–354, 2005

work page 2005
[22]

The Kaldi speech recognition toolkit,

D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y . Qian, P. Schwarzet al., “The Kaldi speech recognition toolkit,” IEEE Signal Processing Society, Tech. Rep., 2011

work page 2011
[23]

Speaker veri- ﬁcation using adapted Gaussian mixture models,

D. A. Reynolds, T. F. Quatieri, and R. B. Dunn, “Speaker veri- ﬁcation using adapted Gaussian mixture models,” Digital signal processing, vol. 10, no. 1-3, pp. 19–41, 2000

work page 2000
[24]

Joint factor analysis versus eigenchannels in speaker recognition,

P. Kenny, G. Boulianne, P. Ouellet, and P. Dumouchel, “Joint factor analysis versus eigenchannels in speaker recognition,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 15, no. 4, pp. 1435–1447, 2007

work page 2007
[25]

A small footprint i-vector extractor,

P. Kenny, “A small footprint i-vector extractor,” in Odyssey, vol. 2012, 2012, pp. 1–6

work page 2012
[26]

Implementation of the standard i-vector system for the kaldi speech recognition toolkit,

S. Madikeri, S. Dey, P. Motlicek, and M. Ferras, “Implementation of the standard i-vector system for the kaldi speech recognition toolkit,” Idiap, Tech. Rep., 2016

work page 2016
[27]

An extensible speaker identiﬁcation SIDEKIT in Python,

A. Larcher, K. A. Lee, and S. Meignier, “An extensible speaker identiﬁcation SIDEKIT in Python,” in 2016 IEEE Interna- tional Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2016, pp. 5095–5099

work page 2016
[28]

MSR identity tool- box v1.0: A MATLAB toolbox for speaker recognition research,

S. O. Sadjadi, M. Slaney, and L. P. Heck, “MSR identity tool- box v1.0: A MATLAB toolbox for speaker recognition research,” 2013

work page 2013
[29]

The sub- space Gaussian mixture model—A structured model for speech recognition,

D. Povey, L. Burget, M. Agarwal, P. Akyazi, F. Kai, A. Ghoshal, O. Glembek, N. Goel, M. Karaﬁ ´at, A. Rastrow et al., “The sub- space Gaussian mixture model—A structured model for speech recognition,” Computer Speech & Language , vol. 25, no. 2, pp. 404–439, 2011

work page 2011
[30]

A study of interspeaker variability in speaker veriﬁcation,

P. Kenny, P. Ouellet, N. Dehak, V . Gupta, and P. Dumouchel, “A study of interspeaker variability in speaker veriﬁcation,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 16, no. 5, pp. 980–988, 2008

work page 2008
[31]

Speaker adaptation using an eigenphone basis,

P. Kenny, G. Boulianne, P. Ouellet, and P. Dumouchel, “Speaker adaptation using an eigenphone basis,” IEEE transactions on speech and audio processing, vol. 12, no. 6, pp. 579–589, 2004

work page 2004
[32]

Unitary triangularization of a nonsymmetric matrix,

A. S. Householder, “Unitary triangularization of a nonsymmetric matrix,” Journal of the ACM (JACM), vol. 5, no. 4, pp. 339–342, 1958

work page 1958
[33]

V oxCeleb: A large- scale speaker identiﬁcation dataset,

A. Nagrani, J. S. Chung, and A. Zisserman, “V oxCeleb: A large- scale speaker identiﬁcation dataset,” Proc. Interspeech 2017, pp. 2616–2620, 2017

work page 2017
[34]

V oxCeleb2: Deep speaker recognition,

J. S. Chung, A. Nagrani, and A. Zisserman, “V oxCeleb2: Deep speaker recognition,” in INTERSPEECH, 2018

work page 2018
[35]

Analysis of i-vector length normalization in speaker recognition systems,

D. Garcia-Romero and C. Y . Espy-Wilson, “Analysis of i-vector length normalization in speaker recognition systems,” in Twelfth Annual Conference of the International Speech Communication Association, 2011

work page 2011
[36]

Automatic differ- entiation in PyTorch,

A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer, “Automatic differ- entiation in PyTorch,” inNIPS-W, 2017

work page 2017
[37]

SciPy: Open source scientiﬁc tools for Python,

E. Jones, T. Oliphant, P. Peterson et al. , “SciPy: Open source scientiﬁc tools for Python,” 2001–. [Online]. Available: http://www.scipy.org/

work page 2001
[38]

Pykaldi: A Python wrapper for Kaldi,

D. Can, V . R. Martinez, P. Papadopoulos, and S. S. Narayanan, “Pykaldi: A Python wrapper for Kaldi,” in IEEE Interna- tional Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018

work page 2018
[39]

X-vectors: Robust DNN embeddings for speaker recog- nition,

D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khu- danpur, “X-vectors: Robust DNN embeddings for speaker recog- nition,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018, pp. 5329– 5333

work page 2018

[1] [1]

Introduction A decade ago, the i-vector speaker embedding was intro- duced [1]. Since its introduction, it has remained as a stan- dard solution for speaker recognition until recent years when it was excelled in many tasks by the deep neural network based embeddings [2, 3]. The recent developments are a result of the widespread interest among researchers ...

work page

[2] [2]

Unleashing the Unused Potential of I-Vectors Enabled by GPU Acceleration

I-vector speaker embeddings We compare two different formulations of the total variability approach [5] of joint factor analysis [13] to extract i-vectors. In the total variability model, all of the variability in utterances is modeled using a single subspace only, without having separate subspaces to model speaker and channel effects. The ﬁrst of the for...

work page internal anchor Pith review Pith/arXiv arXiv 1906

[3] [3]

The most basic one is to only update matrices Tc, while also updating residual covariances Σc gives a slight improve- ment to the performance as we will demonstrate later

Training enhancements The update step of the model training can have many varia- tions. The most basic one is to only update matrices Tc, while also updating residual covariances Σc gives a slight improve- ment to the performance as we will demonstrate later. An- other way to improve the model is to applyminimum divergence re-estimation to make the empiri...

work page

[4] [4]

The computation of frame alignments and Baum-Welch statistics using the current UBM [12, 14]

work page

[5] [5]

E-step: The computation of posterior means and covari- ances for the latent vectors using (3) and (4) to accumu- late the required terms for the M-step

work page

[6] [6]

M-step: The update of matrices Tc followed by the up- date of residual covariances Σc [10]

work page

[7] [7]

Minimum divergence re-estimation: The update of matrices Tc using the transforms P1 and P2 followed by the update of the prior offset p using (12)

work page

[8] [8]

After the model has been trained, the updated UBM is used in the testing phase to compute the frame posteriors

If not the last iteration, the update of the mean vectors of the UBM with the ﬁrst columns of matrices Tupd c multi- plied byp. After the model has been trained, the updated UBM is used in the testing phase to compute the frame posteriors

work page

[9] [9]

Experimentation setup We built the acoustic front-end of our systems on the basis of Kaldi [11] i-vector recipe for V oxCeleb [22, 23]

Experiments 4.1. Experimentation setup We built the acoustic front-end of our systems on the basis of Kaldi [11] i-vector recipe for V oxCeleb [22, 23]. That is, we relied on Kaldi to extract MFCCs, to perform voice activity de- tection (V AD), and to train the UBM. We used the same settings as in the Kaldi recipe: The MFCC vectors are 72-dimensional incl...

work page 2048

[10] [10]

Discussion and conclusions We have a couple of remarks from the practical aspect of the study. First, we found that by using the modern deep learning platforms, such as PyTorch, the implementation of GPU accel- erated algorithms for generative models is almost as straight- forward as it is with their non-GPU counterparts (e.g. NumPy). The only concern is ...

work page

[11] [11]

#309629) and by the Doctoral Programme in Science, Tech- nology and Computing (SCITECO) of the UEF

Acknowledgements This work was partially supported by Academy of Finland (proj. #309629) and by the Doctoral Programme in Science, Tech- nology and Computing (SCITECO) of the UEF. The authors at UEF were also supported by NVIDIA Corporation with the do- nation of Titan V GPU

work page

[12] [12]

Discriminative and generative approaches for long- and short-term speaker characteristics modeling: application to speaker veriﬁcation,

N. Dehak, “Discriminative and generative approaches for long- and short-term speaker characteristics modeling: application to speaker veriﬁcation,” Ph.D. dissertation, ´Ecole de technologie sup´erieure, 2009

work page 2009

[13] [13]

Deep neural network embeddings for text-independent speaker veriﬁcation

D. Snyder, D. Garcia-Romero, D. Povey, and S. Khudanpur, “Deep neural network embeddings for text-independent speaker veriﬁcation.” inInterspeech, 2017, pp. 999–1003

work page 2017

[14] [14]

Spoken language recognition using x- vectors,

D. Snyder, D. Garcia-Romero, A. McCree, G. Sell, D. Povey, and S. Khudanpur, “Spoken language recognition using x- vectors,” in Proc. Odyssey 2018 The Speaker and Language Recognition Workshop, 2018, pp. 105–111. [Online]. Available: http://dx.doi.org/10.21437/Odyssey.2018-15

work page doi:10.21437/odyssey.2018-15 2018

[15] [15]

Goodfellow, Y

I. Goodfellow, Y . Bengio, and A. Courville,Deep learning. MIT press, 2016

work page 2016

[16] [16]

Front-end factor analysis for speaker veriﬁcation,

N. Dehak, P. J. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet, “Front-end factor analysis for speaker veriﬁcation,” IEEE Trans- actions on Audio, Speech, and Language Processing , vol. 19, no. 4, pp. 788–798, 2011

work page 2011

[17] [17]

Simpliﬁcation and optimization of i-vector extraction,

O. Glembek, L. Burget, P. Mat ˇejka, M. Karaﬁ´at, and P. Kenny, “Simpliﬁcation and optimization of i-vector extraction,” in 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2011, pp. 4516–4519

work page 2011

[18] [18]

Factorized sub-space estimation for fast and memory effective i-vector extraction,

S. Cumani and P. Laface, “Factorized sub-space estimation for fast and memory effective i-vector extraction,” IEEE/ACM Transac- tions on Audio, Speech, and Language Processing, vol. 22, no. 1, pp. 248–259, 2014

work page 2014

[19] [19]

Generalizing i-vector esti- mation for rapid speaker recognition,

L. Xu, K. A. Lee, H. Li, and Z. Yang, “Generalizing i-vector esti- mation for rapid speaker recognition,” IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP) , vol. 26, no. 4, pp. 749–759, 2018

work page 2018

[20] [20]

Joint factor analysis of speaker and session variability: Theory and algorithms,

P. Kenny, “Joint factor analysis of speaker and session variability: Theory and algorithms,” CRIM, Montreal,(Report) CRIM-06/08- 13, vol. 14, pp. 28–29, 2005

work page 2005

[21] [21]

Eigenvoice model- ing with sparse training data,

P. Kenny, G. Boulianne, and P. Dumouchel, “Eigenvoice model- ing with sparse training data,” IEEE Transactions on Speech and Audio Processing, vol. 13, no. 3, pp. 345–354, 2005

work page 2005

[22] [22]

The Kaldi speech recognition toolkit,

D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y . Qian, P. Schwarzet al., “The Kaldi speech recognition toolkit,” IEEE Signal Processing Society, Tech. Rep., 2011

work page 2011

[23] [23]

Speaker veri- ﬁcation using adapted Gaussian mixture models,

D. A. Reynolds, T. F. Quatieri, and R. B. Dunn, “Speaker veri- ﬁcation using adapted Gaussian mixture models,” Digital signal processing, vol. 10, no. 1-3, pp. 19–41, 2000

work page 2000

[24] [24]

Joint factor analysis versus eigenchannels in speaker recognition,

P. Kenny, G. Boulianne, P. Ouellet, and P. Dumouchel, “Joint factor analysis versus eigenchannels in speaker recognition,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 15, no. 4, pp. 1435–1447, 2007

work page 2007

[25] [25]

A small footprint i-vector extractor,

P. Kenny, “A small footprint i-vector extractor,” in Odyssey, vol. 2012, 2012, pp. 1–6

work page 2012

[26] [26]

Implementation of the standard i-vector system for the kaldi speech recognition toolkit,

S. Madikeri, S. Dey, P. Motlicek, and M. Ferras, “Implementation of the standard i-vector system for the kaldi speech recognition toolkit,” Idiap, Tech. Rep., 2016

work page 2016

[27] [27]

An extensible speaker identiﬁcation SIDEKIT in Python,

A. Larcher, K. A. Lee, and S. Meignier, “An extensible speaker identiﬁcation SIDEKIT in Python,” in 2016 IEEE Interna- tional Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2016, pp. 5095–5099

work page 2016

[28] [28]

MSR identity tool- box v1.0: A MATLAB toolbox for speaker recognition research,

S. O. Sadjadi, M. Slaney, and L. P. Heck, “MSR identity tool- box v1.0: A MATLAB toolbox for speaker recognition research,” 2013

work page 2013

[29] [29]

The sub- space Gaussian mixture model—A structured model for speech recognition,

D. Povey, L. Burget, M. Agarwal, P. Akyazi, F. Kai, A. Ghoshal, O. Glembek, N. Goel, M. Karaﬁ ´at, A. Rastrow et al., “The sub- space Gaussian mixture model—A structured model for speech recognition,” Computer Speech & Language , vol. 25, no. 2, pp. 404–439, 2011

work page 2011

[30] [30]

A study of interspeaker variability in speaker veriﬁcation,

P. Kenny, P. Ouellet, N. Dehak, V . Gupta, and P. Dumouchel, “A study of interspeaker variability in speaker veriﬁcation,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 16, no. 5, pp. 980–988, 2008

work page 2008

[31] [31]

Speaker adaptation using an eigenphone basis,

P. Kenny, G. Boulianne, P. Ouellet, and P. Dumouchel, “Speaker adaptation using an eigenphone basis,” IEEE transactions on speech and audio processing, vol. 12, no. 6, pp. 579–589, 2004

work page 2004

[32] [32]

Unitary triangularization of a nonsymmetric matrix,

A. S. Householder, “Unitary triangularization of a nonsymmetric matrix,” Journal of the ACM (JACM), vol. 5, no. 4, pp. 339–342, 1958

work page 1958

[33] [33]

V oxCeleb: A large- scale speaker identiﬁcation dataset,

A. Nagrani, J. S. Chung, and A. Zisserman, “V oxCeleb: A large- scale speaker identiﬁcation dataset,” Proc. Interspeech 2017, pp. 2616–2620, 2017

work page 2017

[34] [34]

V oxCeleb2: Deep speaker recognition,

J. S. Chung, A. Nagrani, and A. Zisserman, “V oxCeleb2: Deep speaker recognition,” in INTERSPEECH, 2018

work page 2018

[35] [35]

Analysis of i-vector length normalization in speaker recognition systems,

D. Garcia-Romero and C. Y . Espy-Wilson, “Analysis of i-vector length normalization in speaker recognition systems,” in Twelfth Annual Conference of the International Speech Communication Association, 2011

work page 2011

[36] [36]

Automatic differ- entiation in PyTorch,

A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer, “Automatic differ- entiation in PyTorch,” inNIPS-W, 2017

work page 2017

[37] [37]

SciPy: Open source scientiﬁc tools for Python,

E. Jones, T. Oliphant, P. Peterson et al. , “SciPy: Open source scientiﬁc tools for Python,” 2001–. [Online]. Available: http://www.scipy.org/

work page 2001

[38] [38]

Pykaldi: A Python wrapper for Kaldi,

D. Can, V . R. Martinez, P. Papadopoulos, and S. S. Narayanan, “Pykaldi: A Python wrapper for Kaldi,” in IEEE Interna- tional Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018

work page 2018

[39] [39]

X-vectors: Robust DNN embeddings for speaker recog- nition,

D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khu- danpur, “X-vectors: Robust DNN embeddings for speaker recog- nition,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018, pp. 5329– 5333

work page 2018