A Deep Neural Network for Short-Segment Speaker Recognition

Ali Etemad; Amirhossein Hajavi

arxiv: 1907.10420 · v1 · pith:WTG3TJXJnew · submitted 2019-07-22 · 📡 eess.AS · cs.LG· cs.SD· stat.ML

A Deep Neural Network for Short-Segment Speaker Recognition

Amirhossein Hajavi , Ali Etemad This is my paper

Pith reviewed 2026-05-24 17:19 UTC · model grok-4.3

classification 📡 eess.AS cs.LGcs.SDstat.ML

keywords speaker recognitionshort utterancesdeep neural networkVoxCelebvoice biometricsspeech processingshort-segment recognition

0 comments

The pith

UtterIdNet is a deep neural network built for speaker recognition on speech segments down to 250 milliseconds.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents UtterIdNet, a new deep neural network for identifying speakers from very brief voice clips. Its design focuses on making fuller use of the limited information available in short utterances. When trained and tested on the VoxCeleb datasets, the model delivers stable results across segment lengths of 2 seconds, 1 second, 500 ms, and 250 ms, with the largest gains appearing at the shortest durations. Interactive devices such as phones and smart speakers commonly receive only brief audio, so reliable recognition at these lengths would remove the need for longer prompts.

Core claim

UtterIdNet employs a novel architecture that increases the efficient use of information contained in short speech segments. Evaluations on VoxCeleb show consistent performance for segments of 2 seconds, 1 second, and especially sub-second lengths (500 ms and 250 ms), producing significant gains over earlier models at those shorter durations.

What carries the argument

UtterIdNet's novel architecture, which extracts and uses information from short speech segments more efficiently than prior designs.

If this is right

Speaker recognition becomes practical for the brief utterances typical of voice assistants.
Performance stays stable when input length drops below one second.
Gains are largest at 250 ms and 500 ms compared with earlier methods.
Models can be deployed directly on devices that capture only short voice commands.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same architecture might improve other short-audio tasks such as command classification or emotion detection.
On-device implementations could reduce the need for cloud offloading of brief voice samples.
Future work could test whether the gains hold when training data is limited to short segments only.

Load-bearing premise

The new architecture extracts information from short speech segments more efficiently than earlier networks.

What would settle it

Run UtterIdNet and the previous best models on a held-out set of 250 ms VoxCeleb segments and observe no accuracy improvement for UtterIdNet.

Figures

Figures reproduced from arXiv: 1907.10420 by Ali Etemad, Amirhossein Hajavi.

**Figure 1.** Figure 1: Architecture of our proposed model, UtterIdNet: (a) the overall scheme, followed by (b) the internal architecture of ID Blocks. as inputs. In [25], a long short-term memory (LSTM) architecture was applied on MFCC, resulting in an embedding used to verify the speaker of the utterance by means of cosine distance. Other attempts, such as the model proposed in [23] and [31], have used the LSTM architecture as… view at source ↗

read the original abstract

Todays interactive devices such as smart-phone assistants and smart speakers often deal with short-duration speech segments. As a result, speaker recognition systems integrated into such devices will be much better suited with models capable of performing the recognition task with short-duration utterances. In this paper, a new deep neural network, UtterIdNet, capable of performing speaker recognition with short speech segments is proposed. Our proposed model utilizes a novel architecture that makes it suitable for short-segment speaker recognition through an efficiently increased use of information in short speech segments. UtterIdNet has been trained and tested on the VoxCeleb datasets, the latest benchmarks in speaker recognition. Evaluations for different segment durations show consistent and stable performance for short segments, with significant improvement over the previous models for segments of 2 seconds, 1 second, and especially sub-second durations (250 ms and 500 ms).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The abstract announces UtterIdNet with claimed gains on sub-second VoxCeleb segments but supplies no architecture, numbers, or baselines, so the contribution cannot be assessed.

read the letter

The one thing to know is that this paper names a new network, UtterIdNet, and asserts it handles short utterances better than prior work, especially at 250 ms and 500 ms on VoxCeleb. The practical setting is real: consumer devices often receive brief commands, so models that stay stable there would matter for on-device use. The focus on that regime and the claim of more efficient information use in short segments are the main points offered. Beyond the name and the headline assertion, the abstract gives nothing concrete. No equations, no diagram, no error rates, no baseline descriptions, and no mention of how segments were extracted or how training was done. Without those, it is impossible to tell whether the architecture differs from existing CNN or embedding methods already used in speaker recognition or whether the reported improvements are genuine. The work is empirical, so the absence of any quantitative evidence or statistical detail is the central limitation. The full manuscript might contain tables and architecture details that change the picture, but nothing in the provided material allows verification. This paper would interest people building short-utterance speaker systems for devices. A reader gets value only if the missing experimental section and model description are present and hold up. I would not bring it to a reading group as is. It does not look ready for peer review without the actual results and method details; the current version is too thin to justify referee time.

Referee Report

1 major / 0 minor

Summary. The paper proposes UtterIdNet, a deep neural network with a novel architecture for speaker recognition on short-duration utterances. The model is trained and evaluated on the VoxCeleb datasets and is claimed to deliver consistent, stable performance down to 250 ms segments with significant gains over prior models at 2 s, 1 s, and sub-second lengths.

Significance. If the reported gains on short segments are robust and properly controlled, the work would address a practical gap in speaker recognition for interactive devices. The empirical focus on sub-second performance is directly relevant to real-world constraints.

major comments (1)

[Abstract] The abstract states 'significant improvement' for 250 ms and 500 ms segments yet provides no numerical EER values, confidence intervals, baseline descriptions, or statistical tests. The central empirical claim cannot be assessed without these quantities in the results section.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their review and the opportunity to clarify our work. We address the major comment below.

read point-by-point responses

Referee: [Abstract] The abstract states 'significant improvement' for 250 ms and 500 ms segments yet provides no numerical EER values, confidence intervals, baseline descriptions, or statistical tests. The central empirical claim cannot be assessed without these quantities in the results section.

Authors: We agree that the abstract would benefit from explicit numerical support. In the revised manuscript we have updated the abstract to report the EER values achieved by UtterIdNet at 250 ms and 500 ms together with the corresponding EERs of the strongest published baselines on the same VoxCeleb evaluation protocol. The results section already contains the full set of EER tables (Tables 2–4) that compare UtterIdNet against prior models at every tested duration, including the sub-second lengths; these tables supply the baseline descriptions requested. Confidence intervals and formal statistical tests are not standard practice in the speaker-recognition literature when reporting EER on fixed, publicly released test sets such as VoxCeleb; if the referee considers them essential we are prepared to add bootstrap intervals in a further revision. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical architecture proposal with direct falsifiability

full rationale

The paper proposes UtterIdNet, a DNN architecture for short-segment speaker recognition, and reports empirical results on VoxCeleb for segments down to 250 ms. No derivation chain, equations, or predictions exist that could reduce to inputs by construction. The central claim rests on reported performance metrics that are externally falsifiable via the stated experiments; no self-citation load-bearing, fitted-input-as-prediction, or ansatz smuggling is present. This is a standard empirical ML paper whose results stand or fall on the data splits and metrics, not internal redefinition.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available. The central claim rests on an unspecified novel architecture whose details are not provided, plus the assumption that VoxCeleb test segments are representative of real device usage. No free parameters, axioms, or invented entities can be enumerated from the given text.

pith-pipeline@v0.9.0 · 5682 in / 1113 out tokens · 15281 ms · 2026-05-24T17:19:25.752135+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

44 extracted references · 44 canonical work pages · 5 internal anchors

[1]

Introduction Speaker recognition has seen profound improvements due to the recent advancements in deep learning. Accordingly, accuracy levels of proposed deep neural networks (DNN) for speaker recognition (both veriﬁcation and identiﬁcation) are far surpass- ing previous state-of-the-art techniques. Recent examples in- clude the use of embeddings obtained...

work page
[2]

A Deep Neural Network for Short-Segment Speaker Recognition

Related Work DNNs were ﬁrst used for speaker recognition in [12], and have since shown promising results by outperforming the traditional HMM-GMM techniques. Since the introduction of DNNs, at- tempts have made to incorporate such techniques with I-Vector methods in [13, 17]. Later on, further attempts were made to tackle speaker recognition with DNN unde...

work page internal anchor Pith review Pith/arXiv arXiv 1907
[3]

utilizing novel aggregation methods such as GhostVlad and NetVlad, outperformed the previous use of a 50 layer ResNet model [3]

work page
[4]

Feature extraction To extract informative frequency features from the short- segment utterances, Short-Time Fourier Transform (STFT) is used in this paper

Method In this section, the feature extraction process and our proposed model are described: 3.1. Feature extraction To extract informative frequency features from the short- segment utterances, Short-Time Fourier Transform (STFT) is used in this paper. STFT, as shown in Equation 1 is the Fourier transform of the signal under a time window of τ. In Equa- ...

work page
[5]

Next, the results obtained by our model using different short segment durations are pre- sented, and compared to the state-of-the-art [22]

Experiments and Results In this section, we describe the datasets and the experimental setup utilized for speaker veriﬁcation. Next, the results obtained by our model using different short segment durations are pre- sented, and compared to the state-of-the-art [22]. Furthermore, in addition to short-segment windows, the performance of the proposed model o...

work page
[6]

We evaluated our method using the V ox- Celeb datasets, and demonstrated that UtterIdNet outperforms the state-of-the-art for short segments

Conclusions In this paper a new DNN, UtterIdNet, was proposed with the aim of an efﬁcient increase of information use for short speech segments. We evaluated our method using the V ox- Celeb datasets, and demonstrated that UtterIdNet outperforms the state-of-the-art for short segments. Speciﬁcally, we utilized 250 ms, 500 ms, 1 sec, and 2 sec segments, wh...

work page
[7]

The authors would also like to acknowl- edge the Natural Sciences and Engineering Research Council of Canada (NSERC) for supporting this research (grant num- ber: CRDPJ 533919-18)

Acknowledgements The authors would like to thank IMRSV Data Labs for their support of this work. The authors would also like to acknowl- edge the Natural Sciences and Engineering Research Council of Canada (NSERC) for supporting this research (grant num- ber: CRDPJ 533919-18)

work page
[8]

V oxCeleb: A Large- Scale Speaker Identiﬁcation Dataset,

A. Nagrani, J. S. Chung, and A. Zisserman, “V oxCeleb: A Large- Scale Speaker Identiﬁcation Dataset,”INTERSPEECH, pp. 2616– 2620, 2017

work page 2017
[9]

Analysis of Length Normalization in End-to-End Speaker Veriﬁcation System,

W. Cai, J. Chen, and M. Li, “Analysis of Length Normalization in End-to-End Speaker Veriﬁcation System,”INTERSPEECH, pp. 3618–3622, 2018

work page 2018
[10]

V oxceleb2: Deep speaker recognition,

J. S. Chung, A. Nagrani, and A. Zisserman, “V oxceleb2: Deep speaker recognition,” INTERSPEECH, pp. 1086–1090, 2018

work page 2018
[11]

Deep Neural Network-based Bottleneck Feature and Denoising Autoencoder-based Dereverberation for Distant-talking Speaker Identiﬁcation,

Z. Zhang, L. Wang, A. Kai, T. Yamada, W. Li, and M. Iwahashi, “Deep Neural Network-based Bottleneck Feature and Denoising Autoencoder-based Dereverberation for Distant-talking Speaker Identiﬁcation,” EURASIP Journal on Audio, Speech, and Music Processing, vol. 2015, no. 1, p. 12, 2015

work page 2015
[12]

A Deep Autoencoder Ap- proach for Speaker Identiﬁcation,

S. S. Tirumala and S. R. Shahamiri, “A Deep Autoencoder Ap- proach for Speaker Identiﬁcation,” the 9th International Confer- ence on Signal Processing Systems , pp. 175–179, 2017

work page 2017
[13]

Triplet Loss Based Cosine Similarity Metric Learn- ing for Text-independent Speaker Recognition,

S. Novoselov, V . Shchemelinin, A. Shulipa, A. Kozlov, and I. Kremnev, “Triplet Loss Based Cosine Similarity Metric Learn- ing for Text-independent Speaker Recognition,” INTERSPEECH, pp. 2242–2246, 2018

work page 2018
[14]

Learning Discriminative Features for Speaker Identiﬁcation and Veriﬁcation,

S. Yadav and A. Rai, “Learning Discriminative Features for Speaker Identiﬁcation and Veriﬁcation,” INTERSPEECH, pp. 2237–2241, 2018

work page 2018
[15]

An Overview of Automatic Speaker Recogni- tion Technology,

D. A. Reynolds, “An Overview of Automatic Speaker Recogni- tion Technology,” IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , vol. 4, pp. 4072–4075, 2002

work page 2002
[16]

Joint Factor Analysis of Speaker and Session Variabil- ity: Theory and Algorithms,

P. Kenny, “Joint Factor Analysis of Speaker and Session Variabil- ity: Theory and Algorithms,” CRIM, Montreal,(Report) CRIM- 06/08-13, vol. 14, pp. 28–29, 2005

work page 2005
[17]

An Overview of Speaker Identiﬁca- tion: Accuracy and Robustness Issues,

R. Togneri and D. Pullella, “An Overview of Speaker Identiﬁca- tion: Accuracy and Robustness Issues,” IEEE Circuits and Sys- tems Magazine, vol. 11, no. 2, pp. 23–61, 2011

work page 2011
[18]

The Speak- ers in the Wild SITW Speaker Recognition Database,

M. McLaren, L. Ferrer, D. Castan, and A. Lawson, “The Speak- ers in the Wild SITW Speaker Recognition Database,” INTER- SPEECH, pp. 818–822, 2016

work page 2016
[19]

Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups,

G. Hinton, L. Deng, D. Yu, G. E. Dahl, A.-r. Mohamed, N. Jaitly, A. Senior, V . Vanhoucke, P. Nguyen, and T. N. Sainath, “Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups,” IEEE Transactions on Signal Processing, vol. 29, no. 6, pp. 82–97, 2012

work page 2012
[20]

I-vector Repre- sentation Based on Bottleneck Features for Language Identiﬁca- tion,

Y . Song, B. Jiang, Y . Bao, S. Wei, and L.-R. Dai, “I-vector Repre- sentation Based on Bottleneck Features for Language Identiﬁca- tion,” Electronics Letters, vol. 49, no. 24, pp. 1569–1570, 2013

work page 2013
[21]

Improvement of Distant- talking Speaker Identiﬁcation Using Bottleneck Features of DNN,

T. Yamada, L. Wang, and A. Kai, “Improvement of Distant- talking Speaker Identiﬁcation Using Bottleneck Features of DNN,” INTERSPEECH, pp. 3661–3664, 2013

work page 2013
[22]

A Novel Scheme for Speaker Recognition Using a Phonetically-aware Deep Neural Network,

Y . Lei, N. Scheffer, L. Ferrer, and M. McLaren, “A Novel Scheme for Speaker Recognition Using a Phonetically-aware Deep Neural Network,” IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1695–1699, 2014

work page 2014
[23]

Ap- plication of Convolutional Neural Networks to Language Identiﬁ- cation in Noisy Conditions,

Y . Lei, L. Ferrer, A. Lawson, M. McLaren, and N. Scheffer, “Ap- plication of Convolutional Neural Networks to Language Identiﬁ- cation in Noisy Conditions,” Odyssey, vol. 41, pp. 1–8, 2014

work page 2014
[24]

Deep Neural Networks for Extracting Baum-Welch Statistics for Speaker Recognition,

P. Kenny, V . Gupta, T. Stafylakis, P. Ouellet, and J. Alam, “Deep Neural Networks for Extracting Baum-Welch Statistics for Speaker Recognition,” Odyssey, pp. 293–298, 2014

work page 2014
[25]

Neural Network Bottleneck Features for Lan- guage Identiﬁcation,

P. Matejka, L. Zhang, T. Ng, H. S. Mallidi, O. Glembek, J. Ma, and B. Zhang, “Neural Network Bottleneck Features for Lan- guage Identiﬁcation,” Odyssey, vol. 2014, pp. 299–304, 2014

work page 2014
[26]

Au tomatic Lan- guage Identiﬁcation Using Deep Neural Networks,

I. Lopez-Moreno, J. Gonzalez-Dominguez, O. Plchot, D. Mar- tinez, J. Gonzalez-Rodriguez, and P. Moreno, “Au tomatic Lan- guage Identiﬁcation Using Deep Neural Networks,” IEEE Inter- national Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5337–5341, 2014

work page 2014
[27]

Deep Learning Backend for Single and Multi-session i-Vector Speaker Recognition,

O. Ghahabi and J. Hernando, “Deep Learning Backend for Single and Multi-session i-Vector Speaker Recognition,” IEEE Transac- tions on Audio, Speech, and Language Processing , vol. 25, no. 4, pp. 807–817, 2017

work page 2017
[28]

Frame-level speaker embeddings for text-independent speaker recognition and analysis of end-to-end model

S. Shon, H. Tang, and J. Glass, “Frame-level Speaker Embeddings for Text-independent Speaker Recognition and Analysis of end- to-end model,” arXiv preprint arXiv:1809.04437, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[29]

Utterance- level Aggregation For Speaker Recognition In The Wild,

W. Xie, A. Nagrani, J. S. Chung, and A. Zisserman, “Utterance- level Aggregation For Speaker Recognition In The Wild,” IEEE International Conference on Acoustics, Speech and Signal Pro- cessing (ICASSP), 2019

work page 2019
[30]

Training Utterance-level Embedding Networks for Speaker Identiﬁcation and Veriﬁcation,

H. Park, S. Cho, K. Park, N. Kim, and J. Park, “Training Utterance-level Embedding Networks for Speaker Identiﬁcation and Veriﬁcation,”INTERSPEECH, pp. 3563–3567, 2018

work page 2018
[31]

Avoiding Speaker Overﬁtting in End-to-End DNNs Using Raw Waveform for Text-Independent Speaker Veriﬁcation,

J.-w. Jung, H.-s. Heo, I.-h. Yang, H.-j. Shim, and H.-j. Yu, “Avoiding Speaker Overﬁtting in End-to-End DNNs Using Raw Waveform for Text-Independent Speaker Veriﬁcation,” INTER- SPEECH, pp. 3583–3587, 2018

work page 2018
[32]

Centroid-based Deep Metric Learning for Speaker Recognition,

J. Wang, K.-C. Wang, M. Law, F. Rudzicz, and M. Brudno, “Centroid-based Deep Metric Learning for Speaker Recognition,” IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2019

work page 2019
[33]

On Convolutional LSTM Modeling for Joint Wake-Word Detection and Text De- pendent Speaker Veriﬁcation,

R. Kumar, V . Yeruva, and S. Ganapathy, “On Convolutional LSTM Modeling for Joint Wake-Word Detection and Text De- pendent Speaker Veriﬁcation,” INTERSPEECH, pp. 1121–1125, 2018

work page 2018
[34]

Cycle-GANs for Do- main Adaptation of Acoustic Features for Speaker Recognition,

P. S. Nidadavolu, J. Villalba, and N. Dehak, “Cycle-GANs for Do- main Adaptation of Acoustic Features for Speaker Recognition,” IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6206–6210, 2019

work page 2019
[35]

Speaker Veriﬁcation Using End-to-end Adversarial Language Adaptation,

J. Rohdin, T. Stafylakis, A. Silnova, H. Zeinali, L. Burget, and O. Plchot, “Speaker Veriﬁcation Using End-to-end Adversarial Language Adaptation,” IEEE International Conference on Acous- tics, Speech and Signal Processing (ICASSP) , pp. 6006–6010, 2019

work page 2019
[36]

Attentive Statis- tics Pooling for Deep Speaker Embedding,

K. Okabe, T. Koshinaka, and K. Shinoda, “Attentive Statis- tics Pooling for Deep Speaker Embedding,” INTERSPEECH, pp. 2252–2256, 2018

work page 2018
[37]

Non- parallel V oice Conversion Using i-Vector PLDA: Towards Uni- fying Speaker Veriﬁcation and Transformation,

T. Kinnunen, L. Juvela, P. Alku, and J. Yamagishi, “Non- parallel V oice Conversion Using i-Vector PLDA: Towards Uni- fying Speaker Veriﬁcation and Transformation,” IEEE Interna- tional Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5535–5539, 2017

work page 2017
[38]

An Unsupervised Neural Prediction Framework for Learning Speaker Embeddings Using Recurrent Neural Networks,

A. Jati and P. Georgiou, “An Unsupervised Neural Prediction Framework for Learning Speaker Embeddings Using Recurrent Neural Networks,” INTERSPEECH, pp. 1131–1135, 2018

work page 2018
[39]

MTGAN: Speaker Veriﬁcation through Multitasking Triplet Generative Adversarial Networks,

W. Ding and L. He, “MTGAN: Speaker Veriﬁcation through Multitasking Triplet Generative Adversarial Networks,” INTER- SPEECH, pp. 3633–3637, 2018

work page 2018
[40]

Chan- nel Adversarial Training for Cross-channel Text-independent Speaker Recognition,

X. Fang, L. Zou, J. Li, L. Sun, and Z.-H. Ling, “Chan- nel Adversarial Training for Cross-channel Text-independent Speaker Recognition,” IEEE International Conference on Acous- tics, Speech and Signal Processing (ICASSP) , 2019

work page 2019
[41]

Deep Residual Learning for Image Recognition,

K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual Learning for Image Recognition,” the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778, 2016

work page 2016
[42]

DeepID3: Face Recognition with Very Deep Neural Networks

Y . Sun, D. Liang, X. Wang, and X. Tang, “DeepID3: Face Recog- nition with Very Deep Neural Networks,”arXiv:1502.00873 [cs], 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[43]

Adam: A Method for Stochastic Optimization

D. P. Kingma and J. Ba, “Adam: A method for stochastic opti- mization,” arXiv preprint arXiv:1412.6980, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[44]

Unified Hypersphere Embedding for Speaker Recognition

M. Hajibabaei and D. Dai, “Uniﬁed Hypersphere Embedding for Speaker Recognition,” arXiv preprint arXiv:1807.08312, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[1] [1]

Introduction Speaker recognition has seen profound improvements due to the recent advancements in deep learning. Accordingly, accuracy levels of proposed deep neural networks (DNN) for speaker recognition (both veriﬁcation and identiﬁcation) are far surpass- ing previous state-of-the-art techniques. Recent examples in- clude the use of embeddings obtained...

work page

[2] [2]

A Deep Neural Network for Short-Segment Speaker Recognition

Related Work DNNs were ﬁrst used for speaker recognition in [12], and have since shown promising results by outperforming the traditional HMM-GMM techniques. Since the introduction of DNNs, at- tempts have made to incorporate such techniques with I-Vector methods in [13, 17]. Later on, further attempts were made to tackle speaker recognition with DNN unde...

work page internal anchor Pith review Pith/arXiv arXiv 1907

[3] [3]

utilizing novel aggregation methods such as GhostVlad and NetVlad, outperformed the previous use of a 50 layer ResNet model [3]

work page

[4] [4]

Feature extraction To extract informative frequency features from the short- segment utterances, Short-Time Fourier Transform (STFT) is used in this paper

Method In this section, the feature extraction process and our proposed model are described: 3.1. Feature extraction To extract informative frequency features from the short- segment utterances, Short-Time Fourier Transform (STFT) is used in this paper. STFT, as shown in Equation 1 is the Fourier transform of the signal under a time window of τ. In Equa- ...

work page

[5] [5]

Next, the results obtained by our model using different short segment durations are pre- sented, and compared to the state-of-the-art [22]

Experiments and Results In this section, we describe the datasets and the experimental setup utilized for speaker veriﬁcation. Next, the results obtained by our model using different short segment durations are pre- sented, and compared to the state-of-the-art [22]. Furthermore, in addition to short-segment windows, the performance of the proposed model o...

work page

[6] [6]

We evaluated our method using the V ox- Celeb datasets, and demonstrated that UtterIdNet outperforms the state-of-the-art for short segments

Conclusions In this paper a new DNN, UtterIdNet, was proposed with the aim of an efﬁcient increase of information use for short speech segments. We evaluated our method using the V ox- Celeb datasets, and demonstrated that UtterIdNet outperforms the state-of-the-art for short segments. Speciﬁcally, we utilized 250 ms, 500 ms, 1 sec, and 2 sec segments, wh...

work page

[7] [7]

The authors would also like to acknowl- edge the Natural Sciences and Engineering Research Council of Canada (NSERC) for supporting this research (grant num- ber: CRDPJ 533919-18)

Acknowledgements The authors would like to thank IMRSV Data Labs for their support of this work. The authors would also like to acknowl- edge the Natural Sciences and Engineering Research Council of Canada (NSERC) for supporting this research (grant num- ber: CRDPJ 533919-18)

work page

[8] [8]

V oxCeleb: A Large- Scale Speaker Identiﬁcation Dataset,

A. Nagrani, J. S. Chung, and A. Zisserman, “V oxCeleb: A Large- Scale Speaker Identiﬁcation Dataset,”INTERSPEECH, pp. 2616– 2620, 2017

work page 2017

[9] [9]

Analysis of Length Normalization in End-to-End Speaker Veriﬁcation System,

W. Cai, J. Chen, and M. Li, “Analysis of Length Normalization in End-to-End Speaker Veriﬁcation System,”INTERSPEECH, pp. 3618–3622, 2018

work page 2018

[10] [10]

V oxceleb2: Deep speaker recognition,

J. S. Chung, A. Nagrani, and A. Zisserman, “V oxceleb2: Deep speaker recognition,” INTERSPEECH, pp. 1086–1090, 2018

work page 2018

[11] [11]

Deep Neural Network-based Bottleneck Feature and Denoising Autoencoder-based Dereverberation for Distant-talking Speaker Identiﬁcation,

Z. Zhang, L. Wang, A. Kai, T. Yamada, W. Li, and M. Iwahashi, “Deep Neural Network-based Bottleneck Feature and Denoising Autoencoder-based Dereverberation for Distant-talking Speaker Identiﬁcation,” EURASIP Journal on Audio, Speech, and Music Processing, vol. 2015, no. 1, p. 12, 2015

work page 2015

[12] [12]

A Deep Autoencoder Ap- proach for Speaker Identiﬁcation,

S. S. Tirumala and S. R. Shahamiri, “A Deep Autoencoder Ap- proach for Speaker Identiﬁcation,” the 9th International Confer- ence on Signal Processing Systems , pp. 175–179, 2017

work page 2017

[13] [13]

Triplet Loss Based Cosine Similarity Metric Learn- ing for Text-independent Speaker Recognition,

S. Novoselov, V . Shchemelinin, A. Shulipa, A. Kozlov, and I. Kremnev, “Triplet Loss Based Cosine Similarity Metric Learn- ing for Text-independent Speaker Recognition,” INTERSPEECH, pp. 2242–2246, 2018

work page 2018

[14] [14]

Learning Discriminative Features for Speaker Identiﬁcation and Veriﬁcation,

S. Yadav and A. Rai, “Learning Discriminative Features for Speaker Identiﬁcation and Veriﬁcation,” INTERSPEECH, pp. 2237–2241, 2018

work page 2018

[15] [15]

An Overview of Automatic Speaker Recogni- tion Technology,

D. A. Reynolds, “An Overview of Automatic Speaker Recogni- tion Technology,” IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , vol. 4, pp. 4072–4075, 2002

work page 2002

[16] [16]

Joint Factor Analysis of Speaker and Session Variabil- ity: Theory and Algorithms,

P. Kenny, “Joint Factor Analysis of Speaker and Session Variabil- ity: Theory and Algorithms,” CRIM, Montreal,(Report) CRIM- 06/08-13, vol. 14, pp. 28–29, 2005

work page 2005

[17] [17]

An Overview of Speaker Identiﬁca- tion: Accuracy and Robustness Issues,

R. Togneri and D. Pullella, “An Overview of Speaker Identiﬁca- tion: Accuracy and Robustness Issues,” IEEE Circuits and Sys- tems Magazine, vol. 11, no. 2, pp. 23–61, 2011

work page 2011

[18] [18]

The Speak- ers in the Wild SITW Speaker Recognition Database,

M. McLaren, L. Ferrer, D. Castan, and A. Lawson, “The Speak- ers in the Wild SITW Speaker Recognition Database,” INTER- SPEECH, pp. 818–822, 2016

work page 2016

[19] [19]

Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups,

G. Hinton, L. Deng, D. Yu, G. E. Dahl, A.-r. Mohamed, N. Jaitly, A. Senior, V . Vanhoucke, P. Nguyen, and T. N. Sainath, “Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups,” IEEE Transactions on Signal Processing, vol. 29, no. 6, pp. 82–97, 2012

work page 2012

[20] [20]

I-vector Repre- sentation Based on Bottleneck Features for Language Identiﬁca- tion,

Y . Song, B. Jiang, Y . Bao, S. Wei, and L.-R. Dai, “I-vector Repre- sentation Based on Bottleneck Features for Language Identiﬁca- tion,” Electronics Letters, vol. 49, no. 24, pp. 1569–1570, 2013

work page 2013

[21] [21]

Improvement of Distant- talking Speaker Identiﬁcation Using Bottleneck Features of DNN,

T. Yamada, L. Wang, and A. Kai, “Improvement of Distant- talking Speaker Identiﬁcation Using Bottleneck Features of DNN,” INTERSPEECH, pp. 3661–3664, 2013

work page 2013

[22] [22]

A Novel Scheme for Speaker Recognition Using a Phonetically-aware Deep Neural Network,

Y . Lei, N. Scheffer, L. Ferrer, and M. McLaren, “A Novel Scheme for Speaker Recognition Using a Phonetically-aware Deep Neural Network,” IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1695–1699, 2014

work page 2014

[23] [23]

Ap- plication of Convolutional Neural Networks to Language Identiﬁ- cation in Noisy Conditions,

Y . Lei, L. Ferrer, A. Lawson, M. McLaren, and N. Scheffer, “Ap- plication of Convolutional Neural Networks to Language Identiﬁ- cation in Noisy Conditions,” Odyssey, vol. 41, pp. 1–8, 2014

work page 2014

[24] [24]

Deep Neural Networks for Extracting Baum-Welch Statistics for Speaker Recognition,

P. Kenny, V . Gupta, T. Stafylakis, P. Ouellet, and J. Alam, “Deep Neural Networks for Extracting Baum-Welch Statistics for Speaker Recognition,” Odyssey, pp. 293–298, 2014

work page 2014

[25] [25]

Neural Network Bottleneck Features for Lan- guage Identiﬁcation,

P. Matejka, L. Zhang, T. Ng, H. S. Mallidi, O. Glembek, J. Ma, and B. Zhang, “Neural Network Bottleneck Features for Lan- guage Identiﬁcation,” Odyssey, vol. 2014, pp. 299–304, 2014

work page 2014

[26] [26]

Au tomatic Lan- guage Identiﬁcation Using Deep Neural Networks,

I. Lopez-Moreno, J. Gonzalez-Dominguez, O. Plchot, D. Mar- tinez, J. Gonzalez-Rodriguez, and P. Moreno, “Au tomatic Lan- guage Identiﬁcation Using Deep Neural Networks,” IEEE Inter- national Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5337–5341, 2014

work page 2014

[27] [27]

Deep Learning Backend for Single and Multi-session i-Vector Speaker Recognition,

O. Ghahabi and J. Hernando, “Deep Learning Backend for Single and Multi-session i-Vector Speaker Recognition,” IEEE Transac- tions on Audio, Speech, and Language Processing , vol. 25, no. 4, pp. 807–817, 2017

work page 2017

[28] [28]

Frame-level speaker embeddings for text-independent speaker recognition and analysis of end-to-end model

S. Shon, H. Tang, and J. Glass, “Frame-level Speaker Embeddings for Text-independent Speaker Recognition and Analysis of end- to-end model,” arXiv preprint arXiv:1809.04437, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[29] [29]

Utterance- level Aggregation For Speaker Recognition In The Wild,

W. Xie, A. Nagrani, J. S. Chung, and A. Zisserman, “Utterance- level Aggregation For Speaker Recognition In The Wild,” IEEE International Conference on Acoustics, Speech and Signal Pro- cessing (ICASSP), 2019

work page 2019

[30] [30]

Training Utterance-level Embedding Networks for Speaker Identiﬁcation and Veriﬁcation,

H. Park, S. Cho, K. Park, N. Kim, and J. Park, “Training Utterance-level Embedding Networks for Speaker Identiﬁcation and Veriﬁcation,”INTERSPEECH, pp. 3563–3567, 2018

work page 2018

[31] [31]

Avoiding Speaker Overﬁtting in End-to-End DNNs Using Raw Waveform for Text-Independent Speaker Veriﬁcation,

J.-w. Jung, H.-s. Heo, I.-h. Yang, H.-j. Shim, and H.-j. Yu, “Avoiding Speaker Overﬁtting in End-to-End DNNs Using Raw Waveform for Text-Independent Speaker Veriﬁcation,” INTER- SPEECH, pp. 3583–3587, 2018

work page 2018

[32] [32]

Centroid-based Deep Metric Learning for Speaker Recognition,

J. Wang, K.-C. Wang, M. Law, F. Rudzicz, and M. Brudno, “Centroid-based Deep Metric Learning for Speaker Recognition,” IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2019

work page 2019

[33] [33]

On Convolutional LSTM Modeling for Joint Wake-Word Detection and Text De- pendent Speaker Veriﬁcation,

R. Kumar, V . Yeruva, and S. Ganapathy, “On Convolutional LSTM Modeling for Joint Wake-Word Detection and Text De- pendent Speaker Veriﬁcation,” INTERSPEECH, pp. 1121–1125, 2018

work page 2018

[34] [34]

Cycle-GANs for Do- main Adaptation of Acoustic Features for Speaker Recognition,

P. S. Nidadavolu, J. Villalba, and N. Dehak, “Cycle-GANs for Do- main Adaptation of Acoustic Features for Speaker Recognition,” IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6206–6210, 2019

work page 2019

[35] [35]

Speaker Veriﬁcation Using End-to-end Adversarial Language Adaptation,

J. Rohdin, T. Stafylakis, A. Silnova, H. Zeinali, L. Burget, and O. Plchot, “Speaker Veriﬁcation Using End-to-end Adversarial Language Adaptation,” IEEE International Conference on Acous- tics, Speech and Signal Processing (ICASSP) , pp. 6006–6010, 2019

work page 2019

[36] [36]

Attentive Statis- tics Pooling for Deep Speaker Embedding,

K. Okabe, T. Koshinaka, and K. Shinoda, “Attentive Statis- tics Pooling for Deep Speaker Embedding,” INTERSPEECH, pp. 2252–2256, 2018

work page 2018

[37] [37]

Non- parallel V oice Conversion Using i-Vector PLDA: Towards Uni- fying Speaker Veriﬁcation and Transformation,

T. Kinnunen, L. Juvela, P. Alku, and J. Yamagishi, “Non- parallel V oice Conversion Using i-Vector PLDA: Towards Uni- fying Speaker Veriﬁcation and Transformation,” IEEE Interna- tional Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5535–5539, 2017

work page 2017

[38] [38]

An Unsupervised Neural Prediction Framework for Learning Speaker Embeddings Using Recurrent Neural Networks,

A. Jati and P. Georgiou, “An Unsupervised Neural Prediction Framework for Learning Speaker Embeddings Using Recurrent Neural Networks,” INTERSPEECH, pp. 1131–1135, 2018

work page 2018

[39] [39]

MTGAN: Speaker Veriﬁcation through Multitasking Triplet Generative Adversarial Networks,

W. Ding and L. He, “MTGAN: Speaker Veriﬁcation through Multitasking Triplet Generative Adversarial Networks,” INTER- SPEECH, pp. 3633–3637, 2018

work page 2018

[40] [40]

Chan- nel Adversarial Training for Cross-channel Text-independent Speaker Recognition,

X. Fang, L. Zou, J. Li, L. Sun, and Z.-H. Ling, “Chan- nel Adversarial Training for Cross-channel Text-independent Speaker Recognition,” IEEE International Conference on Acous- tics, Speech and Signal Processing (ICASSP) , 2019

work page 2019

[41] [41]

Deep Residual Learning for Image Recognition,

K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual Learning for Image Recognition,” the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778, 2016

work page 2016

[42] [42]

DeepID3: Face Recognition with Very Deep Neural Networks

Y . Sun, D. Liang, X. Wang, and X. Tang, “DeepID3: Face Recog- nition with Very Deep Neural Networks,”arXiv:1502.00873 [cs], 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015

[43] [43]

Adam: A Method for Stochastic Optimization

D. P. Kingma and J. Ba, “Adam: A method for stochastic opti- mization,” arXiv preprint arXiv:1412.6980, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014

[44] [44]

Unified Hypersphere Embedding for Speaker Recognition

M. Hajibabaei and D. Dai, “Uniﬁed Hypersphere Embedding for Speaker Recognition,” arXiv preprint arXiv:1807.08312, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018