Exploring the Encoding Layer and Loss Function in End-to-End Speaker and Language Recognition System
read the original abstract
In this paper, we explore the encoding/pooling layer and loss function in the end-to-end speaker and language recognition system. First, a unified and interpretable end-to-end system for both speaker and language recognition is developed. It accepts variable-length input and produces an utterance level result. In the end-to-end system, the encoding layer plays a role in aggregating the variable-length input sequence into an utterance level representation. Besides the basic temporal average pooling, we introduce a self-attentive pooling layer and a learnable dictionary encoding layer to get the utterance level representation. In terms of loss function for open-set speaker verification, to get more discriminative speaker embedding, center loss and angular softmax loss is introduced in the end-to-end system. Experimental results on Voxceleb and NIST LRE 07 datasets show that the performance of end-to-end learning system could be significantly improved by the proposed encoding layer and loss function.
This paper has not been read by Pith yet.
Forward citations
Cited by 2 Pith papers
-
Explainable AI in Speaker Recognition -- Making Latent Representations Understandable
Speaker recognition networks form hierarchical clusters in latent space that can be matched to semantic classes using new HCCM algorithm and quantified by Liebig's score.
-
Speaker Recognition with Random Digit Strings Using Uncertainty Normalized HMM-based i-vectors
Digit-specific HMM i-vectors with uncertainty normalization reach 1.52% male and 1.77% female EER on RSR2015 part III using only that corpus and simple cosine scoring.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.