pith. sign in

arxiv: 1904.08104 · v2 · pith:A2US2WVCnew · submitted 2019-04-17 · 📡 eess.AS · cs.LG· cs.SD

RawNet: Advanced end-to-end deep neural network using raw waveforms for text-independent speaker verification

classification 📡 eess.AS cs.LGcs.SD
keywords speakerdeepneuralwaveformsback-endclassificationembeddingend-to-end
0
0 comments X
read the original abstract

Recently, direct modeling of raw waveforms using deep neural networks has been widely studied for a number of tasks in audio domains. In speaker verification, however, utilization of raw waveforms is in its preliminary phase, requiring further investigation. In this study, we explore end-to-end deep neural networks that input raw waveforms to improve various aspects: front-end speaker embedding extraction including model architecture, pre-training scheme, additional objective functions, and back-end classification. Adjustment of model architecture using a pre-training scheme can extract speaker embeddings, giving a significant improvement in performance. Additional objective functions simplify the process of extracting speaker embeddings by merging conventional two-phase processes: extracting utterance-level features such as i-vectors or x-vectors and the feature enhancement phase, e.g., linear discriminant analysis. Effective back-end classification models that suit the proposed speaker embedding are also explored. We propose an end-to-end system that comprises two deep neural networks, one front-end for utterance-level speaker embedding extraction and the other for back-end classification. Experiments conducted on the VoxCeleb1 dataset demonstrate that the proposed model achieves state-of-the-art performance among systems without data augmentation. The proposed system is also comparable to the state-of-the-art x-vector system that adopts data augmentation.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.