Analysis of ABC Frontend Audio Systems for the NIST-SRE24

Anna Silnova; Dominik Bobo\v{s}; Federico Landini; Jiangyu Han; Johan Rohdin; Junyi Peng; Ladislav Mo\v{s}ner; Lin Zhang; Luk\'a\v{s} Burget; Martin Kodovsky

arxiv: 2505.15320 · v1 · pith:DIIEUX47new · submitted 2025-05-21 · 📡 eess.AS · cs.SD

Analysis of ABC Frontend Audio Systems for the NIST-SRE24

Sara Barahona , Anna Silnova , Ladislav Mo\v{s}ner , Junyi Peng , Old\v{r}ich Plchot , Johan Rohdin , Lin Zhang , Jiangyu Han

show 9 more authors

Petr Palka Federico Landini Luk\'a\v{s} Burget Themos Stafylakis Sandro Cumani Dominik Bobo\v{s} Miroslav Hlava\v{c}ek Martin Kodovsky Tom\'a\v{s} Pavl\'i\v{c}ek

This is my paper

classification 📡 eess.AS cs.SD

keywords analysisaudioconditionembeddingextractorsfrontendsmodelsnist

0 comments

read the original abstract

We present a comprehensive analysis of the embedding extractors (frontends) developed by the ABC team for the audio track of NIST SRE 2024. We follow the two scenarios imposed by NIST: using only a provided set of telephone recordings for training (fixed) or adding publicly available data (open condition). Under these constraints, we develop the best possible speaker embedding extractors for the pre-dominant conversational telephone speech (CTS) domain. We explored architectures based on ResNet with different pooling mechanisms, recently introduced ReDimNet architecture, as well as a system based on the XLS-R model, which represents the family of large pre-trained self-supervised models. In open condition, we train on VoxBlink2 dataset, containing 110 thousand speakers across multiple languages. We observed a good performance and robustness of VoxBlink-trained models, and our experiments show practical recipes for developing state-of-the-art frontends for speaker recognition.

This paper has not been read by Pith yet.

Analysis of ABC Frontend Audio Systems for the NIST-SRE24

discussion (0)