pith. sign in

arxiv: 2305.12838 · v2 · pith:EQWBHBHOnew · submitted 2023-05-22 · 📡 eess.AS · cs.SD

An Enhanced Res2Net with Local and Global Feature Fusion for Speaker Verification

classification 📡 eess.AS cs.SD
keywords fusionfeaturefeaturesgloballocaleres2netspeakerverification
0
0 comments X
read the original abstract

Effective fusion of multi-scale features is crucial for improving speaker verification performance. While most existing methods aggregate multi-scale features in a layer-wise manner via simple operations, such as summation or concatenation. This paper proposes a novel architecture called Enhanced Res2Net (ERes2Net), which incorporates both local and global feature fusion techniques to improve the performance. The local feature fusion (LFF) fuses the features within one single residual block to extract the local signal. The global feature fusion (GFF) takes acoustic features of different scales as input to aggregate global signal. To facilitate effective feature fusion in both LFF and GFF, an attentional feature fusion module is employed in the ERes2Net architecture, replacing summation or concatenation operations. A range of experiments conducted on the VoxCeleb datasets demonstrate the superiority of the ERes2Net in speaker verification. Code has been made publicly available at https://github.com/alibaba-damo-academy/3D-Speaker.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 5 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. UniVocal: Unified Speech-Singing Code-Switching Synthesis

    cs.SD 2026-06 unverdicted novelty 6.0

    UniVocal presents a text-context-only framework for speech-singing code-switching synthesis via two-stage curriculum learning and a synthetic data pipeline, claiming SOTA on a new benchmark.

  2. CosyVoice 3: Towards In-the-wild Speech Generation via Scaling-up and Post-training

    cs.SD 2025-05 unverdicted novelty 6.0

    CosyVoice 3 achieves better content consistency, speaker similarity, and prosody naturalness in zero-shot multilingual speech synthesis by scaling data to one million hours, model size to 1.5 billion parameters, and i...

  3. Step-Audio: Unified Understanding and Generation in Intelligent Speech Interaction

    cs.CL 2025-02 unverdicted novelty 6.0

    Step-Audio introduces a 130B-parameter unified speech-text model with open-sourced components for understanding, generation, affordable voice cloning, and dynamic control, claiming SOTA human evaluation results on a n...

  4. Raon-OpenTTS: Open Models and Data for Robust Text-to-Speech

    eess.AS 2026-05 unverdicted novelty 5.0

    Raon-OpenTTS provides an open 510K-hour curated speech dataset and DiT-based TTS models up to 1B parameters that achieve competitive WER and speaker similarity on benchmarks versus closed models trained on millions of hours.

  5. A Multi-task Learning Balanced Attention Convolutional Neural Network Model for Few-shot Underwater Acoustic Target Recognition

    cs.SD 2025-04 unverdicted novelty 4.0

    MT-BCA-CNN achieves 97% accuracy and 95% F1-score on 27-class few-shot underwater acoustic target recognition by combining channel attention and multi-task learning on the Watkins Marine Life Dataset.