An Enhanced Res2Net with Local and Global Feature Fusion for Speaker Verification

Hui Wang; Jiajun Qi; Luyao Cheng; Qian Chen; Siqi Zheng; Yafeng Chen

arxiv: 2305.12838 · v2 · pith:EQWBHBHOnew · submitted 2023-05-22 · 📡 eess.AS · cs.SD

An Enhanced Res2Net with Local and Global Feature Fusion for Speaker Verification

Yafeng Chen , Siqi Zheng , Hui Wang , Luyao Cheng , Qian Chen , Jiajun Qi This is my paper

classification 📡 eess.AS cs.SD

keywords fusionfeaturefeaturesgloballocaleres2netspeakerverification

0 comments

read the original abstract

Effective fusion of multi-scale features is crucial for improving speaker verification performance. While most existing methods aggregate multi-scale features in a layer-wise manner via simple operations, such as summation or concatenation. This paper proposes a novel architecture called Enhanced Res2Net (ERes2Net), which incorporates both local and global feature fusion techniques to improve the performance. The local feature fusion (LFF) fuses the features within one single residual block to extract the local signal. The global feature fusion (GFF) takes acoustic features of different scales as input to aggregate global signal. To facilitate effective feature fusion in both LFF and GFF, an attentional feature fusion module is employed in the ERes2Net architecture, replacing summation or concatenation operations. A range of experiments conducted on the VoxCeleb datasets demonstrate the superiority of the ERes2Net in speaker verification. Code has been made publicly available at https://github.com/alibaba-damo-academy/3D-Speaker.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 5 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

UniVocal: Unified Speech-Singing Code-Switching Synthesis
cs.SD 2026-06 unverdicted novelty 6.0

UniVocal presents a text-context-only framework for speech-singing code-switching synthesis via two-stage curriculum learning and a synthetic data pipeline, claiming SOTA on a new benchmark.
CosyVoice 3: Towards In-the-wild Speech Generation via Scaling-up and Post-training
cs.SD 2025-05 unverdicted novelty 6.0

CosyVoice 3 achieves better content consistency, speaker similarity, and prosody naturalness in zero-shot multilingual speech synthesis by scaling data to one million hours, model size to 1.5 billion parameters, and i...
Step-Audio: Unified Understanding and Generation in Intelligent Speech Interaction
cs.CL 2025-02 unverdicted novelty 6.0

Step-Audio introduces a 130B-parameter unified speech-text model with open-sourced components for understanding, generation, affordable voice cloning, and dynamic control, claiming SOTA human evaluation results on a n...
Raon-OpenTTS: Open Models and Data for Robust Text-to-Speech
eess.AS 2026-05 unverdicted novelty 5.0

Raon-OpenTTS provides an open 510K-hour curated speech dataset and DiT-based TTS models up to 1B parameters that achieve competitive WER and speaker similarity on benchmarks versus closed models trained on millions of hours.
A Multi-task Learning Balanced Attention Convolutional Neural Network Model for Few-shot Underwater Acoustic Target Recognition
cs.SD 2025-04 unverdicted novelty 4.0

MT-BCA-CNN achieves 97% accuracy and 95% F1-score on 27-class few-shot underwater acoustic target recognition by combining channel attention and multi-task learning on the Watkins Marine Life Dataset.