Self-supervised learning for audio-visual speaker diarization

Liqiang Wang; Shi-Xiong Zhang; Yahuan Cong; Yifan Ding; Yong Xu

arxiv: 2002.05314 · v1 · pith:BUZ3RWHVnew · submitted 2020-02-13 · 📡 eess.AS · cs.LG· cs.MM· cs.SD· stat.ML

Self-supervised learning for audio-visual speaker diarization

Yifan Ding , Yong Xu , Shi-Xiong Zhang , Yahuan Cong , Liqiang Wang This is my paper

classification 📡 eess.AS cs.LGcs.MMcs.SDstat.ML

keywords diarizationaudio-videolossspeakerhuman-computerinteractionlearningself-supervised

0 comments

read the original abstract

Speaker diarization, which is to find the speech segments of specific speakers, has been widely used in human-centered applications such as video conferences or human-computer interaction systems. In this paper, we propose a self-supervised audio-video synchronization learning method to address the problem of speaker diarization without massive labeling effort. We improve the previous approaches by introducing two new loss functions: the dynamic triplet loss and the multinomial loss. We test them on a real-world human-computer interaction system and the results show our best model yields a remarkable gain of +8%F1-scoresas well as diarization error rate reduction. Finally, we introduce a new large scale audio-video corpus designed to fill the vacancy of audio-video datasets in Chinese.

This paper has not been read by Pith yet.

Self-supervised learning for audio-visual speaker diarization

discussion (0)