pith. sign in

arxiv: 2002.05314 · v1 · pith:BUZ3RWHVnew · submitted 2020-02-13 · 📡 eess.AS · cs.LG· cs.MM· cs.SD· stat.ML

Self-supervised learning for audio-visual speaker diarization

classification 📡 eess.AS cs.LGcs.MMcs.SDstat.ML
keywords diarizationaudio-videolossspeakerhuman-computerinteractionlearningself-supervised
0
0 comments X
read the original abstract

Speaker diarization, which is to find the speech segments of specific speakers, has been widely used in human-centered applications such as video conferences or human-computer interaction systems. In this paper, we propose a self-supervised audio-video synchronization learning method to address the problem of speaker diarization without massive labeling effort. We improve the previous approaches by introducing two new loss functions: the dynamic triplet loss and the multinomial loss. We test them on a real-world human-computer interaction system and the results show our best model yields a remarkable gain of +8%F1-scoresas well as diarization error rate reduction. Finally, we introduce a new large scale audio-video corpus designed to fill the vacancy of audio-video datasets in Chinese.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.