AVA-ActiveSpeaker: An Audio-Visual Dataset for Active Speaker Detection

Andrew Gallagher; Arkadiusz Stopczynski; Caroline Pantofaru; Cordelia Schmid; Joseph Roth; Liat Kaver; Ondrej Klejch; Radhika Marvin; Sharadh Ramaswamy; Sourish Chaudhuri

arxiv: 1901.01342 · v2 · pith:7LRUPRTKnew · submitted 2019-01-05 · 💻 cs.CV · cs.MM· cs.SD· eess.AS

AVA-ActiveSpeaker: An Audio-Visual Dataset for Active Speaker Detection

Joseph Roth , Sourish Chaudhuri , Ondrej Klejch , Radhika Marvin , Andrew Gallagher , Liat Kaver , Sharadh Ramaswamy , Arkadiusz Stopczynski

show 3 more authors

Cordelia Schmid Zhonghua Xi Caroline Pantofaru

This is my paper

classification 💻 cs.CV cs.MMcs.SDeess.AS

keywords datasetspeakeractivedetectionlabeledaudio-visualfacevideo

0 comments

read the original abstract

Active speaker detection is an important component in video analysis algorithms for applications such as speaker diarization, video re-targeting for meetings, speech enhancement, and human-robot interaction. The absence of a large, carefully labeled audio-visual dataset for this task has constrained algorithm evaluations with respect to data diversity, environments, and accuracy. This has made comparisons and improvements difficult. In this paper, we present the AVA Active Speaker detection dataset (AVA-ActiveSpeaker) that will be released publicly to facilitate algorithm development and enable comparisons. The dataset contains temporally labeled face tracks in video, where each face instance is labeled as speaking or not, and whether the speech is audible. This dataset contains about 3.65 million human labeled frames or about 38.5 hours of face tracks, and the corresponding audio. We also present a new audio-visual approach for active speaker detection, and analyze its performance, demonstrating both its strength and the contributions of the dataset.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Naver at ActivityNet Challenge 2019 -- Task B Active Speaker Detection (AVA)
cs.SD 2019-06 unverdicted novelty 2.0

A 3D CNN front-end plus temporal convolution and LSTM ensemble yields significant gains over baseline for active speaker detection on the AVA dataset.