Revisiting Active Speaker Detection: An In-the-Wild Benchmark for Generalization and Robustness

Khoa Quang Nhat Cao; Le Thien Phuc Nguyen; Lucas Poon; Soochahn Lee; Toan Ngo Duc Vo; Tuan Khai Nguyen; Tuan Tai Nguyen; Tu Ho Manh Pham; Yong Jae Lee; Yuwei Guo

arxiv: 2505.21954 · v2 · pith:XT3UVZMBnew · submitted 2025-05-28 · 💻 cs.CV · cs.AI

Revisiting Active Speaker Detection: An In-the-Wild Benchmark for Generalization and Robustness

Le Thien Phuc Nguyen , Zhuoran Yu , Khoa Quang Nhat Cao , Yuwei Guo , Tu Ho Manh Pham , Tuan Tai Nguyen , Toan Ngo Duc Vo , Lucas Poon

show 3 more authors

Tuan Khai Nguyen Soochahn Lee Yong Jae Lee

This is my paper

classification 💻 cs.CV cs.AI

keywords unitalkmodelsactivebenchmarkchallengingconditionsdetectiongeneralization

0 comments

read the original abstract

We present UniTalk, a novel dataset emphasizing challenging scenarios to enhance model generalization for the task of active speaker detection (ASD). Previously established benchmarks such as AVA predominantly comprise old movies and thus exhibit significant domain gaps with real-world video. In contrast, UniTalk covers diverse video types reflecting challenging real-world conditions, including underrepresented languages, noisy backgrounds, and crowded scenes, while being on par with AVA in scale. Extensive evaluations reveal that ASD remains unsolved under realistic conditions: state-of-the-art models near-perfect on AVA fail to reach saturation on UniTalk. Conversely, models trained on UniTalk generalize better to modern in-the-wild datasets including Talkies and ASW. UniTalk thus establishes a new benchmark for ASD, providing researchers with a valuable resource for developing and evaluating versatile and resilient models.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

See, Hear, and Understand: Benchmarking Audiovisual Human Speech Understanding in Multimodal Large Language Models
cs.CV 2025-12 unverdicted novelty 7.0

AV-SpeakerBench is a new speaker-centered benchmark showing that top multimodal models still struggle with fine-grained audiovisual speech understanding, with Gemini 2.5 Pro leading but open models lagging on fusion.