pith. sign in

arxiv: 2505.21954 · v2 · pith:XT3UVZMBnew · submitted 2025-05-28 · 💻 cs.CV · cs.AI

Revisiting Active Speaker Detection: An In-the-Wild Benchmark for Generalization and Robustness

classification 💻 cs.CV cs.AI
keywords unitalkmodelsactivebenchmarkchallengingconditionsdetectiongeneralization
0
0 comments X
read the original abstract

We present UniTalk, a novel dataset emphasizing challenging scenarios to enhance model generalization for the task of active speaker detection (ASD). Previously established benchmarks such as AVA predominantly comprise old movies and thus exhibit significant domain gaps with real-world video. In contrast, UniTalk covers diverse video types reflecting challenging real-world conditions, including underrepresented languages, noisy backgrounds, and crowded scenes, while being on par with AVA in scale. Extensive evaluations reveal that ASD remains unsolved under realistic conditions: state-of-the-art models near-perfect on AVA fail to reach saturation on UniTalk. Conversely, models trained on UniTalk generalize better to modern in-the-wild datasets including Talkies and ASW. UniTalk thus establishes a new benchmark for ASD, providing researchers with a valuable resource for developing and evaluating versatile and resilient models.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. See, Hear, and Understand: Benchmarking Audiovisual Human Speech Understanding in Multimodal Large Language Models

    cs.CV 2025-12 unverdicted novelty 7.0

    AV-SpeakerBench is a new speaker-centered benchmark showing that top multimodal models still struggle with fine-grained audiovisual speech understanding, with Gemini 2.5 Pro leading but open models lagging on fusion.