Do Modern Video-LLMs Need to Listen? A Benchmark Audit and Scalable Remedy
read the original abstract
Speech and audio encoders developed over years of community effort are routinely excluded from video understanding pipelines, not because they fail, but because benchmarks never required listening. We audit 10 video benchmarks and find items largely solvable from visual cues alone: a single-frame probe answers about 76% of AVQA without audio, suggesting poor measurement of audio-visual reasoning. Building on LLaVA-OneVision, we attach a speech/audio encoder and compare five compressor architectures under 25-fold token reduction (25 Hz to 1 Hz). Across 10 benchmarks, with and without filtering, audio yields clear gains on tasks requiring speech comprehension or cross-modal grounding, while vision-centric suites remain largely unaffected. Our results show that speech encoders play a larger role in video understanding than current benchmarks suggest. We will open-source our work at https://github.com/naver-ai/unimambamia-av.
This paper has not been read by Pith yet.
Forward citations
Cited by 1 Pith paper
-
Not All Modalities Are Equal: Instruction-Aware Gating for Multimodal Videos
UniMVU applies instruction-conditioned inner-modality and modality-level gates to adaptively fuse multiple video modalities, achieving gains of up to 13.5 CIDEr on six benchmarks including AVQA and MVBench.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.