Do Modern Video-LLMs Need to Listen? A Benchmark Audit and Scalable Remedy

Geewook Kim; Minjoon Seo

arxiv: 2509.17901 · v4 · pith:5F7NWAAXnew · submitted 2025-09-22 · 💻 cs.CV · cs.MM· cs.SD

Do Modern Video-LLMs Need to Listen? A Benchmark Audit and Scalable Remedy

Geewook Kim , Minjoon Seo This is my paper

classification 💻 cs.CV cs.MMcs.SD

keywords audiobenchmarksspeechvideoauditbecauseencoderslargely

0 comments

read the original abstract

Speech and audio encoders developed over years of community effort are routinely excluded from video understanding pipelines, not because they fail, but because benchmarks never required listening. We audit 10 video benchmarks and find items largely solvable from visual cues alone: a single-frame probe answers about 76% of AVQA without audio, suggesting poor measurement of audio-visual reasoning. Building on LLaVA-OneVision, we attach a speech/audio encoder and compare five compressor architectures under 25-fold token reduction (25 Hz to 1 Hz). Across 10 benchmarks, with and without filtering, audio yields clear gains on tasks requiring speech comprehension or cross-modal grounding, while vision-centric suites remain largely unaffected. Our results show that speech encoders play a larger role in video understanding than current benchmarks suggest. We will open-source our work at https://github.com/naver-ai/unimambamia-av.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Not All Modalities Are Equal: Instruction-Aware Gating for Multimodal Videos
cs.CV 2026-05 unverdicted novelty 6.0

UniMVU applies instruction-conditioned inner-modality and modality-level gates to adaptively fuse multiple video modalities, achieving gains of up to 13.5 CIDEr on six benchmarks including AVQA and MVBench.