Do Modern Video-LLMs Need to Listen? A Benchmark Audit and Scalable Remedy

· 2025 · cs.CV · arXiv 2509.17901

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

open full Pith review browse 1 citing papers arXiv PDF

abstract

Speech and audio encoders developed over years of community effort are routinely excluded from video understanding pipelines, not because they fail, but because benchmarks never required listening. We audit 10 video benchmarks and find items largely solvable from visual cues alone: a single-frame probe answers about 76% of AVQA without audio, suggesting poor measurement of audio-visual reasoning. Building on LLaVA-OneVision, we attach a speech/audio encoder and compare five compressor architectures under 25-fold token reduction (25 Hz to 1 Hz). Across 10 benchmarks, with and without filtering, audio yields clear gains on tasks requiring speech comprehension or cross-modal grounding, while vision-centric suites remain largely unaffected. Our results show that speech encoders play a larger role in video understanding than current benchmarks suggest. We will open-source our work at https://github.com/naver-ai/unimambamia-av.

representative citing papers

Not All Modalities Are Equal: Instruction-Aware Gating for Multimodal Videos

cs.CV · 2026-05-25 · unverdicted · novelty 6.0

UniMVU applies instruction-conditioned inner-modality and modality-level gates to adaptively fuse multiple video modalities, achieving gains of up to 13.5 CIDEr on six benchmarks including AVQA and MVBench.

citing papers explorer

Showing 1 of 1 citing paper after filters.

Not All Modalities Are Equal: Instruction-Aware Gating for Multimodal Videos cs.CV · 2026-05-25 · unverdicted · none · ref 15 · internal anchor
UniMVU applies instruction-conditioned inner-modality and modality-level gates to adaptively fuse multiple video modalities, achieving gains of up to 13.5 CIDEr on six benchmarks including AVQA and MVBench.

Do Modern Video-LLMs Need to Listen? A Benchmark Audit and Scalable Remedy

fields

years

verdicts

representative citing papers

citing papers explorer