pith. machine review for the scientific record. sign in

arxiv: 2512.12772 · v2 · submitted 2025-12-14 · 💻 cs.MM · cs.CV

Recognition: unknown

JointAVBench: A Benchmark for Joint Audio-Visual Reasoning Evaluation

Authors on Pith no claims yet
classification 💻 cs.MM cs.CV
keywords audioinformationbenchmarkreasoningaudio-visualdimensionsevaluateevaluation
0
0 comments X
read the original abstract

Understanding videos inherently requires reasoning over both visual and auditory information. To properly evaluate Omni-Large Language Models (Omni-LLMs), which are capable of processing multi-modal information including vision and audio, an effective benchmark must comprehensively cover three key aspects: (1) multi-modal dependency (i.e., questions that cannot be answered using vision or audio alone), (2) diverse audio information types (e.g., speech, sound events), and (3) varying scene spans. However, existing datasets fall short in one or more of these dimensions, limiting strict and comprehensive evaluation. To address this gap, we introduce JointAVBench, a novel benchmark with strict audio-video correlation, spanning five cognitive dimensions, four audio information types (speech, sound events, music, vocal traits), and three scene spans (single-, cross-, and full-scene). Given the high cost of manual annotation, we propose an automated pipeline that leverages state-of-the-art vision-LLMs, audio-LLMs, and general-purpose LLMs to synthesize questions and answers that strictly require joint audio-visual understanding. We evaluate leading vision-only, audio-only, and Omni-LLMs on our dataset. Results show that even the best-performing Omni-LLM achieves an average accuracy of only 65.3\%, outperforming uni-modal baselines but revealing substantial room for improvement, especially in cross-scene reasoning.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 5 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. TraceAV-Bench: Benchmarking Multi-Hop Trajectory Reasoning over Long Audio-Visual Videos

    cs.CV 2026-05 unverdicted novelty 8.0

    TraceAV-Bench is the first benchmark for multi-hop trajectory reasoning over long audio-visual videos, showing top models reach only 51-68% accuracy with substantial room for improvement.

  2. FLARE: Full-Modality Long-Video Audiovisual Retrieval Benchmark with User-Simulated Queries

    cs.MM 2026-05 unverdicted novelty 7.0

    FLARE is a new benchmark with 399 long videos, 87k multimodal clips, and 275k user-style queries for testing audiovisual retrieval under caption and query regimes.

  3. Script-a-Video: Deep Structured Audio-visual Captions via Factorized Streams and Relational Grounding

    cs.CV 2026-04 unverdicted novelty 7.0

    MTSS replaces monolithic video captions with factorized streams and relational grounding, yielding reported gains in understanding benchmarks and generation consistency.

  4. MMTB: Evaluating Terminal Agents on Multimedia-File Tasks

    cs.MM 2026-05 unverdicted novelty 6.0

    MMTB is a new benchmark with 105 multimedia terminal tasks that shows how audio and video access changes agent performance and evidence use in executable workflows.

  5. MiniCPM-o 4.5: Towards Real-Time Full-Duplex Omni-Modal Interaction

    cs.CL 2026-04 unverdicted novelty 6.0

    MiniCPM-o 4.5 uses the Omni-Flow streaming framework to deliver real-time full-duplex omni-modal interaction with proactive behavior in a 9B model that approaches Gemini 2.5 Flash performance.