pith. machine review for the scientific record. sign in

arxiv: 2512.09299 · v2 · submitted 2025-12-10 · 💻 cs.CV · cs.SD

Recognition: unknown

VABench: A Comprehensive Benchmark for Audio-Video Generation

Authors on Pith no claims yet
classification 💻 cs.CV cs.SD
keywords generationaudio-videocomprehensivevabenchvideoaudiomodelssounds
0
0 comments X
read the original abstract

Recent advances in video generation have been remarkable, enabling models to produce visually compelling videos with synchronized audio. While existing video generation benchmarks provide comprehensive metrics for visual quality, they lack convincing evaluations for audio-video generation, especially for models aiming to generate synchronized audio-video outputs. To address this gap, we introduce VABench, a comprehensive and multi-dimensional benchmark framework designed to systematically evaluate the capabilities of synchronous audio-video generation. VABench encompasses three primary task types: text-to-audio-video (T2AV), image-to-audio-video (I2AV), and stereo audio-video generation. It further establishes two major evaluation modules covering 15 dimensions. These dimensions specifically assess pairwise similarities (text-video, text-audio, video-audio), audio-video synchronization, lip-speech consistency, and carefully curated audio and video question-answering (QA) pairs, among others. Furthermore, VABench covers seven major content categories: animals, human sounds, music, environmental sounds, synchronous physical sounds, complex scenes, and virtual worlds. We provide a systematic analysis and visualization of the evaluation results, aiming to establish a new standard for assessing video generation models with synchronous audio capabilities and to promote the comprehensive advancement of the field.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 5 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Do Joint Audio-Video Generation Models Understand Physics?

    cs.SD 2026-05 unverdicted novelty 7.0

    Current joint audio-video generation models lack robust physical commonsense, especially during transitions and when prompted for impossible behaviors.

  2. TMD-Bench: A Multi-Level Evaluation Paradigm for Music-Dance Co-Generation

    cs.SD 2026-05 unverdicted novelty 7.0

    TMD-Bench is a multi-level benchmark that measures music-dance co-generation quality including beat-level rhythmic synchronization, supported by a new dataset and Music Captioner, and shows commercial models lag in rh...

  3. VidAudio-Bench: Benchmarking V2A and VT2A Generation across Four Audio Categories

    cs.SD 2026-04 unverdicted novelty 7.0

    VidAudio-Bench benchmarks V2A and VT2A models across four audio categories, revealing poor speech/singing performance and a tension between visual alignment and text instruction following.

  4. SyncDPO: Enhancing Temporal Synchronization in Video-Audio Joint Generation via Preference Learning

    cs.CV 2026-05 unverdicted novelty 6.0

    SyncDPO improves temporal synchronization in video-audio joint generation using DPO with efficient on-the-fly negative sample construction and curriculum learning.

  5. OmniHuman: A Large-scale Dataset and Benchmark for Human-Centric Video Generation

    cs.CV 2026-04 unverdicted novelty 6.0

    OmniHuman is a new large-scale multi-scene dataset with video-, frame-, and individual-level annotations for human-centric video generation, accompanied by the OHBench benchmark that adds metrics aligned with human pe...