AV-SyncBench: Decoupled Benchmarking of Temporal and Semantic Audio-Visual Synchronization

· 2026 · cs.CV · arXiv 2607.00726

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

open full Pith review browse 1 citing papers arXiv PDF

abstract

Audio-visual feature extraction is a fundamental component of multimodal understanding and generation tasks. However, existing evaluation protocols for feature extraction models exhibit dimensional bias, typically focusing on either semantic matching or temporal offset detection. Moreover, their data construction remains coupled, preventing independent assessment of temporal and semantic consistency. We propose AV-SyncBench, the first benchmark to fully separate temporal and semantic evaluation for audio-visual synchronization. Built from in-the-wild videos, it spans Voice, Music, and Sound across 10 scenarios and 5 challenge tasks. Data are automatically filtered and manually verified to ensure on-screen sound sources. The benchmark contains 3,269 videos and 38,390 samples, and we evaluate five representative models to quantify feature quality for alignment and downstream tasks. The code and dataset are available at: https://fgt7t6g.github.io/AV-SyncBench.

representative citing papers

AV-SyncBench: Decoupled Benchmarking of Temporal and Semantic Audio-Visual Synchronization

cs.CV · 2026-07-01 · unverdicted · novelty 6.0

AV-SyncBench is a new benchmark dataset of 3,269 videos that separates temporal and semantic audio-visual synchronization assessment across voice, music, and sound scenarios.

citing papers explorer

Showing 1 of 1 citing paper after filters.

AV-SyncBench: Decoupled Benchmarking of Temporal and Semantic Audio-Visual Synchronization cs.CV · 2026-07-01 · unverdicted · none · ref 1 · internal anchor
AV-SyncBench is a new benchmark dataset of 3,269 videos that separates temporal and semantic audio-visual synchronization assessment across voice, music, and sound scenarios.

AV-SyncBench: Decoupled Benchmarking of Temporal and Semantic Audio-Visual Synchronization

fields

years

verdicts

representative citing papers

citing papers explorer