pith. sign in

arxiv: 2512.09299 · v2 · submitted 2025-12-10 · 💻 cs.CV · cs.SD

VABench: A Comprehensive Benchmark for Audio-Video Generation

Pith reviewed 2026-05-16 23:59 UTC · model grok-4.3

classification 💻 cs.CV cs.SD
keywords audio-video generationbenchmarksynchronizationlip-speech consistencymultimodal evaluationtext-to-audio-videoimage-to-audio-video
0
0 comments X

The pith

VABench supplies a multi-dimensional benchmark to test synchronized audio-video generation models where prior visual-only tests fall short.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces VABench to evaluate models that produce video with matching audio. Existing benchmarks measure visual quality but provide no convincing checks for audio-video alignment. VABench defines three task types—text-to-audio-video, image-to-audio-video, and stereo generation—and scores them across fifteen dimensions that include text-video similarity, text-audio similarity, video-audio similarity, audio-video synchronization, lip-speech consistency, and audio-video question-answering pairs. These dimensions are applied to seven content categories such as animals, music, and complex scenes. The authors supply systematic analysis and visualizations to set a new standard for assessing joint audio-video output.

Core claim

VABench is a benchmark framework that systematically evaluates synchronous audio-video generation models through three primary task types, two major evaluation modules spanning fifteen dimensions that cover pairwise similarities, synchronization, lip-speech consistency, and curated QA pairs, plus seven content categories, establishing quantitative assessment where existing video benchmarks lack audio-video metrics.

What carries the argument

The VABench evaluation framework consisting of fifteen dimensions that measure pairwise similarities, audio-video synchronization, lip-speech consistency, and QA pairs across seven content categories.

If this is right

  • Generation models can receive explicit scores for audio-video alignment rather than visual quality alone.
  • Developers gain concrete targets for improving synchronization and lip-speech consistency.
  • New models can be compared directly on joint audio-video tasks across text, image, and stereo inputs.
  • Research focus may shift toward metrics that treat audio and video as a single synchronized output.
  • Benchmark results can guide selection of models for applications requiring matched sound and motion.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Adoption of VABench could standardize training objectives so that models optimize for both modalities jointly instead of adding audio after video generation.
  • The framework may reveal whether current architectures handle environmental sounds or virtual-world scenes better than human-speech scenes.
  • Future extensions could add temporal metrics that track synchronization drift over longer sequences.
  • Results on VABench might correlate with downstream task performance such as video editing or virtual-reality rendering where audio-video mismatch is noticeable.
  • pith_inferences are editorial extensions and not stated in the paper.

Load-bearing premise

That the chosen fifteen dimensions, pairwise similarity checks, lip-speech tests, and seven content categories together capture the essential qualities needed to judge audio-video generation performance.

What would settle it

A generation model that ranks high on all VABench dimensions yet produces clearly mismatched audio and video when tested on real-world clips outside the seven categories.

Figures

Figures reproduced from arXiv: 2512.09299 by Bohan Zeng, Daili Hua, Hao Liang, Junbo Niu, Quanqing Xu, Wentao Zhang, Xinlong Chen, Xinyi Huang, Xizhi Wang.

Figure 1
Figure 1. Figure 1: Overview of the VABench framework, illustrating its three main components: (1) The audio-video generation tasks being [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Data distribution of VABench. The sunburst chart illus [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: VABench’s seven content categories, illustrated with ex [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Overview of the pipeline for benchmark data curation. [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative comparison of model performance. We [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 7
Figure 7. Figure 7: Comparative radar chart of three models: Phase Co [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
Figure 9
Figure 9. Figure 9: Doppler-effect video for analysis. From top to bottom, [PITH_FULL_IMAGE:figures/full_fig_p014_9.png] view at source ↗
Figure 11
Figure 11. Figure 11: Lightning video for analysis. From top to bot [PITH_FULL_IMAGE:figures/full_fig_p015_11.png] view at source ↗
Figure 13
Figure 13. Figure 13: Stereo-sound video for analysis. From top to bottom, [PITH_FULL_IMAGE:figures/full_fig_p016_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Stereophonic analysis of the video generated by Veo3 [PITH_FULL_IMAGE:figures/full_fig_p016_14.png] view at source ↗
Figure 17
Figure 17. Figure 17: A sample from Veo3’s generated results, illustrating [PITH_FULL_IMAGE:figures/full_fig_p017_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Analysis of the sample generated by Veo3, showing the [PITH_FULL_IMAGE:figures/full_fig_p017_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Video generated by Sora2, showing dual-channel audio [PITH_FULL_IMAGE:figures/full_fig_p018_19.png] view at source ↗
Figure 21
Figure 21. Figure 21: Demographic tendencies in generated human subjects [PITH_FULL_IMAGE:figures/full_fig_p018_21.png] view at source ↗
Figure 23
Figure 23. Figure 23: Video example generated by Seedance+MMAudio on [PITH_FULL_IMAGE:figures/full_fig_p020_23.png] view at source ↗
Figure 22
Figure 22. Figure 22: Video example generated by Veo3 on the T2AV task. [PITH_FULL_IMAGE:figures/full_fig_p020_22.png] view at source ↗
Figure 25
Figure 25. Figure 25: Video example generated by Kling+ThinkSound on the [PITH_FULL_IMAGE:figures/full_fig_p021_25.png] view at source ↗
read the original abstract

Recent advances in video generation have been remarkable, enabling models to produce visually compelling videos with synchronized audio. While existing video generation benchmarks provide comprehensive metrics for visual quality, they lack convincing evaluations for audio-video generation, especially for models aiming to generate synchronized audio-video outputs. To address this gap, we introduce VABench, a comprehensive and multi-dimensional benchmark framework designed to systematically evaluate the capabilities of synchronous audio-video generation. VABench encompasses three primary task types: text-to-audio-video (T2AV), image-to-audio-video (I2AV), and stereo audio-video generation. It further establishes two major evaluation modules covering 15 dimensions. These dimensions specifically assess pairwise similarities (text-video, text-audio, video-audio), audio-video synchronization, lip-speech consistency, and carefully curated audio and video question-answering (QA) pairs, among others. Furthermore, VABench covers seven major content categories: animals, human sounds, music, environmental sounds, synchronous physical sounds, complex scenes, and virtual worlds. We provide a systematic analysis and visualization of the evaluation results, aiming to establish a new standard for assessing video generation models with synchronous audio capabilities and to promote the comprehensive advancement of the field.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper introduces VABench, a benchmark framework for evaluating synchronous audio-video generation models. It defines three task types (text-to-audio-video, image-to-audio-video, and stereo audio-video generation), two evaluation modules spanning 15 dimensions (pairwise similarities, audio-video synchronization, lip-speech consistency, and QA pairs), and seven content categories (animals, human sounds, music, environmental sounds, synchronous physical sounds, complex scenes, virtual worlds), with accompanying analysis and visualizations.

Significance. If the 15 dimensions and QA pairs are shown through validation to correlate with human judgments and to capture perceptually relevant failure modes missed by prior video benchmarks, VABench could provide a useful standardized evaluation protocol for audio-video synchronization in generative models. The multi-task and multi-category coverage is a constructive step toward addressing the gap noted in the abstract.

major comments (3)
  1. [Evaluation modules] Evaluation modules (described in the main text following the abstract): the 15 dimensions are enumerated at a high level but no metric formulas, distance functions, or implementation details are supplied for text-video similarity, video-audio sync, lip-speech consistency, or the QA scoring procedure, rendering the benchmark non-reproducible.
  2. [QA pairs] QA pairs subsection: no inter-annotator agreement, human correlation study, or ablation is reported to confirm that the curated audio and video QA pairs actually measure the intended semantic and perceptual properties.
  3. [Content categories and results analysis] Content categories and results analysis: the claim that the seven categories plus the listed checks suffice to capture key challenges (e.g., temporal audio drift or semantic mismatch) is unsupported by any comparative analysis against existing benchmarks or by evidence that the chosen axes predict human-perceived audio-video quality.
minor comments (1)
  1. [Abstract] The abstract states that VABench 'establishes two major evaluation modules' but does not name or briefly characterize those two modules, which would improve immediate clarity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and will revise the manuscript accordingly to improve clarity and reproducibility.

read point-by-point responses
  1. Referee: [Evaluation modules] Evaluation modules (described in the main text following the abstract): the 15 dimensions are enumerated at a high level but no metric formulas, distance functions, or implementation details are supplied for text-video similarity, video-audio sync, lip-speech consistency, or the QA scoring procedure, rendering the benchmark non-reproducible.

    Authors: We agree that additional implementation details are required for reproducibility. In the revised manuscript we will expand the Evaluation Modules section with explicit metric formulas (e.g., CLIP cosine similarity for text-video, SyncNet-based AV sync score, Wav2Lip lip-speech consistency), distance functions, and the precise QA scoring procedure, together with pseudocode and a pointer to the released evaluation code. revision: yes

  2. Referee: [QA pairs] QA pairs subsection: no inter-annotator agreement, human correlation study, or ablation is reported to confirm that the curated audio and video QA pairs actually measure the intended semantic and perceptual properties.

    Authors: We acknowledge this gap. Although the QA pairs were curated by multiple annotators under a documented protocol, agreement statistics were not reported. We will add an inter-annotator agreement analysis (Fleiss’ kappa) and a human correlation study on a held-out subset in the revised version, placing detailed results in the appendix if space is constrained. revision: yes

  3. Referee: [Content categories and results analysis] Content categories and results analysis: the claim that the seven categories plus the listed checks suffice to capture key challenges (e.g., temporal audio drift or semantic mismatch) is unsupported by any comparative analysis against existing benchmarks or by evidence that the chosen axes predict human-perceived audio-video quality.

    Authors: We agree that stronger justification is needed. The revision will include a comparative table against existing benchmarks (VBench, AIGC-Vid, etc.) and additional analysis with qualitative examples showing how the seven categories and 15 dimensions specifically surface failure modes such as temporal drift and semantic mismatch. We will also reference supporting human-perception literature where available. revision: yes

Circularity Check

0 steps flagged

No circularity: benchmark defined directly by tasks and dimensions

full rationale

The paper introduces VABench by enumerating three task types (T2AV, I2AV, stereo) and two evaluation modules spanning 15 dimensions (pairwise similarities, synchronization, lip-speech consistency, QA pairs) plus seven content categories. No equations, fitted parameters, predictions, or derivations appear in the abstract or described structure. The framework is specified by construction as a list of chosen axes rather than derived from prior results or self-citations that reduce to the same inputs. This is the normal case of a definitional benchmark paper with no load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The benchmark relies on standard similarity and QA evaluation practices as domain assumptions without introducing new free parameters or invented entities.

axioms (1)
  • domain assumption Standard pairwise similarity metrics and curated QA pairs are sufficient to evaluate audio-video synchronization and quality
    Invoked when defining the 15 dimensions without additional justification or validation in the abstract.

pith-pipeline@v0.9.0 · 5537 in / 1028 out tokens · 123306 ms · 2026-05-16T23:59:36.524889+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 8 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. PhyAVBench: A Challenging Audio Physics-Sensitivity Benchmark for Physically Grounded Text-to-Audio-Video Generation

    cs.SD 2025-12 accept novelty 8.0

    PhyAVBench supplies the first benchmark and contrastive metric that measures whether text-to-audio-video models respect real-world audio physics across controlled prompt pairs.

  2. MSAVBench: Towards Comprehensive and Reliable Evaluation of Multi-Shot Audio-Video Generation

    cs.CV 2026-05 conditional novelty 7.0

    MSAVBench is the first comprehensive benchmark for multi-shot audio-video generation, spanning video, audio, shot, and reference dimensions with an adaptive evaluation framework that reaches 91.5% Spearman correlation...

  3. Do Joint Audio-Video Generation Models Understand Physics?

    cs.SD 2026-05 unverdicted novelty 7.0

    Current joint audio-video generation models lack robust physical commonsense, especially during transitions and when prompted for impossible behaviors.

  4. TMD-Bench: A Multi-Level Evaluation Paradigm for Music-Dance Co-Generation

    cs.SD 2026-05 unverdicted novelty 7.0

    TMD-Bench is a multi-level benchmark that measures music-dance co-generation quality including beat-level rhythmic synchronization, supported by a new dataset and Music Captioner, and shows commercial models lag in rh...

  5. VidAudio-Bench: Benchmarking V2A and VT2A Generation across Four Audio Categories

    cs.SD 2026-04 unverdicted novelty 7.0

    VidAudio-Bench benchmarks V2A and VT2A models across four audio categories, revealing poor speech/singing performance and a tension between visual alignment and text instruction following.

  6. PhyAVBench: A Challenging Audio Physics-Sensitivity Benchmark for Physically Grounded Text-to-Audio-Video Generation

    cs.SD 2025-12 unverdicted novelty 7.0

    PhyAVBench provides the first systematic benchmark and metric for audio-physics grounding in T2AV, I2AV, and V2A models using controlled prompt pairs and real video ground truth.

  7. SyncDPO: Enhancing Temporal Synchronization in Video-Audio Joint Generation via Preference Learning

    cs.CV 2026-05 unverdicted novelty 6.0

    SyncDPO improves temporal synchronization in video-audio joint generation using DPO with efficient on-the-fly negative sample construction and curriculum learning.

  8. OmniHuman: A Large-scale Dataset and Benchmark for Human-Centric Video Generation

    cs.CV 2026-04 unverdicted novelty 6.0

    OmniHuman is a new large-scale multi-scene dataset with video-, frame-, and individual-level annotations for human-centric video generation, accompanied by the OHBench benchmark that adds metrics aligned with human pe...

Reference graph

Works this paper leans on

67 extracted references · 67 canonical work pages · cited by 7 Pith papers · 11 internal anchors

  1. [1]

    Unictokens: Boosting personalized understand- ing and generation via unified concept tokens.arXiv preprint arXiv:2505.14671, 2025

    Ruichuan An, Sihan Yang, Renrui Zhang, Zijun Shen, Ming Lu, Gaole Dai, Hao Liang, Ziyu Guo, Shilin Yan, Yulin Luo, et al. Unictokens: Boosting personalized understand- ing and generation via unified concept tokens.arXiv preprint arXiv:2505.14671, 2025. 3

  2. [2]

    Multi-step visual reasoning with visual tokens scaling and verification

    Tianyi Bai, Zengjie Hu, Fupeng Sun, Jiantao Qiu, Yizhen Jiang, Guangxin He, Bohan Zeng, Conghui He, Binhang Yuan, and Wentao Zhang. Multi-step visual reasoning with visual tokens scaling and verification.arXiv preprint arXiv:2506.07235, 2025. 3

  3. [3]

    Lumiere: A space-time diffu- sion model for video generation

    Omer Bar-Tal, Hila Chefer, Omer Tov, Charles Her- rmann, Roni Paiss, Shiran Zada, Ariel Ephrat, Junhwa Hur, Guanghui Liu, Amit Raj, et al. Lumiere: A space-time diffu- sion model for video generation. InSIGGRAPH Asia 2024 Conference Papers, pages 1–11, 2024. 2

  4. [4]

    Control-a-video: Controllable text-to-video generation with diffusion models

    Weifeng Chen, Yatai Ji, Jie Wu, Hefeng Wu, Pan Xie, Jiashi Li, Xin Xia, Xuefeng Xiao, and Liang Lin. Control-a-video: Controllable text-to-video generation with diffusion models. arXiv e-prints, pages arXiv–2305, 2023. 1

  5. [5]

    VersaVid-R1: A Versatile Video Understanding and Reasoning Model from Question Answering to Captioning Tasks,

    Xinlong Chen, Yuanxing Zhang, Yushuo Guan, Bohan Zeng, Yang Shi, Sihan Yang, Pengfei Wan, Qiang Liu, Liang Wang, and Tieniu Tan. Versavid-r1: A versatile video understanding and reasoning model from question answering to captioning tasks.arXiv preprint arXiv:2506.09079, 2025. 3

  6. [6]

    Mmaudio: Taming multimodal joint training for high-quality video-to- audio synthesis

    Ho Kei Cheng, Masato Ishii, Akio Hayakawa, Takashi Shibuya, Alexander Schwing, and Yuki Mitsufuji. Mmaudio: Taming multimodal joint training for high-quality video-to- audio synthesis. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 28901–28911, 2025. 2, 5, 1

  7. [7]

    Veo 3, 2025

    Google DeepMind. Veo 3, 2025. 1, 2

  8. [8]

    Diffusion models beat gans on image synthesis

    Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis. InAdvances in Neural Infor- mation Processing Systems, pages 8780–8794. Curran Asso- ciates, Inc., 2021. 2

  9. [9]

    Cogview2: faster and better text-to-image generation via hi- erarchical transformers

    Ming Ding, Wendi Zheng, Wenyi Hong, and Jie Tang. Cogview2: faster and better text-to-image generation via hi- erarchical transformers. InProceedings of the 36th Inter- national Conference on Neural Information Processing Sys- tems, Red Hook, NY , USA, 2022. Curran Associates Inc. 2

  10. [10]

    Seedance 1.0: Exploring the Boundaries of Video Generation Models

    Yu Gao, Haoyuan Guo, Tuyen Hoang, Weilin Huang, Lu Jiang, Fangyuan Kong, Huixia Li, Jiashi Li, Liang Li, Xi- aojie Li, et al. Seedance 1.0: Exploring the boundaries of video generation models.arXiv preprint arXiv:2506.09113,

  11. [11]

    Imagebind: One embedding space to bind them all

    Rohit Girdhar, Alaaeldin El-Nouby, Zhuang Liu, Mannat Singh, Kalyan Vasudev Alwala, Armand Joulin, and Ishan Misra. Imagebind: One embedding space to bind them all. InProceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 15180–15190, 2023. 5

  12. [12]

    Wan 2.5: Unified multi-modal video generation framework, 2025

    Alibaba Tongyi Group. Wan 2.5: Unified multi-modal video generation framework, 2025. 1, 2

  13. [13]

    Brace: A benchmark for robust audio caption quality evaluation

    Tianyu Guo, Hongyu Chen, Hao Liang, Meiyi Qiang, Bohan Zeng, Linzhuang Sun, Bin Cui, and Wentao Zhang. Brace: A benchmark for robust audio caption quality evaluation. In The Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2025. 3

  14. [14]

    AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning

    Yuwei Guo, Ceyuan Yang, Anyi Rao, Zhengyang Liang, Yaohui Wang, Yu Qiao, Maneesh Agrawala, Dahua Lin, and Bo Dai. Animatediff: Animate your personalized text- to-image diffusion models without specific tuning.arXiv preprint arXiv:2307.04725, 2023. 1

  15. [15]

    Latent video diffusion models for high-fidelity long video generation, 2023

    Yingqing He, Tianyu Yang, Yong Zhang, Ying Shan, and Qifeng Chen. Latent video diffusion models for high-fidelity long video generation, 2023. 2

  16. [16]

    Denoising dif- fusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising dif- fusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

  17. [17]

    Video dif- fusion models.Advances in neural information processing systems, 35:8633–8646, 2022

    Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J Fleet. Video dif- fusion models.Advances in neural information processing systems, 35:8633–8646, 2022. 2

  18. [18]

    Cogvideo: Large-scale pretraining for text-to-video generation via transformers, 2022

    Wenyi Hong, Ming Ding, Wendi Zheng, Xinghan Liu, and Jie Tang. Cogvideo: Large-scale pretraining for text-to-video generation via transformers, 2022. 2

  19. [19]

    Vbench: Comprehensive bench- mark suite for video generative models

    Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, et al. Vbench: Comprehensive bench- mark suite for video generative models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21807–21818, 2024. 1, 3

  20. [20]

    Synchformer: Efficient synchronization from sparse cues

    Vladimir Iashin, Weidi Xie, Esa Rahtu, and Andrew Zisser- man. Synchformer: Efficient synchronization from sparse cues. InICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5325–5329. IEEE, 2024. 5

  21. [21]

    Kates.Signal Processing for Hearing Aids, pages 235–277

    James M. Kates.Signal Processing for Hearing Aids, pages 235–277. Springer US, Boston, MA, 2002. 5

  22. [22]

    Latentsync: Taming audio-conditioned latent diffusion models for lip sync with syncnet supervision.arXiv preprint arXiv:2412.09262, 2024

    Chunyu Li, Chao Zhang, Weikai Xu, Jingyu Lin, Jinghui Xie, Weiguo Feng, Bingyue Peng, Cunjian Chen, and Wei- wei Xing. Latentsync: Taming audio-conditioned latent dif- fusion models for lip sync with syncnet supervision.arXiv preprint arXiv:2412.09262, 2024. 5

  23. [23]

    Deepsound-v1: Start to think step-by-step in the audio gen- eration from videos.arXiv preprint arXiv:2503.22208, 2025

    Yunming Liang, Zihao Chen, Chaofan Ding, and Xinhan Di. Deepsound-v1: Start to think step-by-step in the audio gen- eration from videos.arXiv preprint arXiv:2503.22208, 2025. 2

  24. [24]

    Draw-and-understand: Leveraging visual prompts to enable mllms to comprehend what you want.arXiv preprint arXiv:2403.20271,

    Weifeng Lin, Xinyu Wei, Ruichuan An, Peng Gao, Bocheng Zou, Yulin Luo, Siyuan Huang, Shanghang Zhang, and Hongsheng Li. Draw-and-understand: Leveraging visual prompts to enable mllms to comprehend what you want. arXiv preprint arXiv:2403.20271, 2024. 3 9

  25. [25]

    Perceive anything: Recognize, explain, caption, and segment anything in images and videos, 2025

    Weifeng Lin, Xinyu Wei, Ruichuan An, Tianhe Ren, Tingwei Chen, Renrui Zhang, Ziyu Guo, Wentao Zhang, Lei Zhang, and Hongsheng Li. Perceive anything: Recognize, explain, caption, and segment anything in images and videos, 2025. 3

  26. [26]

    Thinksound: Chain-of-thought reasoning in multimodal large language models for audio generation and editing,

    Huadai Liu, Jialei Wang, Kaicheng Luo, Wen Wang, Qian Chen, Zhou Zhao, and Wei Xue. Thinksound: Chain- of-thought reasoning in multimodal large language mod- els for audio generation and editing.arXiv preprint arXiv:2506.21448, 2025. 2

  27. [27]

    JavisDiT: Joint audio-video diffusion transformer with hierarchical spatio-temporal prior synchronization,

    Kai Liu, Wei Li, Lai Chen, Shengqiong Wu, Yanhao Zheng, Jiayi Ji, Fan Zhou, Rongxin Jiang, Jiebo Luo, Hao Fei, et al. Javisdit: Joint audio-video diffusion transformer with hierar- chical spatio-temporal prior synchronization.arXiv preprint arXiv:2503.23377, 2025. 1, 3

  28. [28]

    Video-p2p: Video editing with cross-attention control

    Shaoteng Liu, Yuechen Zhang, Wenbo Li, Zhe Lin, and Jiaya Jia. Video-p2p: Video editing with cross-attention control. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8599–8608, 2024. 1

  29. [29]

    Llm as dataset ana- lyst: Subpopulation structure discovery with large language model

    Yulin Luo, Ruichuan An, Bocheng Zou, Yiming Tang, Ji- aming Liu, and Shanghang Zhang. Llm as dataset ana- lyst: Subpopulation structure discovery with large language model. InEuropean Conference on Computer Vision, pages 235–252. Springer, 2024. 3

  30. [30]

    Videofusion: Decomposed diffusion models for high- quality video generation

    Zhengxiong Luo, Dayou Chen, Yingya Zhang, Yan Huang, Liang Wang, Yujun Shen, Deli Zhao, Jinren Zhou, and Tie- niu Tan. Decomposed diffusion models for high-quality video generation.arXiv preprint arXiv:2303.08320, 3, 2023. 2

  31. [31]

    Latte: Latent Diffusion Transformer for Video Generation

    Xin Ma, Yaohui Wang, Gengyun Jia, Xinyuan Chen, Zi- wei Liu, Yuan-Fang Li, Cunjian Chen, and Yu Qiao. Latte: Latent diffusion transformer for video generation.arXiv preprint arXiv:2401.03048, 2024. 2

  32. [32]

    Nisqa: A deep cnn-self-attention model for multidimensional speech quality prediction with crowdsourced datasets.arXiv preprint arXiv:2104.09494,

    Gabriel Mittag, Babak Naderi, Assmaa Chehadi, and Sebas- tian M¨oller. Nisqa: A deep cnn-self-attention model for mul- tidimensional speech quality prediction with crowdsourced datasets.arXiv preprint arXiv:2104.09494, 2021. 4

  33. [33]

    Sora, 2024

    OpenAI. Sora, 2024. 2

  34. [34]

    Sora 2: Video generation model, 2025

    OpenAI. Sora 2: Video generation model, 2025. 1, 2

  35. [35]

    Movie Gen: A Cast of Media Foundation Models

    Adam Polyak, Amit Zohar, Andrew Brown, Andros Tjandra, Animesh Sinha, Ann Lee, Apoorv Vyas, Bowen Shi, Chih- Yao Ma, Ching-Yao Chuang, et al. Movie gen: A cast of media foundation models.arXiv preprint arXiv:2410.13720,

  36. [36]

    John Wiley & Sons, 2015

    Ville Pulkki and Matti Karjalainen.Communication acous- tics: an introduction to speech, audio and psychoacoustics. John Wiley & Sons, 2015. 5

  37. [37]

    Dns- mos: A non-intrusive perceptual objective speech quality metric to evaluate noise suppressors

    Chandan KA Reddy, Vishak Gopal, and Ross Cutler. Dns- mos: A non-intrusive perceptual objective speech quality metric to evaluate noise suppressors. InICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6493–6497. IEEE, 2021. 4, 1

  38. [38]

    Improved techniques for training gans.Advances in neural information processing systems, 29, 2016

    Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training gans.Advances in neural information processing systems, 29, 2016. 3

  39. [39]

    Hunyuanvideo-foley: Multimodal diffusion with representation alignment for high-fidelity foley audio generation,

    Sizhe Shan, Qiulin Li, Yutao Cui, Miles Yang, Yuehai Wang, Qun Yang, Jin Zhou, and Zhao Zhong. Hunyuanvideo- foley: Multimodal diffusion with representation alignment for high-fidelity foley audio generation.arXiv preprint arXiv:2508.16930, 2025. 4

  40. [40]

    Mavors: Multi-granularity video representation for multimodal large language model

    Yang Shi, Jiaheng Liu, Yushuo Guan, Zhenhua Wu, Yuanx- ing Zhang, Zihao Wang, Weihong Lin, Jingyun Hua, Zekun Wang, Xinlong Chen, et al. Mavors: Multi-granularity video representation for multimodal large language model. InPro- ceedings of the 33rd ACM International Conference on Mul- timedia, pages 10994–11003, 2025. 3

  41. [41]

    Mme- videoocr: Evaluating ocr-based capabilities of multimodal llms in video scenarios,

    Yang Shi, Huanqian Wang, Wulin Xie, Huanyao Zhang, Lijie Zhao, Yi-Fan Zhang, Xinfeng Li, Chaoyou Fu, Zhuoer Wen, Wenting Liu, et al. Mme-videoocr: Evaluating ocr-based capabilities of multimodal llms in video scenarios.arXiv preprint arXiv:2505.21333, 2025. 3

  42. [42]

    On the perception of the direction of sound.Proceedings of the Royal Society of London

    John William Strutt. On the perception of the direction of sound.Proceedings of the Royal Society of London. Series A, Containing Papers of a Mathematical and Physical Char- acter, 83(559):61–64, 1909. 5

  43. [43]

    Kling 2.5 turbo, 2025

    Kuaishou Technology. Kling 2.5 turbo, 2025. 6, 1

  44. [44]

    Peaq-the itu standard for objective measure- ment of perceived audio quality.Journal of the Audio Engi- neering Society, 48(1/2):3–29, 2000

    Thilo Thiede, William C Treurniet, Roland Bitto, Christian Schmidmer, Thomas Sporer, John G Beerends, and Cather- ine Colomes. Peaq-the itu standard for objective measure- ment of perceived audio quality.Journal of the Audio Engi- neering Society, 48(1/2):3–29, 2000. 5

  45. [45]

    Meta Audiobox Aesthetics: Unified Automatic Quality Assessment for Speech, Music, and Sound

    Andros Tjandra, Yi-Chiao Wu, Baishan Guo, John Hoffman, Brian Ellis, Apoorv Vyas, Bowen Shi, Sanyuan Chen, Matt Le, Nick Zacharov, et al. Meta audiobox aesthetics: Unified automatic quality assessment for speech, music, and sound. arXiv preprint arXiv:2502.05139, 2025. 4

  46. [46]

    Fvd: A new metric for video generation

    Thomas Unterthiner, Sjoerd Van Steenkiste, Karol Kurach, Rapha¨el Marinier, Marcin Michalski, and Sylvain Gelly. Fvd: A new metric for video generation. 2019. 3

  47. [47]

    Wan: Open and Advanced Large-Scale Video Generative Models

    Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video gen- erative models.arXiv preprint arXiv:2503.20314, 2025. 2, 6, 1

  48. [48]

    ModelScope Text-to-Video Technical Report

    Jiuniu Wang, Hangjie Yuan, Dayou Chen, Yingya Zhang, Xiang Wang, and Shiwei Zhang. Modelscope text-to-video technical report.arXiv preprint arXiv:2308.06571, 2023. 2

  49. [49]

    V2a-mapper: A lightweight solution for vision-to-audio generation by connecting foundation models

    Jun Wang, Xijuan Zeng, Chunyu Qiang, Ruilong Chen, Shiyao Wang, Le Wang, Wangjing Zhou, Pengfei Cai, Ji- ahui Zhao, Nan Li, et al. Kling-foley: Multimodal diffu- sion transformer for high-quality video-to-audio generation. arXiv preprint arXiv:2506.19774, 2025. 2, 3

  50. [50]

    InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation

    Yi Wang, Yinan He, Yizhuo Li, Kunchang Li, Jiashuo Yu, Xin Ma, Xinhao Li, Guo Chen, Xinyuan Chen, Yaohui Wang, et al. Internvid: A large-scale video-text dataset for multimodal understanding and generation.arXiv preprint arXiv:2307.06942, 2023. 5

  51. [51]

    Lavie: High-quality video generation with cascaded latent diffusion models.International Journal of Computer Vision, 133(5):3059–3078, 2025

    Yaohui Wang, Xinyuan Chen, Xin Ma, Shangchen Zhou, Ziqi Huang, Yi Wang, Ceyuan Yang, Yinan He, Jiashuo Yu, Peiqing Yang, et al. Lavie: High-quality video generation with cascaded latent diffusion models.International Journal of Computer Vision, 133(5):3059–3078, 2025. 2 10

  52. [52]

    Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation

    Jay Zhangjie Wu, Yixiao Ge, Xintao Wang, Stan Weixian Lei, Yuchao Gu, Yufei Shi, Wynne Hsu, Ying Shan, Xiaohu Qie, and Mike Zheng Shou. Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation. In Proceedings of the IEEE/CVF international conference on computer vision, pages 7623–7633, 2023. 1

  53. [53]

    Automated movie generation via multi-agent cot plan- ning.ArXiv, abs/2503.07314,

    Weijia Wu, Zeyu Zhu, and Mike Zheng Shou. Auto- mated movie generation via multi-agent cot planning.arXiv preprint arXiv:2503.07314, 2025. 1

  54. [54]

    Large-scale con- trastive language-audio pretraining with feature fusion and keyword-to-caption augmentation

    Yusong Wu, Ke Chen, Tianyu Zhang, Yuchen Hui, Taylor Berg-Kirkpatrick, and Shlomo Dubnov. Large-scale con- trastive language-audio pretraining with feature fusion and keyword-to-caption augmentation. InICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2023. 5

  55. [55]

    Jin Xu, Zhifang Guo, Jinzheng He, Hangrui Hu, Ting He, Shuai Bai, Keqin Chen, Jialin Wang, Yang Fan, Kai Dang, et al. Qwen2. 5-omni technical report.arXiv preprint arXiv:2503.20215, 2025. 3, 5, 1, 7

  56. [56]

    CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

    Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiao- han Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer.arXiv preprint arXiv:2408.06072, 2024. 2

  57. [57]

    Magvit: Masked generative video transformer

    Lijun Yu, Yong Cheng, Kihyuk Sohn, Jos ´e Lezama, Han Zhang, Huiwen Chang, Alexander G Hauptmann, Ming- Hsuan Yang, Yuan Hao, Irfan Essa, et al. Magvit: Masked generative video transformer. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10459–10469, 2023. 2

  58. [58]

    Evaluation agent: Efficient and promptable evaluation framework for visual generative models

    Fan Zhang, Shulin Tian, Ziqi Huang, Yu Qiao, and Ziwei Liu. Evaluation agent: Efficient and promptable evaluation framework for visual generative models. InProceedings of the 63rd Annual Meeting of the Association for Computa- tional Linguistics (Volume 1: Long Papers), pages 7561– 7582, 2025. 1

  59. [59]

    VBench-2.0: Advancing Video Generation Benchmark Suite for Intrinsic Faithfulness

    Dian Zheng, Ziqi Huang, Hongbo Liu, Kai Zou, Yinan He, Fan Zhang, Lulu Gu, Yuanhan Zhang, Jingwen He, Wei- Shi Zheng, et al. Vbench-2.0: Advancing video generation benchmark suite for intrinsic faithfulness.arXiv preprint arXiv:2503.21755, 2025. 1, 3

  60. [60]

    Cogview3: Finer and faster text-to-image generation via relay diffusion

    Wendi Zheng, Jiayan Teng, Zhuoyi Yang, Weihan Wang, Jidong Chen, Xiaotao Gu, Yuxiao Dong, Ming Ding, and Jie Tang. Cogview3: Finer and faster text-to-image generation via relay diffusion. InComputer Vision – ECCV 2024: 18th European Conference, Milan, Italy, September 29–October 4, 2024, Proceedings, Part LXXVII, page 1–22, Berlin, Heidelberg, 2...

  61. [61]

    Supplementry results analysis in SpeechClarity and Artistry Table 3

    Additional evaluation metrics 6.1. Supplementry results analysis in SpeechClarity and Artistry Table 3. Supplementary results for T2A V and I2A V Models T2A V I2A V SpeechClarity Artisry Artisry sora2 2.367 3.735 3.931 veo3 2.554 3.825 3.983 wan2.5 2.396 3.838 3.929 seed think 2.008 3.717 3.956 seed mm 2.202 3.707 3.971 wan2.2 think 1.882 3.630 3.942 wan2...

  62. [62]

    The attributes of the videos generated by each model

    Audio-Video Generation Models in Evalua- tion In our experiments, we adhered to the default configura- tion parameters provided by each video generation model, Table 4. The attributes of the videos generated by each model. Models Length FPS sora2 10s 30 veo3 8s 24 wan2.5 5s 24 seedance 1.0 lite 5s 24 wan2.2 5s 24 kling2.5 turbo 5s 24 as summarized in Tab....

  63. [63]

    contemplation

    Detail Analysis of Different Tasks This section provides a comprehensive analysis of experi- mental results across different categories for various mod- els under both T2A V (Tab. 5) and I2A V (Tab. 6) tasks. The study aims to identify common patterns across tasks and elucidate the specific impact of image-conditioned input (I2A V) on the final outcomes. ...

  64. [64]

    approach-pass-recede

    Qualitative Analysis In this section, we conduct a more detailed analysis based on several specific scenarios. These scenarios are selected to examine how the models handle challenging multimodal cues involving physical principles, temporal constraints, and spatial structures. 9.1. Doppler Effect This part evaluates whether the models can generate acous- ...

  65. [65]

    alternate left-right channels

    Special samples Analysis 10.1. V eo3 Case Analysis We examine a case (Fig. 17) where Veo3 autonomously generated stereophonic audio featuring distinct Doppler effects, notably without explicit spatial specifications in the input prompt. We conducted time-domain waveform and spectrogram analyses for both channels, as shown in Fig. 18. The specific prompt u...

  66. [66]

    score” (integer 1-5) and “reason

    MLLM Based Evaluation Cases 11.1. Macro Evaluation System Prompt Sample As introduced in the main paper, our evaluation framework leverages Qwen2.5 Omni 7B [55] to provide a scalable and standardized alternative to traditional MOS. This supple- mentary section provides the specific implementation de- tails for the coarse-grained (macro) evaluation level. ...

  67. [67]

    score” (integer 1-5) and “reason

    Narrative function: Does audio actively clar- ify or enhance the story? Examples include: - Highlighting a key action (e.g., a heartbeat dur- ing a reveal) - Conveying character perspective (e.g., muffled sound during dazed POV) - Bridg- ing scenes through sound continuity (e.g., train whistle fading into next location) - Providing off- screen context (e....