VABench: A Comprehensive Benchmark for Audio-Video Generation

Bohan Zeng; Daili Hua; Hao Liang; Junbo Niu; Quanqing Xu; Wentao Zhang; Xinlong Chen; Xinyi Huang; Xizhi Wang

arxiv: 2512.09299 · v2 · submitted 2025-12-10 · 💻 cs.CV · cs.SD

VABench: A Comprehensive Benchmark for Audio-Video Generation

Daili Hua , Xizhi Wang , Bohan Zeng , Xinyi Huang , Hao Liang , Junbo Niu , Xinlong Chen , Quanqing Xu

show 1 more author

Wentao Zhang

This is my paper

Pith reviewed 2026-05-16 23:59 UTC · model grok-4.3

classification 💻 cs.CV cs.SD

keywords audio-video generationbenchmarksynchronizationlip-speech consistencymultimodal evaluationtext-to-audio-videoimage-to-audio-video

0 comments

The pith

VABench supplies a multi-dimensional benchmark to test synchronized audio-video generation models where prior visual-only tests fall short.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces VABench to evaluate models that produce video with matching audio. Existing benchmarks measure visual quality but provide no convincing checks for audio-video alignment. VABench defines three task types—text-to-audio-video, image-to-audio-video, and stereo generation—and scores them across fifteen dimensions that include text-video similarity, text-audio similarity, video-audio similarity, audio-video synchronization, lip-speech consistency, and audio-video question-answering pairs. These dimensions are applied to seven content categories such as animals, music, and complex scenes. The authors supply systematic analysis and visualizations to set a new standard for assessing joint audio-video output.

Core claim

VABench is a benchmark framework that systematically evaluates synchronous audio-video generation models through three primary task types, two major evaluation modules spanning fifteen dimensions that cover pairwise similarities, synchronization, lip-speech consistency, and curated QA pairs, plus seven content categories, establishing quantitative assessment where existing video benchmarks lack audio-video metrics.

What carries the argument

The VABench evaluation framework consisting of fifteen dimensions that measure pairwise similarities, audio-video synchronization, lip-speech consistency, and QA pairs across seven content categories.

If this is right

Generation models can receive explicit scores for audio-video alignment rather than visual quality alone.
Developers gain concrete targets for improving synchronization and lip-speech consistency.
New models can be compared directly on joint audio-video tasks across text, image, and stereo inputs.
Research focus may shift toward metrics that treat audio and video as a single synchronized output.
Benchmark results can guide selection of models for applications requiring matched sound and motion.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Adoption of VABench could standardize training objectives so that models optimize for both modalities jointly instead of adding audio after video generation.
The framework may reveal whether current architectures handle environmental sounds or virtual-world scenes better than human-speech scenes.
Future extensions could add temporal metrics that track synchronization drift over longer sequences.
Results on VABench might correlate with downstream task performance such as video editing or virtual-reality rendering where audio-video mismatch is noticeable.
pith_inferences are editorial extensions and not stated in the paper.

Load-bearing premise

That the chosen fifteen dimensions, pairwise similarity checks, lip-speech tests, and seven content categories together capture the essential qualities needed to judge audio-video generation performance.

What would settle it

A generation model that ranks high on all VABench dimensions yet produces clearly mismatched audio and video when tested on real-world clips outside the seven categories.

Figures

Figures reproduced from arXiv: 2512.09299 by Bohan Zeng, Daili Hua, Hao Liang, Junbo Niu, Quanqing Xu, Wentao Zhang, Xinlong Chen, Xinyi Huang, Xizhi Wang.

**Figure 1.** Figure 1: Overview of the VABench framework, illustrating its three main components: (1) The audio-video generation tasks being [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 2.** Figure 2: Data distribution of VABench. The sunburst chart illus [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: VABench’s seven content categories, illustrated with ex [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Overview of the pipeline for benchmark data curation. [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗

**Figure 5.** Figure 5: Qualitative comparison of model performance. We [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 7.** Figure 7: Comparative radar chart of three models: Phase Co [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

**Figure 9.** Figure 9: Doppler-effect video for analysis. From top to bottom, [PITH_FULL_IMAGE:figures/full_fig_p014_9.png] view at source ↗

**Figure 11.** Figure 11: Lightning video for analysis. From top to bot [PITH_FULL_IMAGE:figures/full_fig_p015_11.png] view at source ↗

**Figure 13.** Figure 13: Stereo-sound video for analysis. From top to bottom, [PITH_FULL_IMAGE:figures/full_fig_p016_13.png] view at source ↗

**Figure 14.** Figure 14: Stereophonic analysis of the video generated by Veo3 [PITH_FULL_IMAGE:figures/full_fig_p016_14.png] view at source ↗

**Figure 17.** Figure 17: A sample from Veo3’s generated results, illustrating [PITH_FULL_IMAGE:figures/full_fig_p017_17.png] view at source ↗

**Figure 18.** Figure 18: Analysis of the sample generated by Veo3, showing the [PITH_FULL_IMAGE:figures/full_fig_p017_18.png] view at source ↗

**Figure 19.** Figure 19: Video generated by Sora2, showing dual-channel audio [PITH_FULL_IMAGE:figures/full_fig_p018_19.png] view at source ↗

**Figure 21.** Figure 21: Demographic tendencies in generated human subjects [PITH_FULL_IMAGE:figures/full_fig_p018_21.png] view at source ↗

**Figure 23.** Figure 23: Video example generated by Seedance+MMAudio on [PITH_FULL_IMAGE:figures/full_fig_p020_23.png] view at source ↗

**Figure 22.** Figure 22: Video example generated by Veo3 on the T2AV task. [PITH_FULL_IMAGE:figures/full_fig_p020_22.png] view at source ↗

**Figure 25.** Figure 25: Video example generated by Kling+ThinkSound on the [PITH_FULL_IMAGE:figures/full_fig_p021_25.png] view at source ↗

read the original abstract

Recent advances in video generation have been remarkable, enabling models to produce visually compelling videos with synchronized audio. While existing video generation benchmarks provide comprehensive metrics for visual quality, they lack convincing evaluations for audio-video generation, especially for models aiming to generate synchronized audio-video outputs. To address this gap, we introduce VABench, a comprehensive and multi-dimensional benchmark framework designed to systematically evaluate the capabilities of synchronous audio-video generation. VABench encompasses three primary task types: text-to-audio-video (T2AV), image-to-audio-video (I2AV), and stereo audio-video generation. It further establishes two major evaluation modules covering 15 dimensions. These dimensions specifically assess pairwise similarities (text-video, text-audio, video-audio), audio-video synchronization, lip-speech consistency, and carefully curated audio and video question-answering (QA) pairs, among others. Furthermore, VABench covers seven major content categories: animals, human sounds, music, environmental sounds, synchronous physical sounds, complex scenes, and virtual worlds. We provide a systematic analysis and visualization of the evaluation results, aiming to establish a new standard for assessing video generation models with synchronous audio capabilities and to promote the comprehensive advancement of the field.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper introduces VABench, a benchmark framework for evaluating synchronous audio-video generation models. It defines three task types (text-to-audio-video, image-to-audio-video, and stereo audio-video generation), two evaluation modules spanning 15 dimensions (pairwise similarities, audio-video synchronization, lip-speech consistency, and QA pairs), and seven content categories (animals, human sounds, music, environmental sounds, synchronous physical sounds, complex scenes, virtual worlds), with accompanying analysis and visualizations.

Significance. If the 15 dimensions and QA pairs are shown through validation to correlate with human judgments and to capture perceptually relevant failure modes missed by prior video benchmarks, VABench could provide a useful standardized evaluation protocol for audio-video synchronization in generative models. The multi-task and multi-category coverage is a constructive step toward addressing the gap noted in the abstract.

major comments (3)

[Evaluation modules] Evaluation modules (described in the main text following the abstract): the 15 dimensions are enumerated at a high level but no metric formulas, distance functions, or implementation details are supplied for text-video similarity, video-audio sync, lip-speech consistency, or the QA scoring procedure, rendering the benchmark non-reproducible.
[QA pairs] QA pairs subsection: no inter-annotator agreement, human correlation study, or ablation is reported to confirm that the curated audio and video QA pairs actually measure the intended semantic and perceptual properties.
[Content categories and results analysis] Content categories and results analysis: the claim that the seven categories plus the listed checks suffice to capture key challenges (e.g., temporal audio drift or semantic mismatch) is unsupported by any comparative analysis against existing benchmarks or by evidence that the chosen axes predict human-perceived audio-video quality.

minor comments (1)

[Abstract] The abstract states that VABench 'establishes two major evaluation modules' but does not name or briefly characterize those two modules, which would improve immediate clarity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and will revise the manuscript accordingly to improve clarity and reproducibility.

read point-by-point responses

Referee: [Evaluation modules] Evaluation modules (described in the main text following the abstract): the 15 dimensions are enumerated at a high level but no metric formulas, distance functions, or implementation details are supplied for text-video similarity, video-audio sync, lip-speech consistency, or the QA scoring procedure, rendering the benchmark non-reproducible.

Authors: We agree that additional implementation details are required for reproducibility. In the revised manuscript we will expand the Evaluation Modules section with explicit metric formulas (e.g., CLIP cosine similarity for text-video, SyncNet-based AV sync score, Wav2Lip lip-speech consistency), distance functions, and the precise QA scoring procedure, together with pseudocode and a pointer to the released evaluation code. revision: yes
Referee: [QA pairs] QA pairs subsection: no inter-annotator agreement, human correlation study, or ablation is reported to confirm that the curated audio and video QA pairs actually measure the intended semantic and perceptual properties.

Authors: We acknowledge this gap. Although the QA pairs were curated by multiple annotators under a documented protocol, agreement statistics were not reported. We will add an inter-annotator agreement analysis (Fleiss’ kappa) and a human correlation study on a held-out subset in the revised version, placing detailed results in the appendix if space is constrained. revision: yes
Referee: [Content categories and results analysis] Content categories and results analysis: the claim that the seven categories plus the listed checks suffice to capture key challenges (e.g., temporal audio drift or semantic mismatch) is unsupported by any comparative analysis against existing benchmarks or by evidence that the chosen axes predict human-perceived audio-video quality.

Authors: We agree that stronger justification is needed. The revision will include a comparative table against existing benchmarks (VBench, AIGC-Vid, etc.) and additional analysis with qualitative examples showing how the seven categories and 15 dimensions specifically surface failure modes such as temporal drift and semantic mismatch. We will also reference supporting human-perception literature where available. revision: yes

Circularity Check

0 steps flagged

No circularity: benchmark defined directly by tasks and dimensions

full rationale

The paper introduces VABench by enumerating three task types (T2AV, I2AV, stereo) and two evaluation modules spanning 15 dimensions (pairwise similarities, synchronization, lip-speech consistency, QA pairs) plus seven content categories. No equations, fitted parameters, predictions, or derivations appear in the abstract or described structure. The framework is specified by construction as a list of chosen axes rather than derived from prior results or self-citations that reduce to the same inputs. This is the normal case of a definitional benchmark paper with no load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The benchmark relies on standard similarity and QA evaluation practices as domain assumptions without introducing new free parameters or invented entities.

axioms (1)

domain assumption Standard pairwise similarity metrics and curated QA pairs are sufficient to evaluate audio-video synchronization and quality
Invoked when defining the 15 dimensions without additional justification or validation in the abstract.

pith-pipeline@v0.9.0 · 5537 in / 1028 out tokens · 123306 ms · 2026-05-16T23:59:36.524889+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

VABench encompasses two primary audio-video generation tasks... 15 fine-grained metrics... pairwise similarities (text-video, text-audio, video-audio), audio-video synchronization, lip-speech consistency...
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We propose VABench... suite of 15 fine-grained metrics... seven major content categories: animals, human sounds, music...

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 8 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

PhyAVBench: A Challenging Audio Physics-Sensitivity Benchmark for Physically Grounded Text-to-Audio-Video Generation
cs.SD 2025-12 accept novelty 8.0

PhyAVBench supplies the first benchmark and contrastive metric that measures whether text-to-audio-video models respect real-world audio physics across controlled prompt pairs.
MSAVBench: Towards Comprehensive and Reliable Evaluation of Multi-Shot Audio-Video Generation
cs.CV 2026-05 conditional novelty 7.0

MSAVBench is the first comprehensive benchmark for multi-shot audio-video generation, spanning video, audio, shot, and reference dimensions with an adaptive evaluation framework that reaches 91.5% Spearman correlation...
Do Joint Audio-Video Generation Models Understand Physics?
cs.SD 2026-05 unverdicted novelty 7.0

Current joint audio-video generation models lack robust physical commonsense, especially during transitions and when prompted for impossible behaviors.
TMD-Bench: A Multi-Level Evaluation Paradigm for Music-Dance Co-Generation
cs.SD 2026-05 unverdicted novelty 7.0

TMD-Bench is a multi-level benchmark that measures music-dance co-generation quality including beat-level rhythmic synchronization, supported by a new dataset and Music Captioner, and shows commercial models lag in rh...
VidAudio-Bench: Benchmarking V2A and VT2A Generation across Four Audio Categories
cs.SD 2026-04 unverdicted novelty 7.0

VidAudio-Bench benchmarks V2A and VT2A models across four audio categories, revealing poor speech/singing performance and a tension between visual alignment and text instruction following.
PhyAVBench: A Challenging Audio Physics-Sensitivity Benchmark for Physically Grounded Text-to-Audio-Video Generation
cs.SD 2025-12 unverdicted novelty 7.0

PhyAVBench provides the first systematic benchmark and metric for audio-physics grounding in T2AV, I2AV, and V2A models using controlled prompt pairs and real video ground truth.
SyncDPO: Enhancing Temporal Synchronization in Video-Audio Joint Generation via Preference Learning
cs.CV 2026-05 unverdicted novelty 6.0

SyncDPO improves temporal synchronization in video-audio joint generation using DPO with efficient on-the-fly negative sample construction and curriculum learning.
OmniHuman: A Large-scale Dataset and Benchmark for Human-Centric Video Generation
cs.CV 2026-04 unverdicted novelty 6.0

OmniHuman is a new large-scale multi-scene dataset with video-, frame-, and individual-level annotations for human-centric video generation, accompanied by the OHBench benchmark that adds metrics aligned with human pe...

Reference graph

Works this paper leans on

67 extracted references · 67 canonical work pages · cited by 7 Pith papers · 11 internal anchors

[1]

Unictokens: Boosting personalized understand- ing and generation via unified concept tokens.arXiv preprint arXiv:2505.14671, 2025

Ruichuan An, Sihan Yang, Renrui Zhang, Zijun Shen, Ming Lu, Gaole Dai, Hao Liang, Ziyu Guo, Shilin Yan, Yulin Luo, et al. Unictokens: Boosting personalized understand- ing and generation via unified concept tokens.arXiv preprint arXiv:2505.14671, 2025. 3

work page arXiv 2025
[2]

Multi-step visual reasoning with visual tokens scaling and verification

Tianyi Bai, Zengjie Hu, Fupeng Sun, Jiantao Qiu, Yizhen Jiang, Guangxin He, Bohan Zeng, Conghui He, Binhang Yuan, and Wentao Zhang. Multi-step visual reasoning with visual tokens scaling and verification.arXiv preprint arXiv:2506.07235, 2025. 3

work page arXiv 2025
[3]

Lumiere: A space-time diffu- sion model for video generation

Omer Bar-Tal, Hila Chefer, Omer Tov, Charles Her- rmann, Roni Paiss, Shiran Zada, Ariel Ephrat, Junhwa Hur, Guanghui Liu, Amit Raj, et al. Lumiere: A space-time diffu- sion model for video generation. InSIGGRAPH Asia 2024 Conference Papers, pages 1–11, 2024. 2

work page 2024
[4]

Control-a-video: Controllable text-to-video generation with diffusion models

Weifeng Chen, Yatai Ji, Jie Wu, Hefeng Wu, Pan Xie, Jiashi Li, Xin Xia, Xuefeng Xiao, and Liang Lin. Control-a-video: Controllable text-to-video generation with diffusion models. arXiv e-prints, pages arXiv–2305, 2023. 1

work page 2023
[5]

VersaVid-R1: A Versatile Video Understanding and Reasoning Model from Question Answering to Captioning Tasks,

Xinlong Chen, Yuanxing Zhang, Yushuo Guan, Bohan Zeng, Yang Shi, Sihan Yang, Pengfei Wan, Qiang Liu, Liang Wang, and Tieniu Tan. Versavid-r1: A versatile video understanding and reasoning model from question answering to captioning tasks.arXiv preprint arXiv:2506.09079, 2025. 3

work page arXiv 2025
[6]

Mmaudio: Taming multimodal joint training for high-quality video-to- audio synthesis

Ho Kei Cheng, Masato Ishii, Akio Hayakawa, Takashi Shibuya, Alexander Schwing, and Yuki Mitsufuji. Mmaudio: Taming multimodal joint training for high-quality video-to- audio synthesis. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 28901–28911, 2025. 2, 5, 1

work page 2025
[7]

Veo 3, 2025

Google DeepMind. Veo 3, 2025. 1, 2

work page 2025
[8]

Diffusion models beat gans on image synthesis

Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis. InAdvances in Neural Infor- mation Processing Systems, pages 8780–8794. Curran Asso- ciates, Inc., 2021. 2

work page 2021
[9]

Cogview2: faster and better text-to-image generation via hi- erarchical transformers

Ming Ding, Wendi Zheng, Wenyi Hong, and Jie Tang. Cogview2: faster and better text-to-image generation via hi- erarchical transformers. InProceedings of the 36th Inter- national Conference on Neural Information Processing Sys- tems, Red Hook, NY , USA, 2022. Curran Associates Inc. 2

work page 2022
[10]

Seedance 1.0: Exploring the Boundaries of Video Generation Models

Yu Gao, Haoyuan Guo, Tuyen Hoang, Weilin Huang, Lu Jiang, Fangyuan Kong, Huixia Li, Jiashi Li, Liang Li, Xi- aojie Li, et al. Seedance 1.0: Exploring the boundaries of video generation models.arXiv preprint arXiv:2506.09113,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

Imagebind: One embedding space to bind them all

Rohit Girdhar, Alaaeldin El-Nouby, Zhuang Liu, Mannat Singh, Kalyan Vasudev Alwala, Armand Joulin, and Ishan Misra. Imagebind: One embedding space to bind them all. InProceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 15180–15190, 2023. 5

work page 2023
[12]

Wan 2.5: Unified multi-modal video generation framework, 2025

Alibaba Tongyi Group. Wan 2.5: Unified multi-modal video generation framework, 2025. 1, 2

work page 2025
[13]

Brace: A benchmark for robust audio caption quality evaluation

Tianyu Guo, Hongyu Chen, Hao Liang, Meiyi Qiang, Bohan Zeng, Linzhuang Sun, Bin Cui, and Wentao Zhang. Brace: A benchmark for robust audio caption quality evaluation. In The Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2025. 3

work page 2025
[14]

AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning

Yuwei Guo, Ceyuan Yang, Anyi Rao, Zhengyang Liang, Yaohui Wang, Yu Qiao, Maneesh Agrawala, Dahua Lin, and Bo Dai. Animatediff: Animate your personalized text- to-image diffusion models without specific tuning.arXiv preprint arXiv:2307.04725, 2023. 1

work page internal anchor Pith review Pith/arXiv arXiv 2023
[15]

Latent video diffusion models for high-fidelity long video generation, 2023

Yingqing He, Tianyu Yang, Yong Zhang, Ying Shan, and Qifeng Chen. Latent video diffusion models for high-fidelity long video generation, 2023. 2

work page 2023
[16]

Denoising dif- fusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising dif- fusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

work page 2020
[17]

Video dif- fusion models.Advances in neural information processing systems, 35:8633–8646, 2022

Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J Fleet. Video dif- fusion models.Advances in neural information processing systems, 35:8633–8646, 2022. 2

work page 2022
[18]

Cogvideo: Large-scale pretraining for text-to-video generation via transformers, 2022

Wenyi Hong, Ming Ding, Wendi Zheng, Xinghan Liu, and Jie Tang. Cogvideo: Large-scale pretraining for text-to-video generation via transformers, 2022. 2

work page 2022
[19]

Vbench: Comprehensive bench- mark suite for video generative models

Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, et al. Vbench: Comprehensive bench- mark suite for video generative models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21807–21818, 2024. 1, 3

work page 2024
[20]

Synchformer: Efficient synchronization from sparse cues

Vladimir Iashin, Weidi Xie, Esa Rahtu, and Andrew Zisser- man. Synchformer: Efficient synchronization from sparse cues. InICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5325–5329. IEEE, 2024. 5

work page 2024
[21]

Kates.Signal Processing for Hearing Aids, pages 235–277

James M. Kates.Signal Processing for Hearing Aids, pages 235–277. Springer US, Boston, MA, 2002. 5

work page 2002
[22]

Latentsync: Taming audio-conditioned latent diffusion models for lip sync with syncnet supervision.arXiv preprint arXiv:2412.09262, 2024

Chunyu Li, Chao Zhang, Weikai Xu, Jingyu Lin, Jinghui Xie, Weiguo Feng, Bingyue Peng, Cunjian Chen, and Wei- wei Xing. Latentsync: Taming audio-conditioned latent dif- fusion models for lip sync with syncnet supervision.arXiv preprint arXiv:2412.09262, 2024. 5

work page arXiv 2024
[23]

Deepsound-v1: Start to think step-by-step in the audio gen- eration from videos.arXiv preprint arXiv:2503.22208, 2025

Yunming Liang, Zihao Chen, Chaofan Ding, and Xinhan Di. Deepsound-v1: Start to think step-by-step in the audio gen- eration from videos.arXiv preprint arXiv:2503.22208, 2025. 2

work page arXiv 2025
[24]

Draw-and-understand: Leveraging visual prompts to enable mllms to comprehend what you want.arXiv preprint arXiv:2403.20271,

Weifeng Lin, Xinyu Wei, Ruichuan An, Peng Gao, Bocheng Zou, Yulin Luo, Siyuan Huang, Shanghang Zhang, and Hongsheng Li. Draw-and-understand: Leveraging visual prompts to enable mllms to comprehend what you want. arXiv preprint arXiv:2403.20271, 2024. 3 9

work page arXiv 2024
[25]

Perceive anything: Recognize, explain, caption, and segment anything in images and videos, 2025

Weifeng Lin, Xinyu Wei, Ruichuan An, Tianhe Ren, Tingwei Chen, Renrui Zhang, Ziyu Guo, Wentao Zhang, Lei Zhang, and Hongsheng Li. Perceive anything: Recognize, explain, caption, and segment anything in images and videos, 2025. 3

work page 2025
[26]

Thinksound: Chain-of-thought reasoning in multimodal large language models for audio generation and editing,

Huadai Liu, Jialei Wang, Kaicheng Luo, Wen Wang, Qian Chen, Zhou Zhao, and Wei Xue. Thinksound: Chain- of-thought reasoning in multimodal large language mod- els for audio generation and editing.arXiv preprint arXiv:2506.21448, 2025. 2

work page arXiv 2025
[27]

JavisDiT: Joint audio-video diffusion transformer with hierarchical spatio-temporal prior synchronization,

Kai Liu, Wei Li, Lai Chen, Shengqiong Wu, Yanhao Zheng, Jiayi Ji, Fan Zhou, Rongxin Jiang, Jiebo Luo, Hao Fei, et al. Javisdit: Joint audio-video diffusion transformer with hierar- chical spatio-temporal prior synchronization.arXiv preprint arXiv:2503.23377, 2025. 1, 3

work page arXiv 2025
[28]

Video-p2p: Video editing with cross-attention control

Shaoteng Liu, Yuechen Zhang, Wenbo Li, Zhe Lin, and Jiaya Jia. Video-p2p: Video editing with cross-attention control. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8599–8608, 2024. 1

work page 2024
[29]

Llm as dataset ana- lyst: Subpopulation structure discovery with large language model

Yulin Luo, Ruichuan An, Bocheng Zou, Yiming Tang, Ji- aming Liu, and Shanghang Zhang. Llm as dataset ana- lyst: Subpopulation structure discovery with large language model. InEuropean Conference on Computer Vision, pages 235–252. Springer, 2024. 3

work page 2024
[30]

Videofusion: Decomposed diffusion models for high- quality video generation

Zhengxiong Luo, Dayou Chen, Yingya Zhang, Yan Huang, Liang Wang, Yujun Shen, Deli Zhao, Jinren Zhou, and Tie- niu Tan. Decomposed diffusion models for high-quality video generation.arXiv preprint arXiv:2303.08320, 3, 2023. 2

work page arXiv 2023
[31]

Latte: Latent Diffusion Transformer for Video Generation

Xin Ma, Yaohui Wang, Gengyun Jia, Xinyuan Chen, Zi- wei Liu, Yuan-Fang Li, Cunjian Chen, and Yu Qiao. Latte: Latent diffusion transformer for video generation.arXiv preprint arXiv:2401.03048, 2024. 2

work page internal anchor Pith review Pith/arXiv arXiv 2024
[32]

Nisqa: A deep cnn-self-attention model for multidimensional speech quality prediction with crowdsourced datasets.arXiv preprint arXiv:2104.09494,

Gabriel Mittag, Babak Naderi, Assmaa Chehadi, and Sebas- tian M¨oller. Nisqa: A deep cnn-self-attention model for mul- tidimensional speech quality prediction with crowdsourced datasets.arXiv preprint arXiv:2104.09494, 2021. 4

work page arXiv 2021
[33]

Sora, 2024

OpenAI. Sora, 2024. 2

work page 2024
[34]

Sora 2: Video generation model, 2025

OpenAI. Sora 2: Video generation model, 2025. 1, 2

work page 2025
[35]

Movie Gen: A Cast of Media Foundation Models

Adam Polyak, Amit Zohar, Andrew Brown, Andros Tjandra, Animesh Sinha, Ann Lee, Apoorv Vyas, Bowen Shi, Chih- Yao Ma, Ching-Yao Chuang, et al. Movie gen: A cast of media foundation models.arXiv preprint arXiv:2410.13720,

work page internal anchor Pith review Pith/arXiv arXiv
[36]

John Wiley & Sons, 2015

Ville Pulkki and Matti Karjalainen.Communication acous- tics: an introduction to speech, audio and psychoacoustics. John Wiley & Sons, 2015. 5

work page 2015
[37]

Dns- mos: A non-intrusive perceptual objective speech quality metric to evaluate noise suppressors

Chandan KA Reddy, Vishak Gopal, and Ross Cutler. Dns- mos: A non-intrusive perceptual objective speech quality metric to evaluate noise suppressors. InICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6493–6497. IEEE, 2021. 4, 1

work page 2021
[38]

Improved techniques for training gans.Advances in neural information processing systems, 29, 2016

Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training gans.Advances in neural information processing systems, 29, 2016. 3

work page 2016
[39]

Hunyuanvideo-foley: Multimodal diffusion with representation alignment for high-fidelity foley audio generation,

Sizhe Shan, Qiulin Li, Yutao Cui, Miles Yang, Yuehai Wang, Qun Yang, Jin Zhou, and Zhao Zhong. Hunyuanvideo- foley: Multimodal diffusion with representation alignment for high-fidelity foley audio generation.arXiv preprint arXiv:2508.16930, 2025. 4

work page arXiv 2025
[40]

Mavors: Multi-granularity video representation for multimodal large language model

Yang Shi, Jiaheng Liu, Yushuo Guan, Zhenhua Wu, Yuanx- ing Zhang, Zihao Wang, Weihong Lin, Jingyun Hua, Zekun Wang, Xinlong Chen, et al. Mavors: Multi-granularity video representation for multimodal large language model. InPro- ceedings of the 33rd ACM International Conference on Mul- timedia, pages 10994–11003, 2025. 3

work page 2025
[41]

Mme- videoocr: Evaluating ocr-based capabilities of multimodal llms in video scenarios,

Yang Shi, Huanqian Wang, Wulin Xie, Huanyao Zhang, Lijie Zhao, Yi-Fan Zhang, Xinfeng Li, Chaoyou Fu, Zhuoer Wen, Wenting Liu, et al. Mme-videoocr: Evaluating ocr-based capabilities of multimodal llms in video scenarios.arXiv preprint arXiv:2505.21333, 2025. 3

work page arXiv 2025
[42]

On the perception of the direction of sound.Proceedings of the Royal Society of London

John William Strutt. On the perception of the direction of sound.Proceedings of the Royal Society of London. Series A, Containing Papers of a Mathematical and Physical Char- acter, 83(559):61–64, 1909. 5

work page 1909
[43]

Kling 2.5 turbo, 2025

Kuaishou Technology. Kling 2.5 turbo, 2025. 6, 1

work page 2025
[44]

Peaq-the itu standard for objective measure- ment of perceived audio quality.Journal of the Audio Engi- neering Society, 48(1/2):3–29, 2000

Thilo Thiede, William C Treurniet, Roland Bitto, Christian Schmidmer, Thomas Sporer, John G Beerends, and Cather- ine Colomes. Peaq-the itu standard for objective measure- ment of perceived audio quality.Journal of the Audio Engi- neering Society, 48(1/2):3–29, 2000. 5

work page 2000
[45]

Meta Audiobox Aesthetics: Unified Automatic Quality Assessment for Speech, Music, and Sound

Andros Tjandra, Yi-Chiao Wu, Baishan Guo, John Hoffman, Brian Ellis, Apoorv Vyas, Bowen Shi, Sanyuan Chen, Matt Le, Nick Zacharov, et al. Meta audiobox aesthetics: Unified automatic quality assessment for speech, music, and sound. arXiv preprint arXiv:2502.05139, 2025. 4

work page internal anchor Pith review arXiv 2025
[46]

Fvd: A new metric for video generation

Thomas Unterthiner, Sjoerd Van Steenkiste, Karol Kurach, Rapha¨el Marinier, Marcin Michalski, and Sylvain Gelly. Fvd: A new metric for video generation. 2019. 3

work page 2019
[47]

Wan: Open and Advanced Large-Scale Video Generative Models

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video gen- erative models.arXiv preprint arXiv:2503.20314, 2025. 2, 6, 1

work page internal anchor Pith review Pith/arXiv arXiv 2025
[48]

ModelScope Text-to-Video Technical Report

Jiuniu Wang, Hangjie Yuan, Dayou Chen, Yingya Zhang, Xiang Wang, and Shiwei Zhang. Modelscope text-to-video technical report.arXiv preprint arXiv:2308.06571, 2023. 2

work page internal anchor Pith review Pith/arXiv arXiv 2023
[49]

V2a-mapper: A lightweight solution for vision-to-audio generation by connecting foundation models

Jun Wang, Xijuan Zeng, Chunyu Qiang, Ruilong Chen, Shiyao Wang, Le Wang, Wangjing Zhou, Pengfei Cai, Ji- ahui Zhao, Nan Li, et al. Kling-foley: Multimodal diffu- sion transformer for high-quality video-to-audio generation. arXiv preprint arXiv:2506.19774, 2025. 2, 3

work page arXiv 2025
[50]

InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation

Yi Wang, Yinan He, Yizhuo Li, Kunchang Li, Jiashuo Yu, Xin Ma, Xinhao Li, Guo Chen, Xinyuan Chen, Yaohui Wang, et al. Internvid: A large-scale video-text dataset for multimodal understanding and generation.arXiv preprint arXiv:2307.06942, 2023. 5

work page internal anchor Pith review Pith/arXiv arXiv 2023
[51]

Lavie: High-quality video generation with cascaded latent diffusion models.International Journal of Computer Vision, 133(5):3059–3078, 2025

Yaohui Wang, Xinyuan Chen, Xin Ma, Shangchen Zhou, Ziqi Huang, Yi Wang, Ceyuan Yang, Yinan He, Jiashuo Yu, Peiqing Yang, et al. Lavie: High-quality video generation with cascaded latent diffusion models.International Journal of Computer Vision, 133(5):3059–3078, 2025. 2 10

work page 2025
[52]

Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation

Jay Zhangjie Wu, Yixiao Ge, Xintao Wang, Stan Weixian Lei, Yuchao Gu, Yufei Shi, Wynne Hsu, Ying Shan, Xiaohu Qie, and Mike Zheng Shou. Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation. In Proceedings of the IEEE/CVF international conference on computer vision, pages 7623–7633, 2023. 1

work page 2023
[53]

Automated movie generation via multi-agent cot plan- ning.ArXiv, abs/2503.07314,

Weijia Wu, Zeyu Zhu, and Mike Zheng Shou. Auto- mated movie generation via multi-agent cot planning.arXiv preprint arXiv:2503.07314, 2025. 1

work page arXiv 2025
[54]

Large-scale con- trastive language-audio pretraining with feature fusion and keyword-to-caption augmentation

Yusong Wu, Ke Chen, Tianyu Zhang, Yuchen Hui, Taylor Berg-Kirkpatrick, and Shlomo Dubnov. Large-scale con- trastive language-audio pretraining with feature fusion and keyword-to-caption augmentation. InICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2023. 5

work page 2023
[55]

Jin Xu, Zhifang Guo, Jinzheng He, Hangrui Hu, Ting He, Shuai Bai, Keqin Chen, Jialin Wang, Yang Fan, Kai Dang, et al. Qwen2. 5-omni technical report.arXiv preprint arXiv:2503.20215, 2025. 3, 5, 1, 7

work page internal anchor Pith review Pith/arXiv arXiv 2025
[56]

CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiao- han Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer.arXiv preprint arXiv:2408.06072, 2024. 2

work page internal anchor Pith review Pith/arXiv arXiv 2024
[57]

Magvit: Masked generative video transformer

Lijun Yu, Yong Cheng, Kihyuk Sohn, Jos ´e Lezama, Han Zhang, Huiwen Chang, Alexander G Hauptmann, Ming- Hsuan Yang, Yuan Hao, Irfan Essa, et al. Magvit: Masked generative video transformer. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10459–10469, 2023. 2

work page 2023
[58]

Evaluation agent: Efficient and promptable evaluation framework for visual generative models

Fan Zhang, Shulin Tian, Ziqi Huang, Yu Qiao, and Ziwei Liu. Evaluation agent: Efficient and promptable evaluation framework for visual generative models. InProceedings of the 63rd Annual Meeting of the Association for Computa- tional Linguistics (Volume 1: Long Papers), pages 7561– 7582, 2025. 1

work page 2025
[59]

VBench-2.0: Advancing Video Generation Benchmark Suite for Intrinsic Faithfulness

Dian Zheng, Ziqi Huang, Hongbo Liu, Kai Zou, Yinan He, Fan Zhang, Lulu Gu, Yuanhan Zhang, Jingwen He, Wei- Shi Zheng, et al. Vbench-2.0: Advancing video generation benchmark suite for intrinsic faithfulness.arXiv preprint arXiv:2503.21755, 2025. 1, 3

work page internal anchor Pith review Pith/arXiv arXiv 2025
[60]

Cogview3: Finer and faster text-to-image generation via relay diffusion

Wendi Zheng, Jiayan Teng, Zhuoyi Yang, Weihan Wang, Jidong Chen, Xiaotao Gu, Yuxiao Dong, Ming Ding, and Jie Tang. Cogview3: Finer and faster text-to-image generation via relay diffusion. InComputer Vision – ECCV 2024: 18th European Conference, Milan, Italy, September 29–October 4, 2024, Proceedings, Part LXXVII, page 1–22, Berlin, Heidelberg, 2...

work page 2024
[61]

Supplementry results analysis in SpeechClarity and Artistry Table 3

Additional evaluation metrics 6.1. Supplementry results analysis in SpeechClarity and Artistry Table 3. Supplementary results for T2A V and I2A V Models T2A V I2A V SpeechClarity Artisry Artisry sora2 2.367 3.735 3.931 veo3 2.554 3.825 3.983 wan2.5 2.396 3.838 3.929 seed think 2.008 3.717 3.956 seed mm 2.202 3.707 3.971 wan2.2 think 1.882 3.630 3.942 wan2...

work page
[62]

The attributes of the videos generated by each model

Audio-Video Generation Models in Evalua- tion In our experiments, we adhered to the default configura- tion parameters provided by each video generation model, Table 4. The attributes of the videos generated by each model. Models Length FPS sora2 10s 30 veo3 8s 24 wan2.5 5s 24 seedance 1.0 lite 5s 24 wan2.2 5s 24 kling2.5 turbo 5s 24 as summarized in Tab....

work page
[63]

contemplation

Detail Analysis of Different Tasks This section provides a comprehensive analysis of experi- mental results across different categories for various mod- els under both T2A V (Tab. 5) and I2A V (Tab. 6) tasks. The study aims to identify common patterns across tasks and elucidate the specific impact of image-conditioned input (I2A V) on the final outcomes. ...

work page
[64]

approach-pass-recede

Qualitative Analysis In this section, we conduct a more detailed analysis based on several specific scenarios. These scenarios are selected to examine how the models handle challenging multimodal cues involving physical principles, temporal constraints, and spatial structures. 9.1. Doppler Effect This part evaluates whether the models can generate acous- ...

work page
[65]

alternate left-right channels

Special samples Analysis 10.1. V eo3 Case Analysis We examine a case (Fig. 17) where Veo3 autonomously generated stereophonic audio featuring distinct Doppler effects, notably without explicit spatial specifications in the input prompt. We conducted time-domain waveform and spectrogram analyses for both channels, as shown in Fig. 18. The specific prompt u...

work page
[66]

score” (integer 1-5) and “reason

MLLM Based Evaluation Cases 11.1. Macro Evaluation System Prompt Sample As introduced in the main paper, our evaluation framework leverages Qwen2.5 Omni 7B [55] to provide a scalable and standardized alternative to traditional MOS. This supple- mentary section provides the specific implementation de- tails for the coarse-grained (macro) evaluation level. ...

work page
[67]

score” (integer 1-5) and “reason

Narrative function: Does audio actively clar- ify or enhance the story? Examples include: - Highlighting a key action (e.g., a heartbeat dur- ing a reveal) - Conveying character perspective (e.g., muffled sound during dazed POV) - Bridg- ing scenes through sound continuity (e.g., train whistle fading into next location) - Providing off- screen context (e....

work page 2078

[1] [1]

Unictokens: Boosting personalized understand- ing and generation via unified concept tokens.arXiv preprint arXiv:2505.14671, 2025

Ruichuan An, Sihan Yang, Renrui Zhang, Zijun Shen, Ming Lu, Gaole Dai, Hao Liang, Ziyu Guo, Shilin Yan, Yulin Luo, et al. Unictokens: Boosting personalized understand- ing and generation via unified concept tokens.arXiv preprint arXiv:2505.14671, 2025. 3

work page arXiv 2025

[2] [2]

Multi-step visual reasoning with visual tokens scaling and verification

Tianyi Bai, Zengjie Hu, Fupeng Sun, Jiantao Qiu, Yizhen Jiang, Guangxin He, Bohan Zeng, Conghui He, Binhang Yuan, and Wentao Zhang. Multi-step visual reasoning with visual tokens scaling and verification.arXiv preprint arXiv:2506.07235, 2025. 3

work page arXiv 2025

[3] [3]

Lumiere: A space-time diffu- sion model for video generation

Omer Bar-Tal, Hila Chefer, Omer Tov, Charles Her- rmann, Roni Paiss, Shiran Zada, Ariel Ephrat, Junhwa Hur, Guanghui Liu, Amit Raj, et al. Lumiere: A space-time diffu- sion model for video generation. InSIGGRAPH Asia 2024 Conference Papers, pages 1–11, 2024. 2

work page 2024

[4] [4]

Control-a-video: Controllable text-to-video generation with diffusion models

Weifeng Chen, Yatai Ji, Jie Wu, Hefeng Wu, Pan Xie, Jiashi Li, Xin Xia, Xuefeng Xiao, and Liang Lin. Control-a-video: Controllable text-to-video generation with diffusion models. arXiv e-prints, pages arXiv–2305, 2023. 1

work page 2023

[5] [5]

VersaVid-R1: A Versatile Video Understanding and Reasoning Model from Question Answering to Captioning Tasks,

Xinlong Chen, Yuanxing Zhang, Yushuo Guan, Bohan Zeng, Yang Shi, Sihan Yang, Pengfei Wan, Qiang Liu, Liang Wang, and Tieniu Tan. Versavid-r1: A versatile video understanding and reasoning model from question answering to captioning tasks.arXiv preprint arXiv:2506.09079, 2025. 3

work page arXiv 2025

[6] [6]

Mmaudio: Taming multimodal joint training for high-quality video-to- audio synthesis

Ho Kei Cheng, Masato Ishii, Akio Hayakawa, Takashi Shibuya, Alexander Schwing, and Yuki Mitsufuji. Mmaudio: Taming multimodal joint training for high-quality video-to- audio synthesis. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 28901–28911, 2025. 2, 5, 1

work page 2025

[7] [7]

Veo 3, 2025

Google DeepMind. Veo 3, 2025. 1, 2

work page 2025

[8] [8]

Diffusion models beat gans on image synthesis

Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis. InAdvances in Neural Infor- mation Processing Systems, pages 8780–8794. Curran Asso- ciates, Inc., 2021. 2

work page 2021

[9] [9]

Cogview2: faster and better text-to-image generation via hi- erarchical transformers

Ming Ding, Wendi Zheng, Wenyi Hong, and Jie Tang. Cogview2: faster and better text-to-image generation via hi- erarchical transformers. InProceedings of the 36th Inter- national Conference on Neural Information Processing Sys- tems, Red Hook, NY , USA, 2022. Curran Associates Inc. 2

work page 2022

[10] [10]

Seedance 1.0: Exploring the Boundaries of Video Generation Models

Yu Gao, Haoyuan Guo, Tuyen Hoang, Weilin Huang, Lu Jiang, Fangyuan Kong, Huixia Li, Jiashi Li, Liang Li, Xi- aojie Li, et al. Seedance 1.0: Exploring the boundaries of video generation models.arXiv preprint arXiv:2506.09113,

work page internal anchor Pith review Pith/arXiv arXiv

[11] [11]

Imagebind: One embedding space to bind them all

Rohit Girdhar, Alaaeldin El-Nouby, Zhuang Liu, Mannat Singh, Kalyan Vasudev Alwala, Armand Joulin, and Ishan Misra. Imagebind: One embedding space to bind them all. InProceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 15180–15190, 2023. 5

work page 2023

[12] [12]

Wan 2.5: Unified multi-modal video generation framework, 2025

Alibaba Tongyi Group. Wan 2.5: Unified multi-modal video generation framework, 2025. 1, 2

work page 2025

[13] [13]

Brace: A benchmark for robust audio caption quality evaluation

Tianyu Guo, Hongyu Chen, Hao Liang, Meiyi Qiang, Bohan Zeng, Linzhuang Sun, Bin Cui, and Wentao Zhang. Brace: A benchmark for robust audio caption quality evaluation. In The Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2025. 3

work page 2025

[14] [14]

AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning

Yuwei Guo, Ceyuan Yang, Anyi Rao, Zhengyang Liang, Yaohui Wang, Yu Qiao, Maneesh Agrawala, Dahua Lin, and Bo Dai. Animatediff: Animate your personalized text- to-image diffusion models without specific tuning.arXiv preprint arXiv:2307.04725, 2023. 1

work page internal anchor Pith review Pith/arXiv arXiv 2023

[15] [15]

Latent video diffusion models for high-fidelity long video generation, 2023

Yingqing He, Tianyu Yang, Yong Zhang, Ying Shan, and Qifeng Chen. Latent video diffusion models for high-fidelity long video generation, 2023. 2

work page 2023

[16] [16]

Denoising dif- fusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising dif- fusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

work page 2020

[17] [17]

Video dif- fusion models.Advances in neural information processing systems, 35:8633–8646, 2022

Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J Fleet. Video dif- fusion models.Advances in neural information processing systems, 35:8633–8646, 2022. 2

work page 2022

[18] [18]

Cogvideo: Large-scale pretraining for text-to-video generation via transformers, 2022

Wenyi Hong, Ming Ding, Wendi Zheng, Xinghan Liu, and Jie Tang. Cogvideo: Large-scale pretraining for text-to-video generation via transformers, 2022. 2

work page 2022

[19] [19]

Vbench: Comprehensive bench- mark suite for video generative models

Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, et al. Vbench: Comprehensive bench- mark suite for video generative models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21807–21818, 2024. 1, 3

work page 2024

[20] [20]

Synchformer: Efficient synchronization from sparse cues

Vladimir Iashin, Weidi Xie, Esa Rahtu, and Andrew Zisser- man. Synchformer: Efficient synchronization from sparse cues. InICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5325–5329. IEEE, 2024. 5

work page 2024

[21] [21]

Kates.Signal Processing for Hearing Aids, pages 235–277

James M. Kates.Signal Processing for Hearing Aids, pages 235–277. Springer US, Boston, MA, 2002. 5

work page 2002

[22] [22]

Latentsync: Taming audio-conditioned latent diffusion models for lip sync with syncnet supervision.arXiv preprint arXiv:2412.09262, 2024

Chunyu Li, Chao Zhang, Weikai Xu, Jingyu Lin, Jinghui Xie, Weiguo Feng, Bingyue Peng, Cunjian Chen, and Wei- wei Xing. Latentsync: Taming audio-conditioned latent dif- fusion models for lip sync with syncnet supervision.arXiv preprint arXiv:2412.09262, 2024. 5

work page arXiv 2024

[23] [23]

Deepsound-v1: Start to think step-by-step in the audio gen- eration from videos.arXiv preprint arXiv:2503.22208, 2025

Yunming Liang, Zihao Chen, Chaofan Ding, and Xinhan Di. Deepsound-v1: Start to think step-by-step in the audio gen- eration from videos.arXiv preprint arXiv:2503.22208, 2025. 2

work page arXiv 2025

[24] [24]

Draw-and-understand: Leveraging visual prompts to enable mllms to comprehend what you want.arXiv preprint arXiv:2403.20271,

Weifeng Lin, Xinyu Wei, Ruichuan An, Peng Gao, Bocheng Zou, Yulin Luo, Siyuan Huang, Shanghang Zhang, and Hongsheng Li. Draw-and-understand: Leveraging visual prompts to enable mllms to comprehend what you want. arXiv preprint arXiv:2403.20271, 2024. 3 9

work page arXiv 2024

[25] [25]

Perceive anything: Recognize, explain, caption, and segment anything in images and videos, 2025

Weifeng Lin, Xinyu Wei, Ruichuan An, Tianhe Ren, Tingwei Chen, Renrui Zhang, Ziyu Guo, Wentao Zhang, Lei Zhang, and Hongsheng Li. Perceive anything: Recognize, explain, caption, and segment anything in images and videos, 2025. 3

work page 2025

[26] [26]

Thinksound: Chain-of-thought reasoning in multimodal large language models for audio generation and editing,

Huadai Liu, Jialei Wang, Kaicheng Luo, Wen Wang, Qian Chen, Zhou Zhao, and Wei Xue. Thinksound: Chain- of-thought reasoning in multimodal large language mod- els for audio generation and editing.arXiv preprint arXiv:2506.21448, 2025. 2

work page arXiv 2025

[27] [27]

JavisDiT: Joint audio-video diffusion transformer with hierarchical spatio-temporal prior synchronization,

Kai Liu, Wei Li, Lai Chen, Shengqiong Wu, Yanhao Zheng, Jiayi Ji, Fan Zhou, Rongxin Jiang, Jiebo Luo, Hao Fei, et al. Javisdit: Joint audio-video diffusion transformer with hierar- chical spatio-temporal prior synchronization.arXiv preprint arXiv:2503.23377, 2025. 1, 3

work page arXiv 2025

[28] [28]

Video-p2p: Video editing with cross-attention control

Shaoteng Liu, Yuechen Zhang, Wenbo Li, Zhe Lin, and Jiaya Jia. Video-p2p: Video editing with cross-attention control. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8599–8608, 2024. 1

work page 2024

[29] [29]

Llm as dataset ana- lyst: Subpopulation structure discovery with large language model

Yulin Luo, Ruichuan An, Bocheng Zou, Yiming Tang, Ji- aming Liu, and Shanghang Zhang. Llm as dataset ana- lyst: Subpopulation structure discovery with large language model. InEuropean Conference on Computer Vision, pages 235–252. Springer, 2024. 3

work page 2024

[30] [30]

Videofusion: Decomposed diffusion models for high- quality video generation

Zhengxiong Luo, Dayou Chen, Yingya Zhang, Yan Huang, Liang Wang, Yujun Shen, Deli Zhao, Jinren Zhou, and Tie- niu Tan. Decomposed diffusion models for high-quality video generation.arXiv preprint arXiv:2303.08320, 3, 2023. 2

work page arXiv 2023

[31] [31]

Latte: Latent Diffusion Transformer for Video Generation

Xin Ma, Yaohui Wang, Gengyun Jia, Xinyuan Chen, Zi- wei Liu, Yuan-Fang Li, Cunjian Chen, and Yu Qiao. Latte: Latent diffusion transformer for video generation.arXiv preprint arXiv:2401.03048, 2024. 2

work page internal anchor Pith review Pith/arXiv arXiv 2024

[32] [32]

Nisqa: A deep cnn-self-attention model for multidimensional speech quality prediction with crowdsourced datasets.arXiv preprint arXiv:2104.09494,

Gabriel Mittag, Babak Naderi, Assmaa Chehadi, and Sebas- tian M¨oller. Nisqa: A deep cnn-self-attention model for mul- tidimensional speech quality prediction with crowdsourced datasets.arXiv preprint arXiv:2104.09494, 2021. 4

work page arXiv 2021

[33] [33]

Sora, 2024

OpenAI. Sora, 2024. 2

work page 2024

[34] [34]

Sora 2: Video generation model, 2025

OpenAI. Sora 2: Video generation model, 2025. 1, 2

work page 2025

[35] [35]

Movie Gen: A Cast of Media Foundation Models

Adam Polyak, Amit Zohar, Andrew Brown, Andros Tjandra, Animesh Sinha, Ann Lee, Apoorv Vyas, Bowen Shi, Chih- Yao Ma, Ching-Yao Chuang, et al. Movie gen: A cast of media foundation models.arXiv preprint arXiv:2410.13720,

work page internal anchor Pith review Pith/arXiv arXiv

[36] [36]

John Wiley & Sons, 2015

Ville Pulkki and Matti Karjalainen.Communication acous- tics: an introduction to speech, audio and psychoacoustics. John Wiley & Sons, 2015. 5

work page 2015

[37] [37]

Dns- mos: A non-intrusive perceptual objective speech quality metric to evaluate noise suppressors

Chandan KA Reddy, Vishak Gopal, and Ross Cutler. Dns- mos: A non-intrusive perceptual objective speech quality metric to evaluate noise suppressors. InICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6493–6497. IEEE, 2021. 4, 1

work page 2021

[38] [38]

Improved techniques for training gans.Advances in neural information processing systems, 29, 2016

Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training gans.Advances in neural information processing systems, 29, 2016. 3

work page 2016

[39] [39]

Hunyuanvideo-foley: Multimodal diffusion with representation alignment for high-fidelity foley audio generation,

Sizhe Shan, Qiulin Li, Yutao Cui, Miles Yang, Yuehai Wang, Qun Yang, Jin Zhou, and Zhao Zhong. Hunyuanvideo- foley: Multimodal diffusion with representation alignment for high-fidelity foley audio generation.arXiv preprint arXiv:2508.16930, 2025. 4

work page arXiv 2025

[40] [40]

Mavors: Multi-granularity video representation for multimodal large language model

Yang Shi, Jiaheng Liu, Yushuo Guan, Zhenhua Wu, Yuanx- ing Zhang, Zihao Wang, Weihong Lin, Jingyun Hua, Zekun Wang, Xinlong Chen, et al. Mavors: Multi-granularity video representation for multimodal large language model. InPro- ceedings of the 33rd ACM International Conference on Mul- timedia, pages 10994–11003, 2025. 3

work page 2025

[41] [41]

Mme- videoocr: Evaluating ocr-based capabilities of multimodal llms in video scenarios,

Yang Shi, Huanqian Wang, Wulin Xie, Huanyao Zhang, Lijie Zhao, Yi-Fan Zhang, Xinfeng Li, Chaoyou Fu, Zhuoer Wen, Wenting Liu, et al. Mme-videoocr: Evaluating ocr-based capabilities of multimodal llms in video scenarios.arXiv preprint arXiv:2505.21333, 2025. 3

work page arXiv 2025

[42] [42]

On the perception of the direction of sound.Proceedings of the Royal Society of London

John William Strutt. On the perception of the direction of sound.Proceedings of the Royal Society of London. Series A, Containing Papers of a Mathematical and Physical Char- acter, 83(559):61–64, 1909. 5

work page 1909

[43] [43]

Kling 2.5 turbo, 2025

Kuaishou Technology. Kling 2.5 turbo, 2025. 6, 1

work page 2025

[44] [44]

Peaq-the itu standard for objective measure- ment of perceived audio quality.Journal of the Audio Engi- neering Society, 48(1/2):3–29, 2000

Thilo Thiede, William C Treurniet, Roland Bitto, Christian Schmidmer, Thomas Sporer, John G Beerends, and Cather- ine Colomes. Peaq-the itu standard for objective measure- ment of perceived audio quality.Journal of the Audio Engi- neering Society, 48(1/2):3–29, 2000. 5

work page 2000

[45] [45]

Meta Audiobox Aesthetics: Unified Automatic Quality Assessment for Speech, Music, and Sound

Andros Tjandra, Yi-Chiao Wu, Baishan Guo, John Hoffman, Brian Ellis, Apoorv Vyas, Bowen Shi, Sanyuan Chen, Matt Le, Nick Zacharov, et al. Meta audiobox aesthetics: Unified automatic quality assessment for speech, music, and sound. arXiv preprint arXiv:2502.05139, 2025. 4

work page internal anchor Pith review arXiv 2025

[46] [46]

Fvd: A new metric for video generation

Thomas Unterthiner, Sjoerd Van Steenkiste, Karol Kurach, Rapha¨el Marinier, Marcin Michalski, and Sylvain Gelly. Fvd: A new metric for video generation. 2019. 3

work page 2019

[47] [47]

Wan: Open and Advanced Large-Scale Video Generative Models

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video gen- erative models.arXiv preprint arXiv:2503.20314, 2025. 2, 6, 1

work page internal anchor Pith review Pith/arXiv arXiv 2025

[48] [48]

ModelScope Text-to-Video Technical Report

Jiuniu Wang, Hangjie Yuan, Dayou Chen, Yingya Zhang, Xiang Wang, and Shiwei Zhang. Modelscope text-to-video technical report.arXiv preprint arXiv:2308.06571, 2023. 2

work page internal anchor Pith review Pith/arXiv arXiv 2023

[49] [49]

V2a-mapper: A lightweight solution for vision-to-audio generation by connecting foundation models

Jun Wang, Xijuan Zeng, Chunyu Qiang, Ruilong Chen, Shiyao Wang, Le Wang, Wangjing Zhou, Pengfei Cai, Ji- ahui Zhao, Nan Li, et al. Kling-foley: Multimodal diffu- sion transformer for high-quality video-to-audio generation. arXiv preprint arXiv:2506.19774, 2025. 2, 3

work page arXiv 2025

[50] [50]

InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation

Yi Wang, Yinan He, Yizhuo Li, Kunchang Li, Jiashuo Yu, Xin Ma, Xinhao Li, Guo Chen, Xinyuan Chen, Yaohui Wang, et al. Internvid: A large-scale video-text dataset for multimodal understanding and generation.arXiv preprint arXiv:2307.06942, 2023. 5

work page internal anchor Pith review Pith/arXiv arXiv 2023

[51] [51]

Lavie: High-quality video generation with cascaded latent diffusion models.International Journal of Computer Vision, 133(5):3059–3078, 2025

Yaohui Wang, Xinyuan Chen, Xin Ma, Shangchen Zhou, Ziqi Huang, Yi Wang, Ceyuan Yang, Yinan He, Jiashuo Yu, Peiqing Yang, et al. Lavie: High-quality video generation with cascaded latent diffusion models.International Journal of Computer Vision, 133(5):3059–3078, 2025. 2 10

work page 2025

[52] [52]

Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation

Jay Zhangjie Wu, Yixiao Ge, Xintao Wang, Stan Weixian Lei, Yuchao Gu, Yufei Shi, Wynne Hsu, Ying Shan, Xiaohu Qie, and Mike Zheng Shou. Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation. In Proceedings of the IEEE/CVF international conference on computer vision, pages 7623–7633, 2023. 1

work page 2023

[53] [53]

Automated movie generation via multi-agent cot plan- ning.ArXiv, abs/2503.07314,

Weijia Wu, Zeyu Zhu, and Mike Zheng Shou. Auto- mated movie generation via multi-agent cot planning.arXiv preprint arXiv:2503.07314, 2025. 1

work page arXiv 2025

[54] [54]

Large-scale con- trastive language-audio pretraining with feature fusion and keyword-to-caption augmentation

Yusong Wu, Ke Chen, Tianyu Zhang, Yuchen Hui, Taylor Berg-Kirkpatrick, and Shlomo Dubnov. Large-scale con- trastive language-audio pretraining with feature fusion and keyword-to-caption augmentation. InICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2023. 5

work page 2023

[55] [55]

Jin Xu, Zhifang Guo, Jinzheng He, Hangrui Hu, Ting He, Shuai Bai, Keqin Chen, Jialin Wang, Yang Fan, Kai Dang, et al. Qwen2. 5-omni technical report.arXiv preprint arXiv:2503.20215, 2025. 3, 5, 1, 7

work page internal anchor Pith review Pith/arXiv arXiv 2025

[56] [56]

CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiao- han Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer.arXiv preprint arXiv:2408.06072, 2024. 2

work page internal anchor Pith review Pith/arXiv arXiv 2024

[57] [57]

Magvit: Masked generative video transformer

Lijun Yu, Yong Cheng, Kihyuk Sohn, Jos ´e Lezama, Han Zhang, Huiwen Chang, Alexander G Hauptmann, Ming- Hsuan Yang, Yuan Hao, Irfan Essa, et al. Magvit: Masked generative video transformer. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10459–10469, 2023. 2

work page 2023

[58] [58]

Evaluation agent: Efficient and promptable evaluation framework for visual generative models

Fan Zhang, Shulin Tian, Ziqi Huang, Yu Qiao, and Ziwei Liu. Evaluation agent: Efficient and promptable evaluation framework for visual generative models. InProceedings of the 63rd Annual Meeting of the Association for Computa- tional Linguistics (Volume 1: Long Papers), pages 7561– 7582, 2025. 1

work page 2025

[59] [59]

VBench-2.0: Advancing Video Generation Benchmark Suite for Intrinsic Faithfulness

Dian Zheng, Ziqi Huang, Hongbo Liu, Kai Zou, Yinan He, Fan Zhang, Lulu Gu, Yuanhan Zhang, Jingwen He, Wei- Shi Zheng, et al. Vbench-2.0: Advancing video generation benchmark suite for intrinsic faithfulness.arXiv preprint arXiv:2503.21755, 2025. 1, 3

work page internal anchor Pith review Pith/arXiv arXiv 2025

[60] [60]

Cogview3: Finer and faster text-to-image generation via relay diffusion

Wendi Zheng, Jiayan Teng, Zhuoyi Yang, Weihan Wang, Jidong Chen, Xiaotao Gu, Yuxiao Dong, Ming Ding, and Jie Tang. Cogview3: Finer and faster text-to-image generation via relay diffusion. InComputer Vision – ECCV 2024: 18th European Conference, Milan, Italy, September 29–October 4, 2024, Proceedings, Part LXXVII, page 1–22, Berlin, Heidelberg, 2...

work page 2024

[61] [61]

Supplementry results analysis in SpeechClarity and Artistry Table 3

Additional evaluation metrics 6.1. Supplementry results analysis in SpeechClarity and Artistry Table 3. Supplementary results for T2A V and I2A V Models T2A V I2A V SpeechClarity Artisry Artisry sora2 2.367 3.735 3.931 veo3 2.554 3.825 3.983 wan2.5 2.396 3.838 3.929 seed think 2.008 3.717 3.956 seed mm 2.202 3.707 3.971 wan2.2 think 1.882 3.630 3.942 wan2...

work page

[62] [62]

The attributes of the videos generated by each model

Audio-Video Generation Models in Evalua- tion In our experiments, we adhered to the default configura- tion parameters provided by each video generation model, Table 4. The attributes of the videos generated by each model. Models Length FPS sora2 10s 30 veo3 8s 24 wan2.5 5s 24 seedance 1.0 lite 5s 24 wan2.2 5s 24 kling2.5 turbo 5s 24 as summarized in Tab....

work page

[63] [63]

contemplation

Detail Analysis of Different Tasks This section provides a comprehensive analysis of experi- mental results across different categories for various mod- els under both T2A V (Tab. 5) and I2A V (Tab. 6) tasks. The study aims to identify common patterns across tasks and elucidate the specific impact of image-conditioned input (I2A V) on the final outcomes. ...

work page

[64] [64]

approach-pass-recede

Qualitative Analysis In this section, we conduct a more detailed analysis based on several specific scenarios. These scenarios are selected to examine how the models handle challenging multimodal cues involving physical principles, temporal constraints, and spatial structures. 9.1. Doppler Effect This part evaluates whether the models can generate acous- ...

work page

[65] [65]

alternate left-right channels

Special samples Analysis 10.1. V eo3 Case Analysis We examine a case (Fig. 17) where Veo3 autonomously generated stereophonic audio featuring distinct Doppler effects, notably without explicit spatial specifications in the input prompt. We conducted time-domain waveform and spectrogram analyses for both channels, as shown in Fig. 18. The specific prompt u...

work page

[66] [66]

score” (integer 1-5) and “reason

MLLM Based Evaluation Cases 11.1. Macro Evaluation System Prompt Sample As introduced in the main paper, our evaluation framework leverages Qwen2.5 Omni 7B [55] to provide a scalable and standardized alternative to traditional MOS. This supple- mentary section provides the specific implementation de- tails for the coarse-grained (macro) evaluation level. ...

work page

[67] [67]

score” (integer 1-5) and “reason

Narrative function: Does audio actively clar- ify or enhance the story? Examples include: - Highlighting a key action (e.g., a heartbeat dur- ing a reveal) - Conveying character perspective (e.g., muffled sound during dazed POV) - Bridg- ing scenes through sound continuity (e.g., train whistle fading into next location) - Providing off- screen context (e....

work page 2078