VABench: A Comprehensive Benchmark for Audio-Video Generation
Pith reviewed 2026-05-16 23:59 UTC · model grok-4.3
The pith
VABench supplies a multi-dimensional benchmark to test synchronized audio-video generation models where prior visual-only tests fall short.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
VABench is a benchmark framework that systematically evaluates synchronous audio-video generation models through three primary task types, two major evaluation modules spanning fifteen dimensions that cover pairwise similarities, synchronization, lip-speech consistency, and curated QA pairs, plus seven content categories, establishing quantitative assessment where existing video benchmarks lack audio-video metrics.
What carries the argument
The VABench evaluation framework consisting of fifteen dimensions that measure pairwise similarities, audio-video synchronization, lip-speech consistency, and QA pairs across seven content categories.
If this is right
- Generation models can receive explicit scores for audio-video alignment rather than visual quality alone.
- Developers gain concrete targets for improving synchronization and lip-speech consistency.
- New models can be compared directly on joint audio-video tasks across text, image, and stereo inputs.
- Research focus may shift toward metrics that treat audio and video as a single synchronized output.
- Benchmark results can guide selection of models for applications requiring matched sound and motion.
Where Pith is reading between the lines
- Adoption of VABench could standardize training objectives so that models optimize for both modalities jointly instead of adding audio after video generation.
- The framework may reveal whether current architectures handle environmental sounds or virtual-world scenes better than human-speech scenes.
- Future extensions could add temporal metrics that track synchronization drift over longer sequences.
- Results on VABench might correlate with downstream task performance such as video editing or virtual-reality rendering where audio-video mismatch is noticeable.
- pith_inferences are editorial extensions and not stated in the paper.
Load-bearing premise
That the chosen fifteen dimensions, pairwise similarity checks, lip-speech tests, and seven content categories together capture the essential qualities needed to judge audio-video generation performance.
What would settle it
A generation model that ranks high on all VABench dimensions yet produces clearly mismatched audio and video when tested on real-world clips outside the seven categories.
Figures
read the original abstract
Recent advances in video generation have been remarkable, enabling models to produce visually compelling videos with synchronized audio. While existing video generation benchmarks provide comprehensive metrics for visual quality, they lack convincing evaluations for audio-video generation, especially for models aiming to generate synchronized audio-video outputs. To address this gap, we introduce VABench, a comprehensive and multi-dimensional benchmark framework designed to systematically evaluate the capabilities of synchronous audio-video generation. VABench encompasses three primary task types: text-to-audio-video (T2AV), image-to-audio-video (I2AV), and stereo audio-video generation. It further establishes two major evaluation modules covering 15 dimensions. These dimensions specifically assess pairwise similarities (text-video, text-audio, video-audio), audio-video synchronization, lip-speech consistency, and carefully curated audio and video question-answering (QA) pairs, among others. Furthermore, VABench covers seven major content categories: animals, human sounds, music, environmental sounds, synchronous physical sounds, complex scenes, and virtual worlds. We provide a systematic analysis and visualization of the evaluation results, aiming to establish a new standard for assessing video generation models with synchronous audio capabilities and to promote the comprehensive advancement of the field.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces VABench, a benchmark framework for evaluating synchronous audio-video generation models. It defines three task types (text-to-audio-video, image-to-audio-video, and stereo audio-video generation), two evaluation modules spanning 15 dimensions (pairwise similarities, audio-video synchronization, lip-speech consistency, and QA pairs), and seven content categories (animals, human sounds, music, environmental sounds, synchronous physical sounds, complex scenes, virtual worlds), with accompanying analysis and visualizations.
Significance. If the 15 dimensions and QA pairs are shown through validation to correlate with human judgments and to capture perceptually relevant failure modes missed by prior video benchmarks, VABench could provide a useful standardized evaluation protocol for audio-video synchronization in generative models. The multi-task and multi-category coverage is a constructive step toward addressing the gap noted in the abstract.
major comments (3)
- [Evaluation modules] Evaluation modules (described in the main text following the abstract): the 15 dimensions are enumerated at a high level but no metric formulas, distance functions, or implementation details are supplied for text-video similarity, video-audio sync, lip-speech consistency, or the QA scoring procedure, rendering the benchmark non-reproducible.
- [QA pairs] QA pairs subsection: no inter-annotator agreement, human correlation study, or ablation is reported to confirm that the curated audio and video QA pairs actually measure the intended semantic and perceptual properties.
- [Content categories and results analysis] Content categories and results analysis: the claim that the seven categories plus the listed checks suffice to capture key challenges (e.g., temporal audio drift or semantic mismatch) is unsupported by any comparative analysis against existing benchmarks or by evidence that the chosen axes predict human-perceived audio-video quality.
minor comments (1)
- [Abstract] The abstract states that VABench 'establishes two major evaluation modules' but does not name or briefly characterize those two modules, which would improve immediate clarity.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment below and will revise the manuscript accordingly to improve clarity and reproducibility.
read point-by-point responses
-
Referee: [Evaluation modules] Evaluation modules (described in the main text following the abstract): the 15 dimensions are enumerated at a high level but no metric formulas, distance functions, or implementation details are supplied for text-video similarity, video-audio sync, lip-speech consistency, or the QA scoring procedure, rendering the benchmark non-reproducible.
Authors: We agree that additional implementation details are required for reproducibility. In the revised manuscript we will expand the Evaluation Modules section with explicit metric formulas (e.g., CLIP cosine similarity for text-video, SyncNet-based AV sync score, Wav2Lip lip-speech consistency), distance functions, and the precise QA scoring procedure, together with pseudocode and a pointer to the released evaluation code. revision: yes
-
Referee: [QA pairs] QA pairs subsection: no inter-annotator agreement, human correlation study, or ablation is reported to confirm that the curated audio and video QA pairs actually measure the intended semantic and perceptual properties.
Authors: We acknowledge this gap. Although the QA pairs were curated by multiple annotators under a documented protocol, agreement statistics were not reported. We will add an inter-annotator agreement analysis (Fleiss’ kappa) and a human correlation study on a held-out subset in the revised version, placing detailed results in the appendix if space is constrained. revision: yes
-
Referee: [Content categories and results analysis] Content categories and results analysis: the claim that the seven categories plus the listed checks suffice to capture key challenges (e.g., temporal audio drift or semantic mismatch) is unsupported by any comparative analysis against existing benchmarks or by evidence that the chosen axes predict human-perceived audio-video quality.
Authors: We agree that stronger justification is needed. The revision will include a comparative table against existing benchmarks (VBench, AIGC-Vid, etc.) and additional analysis with qualitative examples showing how the seven categories and 15 dimensions specifically surface failure modes such as temporal drift and semantic mismatch. We will also reference supporting human-perception literature where available. revision: yes
Circularity Check
No circularity: benchmark defined directly by tasks and dimensions
full rationale
The paper introduces VABench by enumerating three task types (T2AV, I2AV, stereo) and two evaluation modules spanning 15 dimensions (pairwise similarities, synchronization, lip-speech consistency, QA pairs) plus seven content categories. No equations, fitted parameters, predictions, or derivations appear in the abstract or described structure. The framework is specified by construction as a list of chosen axes rather than derived from prior results or self-citations that reduce to the same inputs. This is the normal case of a definitional benchmark paper with no load-bearing circular steps.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Standard pairwise similarity metrics and curated QA pairs are sufficient to evaluate audio-video synchronization and quality
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
VABench encompasses two primary audio-video generation tasks... 15 fine-grained metrics... pairwise similarities (text-video, text-audio, video-audio), audio-video synchronization, lip-speech consistency...
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We propose VABench... suite of 15 fine-grained metrics... seven major content categories: animals, human sounds, music...
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 8 Pith papers
-
PhyAVBench: A Challenging Audio Physics-Sensitivity Benchmark for Physically Grounded Text-to-Audio-Video Generation
PhyAVBench supplies the first benchmark and contrastive metric that measures whether text-to-audio-video models respect real-world audio physics across controlled prompt pairs.
-
MSAVBench: Towards Comprehensive and Reliable Evaluation of Multi-Shot Audio-Video Generation
MSAVBench is the first comprehensive benchmark for multi-shot audio-video generation, spanning video, audio, shot, and reference dimensions with an adaptive evaluation framework that reaches 91.5% Spearman correlation...
-
Do Joint Audio-Video Generation Models Understand Physics?
Current joint audio-video generation models lack robust physical commonsense, especially during transitions and when prompted for impossible behaviors.
-
TMD-Bench: A Multi-Level Evaluation Paradigm for Music-Dance Co-Generation
TMD-Bench is a multi-level benchmark that measures music-dance co-generation quality including beat-level rhythmic synchronization, supported by a new dataset and Music Captioner, and shows commercial models lag in rh...
-
VidAudio-Bench: Benchmarking V2A and VT2A Generation across Four Audio Categories
VidAudio-Bench benchmarks V2A and VT2A models across four audio categories, revealing poor speech/singing performance and a tension between visual alignment and text instruction following.
-
PhyAVBench: A Challenging Audio Physics-Sensitivity Benchmark for Physically Grounded Text-to-Audio-Video Generation
PhyAVBench provides the first systematic benchmark and metric for audio-physics grounding in T2AV, I2AV, and V2A models using controlled prompt pairs and real video ground truth.
-
SyncDPO: Enhancing Temporal Synchronization in Video-Audio Joint Generation via Preference Learning
SyncDPO improves temporal synchronization in video-audio joint generation using DPO with efficient on-the-fly negative sample construction and curriculum learning.
-
OmniHuman: A Large-scale Dataset and Benchmark for Human-Centric Video Generation
OmniHuman is a new large-scale multi-scene dataset with video-, frame-, and individual-level annotations for human-centric video generation, accompanied by the OHBench benchmark that adds metrics aligned with human pe...
Reference graph
Works this paper leans on
-
[1]
Ruichuan An, Sihan Yang, Renrui Zhang, Zijun Shen, Ming Lu, Gaole Dai, Hao Liang, Ziyu Guo, Shilin Yan, Yulin Luo, et al. Unictokens: Boosting personalized understand- ing and generation via unified concept tokens.arXiv preprint arXiv:2505.14671, 2025. 3
-
[2]
Multi-step visual reasoning with visual tokens scaling and verification
Tianyi Bai, Zengjie Hu, Fupeng Sun, Jiantao Qiu, Yizhen Jiang, Guangxin He, Bohan Zeng, Conghui He, Binhang Yuan, and Wentao Zhang. Multi-step visual reasoning with visual tokens scaling and verification.arXiv preprint arXiv:2506.07235, 2025. 3
-
[3]
Lumiere: A space-time diffu- sion model for video generation
Omer Bar-Tal, Hila Chefer, Omer Tov, Charles Her- rmann, Roni Paiss, Shiran Zada, Ariel Ephrat, Junhwa Hur, Guanghui Liu, Amit Raj, et al. Lumiere: A space-time diffu- sion model for video generation. InSIGGRAPH Asia 2024 Conference Papers, pages 1–11, 2024. 2
work page 2024
-
[4]
Control-a-video: Controllable text-to-video generation with diffusion models
Weifeng Chen, Yatai Ji, Jie Wu, Hefeng Wu, Pan Xie, Jiashi Li, Xin Xia, Xuefeng Xiao, and Liang Lin. Control-a-video: Controllable text-to-video generation with diffusion models. arXiv e-prints, pages arXiv–2305, 2023. 1
work page 2023
-
[5]
Xinlong Chen, Yuanxing Zhang, Yushuo Guan, Bohan Zeng, Yang Shi, Sihan Yang, Pengfei Wan, Qiang Liu, Liang Wang, and Tieniu Tan. Versavid-r1: A versatile video understanding and reasoning model from question answering to captioning tasks.arXiv preprint arXiv:2506.09079, 2025. 3
-
[6]
Mmaudio: Taming multimodal joint training for high-quality video-to- audio synthesis
Ho Kei Cheng, Masato Ishii, Akio Hayakawa, Takashi Shibuya, Alexander Schwing, and Yuki Mitsufuji. Mmaudio: Taming multimodal joint training for high-quality video-to- audio synthesis. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 28901–28911, 2025. 2, 5, 1
work page 2025
- [7]
-
[8]
Diffusion models beat gans on image synthesis
Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis. InAdvances in Neural Infor- mation Processing Systems, pages 8780–8794. Curran Asso- ciates, Inc., 2021. 2
work page 2021
-
[9]
Cogview2: faster and better text-to-image generation via hi- erarchical transformers
Ming Ding, Wendi Zheng, Wenyi Hong, and Jie Tang. Cogview2: faster and better text-to-image generation via hi- erarchical transformers. InProceedings of the 36th Inter- national Conference on Neural Information Processing Sys- tems, Red Hook, NY , USA, 2022. Curran Associates Inc. 2
work page 2022
-
[10]
Seedance 1.0: Exploring the Boundaries of Video Generation Models
Yu Gao, Haoyuan Guo, Tuyen Hoang, Weilin Huang, Lu Jiang, Fangyuan Kong, Huixia Li, Jiashi Li, Liang Li, Xi- aojie Li, et al. Seedance 1.0: Exploring the boundaries of video generation models.arXiv preprint arXiv:2506.09113,
work page internal anchor Pith review Pith/arXiv arXiv
-
[11]
Imagebind: One embedding space to bind them all
Rohit Girdhar, Alaaeldin El-Nouby, Zhuang Liu, Mannat Singh, Kalyan Vasudev Alwala, Armand Joulin, and Ishan Misra. Imagebind: One embedding space to bind them all. InProceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 15180–15190, 2023. 5
work page 2023
-
[12]
Wan 2.5: Unified multi-modal video generation framework, 2025
Alibaba Tongyi Group. Wan 2.5: Unified multi-modal video generation framework, 2025. 1, 2
work page 2025
-
[13]
Brace: A benchmark for robust audio caption quality evaluation
Tianyu Guo, Hongyu Chen, Hao Liang, Meiyi Qiang, Bohan Zeng, Linzhuang Sun, Bin Cui, and Wentao Zhang. Brace: A benchmark for robust audio caption quality evaluation. In The Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2025. 3
work page 2025
-
[14]
AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning
Yuwei Guo, Ceyuan Yang, Anyi Rao, Zhengyang Liang, Yaohui Wang, Yu Qiao, Maneesh Agrawala, Dahua Lin, and Bo Dai. Animatediff: Animate your personalized text- to-image diffusion models without specific tuning.arXiv preprint arXiv:2307.04725, 2023. 1
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[15]
Latent video diffusion models for high-fidelity long video generation, 2023
Yingqing He, Tianyu Yang, Yong Zhang, Ying Shan, and Qifeng Chen. Latent video diffusion models for high-fidelity long video generation, 2023. 2
work page 2023
-
[16]
Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising dif- fusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020
work page 2020
-
[17]
Video dif- fusion models.Advances in neural information processing systems, 35:8633–8646, 2022
Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J Fleet. Video dif- fusion models.Advances in neural information processing systems, 35:8633–8646, 2022. 2
work page 2022
-
[18]
Cogvideo: Large-scale pretraining for text-to-video generation via transformers, 2022
Wenyi Hong, Ming Ding, Wendi Zheng, Xinghan Liu, and Jie Tang. Cogvideo: Large-scale pretraining for text-to-video generation via transformers, 2022. 2
work page 2022
-
[19]
Vbench: Comprehensive bench- mark suite for video generative models
Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, et al. Vbench: Comprehensive bench- mark suite for video generative models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21807–21818, 2024. 1, 3
work page 2024
-
[20]
Synchformer: Efficient synchronization from sparse cues
Vladimir Iashin, Weidi Xie, Esa Rahtu, and Andrew Zisser- man. Synchformer: Efficient synchronization from sparse cues. InICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5325–5329. IEEE, 2024. 5
work page 2024
-
[21]
Kates.Signal Processing for Hearing Aids, pages 235–277
James M. Kates.Signal Processing for Hearing Aids, pages 235–277. Springer US, Boston, MA, 2002. 5
work page 2002
-
[22]
Chunyu Li, Chao Zhang, Weikai Xu, Jingyu Lin, Jinghui Xie, Weiguo Feng, Bingyue Peng, Cunjian Chen, and Wei- wei Xing. Latentsync: Taming audio-conditioned latent dif- fusion models for lip sync with syncnet supervision.arXiv preprint arXiv:2412.09262, 2024. 5
-
[23]
Yunming Liang, Zihao Chen, Chaofan Ding, and Xinhan Di. Deepsound-v1: Start to think step-by-step in the audio gen- eration from videos.arXiv preprint arXiv:2503.22208, 2025. 2
-
[24]
Weifeng Lin, Xinyu Wei, Ruichuan An, Peng Gao, Bocheng Zou, Yulin Luo, Siyuan Huang, Shanghang Zhang, and Hongsheng Li. Draw-and-understand: Leveraging visual prompts to enable mllms to comprehend what you want. arXiv preprint arXiv:2403.20271, 2024. 3 9
-
[25]
Perceive anything: Recognize, explain, caption, and segment anything in images and videos, 2025
Weifeng Lin, Xinyu Wei, Ruichuan An, Tianhe Ren, Tingwei Chen, Renrui Zhang, Ziyu Guo, Wentao Zhang, Lei Zhang, and Hongsheng Li. Perceive anything: Recognize, explain, caption, and segment anything in images and videos, 2025. 3
work page 2025
-
[26]
Huadai Liu, Jialei Wang, Kaicheng Luo, Wen Wang, Qian Chen, Zhou Zhao, and Wei Xue. Thinksound: Chain- of-thought reasoning in multimodal large language mod- els for audio generation and editing.arXiv preprint arXiv:2506.21448, 2025. 2
-
[27]
Kai Liu, Wei Li, Lai Chen, Shengqiong Wu, Yanhao Zheng, Jiayi Ji, Fan Zhou, Rongxin Jiang, Jiebo Luo, Hao Fei, et al. Javisdit: Joint audio-video diffusion transformer with hierar- chical spatio-temporal prior synchronization.arXiv preprint arXiv:2503.23377, 2025. 1, 3
-
[28]
Video-p2p: Video editing with cross-attention control
Shaoteng Liu, Yuechen Zhang, Wenbo Li, Zhe Lin, and Jiaya Jia. Video-p2p: Video editing with cross-attention control. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8599–8608, 2024. 1
work page 2024
-
[29]
Llm as dataset ana- lyst: Subpopulation structure discovery with large language model
Yulin Luo, Ruichuan An, Bocheng Zou, Yiming Tang, Ji- aming Liu, and Shanghang Zhang. Llm as dataset ana- lyst: Subpopulation structure discovery with large language model. InEuropean Conference on Computer Vision, pages 235–252. Springer, 2024. 3
work page 2024
-
[30]
Videofusion: Decomposed diffusion models for high- quality video generation
Zhengxiong Luo, Dayou Chen, Yingya Zhang, Yan Huang, Liang Wang, Yujun Shen, Deli Zhao, Jinren Zhou, and Tie- niu Tan. Decomposed diffusion models for high-quality video generation.arXiv preprint arXiv:2303.08320, 3, 2023. 2
-
[31]
Latte: Latent Diffusion Transformer for Video Generation
Xin Ma, Yaohui Wang, Gengyun Jia, Xinyuan Chen, Zi- wei Liu, Yuan-Fang Li, Cunjian Chen, and Yu Qiao. Latte: Latent diffusion transformer for video generation.arXiv preprint arXiv:2401.03048, 2024. 2
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[32]
Gabriel Mittag, Babak Naderi, Assmaa Chehadi, and Sebas- tian M¨oller. Nisqa: A deep cnn-self-attention model for mul- tidimensional speech quality prediction with crowdsourced datasets.arXiv preprint arXiv:2104.09494, 2021. 4
- [33]
-
[34]
Sora 2: Video generation model, 2025
OpenAI. Sora 2: Video generation model, 2025. 1, 2
work page 2025
-
[35]
Movie Gen: A Cast of Media Foundation Models
Adam Polyak, Amit Zohar, Andrew Brown, Andros Tjandra, Animesh Sinha, Ann Lee, Apoorv Vyas, Bowen Shi, Chih- Yao Ma, Ching-Yao Chuang, et al. Movie gen: A cast of media foundation models.arXiv preprint arXiv:2410.13720,
work page internal anchor Pith review Pith/arXiv arXiv
-
[36]
Ville Pulkki and Matti Karjalainen.Communication acous- tics: an introduction to speech, audio and psychoacoustics. John Wiley & Sons, 2015. 5
work page 2015
-
[37]
Dns- mos: A non-intrusive perceptual objective speech quality metric to evaluate noise suppressors
Chandan KA Reddy, Vishak Gopal, and Ross Cutler. Dns- mos: A non-intrusive perceptual objective speech quality metric to evaluate noise suppressors. InICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6493–6497. IEEE, 2021. 4, 1
work page 2021
-
[38]
Improved techniques for training gans.Advances in neural information processing systems, 29, 2016
Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training gans.Advances in neural information processing systems, 29, 2016. 3
work page 2016
-
[39]
Sizhe Shan, Qiulin Li, Yutao Cui, Miles Yang, Yuehai Wang, Qun Yang, Jin Zhou, and Zhao Zhong. Hunyuanvideo- foley: Multimodal diffusion with representation alignment for high-fidelity foley audio generation.arXiv preprint arXiv:2508.16930, 2025. 4
-
[40]
Mavors: Multi-granularity video representation for multimodal large language model
Yang Shi, Jiaheng Liu, Yushuo Guan, Zhenhua Wu, Yuanx- ing Zhang, Zihao Wang, Weihong Lin, Jingyun Hua, Zekun Wang, Xinlong Chen, et al. Mavors: Multi-granularity video representation for multimodal large language model. InPro- ceedings of the 33rd ACM International Conference on Mul- timedia, pages 10994–11003, 2025. 3
work page 2025
-
[41]
Mme- videoocr: Evaluating ocr-based capabilities of multimodal llms in video scenarios,
Yang Shi, Huanqian Wang, Wulin Xie, Huanyao Zhang, Lijie Zhao, Yi-Fan Zhang, Xinfeng Li, Chaoyou Fu, Zhuoer Wen, Wenting Liu, et al. Mme-videoocr: Evaluating ocr-based capabilities of multimodal llms in video scenarios.arXiv preprint arXiv:2505.21333, 2025. 3
-
[42]
On the perception of the direction of sound.Proceedings of the Royal Society of London
John William Strutt. On the perception of the direction of sound.Proceedings of the Royal Society of London. Series A, Containing Papers of a Mathematical and Physical Char- acter, 83(559):61–64, 1909. 5
work page 1909
- [43]
-
[44]
Thilo Thiede, William C Treurniet, Roland Bitto, Christian Schmidmer, Thomas Sporer, John G Beerends, and Cather- ine Colomes. Peaq-the itu standard for objective measure- ment of perceived audio quality.Journal of the Audio Engi- neering Society, 48(1/2):3–29, 2000. 5
work page 2000
-
[45]
Meta Audiobox Aesthetics: Unified Automatic Quality Assessment for Speech, Music, and Sound
Andros Tjandra, Yi-Chiao Wu, Baishan Guo, John Hoffman, Brian Ellis, Apoorv Vyas, Bowen Shi, Sanyuan Chen, Matt Le, Nick Zacharov, et al. Meta audiobox aesthetics: Unified automatic quality assessment for speech, music, and sound. arXiv preprint arXiv:2502.05139, 2025. 4
work page internal anchor Pith review arXiv 2025
-
[46]
Fvd: A new metric for video generation
Thomas Unterthiner, Sjoerd Van Steenkiste, Karol Kurach, Rapha¨el Marinier, Marcin Michalski, and Sylvain Gelly. Fvd: A new metric for video generation. 2019. 3
work page 2019
-
[47]
Wan: Open and Advanced Large-Scale Video Generative Models
Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video gen- erative models.arXiv preprint arXiv:2503.20314, 2025. 2, 6, 1
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[48]
ModelScope Text-to-Video Technical Report
Jiuniu Wang, Hangjie Yuan, Dayou Chen, Yingya Zhang, Xiang Wang, and Shiwei Zhang. Modelscope text-to-video technical report.arXiv preprint arXiv:2308.06571, 2023. 2
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[49]
V2a-mapper: A lightweight solution for vision-to-audio generation by connecting foundation models
Jun Wang, Xijuan Zeng, Chunyu Qiang, Ruilong Chen, Shiyao Wang, Le Wang, Wangjing Zhou, Pengfei Cai, Ji- ahui Zhao, Nan Li, et al. Kling-foley: Multimodal diffu- sion transformer for high-quality video-to-audio generation. arXiv preprint arXiv:2506.19774, 2025. 2, 3
-
[50]
InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation
Yi Wang, Yinan He, Yizhuo Li, Kunchang Li, Jiashuo Yu, Xin Ma, Xinhao Li, Guo Chen, Xinyuan Chen, Yaohui Wang, et al. Internvid: A large-scale video-text dataset for multimodal understanding and generation.arXiv preprint arXiv:2307.06942, 2023. 5
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[51]
Yaohui Wang, Xinyuan Chen, Xin Ma, Shangchen Zhou, Ziqi Huang, Yi Wang, Ceyuan Yang, Yinan He, Jiashuo Yu, Peiqing Yang, et al. Lavie: High-quality video generation with cascaded latent diffusion models.International Journal of Computer Vision, 133(5):3059–3078, 2025. 2 10
work page 2025
-
[52]
Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation
Jay Zhangjie Wu, Yixiao Ge, Xintao Wang, Stan Weixian Lei, Yuchao Gu, Yufei Shi, Wynne Hsu, Ying Shan, Xiaohu Qie, and Mike Zheng Shou. Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation. In Proceedings of the IEEE/CVF international conference on computer vision, pages 7623–7633, 2023. 1
work page 2023
-
[53]
Automated movie generation via multi-agent cot plan- ning.ArXiv, abs/2503.07314,
Weijia Wu, Zeyu Zhu, and Mike Zheng Shou. Auto- mated movie generation via multi-agent cot planning.arXiv preprint arXiv:2503.07314, 2025. 1
-
[54]
Yusong Wu, Ke Chen, Tianyu Zhang, Yuchen Hui, Taylor Berg-Kirkpatrick, and Shlomo Dubnov. Large-scale con- trastive language-audio pretraining with feature fusion and keyword-to-caption augmentation. InICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2023. 5
work page 2023
-
[55]
Jin Xu, Zhifang Guo, Jinzheng He, Hangrui Hu, Ting He, Shuai Bai, Keqin Chen, Jialin Wang, Yang Fan, Kai Dang, et al. Qwen2. 5-omni technical report.arXiv preprint arXiv:2503.20215, 2025. 3, 5, 1, 7
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[56]
CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer
Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiao- han Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer.arXiv preprint arXiv:2408.06072, 2024. 2
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[57]
Magvit: Masked generative video transformer
Lijun Yu, Yong Cheng, Kihyuk Sohn, Jos ´e Lezama, Han Zhang, Huiwen Chang, Alexander G Hauptmann, Ming- Hsuan Yang, Yuan Hao, Irfan Essa, et al. Magvit: Masked generative video transformer. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10459–10469, 2023. 2
work page 2023
-
[58]
Evaluation agent: Efficient and promptable evaluation framework for visual generative models
Fan Zhang, Shulin Tian, Ziqi Huang, Yu Qiao, and Ziwei Liu. Evaluation agent: Efficient and promptable evaluation framework for visual generative models. InProceedings of the 63rd Annual Meeting of the Association for Computa- tional Linguistics (Volume 1: Long Papers), pages 7561– 7582, 2025. 1
work page 2025
-
[59]
VBench-2.0: Advancing Video Generation Benchmark Suite for Intrinsic Faithfulness
Dian Zheng, Ziqi Huang, Hongbo Liu, Kai Zou, Yinan He, Fan Zhang, Lulu Gu, Yuanhan Zhang, Jingwen He, Wei- Shi Zheng, et al. Vbench-2.0: Advancing video generation benchmark suite for intrinsic faithfulness.arXiv preprint arXiv:2503.21755, 2025. 1, 3
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[60]
Cogview3: Finer and faster text-to-image generation via relay diffusion
Wendi Zheng, Jiayan Teng, Zhuoyi Yang, Weihan Wang, Jidong Chen, Xiaotao Gu, Yuxiao Dong, Ming Ding, and Jie Tang. Cogview3: Finer and faster text-to-image generation via relay diffusion. InComputer Vision – ECCV 2024: 18th European Conference, Milan, Italy, September 29–October 4, 2024, Proceedings, Part LXXVII, page 1–22, Berlin, Heidelberg, 2...
work page 2024
-
[61]
Supplementry results analysis in SpeechClarity and Artistry Table 3
Additional evaluation metrics 6.1. Supplementry results analysis in SpeechClarity and Artistry Table 3. Supplementary results for T2A V and I2A V Models T2A V I2A V SpeechClarity Artisry Artisry sora2 2.367 3.735 3.931 veo3 2.554 3.825 3.983 wan2.5 2.396 3.838 3.929 seed think 2.008 3.717 3.956 seed mm 2.202 3.707 3.971 wan2.2 think 1.882 3.630 3.942 wan2...
-
[62]
The attributes of the videos generated by each model
Audio-Video Generation Models in Evalua- tion In our experiments, we adhered to the default configura- tion parameters provided by each video generation model, Table 4. The attributes of the videos generated by each model. Models Length FPS sora2 10s 30 veo3 8s 24 wan2.5 5s 24 seedance 1.0 lite 5s 24 wan2.2 5s 24 kling2.5 turbo 5s 24 as summarized in Tab....
-
[63]
Detail Analysis of Different Tasks This section provides a comprehensive analysis of experi- mental results across different categories for various mod- els under both T2A V (Tab. 5) and I2A V (Tab. 6) tasks. The study aims to identify common patterns across tasks and elucidate the specific impact of image-conditioned input (I2A V) on the final outcomes. ...
-
[64]
Qualitative Analysis In this section, we conduct a more detailed analysis based on several specific scenarios. These scenarios are selected to examine how the models handle challenging multimodal cues involving physical principles, temporal constraints, and spatial structures. 9.1. Doppler Effect This part evaluates whether the models can generate acous- ...
-
[65]
Special samples Analysis 10.1. V eo3 Case Analysis We examine a case (Fig. 17) where Veo3 autonomously generated stereophonic audio featuring distinct Doppler effects, notably without explicit spatial specifications in the input prompt. We conducted time-domain waveform and spectrogram analyses for both channels, as shown in Fig. 18. The specific prompt u...
-
[66]
score” (integer 1-5) and “reason
MLLM Based Evaluation Cases 11.1. Macro Evaluation System Prompt Sample As introduced in the main paper, our evaluation framework leverages Qwen2.5 Omni 7B [55] to provide a scalable and standardized alternative to traditional MOS. This supple- mentary section provides the specific implementation de- tails for the coarse-grained (macro) evaluation level. ...
-
[67]
score” (integer 1-5) and “reason
Narrative function: Does audio actively clar- ify or enhance the story? Examples include: - Highlighting a key action (e.g., a heartbeat dur- ing a reveal) - Conveying character perspective (e.g., muffled sound during dazed POV) - Bridg- ing scenes through sound continuity (e.g., train whistle fading into next location) - Providing off- screen context (e....
work page 2078
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.