arxiv: 2512.00336 · v3 · submitted 2025-11-29 · 💻 cs.CV

MVAD: A Benchmark Dataset for Multimodal AI-Generated Video-Audio Detection

Mengxue Hu , Yunfeng Diao , Changtao Miao , Zhiqing Guo , Jianshu Li , Zhe Li , Joey Tianyi Zhou This is my paper

Pith reviewed 2026-05-17 03:44 UTC · model grok-4.3

classification 💻 cs.CV

keywords AI-generated contentmultimodal datasetvideo-audio detectionforgery patternsbenchmark datasetdeepfake detectioncontent authenticitygenerative models

0 comments

The pith

The MVAD dataset is the first comprehensive benchmark for detecting AI-generated multimodal video-audio content.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper presents the Multimodal Video-Audio Dataset (MVAD) to address the shortage of resources for identifying AI-generated videos that include synchronized audio. Existing datasets focus mainly on visuals alone or are restricted to facial deepfakes, leaving a gap as AI tools create more general multimodal content. The new dataset supplies samples from three realistic video-audio forgery patterns, produced with various state-of-the-art models for high quality, and spans realistic and anime styles plus categories such as humans, animals, objects, and scenes. A reader would care because improved detection tools are needed to check content authenticity when both video and audio are involved. If the dataset works as intended, it supports development of systems that verify multimodal fakes more effectively than current options allow.

Core claim

We introduce the Multimodal Video-Audio Dataset (MVAD), the first comprehensive dataset specifically designed for detecting AI-generated multimodal video-audio content. Our dataset exhibits three key characteristics: genuine multimodality with samples generated according to three realistic video-audio forgery patterns; high perceptual quality achieved through diverse state-of-the-art generative models; and comprehensive diversity spanning realistic and anime visual styles, four content categories (humans, animals, objects, and scenes), and four video-audio multimodal data types.

What carries the argument

The MVAD dataset, which supplies benchmark examples built from three realistic video-audio forgery patterns together with broad coverage of styles, categories, and data types.

If this is right

Detection systems can now train and test on paired video and audio forgeries rather than visuals alone.
Development of trustworthy multimodal detection tools is supported beyond the limits of facial-deepfake datasets.
Evaluation becomes possible across realistic and anime styles as well as four content categories.
High-quality examples from multiple generative models allow comparison of detector robustness.
Four distinct multimodal data types provide varied training cases for audio-video consistency checks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The dataset could be updated later with new forgery techniques as generative models continue to improve.
Researchers might use MVAD to measure how well detectors catch mismatches between generated audio and video.
Similar benchmark construction could be applied to other paired modalities such as image-text or video-text combinations.
Media platforms could incorporate MVAD-style data to build practical authenticity checks for user-uploaded multimodal content.

Load-bearing premise

The three chosen video-audio forgery patterns plus the selected range of styles and categories are sufficient to stand for the wider and still-expanding set of multimodal AI-generated content.

What would settle it

A detector trained only on MVAD samples is tested on new AI-generated video-audio pairs that use forgery methods or styles outside the three defined patterns and categories; poor performance would indicate the dataset does not cover the necessary range.

Figures

Figures reproduced from arXiv: 2512.00336 by Changtao Miao, Jianshu Li, Joey Tianyi Zhou, Mengxue Hu, Yunfeng Diao, Zhe Li, Zhiqing Guo.

**Figure 1.** Figure 1: MVAD represents the first general-purpose dataset specifically designed for detecting AI-generated multimodal video-audio content, addressing a critical gap in current research. forgery scenarios. • High Quality: MVAD features a carefully designed construction and evaluation pipeline, incorporating multiple state-of-the-art video-audio generation models to produce high-quality multimodal content. This hig… view at source ↗

**Figure 2.** Figure 2: Construction pipeline of MVAD, comprising: data collection from open sources and self-synthesized content generation; multi-stage data generation implementing three distinct forgery patterns; and comprehensive evaluation through automated metrics, LMM assessment, and human expert verification. both authentic and fake videos have been constructed for training and evaluation. Early AI-generated video dataset… view at source ↗

read the original abstract

The rapid advancement of AI-generated multimodal video-audio content has raised significant concerns regarding information security and content authenticity. Existing synthetic video datasets predominantly focus on the visual modality alone, while the few incorporating audio are largely confined to facial deepfakes--a limitation that fails to address the expanding landscape of general multimodal AI-generated content and substantially impedes the development of trustworthy detection systems. To bridge this critical gap, we introduce the Multimodal Video-Audio Dataset (MVAD), the first comprehensive dataset specifically designed for detecting AI-generated multimodal video-audio content. Our dataset exhibits three key characteristics: (1) genuine multimodality with samples generated according to three realistic video-audio forgery patterns; (2) high perceptual quality achieved through diverse state-of-the-art generative models; and (3) comprehensive diversity spanning realistic and anime visual styles, four content categories (humans, animals, objects, and scenes), and four video-audio multimodal data types. Our dataset will be available at https://github.com/HuMengXue0104/MVAD.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MVAD is a dataset paper that targets a real gap in multimodal video-audio deepfake benchmarks but rests its 'comprehensive' claim on three unspecified forgery patterns whose coverage is not yet shown.

read the letter

The main point for you is that this work releases MVAD, a new benchmark meant to cover AI-generated video with audio beyond the usual visual-only or face-focused sets. It claims three realistic forgery patterns, SOTA generators for quality, and diversity across realistic/anime styles, four content categories, and four data types, with a GitHub release planned. That directly addresses the limitation the abstract describes in prior datasets and gives the community something broader to test detectors against as generation improves.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces the Multimodal Video-Audio Dataset (MVAD) as the first comprehensive benchmark for detecting AI-generated multimodal video-audio content. It claims three key characteristics: genuine multimodality via three realistic video-audio forgery patterns, high perceptual quality from diverse state-of-the-art generative models, and broad diversity across realistic/anime styles, four content categories (humans, animals, objects, scenes), and four video-audio data types. The dataset is planned for public release via GitHub.

Significance. If the construction details and coverage claims hold, MVAD would fill a clear gap left by visual-only synthetic video datasets and facial-deepfake audio datasets, enabling more general multimodal detection research. Public release of a dataset with planned diversity is a constructive contribution to multimedia forensics.

major comments (2)

[Abstract and Section 3] Abstract and Section 3 (Dataset Construction): the central claim that MVAD is 'the first comprehensive dataset' for general multimodal AI-generated content rests on the three unspecified forgery patterns being representative; however, no enumeration, comparison to joint video-audio diffusion versus separate synthesis-plus-alignment, or argument against selection bias in the four categories is provided, leaving the representativeness unverified.
[Section 4] Section 4 (Quality and Validation): no quantitative metrics (e.g., perceptual quality scores, forgery realism measures, or inter-rater agreement) or construction details are reported to support the 'high perceptual quality' and 'realistic' characteristics, which is load-bearing for the dataset's utility as a benchmark.

minor comments (2)

[Abstract] The four video-audio multimodal data types are referenced but never enumerated or defined, reducing clarity on the exact composition of the dataset.
Figure captions and table headers could more explicitly link each sample to its forgery pattern and generator model for easier reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify how to strengthen the presentation of MVAD's novelty and validation. We address each major comment below and have revised the manuscript to incorporate additional details and quantitative support.

read point-by-point responses

Referee: [Abstract and Section 3] Abstract and Section 3 (Dataset Construction): the central claim that MVAD is 'the first comprehensive dataset' for general multimodal AI-generated content rests on the three unspecified forgery patterns being representative; however, no enumeration, comparison to joint video-audio diffusion versus separate synthesis-plus-alignment, or argument against selection bias in the four categories is provided, leaving the representativeness unverified.

Authors: We agree that the manuscript would benefit from greater explicitness on this point. Section 3 describes the three forgery patterns at a high level but does not enumerate them in detail or compare them to alternative generation pipelines. In the revised version we will add: (1) an explicit enumeration of the three patterns with concrete examples, (2) a short comparison to joint video-audio diffusion models versus separate synthesis followed by alignment, and (3) a brief justification for the choice of the four content categories together with evidence that they cover the main semantic domains encountered in real-world multimodal content. These additions will make the representativeness argument verifiable without altering the core claims. revision: yes
Referee: [Section 4] Section 4 (Quality and Validation): no quantitative metrics (e.g., perceptual quality scores, forgery realism measures, or inter-rater agreement) or construction details are reported to support the 'high perceptual quality' and 'realistic' characteristics, which is load-bearing for the dataset's utility as a benchmark.

Authors: We acknowledge that the current Section 4 relies primarily on qualitative descriptions. In the revised manuscript we will report quantitative perceptual quality scores (e.g., FID and LPIPS computed against real reference videos), forgery realism measures where applicable, and results from a controlled user study that includes inter-rater agreement statistics. We will also expand the construction details to document the specific state-of-the-art models used, post-processing steps, and quality-control procedures. These additions directly address the load-bearing nature of the quality claim. revision: yes

Circularity Check

0 steps flagged

No circularity: dataset construction paper with no derivations or fitted predictions

full rationale

The paper introduces MVAD as a new benchmark dataset constructed from existing generative models and three defined forgery patterns, without any equations, parameter fitting, predictive modeling, or derivation chains. The central claims rest on descriptive choices about multimodality, quality, and diversity rather than any self-referential reduction where an output is defined in terms of itself or a fitted input is relabeled as a prediction. No self-citations serve as load-bearing uniqueness theorems, and the representativeness of the patterns is presented as an empirical design decision open to external scrutiny, not a circular step. This is a standard honest dataset paper with self-contained construction details.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the creation and release of a new dataset using existing generative models; no free parameters are fitted and no new entities are postulated.

axioms (1)

domain assumption The three realistic video-audio forgery patterns adequately capture the main challenges in multimodal AI-generated content.
Invoked when describing the dataset's key characteristics and motivation.

pith-pipeline@v0.9.0 · 5499 in / 1126 out tokens · 44171 ms · 2026-05-17T03:44:48.958079+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

59 extracted references · 59 canonical work pages · 9 internal anchors

[1]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

work page
[2]

Cosmos World Foundation Model Platform for Physical AI

Agarwal, N., Ali, A., Bala, M., Balaji, Y., Barker, E., Cai, T., Chattopadhyay, P., Chen, Y., Cui, Y., Ding, Y., et al. Cosmos world foundation model platform for physical ai. arXiv preprint arXiv:2501.03575, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

Pixverse: Ai video generation platform

Aishi Technology . Pixverse: Ai video generation platform. Online platform, 2024. URL https://app.pixverse.ai/home. Accessed: 2025-11-24

work page 2024
[4]

Wan (tongyi wanxiang): Ai video generation platform

Alibaba Cloud . Wan (tongyi wanxiang): Ai video generation platform. Online platform, 2025. URL https://wan.video/. Accessed: 2025-11-24

work page 2025
[5]

Ai-generated video detection via spatial-temporal anomaly learning

Bai, J., Lin, M., Cao, G., and Lou, Z. Ai-generated video detection via spatial-temporal anomaly learning. In Proceedings of the Chinese Conference on Pattern Recognition and Computer Vision (PRCV), pp.\ 460--470. Springer, 2024

work page 2024
[6]

R., Christodorescu, M., Datta, A., Feizi, S., et al

Barrett, C., Boyd, B., Bursztein, E., Carlini, N., Chen, B., Choi, J., Chowdhury, A. R., Christodorescu, M., Datta, A., Feizi, S., et al. Identifying and mitigating the security risks of generative ai. Foundations and Trends in Privacy and Security , 6 0 (1): 0 1--52, 2023

work page 2023
[7]

Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

Blattmann, A., Dockhorn, T., Kulal, S., Mendelevitch, D., Kilian, M., Lorenz, D., Levi, Y., English, Z., Voleti, V., Letts, A., et al. Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint, 2023. arXiv:2311.15127

work page internal anchor Pith review Pith/arXiv arXiv 2023
[8]

and Dolan, W

Chen, D. and Dolan, W. B. Collecting highly parallel data for paraphrase evaluation. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (ACL-HLT), pp.\ 190--200, Portland, Oregon, USA, 2011. Association for Computational Linguistics

work page 2011
[9]

Diffute: Universal text editing diffusion model

Chen, H., Xu, Z., Gu, Z., Lan, J., Zheng, X., Li, Y., Meng, C., Zhu, H., and Wang, W. Diffute: Universal text editing diffusion model. In Advances in Neural Information Processing Systems (NeurIPS), volume 36, pp.\ 63062--63074, 2023

work page 2023
[10]

Demamba: Ai-generated video detection on million- scale genvideo benchmark.arXiv preprint arXiv:2405.19707,

Chen, H., Hong, Y., Huang, Z., Xu, Z., Gu, Z., Li, Y., Lan, J., Zhu, H., Zhang, J., Wang, W., et al. Demamba: Ai-generated video detection on million-scale genvideo benchmark. arXiv preprint arXiv:2405.19707, 2024 a

work page arXiv 2024
[11]

Videocrafter2: Overcoming data limitations for high-quality video diffusion models

Chen, H., Zhang, Y., Cun, X., Xia, M., Wang, X., Weng, C., and Shan, Y. Videocrafter2: Overcoming data limitations for high-quality video diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.\ 7310--7320, Seattle, WA, USA, 2024 b . IEEE

work page 2024
[12]

Humo: Human-centric video generation via collaborative multi-modal conditioning

Chen, L., Ma, T., Liu, J., Li, B., Chen, Z., Liu, L., He, X., Li, G., He, Q., and Wu, Z. Humo: Human-centric video generation via collaborative multi-modal conditioning. arXiv preprint arXiv:2509.08519, 2025 a

work page arXiv 2025
[13]

Talkvid: A large-scale diversified dataset for audio-driven talking head synthesis

Chen, S., Huang, H., Liu, Y., Ye, Z., Chen, P., Zhu, C., Guan, M., Wang, R., Chen, J., Li, G., et al. Talkvid: A large-scale diversified dataset for audio-driven talking head synthesis. arXiv preprint arXiv:2508.13618, 2025 b

work page arXiv 2025
[14]

Genworld: Towards detecting ai-generated real-world simulation videos

Chen, W., Zheng, W., Zheng, Y., Chen, L., Zhou, J., Lu, J., and Duan, Y. Genworld: Towards detecting ai-generated real-world simulation videos. arXiv preprint arXiv:2506.10975, 2025 c

work page arXiv 2025
[15]

K., Ishii, M., Hayakawa, A., Shibuya, T., Schwing, A., and Mitsufuji, Y

Cheng, H. K., Ishii, M., Hayakawa, A., Shibuya, T., Schwing, A., and Mitsufuji, Y. MMAudio : Taming multimodal joint training for high-quality video-to-audio synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.\ 28901--28911, Atlanta, GA, USA, 2025. IEEE

work page 2025
[16]

Deepseek ai chat platform

DeepSeek . Deepseek ai chat platform. Online chat platform, 2024. URL https://chat.deepseek.com/. Accessed: 2025-11-24

work page 2024
[17]

Dolhansky, B., Bitton, J., Pflaum, B., Lu, J., Howes, R., Wang, M., and Ferrer, C. C. The deepfake detection challenge (dfdc) dataset. arXiv preprint arXiv:2006.07397, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2006
[18]

Privacy and security concerns in generative AI : A comprehensive survey

Golda, A., Mekonen, K., Pandey, A., Singh, A., Hassija, V., Chamola, V., and Sikdar, B. Privacy and security concerns in generative AI : A comprehensive survey. IEEE Access, 12: 0 48126--48144, 2024

work page 2024
[19]

Haiper AI : AI video generation platform

Haiper . Haiper AI : AI video generation platform. Online platform, 2024. URL https://haiper.ai/. Accessed: 2025-11-24

work page 2024
[20]

Denoising diffusion probabilistic models

Ho, J., Jain, A., and Abbeel, P. Denoising diffusion probabilistic models. In Advances in Neural Information Processing Systems (NeurIPS), volume 33, pp.\ 6840--6851, 2020

work page 2020
[21]

VBench : Comprehensive benchmark suite for video generative models

Huang, Z., He, Y., Yu, J., Zhang, F., Si, C., Jiang, Y., Zhang, Y., Wu, T., Jin, Q., Chanpaisit, N., et al. VBench : Comprehensive benchmark suite for video generative models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.\ 21807--21818, Seattle, WA, USA, 2024. IEEE

work page 2024
[22]

Speech-forensics: Towards comprehensive synthetic speech dataset establishment and analysis

Ji, Z., Lin, C., Wang, H., and Shen, C. Speech-forensics: Towards comprehensive synthetic speech dataset establishment and analysis. In Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence (IJCAI), pp.\ 413--421, 2024. doi:10.24963/ijcaai.2024/46

work page doi:10.24963/ijcaai.2024/46 2024
[23]

Jiying ai: Video generation platform

Jimeng AI . Jiying ai: Video generation platform. Online platform, 2024. URL https://jimeng.jianying.com/. Accessed: 2024-11-24

work page 2024
[24]

Spoofceleb: Speech deepfake detection and SASV in the wild

Jung, J., Wu, Y., Wang, X., et al. Spoofceleb: Speech deepfake detection and SASV in the wild. IEEE Open Journal of Signal Processing, 2025

work page 2025
[25]

Khalid, H., Tariq, S., Kim, M., and Woo, S. S. Fakeavceleb: A novel audio-video multimodal deepfake dataset. arXiv preprint arXiv:2108.05080, 2021

work page arXiv 2021
[26]

Kling ai: Advanced video generation platform

Kling AI . Kling ai: Advanced video generation platform. Online platform, 2024. URL https://klingai.com/global/. Accessed: 2025-11-24

work page 2024
[27]

Kling ai: Advanced video generation platform

Kling AI . Kling ai: Advanced video generation platform. Online platform, 2025. URL https://klingai.com/global/. Accessed: 2025-11-24

work page 2025
[28]

VideoPoet: A Large Language Model for Zero-Shot Video Generation

Kondratyuk, D., Yu, L., Gu, X., Lezama, J., Huang, J., Schindler, G., Hornung, R., Birodkar, V., Yan, J., Chiu, M.-C., et al. Videopoet: A large language model for zero-shot video generation. arXiv preprint arXiv:2312.14125, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[29]

Snapfusion: Text-to-image diffusion model on mobile devices within two seconds

Li, Y., Wang, H., Jin, Q., Hu, J., Chemerys, P., Fu, Y., Wang, Y., Tulyakov, S., and Ren, J. Snapfusion: Text-to-image diffusion model on mobile devices within two seconds. In Advances in Neural Information Processing Systems (NeurIPS), volume 36, pp.\ 20662--20678, 2023

work page 2023
[30]

Stiv: Scalable text and image conditioned video generation

Lin, Z., Liu, W., Chen, C., Lu, J., Hu, W., Fu, T.-J., Allardice, J., Lai, Z., Song, L., Zhang, B., et al. Stiv: Scalable text and image conditioned video generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp.\ 16249--16259, Paris, France, 2025. IEEE

work page 2025
[31]

Evalcrafter: Benchmarking and evaluating large video generation models

Liu, Y., Cun, X., Liu, X., Wang, X., Zhang, Y., Chen, H., Liu, Y., Zeng, T., Chan, R., and Shan, Y. Evalcrafter: Benchmarking and evaluating large video generation models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.\ 22139--22149, Seattle, WA, USA, 2024. IEEE

work page 2024
[32]

UVE : Are MLLMs unified evaluators for AI -generated videos? arXiv preprint arXiv:2503.09949, 2025

Liu, Y., Zhu, R., Ren, S., Wang, J., Guo, H., Sun, X., and Jiang, L. UVE : Are MLLMs unified evaluators for AI -generated videos? arXiv preprint arXiv:2503.09949, 2025

work page arXiv 2025
[33]

DeCoF : Generated video detection via frame consistency: The first benchmark dataset

Ma, L., Zhang, J., Deng, H., Zhang, N., Guo, Q., Yu, H., Liao, Y., and Zhou, P. DeCoF : Generated video detection via frame consistency: The first benchmark dataset. arXiv preprint arXiv:2402.xxxxx, 2024

work page 2024
[34]

Moonvalley: Ai video generation platform

MoonV . Moonvalley: Ai video generation platform. Online platform, 2024. URL https://www.moonvalley.com/. Accessed: 2025-11-24

work page 2024
[35]

OpenVid-1M: A Large-Scale High-Quality Dataset for Text-to-video Generation

Nan, K., Xie, R., Zhou, P., Fan, T., Yang, Z., Chen, Z., Li, X., Yang, J., and Tai, Y. Openvid-1m: A large-scale high-quality dataset for text-to-video generation. arXiv preprint arXiv:2407.02371, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[36]

Genvidbench: A challenging benchmark for detecting ai-generated video

Ni, Z., Yan, Q., Huang, M., Yuan, T., Tang, Y., Hu, H., Chen, X., and Wang, Y. Genvidbench: A challenging benchmark for detecting ai-generated video. arXiv preprint, 2025. arXiv:2501.11340

work page arXiv 2025
[37]

OpenAI . GPT-4o . Large language model, 2024 a . URL https://openai.com/. Accessed: 2025-11-24

work page 2024
[38]

Sora: Creating video from text

OpenAI . Sora: Creating video from text. Online platform, 2024 b . URL https://sora.chatgpt.com/explore. Accessed: 2025-11-24

work page 2024
[39]

Sora: Creating video from text

OpenAI . Sora: Creating video from text. Online platform, 2025. URL https://sora.chatgpt.com/explore. Accessed: 2025-11-24

work page 2025
[40]

Pika: Ai video generation platform

Pika Labs . Pika: Ai video generation platform. Online platform, 2024. URL https://pika.art/pikaffects. Accessed: 2025-11-24

work page 2024
[41]

Runwayml: Ai video generation platform

Runway ML, Inc. Runwayml: Ai video generation platform. Online platform, 2024. URL https://app.runwayml.com/. Accessed: 2025-11-23

work page 2024
[42]

Hunyuanvideo-foley: Multimodal diffusion with representation alignment for high-fidelity foley audio generation

Shan, S., Li, Q., Cui, Y., Yang, M., Wang, Y., Yang, Q., Zhou, J., and Zhong, Z. Hunyuanvideo-foley: Multimodal diffusion with representation alignment for high-fidelity foley audio generation. arXiv preprint arXiv:2508.16930, 2025

work page arXiv 2025
[43]

On learning multi-modal forgery representation for diffusion generated video detection

Song, X., Guo, X., Zhang, J., Li, Q., Bai, L., Liu, X., Zhai, G., and Liu, X. On learning multi-modal forgery representation for diffusion generated video detection. In Advances in Neural Information Processing Systems (NeurIPS), volume 37, pp.\ 122054--122077, 2024

work page 2024
[44]

AudioX: A Unified Framework for Anything-to-Audio Generation

Tian, Z., Jin, Y., Liu, Z., Yuan, R., Tan, X., Chen, Q., Xue, W., and Guo, Y. Audiox: Diffusion transformer for anything-to-audio generation. arXiv preprint arXiv:2503.10522, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[45]

Noisee ai: Ai music video generation platform

Tranquillitatis Inc. Noisee ai: Ai music video generation platform. Online platform, 2024. URL https://noisee.com.cn/#/. Accessed: 2024-11-24

work page 2024
[46]

Veo3 ai: Advanced video generation platform

Veo3 AI . Veo3 ai: Advanced video generation platform. Online platform, 2025. URL https://www.veo3ai.io/. Accessed: 2025-11-24

work page 2025
[47]

Vidu: Ultra-realistic video generation model

Vidu . Vidu: Ultra-realistic video generation model. Online platform, 2024. URL https://www.vidu.cn/. Accessed: 2025-11-24

work page 2024
[48]

Viva video ai: Ai video generation platform

Viva Video AI . Viva video ai: Ai video generation platform. Online platform, 2024. URL https://vivavideo.ai/. Accessed: 2024-11-24

work page 2024
[49]

Emu3: Next-Token Prediction is All You Need

Wang, X., Zhang, X., Luo, Z., Sun, Q., Cui, Y., Wang, J., Zhang, F., Wang, Y., Li, Z., Yu, Q., Zhao, Y., Ao, Y., Min, X., Li, T., Wu, B., Zhao, B., Zhang, B., Wang, L., Liu, G., He, Z., Yang, X., Liu, J., Lin, Y., Huang, T., and Wang, Z. Emu3: Next-token prediction is all you need. arXiv preprint arXiv:2409.18869, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[50]

InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation

Wang, Y., He, Y., Li, Y., Li, K., Yu, J., Ma, X., Li, X., Chen, G., Chen, X., Wang, Y., et al. Internvid: A large-scale video-text dataset for multimodal understanding and generation. arXiv preprint arXiv:2307.06942, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[51]

Lavie: High-quality video generation with cascaded latent diffusion models

Wang, Y., Chen, X., Ma, X., Zhou, S., Huang, Z., Wang, Y., Yang, C., He, Y., Yu, J., Yang, P., et al. Lavie: High-quality video generation with cascaded latent diffusion models. International Journal of Computer Vision, 133 0 (5): 0 3059--3078, 2025

work page 2025
[52]

Art-v: Auto-regressive text-to-video generation with diffusion models

Weng, W., Feng, R., Wang, Y., Dai, Q., Wang, C., Yin, D., Zhao, Z., Qiu, K., Bao, J., Yuan, Y., et al. Art-v: Auto-regressive text-to-video generation with diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.\ 7395--7405, Seattle, WA, USA, 2024. IEEE

work page 2024
[53]

UGC-VideoCaptioner : An omni ugc video detail caption model and new benchmarks

Wu, P., Liu, Y., Zhu, Z., Zhou, E., and Shen, J. UGC-VideoCaptioner : An omni ugc video detail caption model and new benchmarks. arXiv preprint arXiv:2507.11336, 2025

work page arXiv 2025
[54]

MSR-VTT : A large video description dataset for bridging video and language

Xu, J., Mei, T., Yao, T., and Rui, Y. MSR-VTT : A large video description dataset for bridging video and language. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.\ 5288--5296, Las Vegas, NV, USA, 2016. IEEE

work page 2016
[55]

Adding conditional control to text-to-image diffusion models

Zhang, L., Rao, A., and Agrawala, M. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp.\ 3836--3847, Paris, France, 2023. IEEE

work page 2023
[56]

Foleycrafter: Bring silent videos to life with lifelike and synchronized sounds

Zhang, Y., Gu, Y., Zeng, Y., Xing, Z., Wang, Y., Wu, Z., and Chen, K. Foleycrafter: Bring silent videos to life with lifelike and synchronized sounds. arXiv preprint arXiv:2407.01494, 2024

work page arXiv 2024
[57]

Occworld: Learning a 3d occupancy world model for autonomous driving

Zheng, W., Chen, W., Huang, Y., Zhang, B., Duan, Y., and Lu, J. Occworld: Learning a 3d occupancy world model for autonomous driving. In Proceedings of the European Conference on Computer Vision (ECCV), pp.\ 55--72. Springer, 2024 a

work page 2024
[58]

Open-Sora: Democratizing Efficient Video Production for All

Zheng, Z., Peng, X., Yang, T., Shen, C., Li, S., Liu, H., Zhou, Y., Li, T., and You, Y. Open-sora: Democratizing efficient video production for all. arXiv preprint arXiv:2412.20404, 2024 b

work page internal anchor Pith review Pith/arXiv arXiv 2024
[59]

Harmonyset: A comprehensive dataset for understanding video-music semantic alignment and temporal synchronization

Zhou, Z., Mei, K., Lu, Y., Wang, T., and Rao, F. Harmonyset: A comprehensive dataset for understanding video-music semantic alignment and temporal synchronization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.\ 3152--3162, Atlanta, GA, USA, 2025. IEEE

work page 2025