pith. machine review for the scientific record. sign in

arxiv: 2512.00336 · v3 · submitted 2025-11-29 · 💻 cs.CV

MVAD: A Benchmark Dataset for Multimodal AI-Generated Video-Audio Detection

Pith reviewed 2026-05-17 03:44 UTC · model grok-4.3

classification 💻 cs.CV
keywords AI-generated contentmultimodal datasetvideo-audio detectionforgery patternsbenchmark datasetdeepfake detectioncontent authenticitygenerative models
0
0 comments X

The pith

The MVAD dataset is the first comprehensive benchmark for detecting AI-generated multimodal video-audio content.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper presents the Multimodal Video-Audio Dataset (MVAD) to address the shortage of resources for identifying AI-generated videos that include synchronized audio. Existing datasets focus mainly on visuals alone or are restricted to facial deepfakes, leaving a gap as AI tools create more general multimodal content. The new dataset supplies samples from three realistic video-audio forgery patterns, produced with various state-of-the-art models for high quality, and spans realistic and anime styles plus categories such as humans, animals, objects, and scenes. A reader would care because improved detection tools are needed to check content authenticity when both video and audio are involved. If the dataset works as intended, it supports development of systems that verify multimodal fakes more effectively than current options allow.

Core claim

We introduce the Multimodal Video-Audio Dataset (MVAD), the first comprehensive dataset specifically designed for detecting AI-generated multimodal video-audio content. Our dataset exhibits three key characteristics: genuine multimodality with samples generated according to three realistic video-audio forgery patterns; high perceptual quality achieved through diverse state-of-the-art generative models; and comprehensive diversity spanning realistic and anime visual styles, four content categories (humans, animals, objects, and scenes), and four video-audio multimodal data types.

What carries the argument

The MVAD dataset, which supplies benchmark examples built from three realistic video-audio forgery patterns together with broad coverage of styles, categories, and data types.

If this is right

  • Detection systems can now train and test on paired video and audio forgeries rather than visuals alone.
  • Development of trustworthy multimodal detection tools is supported beyond the limits of facial-deepfake datasets.
  • Evaluation becomes possible across realistic and anime styles as well as four content categories.
  • High-quality examples from multiple generative models allow comparison of detector robustness.
  • Four distinct multimodal data types provide varied training cases for audio-video consistency checks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The dataset could be updated later with new forgery techniques as generative models continue to improve.
  • Researchers might use MVAD to measure how well detectors catch mismatches between generated audio and video.
  • Similar benchmark construction could be applied to other paired modalities such as image-text or video-text combinations.
  • Media platforms could incorporate MVAD-style data to build practical authenticity checks for user-uploaded multimodal content.

Load-bearing premise

The three chosen video-audio forgery patterns plus the selected range of styles and categories are sufficient to stand for the wider and still-expanding set of multimodal AI-generated content.

What would settle it

A detector trained only on MVAD samples is tested on new AI-generated video-audio pairs that use forgery methods or styles outside the three defined patterns and categories; poor performance would indicate the dataset does not cover the necessary range.

Figures

Figures reproduced from arXiv: 2512.00336 by Changtao Miao, Jianshu Li, Joey Tianyi Zhou, Mengxue Hu, Yunfeng Diao, Zhe Li, Zhiqing Guo.

Figure 1
Figure 1. Figure 1: MVAD represents the first general-purpose dataset specifically designed for detecting AI-generated multimodal video-audio content, addressing a critical gap in current research. forgery scenarios. • High Quality: MVAD features a carefully designed construction and evaluation pipeline, incorporating multiple state-of-the-art video-audio generation mod￾els to produce high-quality multimodal content. This hig… view at source ↗
Figure 2
Figure 2. Figure 2: Construction pipeline of MVAD, comprising: data collection from open sources and self-synthesized content generation; multi-stage data generation implementing three distinct forgery patterns; and comprehensive evaluation through automated metrics, LMM assessment, and human expert verification. both authentic and fake videos have been constructed for training and evaluation. Early AI-generated video dataset… view at source ↗
read the original abstract

The rapid advancement of AI-generated multimodal video-audio content has raised significant concerns regarding information security and content authenticity. Existing synthetic video datasets predominantly focus on the visual modality alone, while the few incorporating audio are largely confined to facial deepfakes--a limitation that fails to address the expanding landscape of general multimodal AI-generated content and substantially impedes the development of trustworthy detection systems. To bridge this critical gap, we introduce the Multimodal Video-Audio Dataset (MVAD), the first comprehensive dataset specifically designed for detecting AI-generated multimodal video-audio content. Our dataset exhibits three key characteristics: (1) genuine multimodality with samples generated according to three realistic video-audio forgery patterns; (2) high perceptual quality achieved through diverse state-of-the-art generative models; and (3) comprehensive diversity spanning realistic and anime visual styles, four content categories (humans, animals, objects, and scenes), and four video-audio multimodal data types. Our dataset will be available at https://github.com/HuMengXue0104/MVAD.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces the Multimodal Video-Audio Dataset (MVAD) as the first comprehensive benchmark for detecting AI-generated multimodal video-audio content. It claims three key characteristics: genuine multimodality via three realistic video-audio forgery patterns, high perceptual quality from diverse state-of-the-art generative models, and broad diversity across realistic/anime styles, four content categories (humans, animals, objects, scenes), and four video-audio data types. The dataset is planned for public release via GitHub.

Significance. If the construction details and coverage claims hold, MVAD would fill a clear gap left by visual-only synthetic video datasets and facial-deepfake audio datasets, enabling more general multimodal detection research. Public release of a dataset with planned diversity is a constructive contribution to multimedia forensics.

major comments (2)
  1. [Abstract and Section 3] Abstract and Section 3 (Dataset Construction): the central claim that MVAD is 'the first comprehensive dataset' for general multimodal AI-generated content rests on the three unspecified forgery patterns being representative; however, no enumeration, comparison to joint video-audio diffusion versus separate synthesis-plus-alignment, or argument against selection bias in the four categories is provided, leaving the representativeness unverified.
  2. [Section 4] Section 4 (Quality and Validation): no quantitative metrics (e.g., perceptual quality scores, forgery realism measures, or inter-rater agreement) or construction details are reported to support the 'high perceptual quality' and 'realistic' characteristics, which is load-bearing for the dataset's utility as a benchmark.
minor comments (2)
  1. [Abstract] The four video-audio multimodal data types are referenced but never enumerated or defined, reducing clarity on the exact composition of the dataset.
  2. Figure captions and table headers could more explicitly link each sample to its forgery pattern and generator model for easier reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify how to strengthen the presentation of MVAD's novelty and validation. We address each major comment below and have revised the manuscript to incorporate additional details and quantitative support.

read point-by-point responses
  1. Referee: [Abstract and Section 3] Abstract and Section 3 (Dataset Construction): the central claim that MVAD is 'the first comprehensive dataset' for general multimodal AI-generated content rests on the three unspecified forgery patterns being representative; however, no enumeration, comparison to joint video-audio diffusion versus separate synthesis-plus-alignment, or argument against selection bias in the four categories is provided, leaving the representativeness unverified.

    Authors: We agree that the manuscript would benefit from greater explicitness on this point. Section 3 describes the three forgery patterns at a high level but does not enumerate them in detail or compare them to alternative generation pipelines. In the revised version we will add: (1) an explicit enumeration of the three patterns with concrete examples, (2) a short comparison to joint video-audio diffusion models versus separate synthesis followed by alignment, and (3) a brief justification for the choice of the four content categories together with evidence that they cover the main semantic domains encountered in real-world multimodal content. These additions will make the representativeness argument verifiable without altering the core claims. revision: yes

  2. Referee: [Section 4] Section 4 (Quality and Validation): no quantitative metrics (e.g., perceptual quality scores, forgery realism measures, or inter-rater agreement) or construction details are reported to support the 'high perceptual quality' and 'realistic' characteristics, which is load-bearing for the dataset's utility as a benchmark.

    Authors: We acknowledge that the current Section 4 relies primarily on qualitative descriptions. In the revised manuscript we will report quantitative perceptual quality scores (e.g., FID and LPIPS computed against real reference videos), forgery realism measures where applicable, and results from a controlled user study that includes inter-rater agreement statistics. We will also expand the construction details to document the specific state-of-the-art models used, post-processing steps, and quality-control procedures. These additions directly address the load-bearing nature of the quality claim. revision: yes

Circularity Check

0 steps flagged

No circularity: dataset construction paper with no derivations or fitted predictions

full rationale

The paper introduces MVAD as a new benchmark dataset constructed from existing generative models and three defined forgery patterns, without any equations, parameter fitting, predictive modeling, or derivation chains. The central claims rest on descriptive choices about multimodality, quality, and diversity rather than any self-referential reduction where an output is defined in terms of itself or a fitted input is relabeled as a prediction. No self-citations serve as load-bearing uniqueness theorems, and the representativeness of the patterns is presented as an empirical design decision open to external scrutiny, not a circular step. This is a standard honest dataset paper with self-contained construction details.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the creation and release of a new dataset using existing generative models; no free parameters are fitted and no new entities are postulated.

axioms (1)
  • domain assumption The three realistic video-audio forgery patterns adequately capture the main challenges in multimodal AI-generated content.
    Invoked when describing the dataset's key characteristics and motivation.

pith-pipeline@v0.9.0 · 5499 in / 1126 out tokens · 44171 ms · 2026-05-17T03:44:48.958079+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

59 extracted references · 59 canonical work pages · 9 internal anchors

  1. [1]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

  2. [2]

    Cosmos World Foundation Model Platform for Physical AI

    Agarwal, N., Ali, A., Bala, M., Balaji, Y., Barker, E., Cai, T., Chattopadhyay, P., Chen, Y., Cui, Y., Ding, Y., et al. Cosmos world foundation model platform for physical ai. arXiv preprint arXiv:2501.03575, 2025

  3. [3]

    Pixverse: Ai video generation platform

    Aishi Technology . Pixverse: Ai video generation platform. Online platform, 2024. URL https://app.pixverse.ai/home. Accessed: 2025-11-24

  4. [4]

    Wan (tongyi wanxiang): Ai video generation platform

    Alibaba Cloud . Wan (tongyi wanxiang): Ai video generation platform. Online platform, 2025. URL https://wan.video/. Accessed: 2025-11-24

  5. [5]

    Ai-generated video detection via spatial-temporal anomaly learning

    Bai, J., Lin, M., Cao, G., and Lou, Z. Ai-generated video detection via spatial-temporal anomaly learning. In Proceedings of the Chinese Conference on Pattern Recognition and Computer Vision (PRCV), pp.\ 460--470. Springer, 2024

  6. [6]

    R., Christodorescu, M., Datta, A., Feizi, S., et al

    Barrett, C., Boyd, B., Bursztein, E., Carlini, N., Chen, B., Choi, J., Chowdhury, A. R., Christodorescu, M., Datta, A., Feizi, S., et al. Identifying and mitigating the security risks of generative ai. Foundations and Trends in Privacy and Security , 6 0 (1): 0 1--52, 2023

  7. [7]

    Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

    Blattmann, A., Dockhorn, T., Kulal, S., Mendelevitch, D., Kilian, M., Lorenz, D., Levi, Y., English, Z., Voleti, V., Letts, A., et al. Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint, 2023. arXiv:2311.15127

  8. [8]

    and Dolan, W

    Chen, D. and Dolan, W. B. Collecting highly parallel data for paraphrase evaluation. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (ACL-HLT), pp.\ 190--200, Portland, Oregon, USA, 2011. Association for Computational Linguistics

  9. [9]

    Diffute: Universal text editing diffusion model

    Chen, H., Xu, Z., Gu, Z., Lan, J., Zheng, X., Li, Y., Meng, C., Zhu, H., and Wang, W. Diffute: Universal text editing diffusion model. In Advances in Neural Information Processing Systems (NeurIPS), volume 36, pp.\ 63062--63074, 2023

  10. [10]

    Demamba: Ai-generated video detection on million- scale genvideo benchmark.arXiv preprint arXiv:2405.19707,

    Chen, H., Hong, Y., Huang, Z., Xu, Z., Gu, Z., Li, Y., Lan, J., Zhu, H., Zhang, J., Wang, W., et al. Demamba: Ai-generated video detection on million-scale genvideo benchmark. arXiv preprint arXiv:2405.19707, 2024 a

  11. [11]

    Videocrafter2: Overcoming data limitations for high-quality video diffusion models

    Chen, H., Zhang, Y., Cun, X., Xia, M., Wang, X., Weng, C., and Shan, Y. Videocrafter2: Overcoming data limitations for high-quality video diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.\ 7310--7320, Seattle, WA, USA, 2024 b . IEEE

  12. [12]

    Humo: Human-centric video generation via collaborative multi-modal conditioning

    Chen, L., Ma, T., Liu, J., Li, B., Chen, Z., Liu, L., He, X., Li, G., He, Q., and Wu, Z. Humo: Human-centric video generation via collaborative multi-modal conditioning. arXiv preprint arXiv:2509.08519, 2025 a

  13. [13]

    Talkvid: A large-scale diversified dataset for audio-driven talking head synthesis

    Chen, S., Huang, H., Liu, Y., Ye, Z., Chen, P., Zhu, C., Guan, M., Wang, R., Chen, J., Li, G., et al. Talkvid: A large-scale diversified dataset for audio-driven talking head synthesis. arXiv preprint arXiv:2508.13618, 2025 b

  14. [14]

    Genworld: Towards detecting ai-generated real-world simulation videos

    Chen, W., Zheng, W., Zheng, Y., Chen, L., Zhou, J., Lu, J., and Duan, Y. Genworld: Towards detecting ai-generated real-world simulation videos. arXiv preprint arXiv:2506.10975, 2025 c

  15. [15]

    K., Ishii, M., Hayakawa, A., Shibuya, T., Schwing, A., and Mitsufuji, Y

    Cheng, H. K., Ishii, M., Hayakawa, A., Shibuya, T., Schwing, A., and Mitsufuji, Y. MMAudio : Taming multimodal joint training for high-quality video-to-audio synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.\ 28901--28911, Atlanta, GA, USA, 2025. IEEE

  16. [16]

    Deepseek ai chat platform

    DeepSeek . Deepseek ai chat platform. Online chat platform, 2024. URL https://chat.deepseek.com/. Accessed: 2025-11-24

  17. [17]

    Dolhansky, B., Bitton, J., Pflaum, B., Lu, J., Howes, R., Wang, M., and Ferrer, C. C. The deepfake detection challenge (dfdc) dataset. arXiv preprint arXiv:2006.07397, 2020

  18. [18]

    Privacy and security concerns in generative AI : A comprehensive survey

    Golda, A., Mekonen, K., Pandey, A., Singh, A., Hassija, V., Chamola, V., and Sikdar, B. Privacy and security concerns in generative AI : A comprehensive survey. IEEE Access, 12: 0 48126--48144, 2024

  19. [19]

    Haiper AI : AI video generation platform

    Haiper . Haiper AI : AI video generation platform. Online platform, 2024. URL https://haiper.ai/. Accessed: 2025-11-24

  20. [20]

    Denoising diffusion probabilistic models

    Ho, J., Jain, A., and Abbeel, P. Denoising diffusion probabilistic models. In Advances in Neural Information Processing Systems (NeurIPS), volume 33, pp.\ 6840--6851, 2020

  21. [21]

    VBench : Comprehensive benchmark suite for video generative models

    Huang, Z., He, Y., Yu, J., Zhang, F., Si, C., Jiang, Y., Zhang, Y., Wu, T., Jin, Q., Chanpaisit, N., et al. VBench : Comprehensive benchmark suite for video generative models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.\ 21807--21818, Seattle, WA, USA, 2024. IEEE

  22. [22]

    Speech-forensics: Towards comprehensive synthetic speech dataset establishment and analysis

    Ji, Z., Lin, C., Wang, H., and Shen, C. Speech-forensics: Towards comprehensive synthetic speech dataset establishment and analysis. In Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence (IJCAI), pp.\ 413--421, 2024. doi:10.24963/ijcaai.2024/46

  23. [23]

    Jiying ai: Video generation platform

    Jimeng AI . Jiying ai: Video generation platform. Online platform, 2024. URL https://jimeng.jianying.com/. Accessed: 2024-11-24

  24. [24]

    Spoofceleb: Speech deepfake detection and SASV in the wild

    Jung, J., Wu, Y., Wang, X., et al. Spoofceleb: Speech deepfake detection and SASV in the wild. IEEE Open Journal of Signal Processing, 2025

  25. [25]

    Khalid, H., Tariq, S., Kim, M., and Woo, S. S. Fakeavceleb: A novel audio-video multimodal deepfake dataset. arXiv preprint arXiv:2108.05080, 2021

  26. [26]

    Kling ai: Advanced video generation platform

    Kling AI . Kling ai: Advanced video generation platform. Online platform, 2024. URL https://klingai.com/global/. Accessed: 2025-11-24

  27. [27]

    Kling ai: Advanced video generation platform

    Kling AI . Kling ai: Advanced video generation platform. Online platform, 2025. URL https://klingai.com/global/. Accessed: 2025-11-24

  28. [28]

    VideoPoet: A Large Language Model for Zero-Shot Video Generation

    Kondratyuk, D., Yu, L., Gu, X., Lezama, J., Huang, J., Schindler, G., Hornung, R., Birodkar, V., Yan, J., Chiu, M.-C., et al. Videopoet: A large language model for zero-shot video generation. arXiv preprint arXiv:2312.14125, 2023

  29. [29]

    Snapfusion: Text-to-image diffusion model on mobile devices within two seconds

    Li, Y., Wang, H., Jin, Q., Hu, J., Chemerys, P., Fu, Y., Wang, Y., Tulyakov, S., and Ren, J. Snapfusion: Text-to-image diffusion model on mobile devices within two seconds. In Advances in Neural Information Processing Systems (NeurIPS), volume 36, pp.\ 20662--20678, 2023

  30. [30]

    Stiv: Scalable text and image conditioned video generation

    Lin, Z., Liu, W., Chen, C., Lu, J., Hu, W., Fu, T.-J., Allardice, J., Lai, Z., Song, L., Zhang, B., et al. Stiv: Scalable text and image conditioned video generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp.\ 16249--16259, Paris, France, 2025. IEEE

  31. [31]

    Evalcrafter: Benchmarking and evaluating large video generation models

    Liu, Y., Cun, X., Liu, X., Wang, X., Zhang, Y., Chen, H., Liu, Y., Zeng, T., Chan, R., and Shan, Y. Evalcrafter: Benchmarking and evaluating large video generation models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.\ 22139--22149, Seattle, WA, USA, 2024. IEEE

  32. [32]

    UVE : Are MLLMs unified evaluators for AI -generated videos? arXiv preprint arXiv:2503.09949, 2025

    Liu, Y., Zhu, R., Ren, S., Wang, J., Guo, H., Sun, X., and Jiang, L. UVE : Are MLLMs unified evaluators for AI -generated videos? arXiv preprint arXiv:2503.09949, 2025

  33. [33]

    DeCoF : Generated video detection via frame consistency: The first benchmark dataset

    Ma, L., Zhang, J., Deng, H., Zhang, N., Guo, Q., Yu, H., Liao, Y., and Zhou, P. DeCoF : Generated video detection via frame consistency: The first benchmark dataset. arXiv preprint arXiv:2402.xxxxx, 2024

  34. [34]

    Moonvalley: Ai video generation platform

    MoonV . Moonvalley: Ai video generation platform. Online platform, 2024. URL https://www.moonvalley.com/. Accessed: 2025-11-24

  35. [35]

    OpenVid-1M: A Large-Scale High-Quality Dataset for Text-to-video Generation

    Nan, K., Xie, R., Zhou, P., Fan, T., Yang, Z., Chen, Z., Li, X., Yang, J., and Tai, Y. Openvid-1m: A large-scale high-quality dataset for text-to-video generation. arXiv preprint arXiv:2407.02371, 2024

  36. [36]

    Genvidbench: A challenging benchmark for detecting ai-generated video

    Ni, Z., Yan, Q., Huang, M., Yuan, T., Tang, Y., Hu, H., Chen, X., and Wang, Y. Genvidbench: A challenging benchmark for detecting ai-generated video. arXiv preprint, 2025. arXiv:2501.11340

  37. [37]

    OpenAI . GPT-4o . Large language model, 2024 a . URL https://openai.com/. Accessed: 2025-11-24

  38. [38]

    Sora: Creating video from text

    OpenAI . Sora: Creating video from text. Online platform, 2024 b . URL https://sora.chatgpt.com/explore. Accessed: 2025-11-24

  39. [39]

    Sora: Creating video from text

    OpenAI . Sora: Creating video from text. Online platform, 2025. URL https://sora.chatgpt.com/explore. Accessed: 2025-11-24

  40. [40]

    Pika: Ai video generation platform

    Pika Labs . Pika: Ai video generation platform. Online platform, 2024. URL https://pika.art/pikaffects. Accessed: 2025-11-24

  41. [41]

    Runwayml: Ai video generation platform

    Runway ML, Inc. Runwayml: Ai video generation platform. Online platform, 2024. URL https://app.runwayml.com/. Accessed: 2025-11-23

  42. [42]

    Hunyuanvideo-foley: Multimodal diffusion with representation alignment for high-fidelity foley audio generation

    Shan, S., Li, Q., Cui, Y., Yang, M., Wang, Y., Yang, Q., Zhou, J., and Zhong, Z. Hunyuanvideo-foley: Multimodal diffusion with representation alignment for high-fidelity foley audio generation. arXiv preprint arXiv:2508.16930, 2025

  43. [43]

    On learning multi-modal forgery representation for diffusion generated video detection

    Song, X., Guo, X., Zhang, J., Li, Q., Bai, L., Liu, X., Zhai, G., and Liu, X. On learning multi-modal forgery representation for diffusion generated video detection. In Advances in Neural Information Processing Systems (NeurIPS), volume 37, pp.\ 122054--122077, 2024

  44. [44]

    AudioX: A Unified Framework for Anything-to-Audio Generation

    Tian, Z., Jin, Y., Liu, Z., Yuan, R., Tan, X., Chen, Q., Xue, W., and Guo, Y. Audiox: Diffusion transformer for anything-to-audio generation. arXiv preprint arXiv:2503.10522, 2025

  45. [45]

    Noisee ai: Ai music video generation platform

    Tranquillitatis Inc. Noisee ai: Ai music video generation platform. Online platform, 2024. URL https://noisee.com.cn/#/. Accessed: 2024-11-24

  46. [46]

    Veo3 ai: Advanced video generation platform

    Veo3 AI . Veo3 ai: Advanced video generation platform. Online platform, 2025. URL https://www.veo3ai.io/. Accessed: 2025-11-24

  47. [47]

    Vidu: Ultra-realistic video generation model

    Vidu . Vidu: Ultra-realistic video generation model. Online platform, 2024. URL https://www.vidu.cn/. Accessed: 2025-11-24

  48. [48]

    Viva video ai: Ai video generation platform

    Viva Video AI . Viva video ai: Ai video generation platform. Online platform, 2024. URL https://vivavideo.ai/. Accessed: 2024-11-24

  49. [49]

    Emu3: Next-Token Prediction is All You Need

    Wang, X., Zhang, X., Luo, Z., Sun, Q., Cui, Y., Wang, J., Zhang, F., Wang, Y., Li, Z., Yu, Q., Zhao, Y., Ao, Y., Min, X., Li, T., Wu, B., Zhao, B., Zhang, B., Wang, L., Liu, G., He, Z., Yang, X., Liu, J., Lin, Y., Huang, T., and Wang, Z. Emu3: Next-token prediction is all you need. arXiv preprint arXiv:2409.18869, 2024

  50. [50]

    InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation

    Wang, Y., He, Y., Li, Y., Li, K., Yu, J., Ma, X., Li, X., Chen, G., Chen, X., Wang, Y., et al. Internvid: A large-scale video-text dataset for multimodal understanding and generation. arXiv preprint arXiv:2307.06942, 2023

  51. [51]

    Lavie: High-quality video generation with cascaded latent diffusion models

    Wang, Y., Chen, X., Ma, X., Zhou, S., Huang, Z., Wang, Y., Yang, C., He, Y., Yu, J., Yang, P., et al. Lavie: High-quality video generation with cascaded latent diffusion models. International Journal of Computer Vision, 133 0 (5): 0 3059--3078, 2025

  52. [52]

    Art-v: Auto-regressive text-to-video generation with diffusion models

    Weng, W., Feng, R., Wang, Y., Dai, Q., Wang, C., Yin, D., Zhao, Z., Qiu, K., Bao, J., Yuan, Y., et al. Art-v: Auto-regressive text-to-video generation with diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.\ 7395--7405, Seattle, WA, USA, 2024. IEEE

  53. [53]

    UGC-VideoCaptioner : An omni ugc video detail caption model and new benchmarks

    Wu, P., Liu, Y., Zhu, Z., Zhou, E., and Shen, J. UGC-VideoCaptioner : An omni ugc video detail caption model and new benchmarks. arXiv preprint arXiv:2507.11336, 2025

  54. [54]

    MSR-VTT : A large video description dataset for bridging video and language

    Xu, J., Mei, T., Yao, T., and Rui, Y. MSR-VTT : A large video description dataset for bridging video and language. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.\ 5288--5296, Las Vegas, NV, USA, 2016. IEEE

  55. [55]

    Adding conditional control to text-to-image diffusion models

    Zhang, L., Rao, A., and Agrawala, M. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp.\ 3836--3847, Paris, France, 2023. IEEE

  56. [56]

    Foleycrafter: Bring silent videos to life with lifelike and synchronized sounds

    Zhang, Y., Gu, Y., Zeng, Y., Xing, Z., Wang, Y., Wu, Z., and Chen, K. Foleycrafter: Bring silent videos to life with lifelike and synchronized sounds. arXiv preprint arXiv:2407.01494, 2024

  57. [57]

    Occworld: Learning a 3d occupancy world model for autonomous driving

    Zheng, W., Chen, W., Huang, Y., Zhang, B., Duan, Y., and Lu, J. Occworld: Learning a 3d occupancy world model for autonomous driving. In Proceedings of the European Conference on Computer Vision (ECCV), pp.\ 55--72. Springer, 2024 a

  58. [58]

    Open-Sora: Democratizing Efficient Video Production for All

    Zheng, Z., Peng, X., Yang, T., Shen, C., Li, S., Liu, H., Zhou, Y., Li, T., and You, Y. Open-sora: Democratizing efficient video production for all. arXiv preprint arXiv:2412.20404, 2024 b

  59. [59]

    Harmonyset: A comprehensive dataset for understanding video-music semantic alignment and temporal synchronization

    Zhou, Z., Mei, K., Lu, Y., Wang, T., and Rao, F. Harmonyset: A comprehensive dataset for understanding video-music semantic alignment and temporal synchronization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.\ 3152--3162, Atlanta, GA, USA, 2025. IEEE