MVAD: A Benchmark Dataset for Multimodal AI-Generated Video-Audio Detection
Pith reviewed 2026-05-17 03:44 UTC · model grok-4.3
The pith
The MVAD dataset is the first comprehensive benchmark for detecting AI-generated multimodal video-audio content.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We introduce the Multimodal Video-Audio Dataset (MVAD), the first comprehensive dataset specifically designed for detecting AI-generated multimodal video-audio content. Our dataset exhibits three key characteristics: genuine multimodality with samples generated according to three realistic video-audio forgery patterns; high perceptual quality achieved through diverse state-of-the-art generative models; and comprehensive diversity spanning realistic and anime visual styles, four content categories (humans, animals, objects, and scenes), and four video-audio multimodal data types.
What carries the argument
The MVAD dataset, which supplies benchmark examples built from three realistic video-audio forgery patterns together with broad coverage of styles, categories, and data types.
If this is right
- Detection systems can now train and test on paired video and audio forgeries rather than visuals alone.
- Development of trustworthy multimodal detection tools is supported beyond the limits of facial-deepfake datasets.
- Evaluation becomes possible across realistic and anime styles as well as four content categories.
- High-quality examples from multiple generative models allow comparison of detector robustness.
- Four distinct multimodal data types provide varied training cases for audio-video consistency checks.
Where Pith is reading between the lines
- The dataset could be updated later with new forgery techniques as generative models continue to improve.
- Researchers might use MVAD to measure how well detectors catch mismatches between generated audio and video.
- Similar benchmark construction could be applied to other paired modalities such as image-text or video-text combinations.
- Media platforms could incorporate MVAD-style data to build practical authenticity checks for user-uploaded multimodal content.
Load-bearing premise
The three chosen video-audio forgery patterns plus the selected range of styles and categories are sufficient to stand for the wider and still-expanding set of multimodal AI-generated content.
What would settle it
A detector trained only on MVAD samples is tested on new AI-generated video-audio pairs that use forgery methods or styles outside the three defined patterns and categories; poor performance would indicate the dataset does not cover the necessary range.
Figures
read the original abstract
The rapid advancement of AI-generated multimodal video-audio content has raised significant concerns regarding information security and content authenticity. Existing synthetic video datasets predominantly focus on the visual modality alone, while the few incorporating audio are largely confined to facial deepfakes--a limitation that fails to address the expanding landscape of general multimodal AI-generated content and substantially impedes the development of trustworthy detection systems. To bridge this critical gap, we introduce the Multimodal Video-Audio Dataset (MVAD), the first comprehensive dataset specifically designed for detecting AI-generated multimodal video-audio content. Our dataset exhibits three key characteristics: (1) genuine multimodality with samples generated according to three realistic video-audio forgery patterns; (2) high perceptual quality achieved through diverse state-of-the-art generative models; and (3) comprehensive diversity spanning realistic and anime visual styles, four content categories (humans, animals, objects, and scenes), and four video-audio multimodal data types. Our dataset will be available at https://github.com/HuMengXue0104/MVAD.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces the Multimodal Video-Audio Dataset (MVAD) as the first comprehensive benchmark for detecting AI-generated multimodal video-audio content. It claims three key characteristics: genuine multimodality via three realistic video-audio forgery patterns, high perceptual quality from diverse state-of-the-art generative models, and broad diversity across realistic/anime styles, four content categories (humans, animals, objects, scenes), and four video-audio data types. The dataset is planned for public release via GitHub.
Significance. If the construction details and coverage claims hold, MVAD would fill a clear gap left by visual-only synthetic video datasets and facial-deepfake audio datasets, enabling more general multimodal detection research. Public release of a dataset with planned diversity is a constructive contribution to multimedia forensics.
major comments (2)
- [Abstract and Section 3] Abstract and Section 3 (Dataset Construction): the central claim that MVAD is 'the first comprehensive dataset' for general multimodal AI-generated content rests on the three unspecified forgery patterns being representative; however, no enumeration, comparison to joint video-audio diffusion versus separate synthesis-plus-alignment, or argument against selection bias in the four categories is provided, leaving the representativeness unverified.
- [Section 4] Section 4 (Quality and Validation): no quantitative metrics (e.g., perceptual quality scores, forgery realism measures, or inter-rater agreement) or construction details are reported to support the 'high perceptual quality' and 'realistic' characteristics, which is load-bearing for the dataset's utility as a benchmark.
minor comments (2)
- [Abstract] The four video-audio multimodal data types are referenced but never enumerated or defined, reducing clarity on the exact composition of the dataset.
- Figure captions and table headers could more explicitly link each sample to its forgery pattern and generator model for easier reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive comments, which help clarify how to strengthen the presentation of MVAD's novelty and validation. We address each major comment below and have revised the manuscript to incorporate additional details and quantitative support.
read point-by-point responses
-
Referee: [Abstract and Section 3] Abstract and Section 3 (Dataset Construction): the central claim that MVAD is 'the first comprehensive dataset' for general multimodal AI-generated content rests on the three unspecified forgery patterns being representative; however, no enumeration, comparison to joint video-audio diffusion versus separate synthesis-plus-alignment, or argument against selection bias in the four categories is provided, leaving the representativeness unverified.
Authors: We agree that the manuscript would benefit from greater explicitness on this point. Section 3 describes the three forgery patterns at a high level but does not enumerate them in detail or compare them to alternative generation pipelines. In the revised version we will add: (1) an explicit enumeration of the three patterns with concrete examples, (2) a short comparison to joint video-audio diffusion models versus separate synthesis followed by alignment, and (3) a brief justification for the choice of the four content categories together with evidence that they cover the main semantic domains encountered in real-world multimodal content. These additions will make the representativeness argument verifiable without altering the core claims. revision: yes
-
Referee: [Section 4] Section 4 (Quality and Validation): no quantitative metrics (e.g., perceptual quality scores, forgery realism measures, or inter-rater agreement) or construction details are reported to support the 'high perceptual quality' and 'realistic' characteristics, which is load-bearing for the dataset's utility as a benchmark.
Authors: We acknowledge that the current Section 4 relies primarily on qualitative descriptions. In the revised manuscript we will report quantitative perceptual quality scores (e.g., FID and LPIPS computed against real reference videos), forgery realism measures where applicable, and results from a controlled user study that includes inter-rater agreement statistics. We will also expand the construction details to document the specific state-of-the-art models used, post-processing steps, and quality-control procedures. These additions directly address the load-bearing nature of the quality claim. revision: yes
Circularity Check
No circularity: dataset construction paper with no derivations or fitted predictions
full rationale
The paper introduces MVAD as a new benchmark dataset constructed from existing generative models and three defined forgery patterns, without any equations, parameter fitting, predictive modeling, or derivation chains. The central claims rest on descriptive choices about multimodality, quality, and diversity rather than any self-referential reduction where an output is defined in terms of itself or a fitted input is relabeled as a prediction. No self-citations serve as load-bearing uniqueness theorems, and the representativeness of the patterns is presented as an empirical design decision open to external scrutiny, not a circular step. This is a standard honest dataset paper with self-contained construction details.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The three realistic video-audio forgery patterns adequately capture the main challenges in multimodal AI-generated content.
Reference graph
Works this paper leans on
-
[1]
" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...
-
[2]
Cosmos World Foundation Model Platform for Physical AI
Agarwal, N., Ali, A., Bala, M., Balaji, Y., Barker, E., Cai, T., Chattopadhyay, P., Chen, Y., Cui, Y., Ding, Y., et al. Cosmos world foundation model platform for physical ai. arXiv preprint arXiv:2501.03575, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[3]
Pixverse: Ai video generation platform
Aishi Technology . Pixverse: Ai video generation platform. Online platform, 2024. URL https://app.pixverse.ai/home. Accessed: 2025-11-24
work page 2024
-
[4]
Wan (tongyi wanxiang): Ai video generation platform
Alibaba Cloud . Wan (tongyi wanxiang): Ai video generation platform. Online platform, 2025. URL https://wan.video/. Accessed: 2025-11-24
work page 2025
-
[5]
Ai-generated video detection via spatial-temporal anomaly learning
Bai, J., Lin, M., Cao, G., and Lou, Z. Ai-generated video detection via spatial-temporal anomaly learning. In Proceedings of the Chinese Conference on Pattern Recognition and Computer Vision (PRCV), pp.\ 460--470. Springer, 2024
work page 2024
-
[6]
R., Christodorescu, M., Datta, A., Feizi, S., et al
Barrett, C., Boyd, B., Bursztein, E., Carlini, N., Chen, B., Choi, J., Chowdhury, A. R., Christodorescu, M., Datta, A., Feizi, S., et al. Identifying and mitigating the security risks of generative ai. Foundations and Trends in Privacy and Security , 6 0 (1): 0 1--52, 2023
work page 2023
-
[7]
Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets
Blattmann, A., Dockhorn, T., Kulal, S., Mendelevitch, D., Kilian, M., Lorenz, D., Levi, Y., English, Z., Voleti, V., Letts, A., et al. Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint, 2023. arXiv:2311.15127
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[8]
Chen, D. and Dolan, W. B. Collecting highly parallel data for paraphrase evaluation. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (ACL-HLT), pp.\ 190--200, Portland, Oregon, USA, 2011. Association for Computational Linguistics
work page 2011
-
[9]
Diffute: Universal text editing diffusion model
Chen, H., Xu, Z., Gu, Z., Lan, J., Zheng, X., Li, Y., Meng, C., Zhu, H., and Wang, W. Diffute: Universal text editing diffusion model. In Advances in Neural Information Processing Systems (NeurIPS), volume 36, pp.\ 63062--63074, 2023
work page 2023
-
[10]
Chen, H., Hong, Y., Huang, Z., Xu, Z., Gu, Z., Li, Y., Lan, J., Zhu, H., Zhang, J., Wang, W., et al. Demamba: Ai-generated video detection on million-scale genvideo benchmark. arXiv preprint arXiv:2405.19707, 2024 a
-
[11]
Videocrafter2: Overcoming data limitations for high-quality video diffusion models
Chen, H., Zhang, Y., Cun, X., Xia, M., Wang, X., Weng, C., and Shan, Y. Videocrafter2: Overcoming data limitations for high-quality video diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.\ 7310--7320, Seattle, WA, USA, 2024 b . IEEE
work page 2024
-
[12]
Humo: Human-centric video generation via collaborative multi-modal conditioning
Chen, L., Ma, T., Liu, J., Li, B., Chen, Z., Liu, L., He, X., Li, G., He, Q., and Wu, Z. Humo: Human-centric video generation via collaborative multi-modal conditioning. arXiv preprint arXiv:2509.08519, 2025 a
-
[13]
Talkvid: A large-scale diversified dataset for audio-driven talking head synthesis
Chen, S., Huang, H., Liu, Y., Ye, Z., Chen, P., Zhu, C., Guan, M., Wang, R., Chen, J., Li, G., et al. Talkvid: A large-scale diversified dataset for audio-driven talking head synthesis. arXiv preprint arXiv:2508.13618, 2025 b
-
[14]
Genworld: Towards detecting ai-generated real-world simulation videos
Chen, W., Zheng, W., Zheng, Y., Chen, L., Zhou, J., Lu, J., and Duan, Y. Genworld: Towards detecting ai-generated real-world simulation videos. arXiv preprint arXiv:2506.10975, 2025 c
-
[15]
K., Ishii, M., Hayakawa, A., Shibuya, T., Schwing, A., and Mitsufuji, Y
Cheng, H. K., Ishii, M., Hayakawa, A., Shibuya, T., Schwing, A., and Mitsufuji, Y. MMAudio : Taming multimodal joint training for high-quality video-to-audio synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.\ 28901--28911, Atlanta, GA, USA, 2025. IEEE
work page 2025
-
[16]
DeepSeek . Deepseek ai chat platform. Online chat platform, 2024. URL https://chat.deepseek.com/. Accessed: 2025-11-24
work page 2024
-
[17]
Dolhansky, B., Bitton, J., Pflaum, B., Lu, J., Howes, R., Wang, M., and Ferrer, C. C. The deepfake detection challenge (dfdc) dataset. arXiv preprint arXiv:2006.07397, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2006
-
[18]
Privacy and security concerns in generative AI : A comprehensive survey
Golda, A., Mekonen, K., Pandey, A., Singh, A., Hassija, V., Chamola, V., and Sikdar, B. Privacy and security concerns in generative AI : A comprehensive survey. IEEE Access, 12: 0 48126--48144, 2024
work page 2024
-
[19]
Haiper AI : AI video generation platform
Haiper . Haiper AI : AI video generation platform. Online platform, 2024. URL https://haiper.ai/. Accessed: 2025-11-24
work page 2024
-
[20]
Denoising diffusion probabilistic models
Ho, J., Jain, A., and Abbeel, P. Denoising diffusion probabilistic models. In Advances in Neural Information Processing Systems (NeurIPS), volume 33, pp.\ 6840--6851, 2020
work page 2020
-
[21]
VBench : Comprehensive benchmark suite for video generative models
Huang, Z., He, Y., Yu, J., Zhang, F., Si, C., Jiang, Y., Zhang, Y., Wu, T., Jin, Q., Chanpaisit, N., et al. VBench : Comprehensive benchmark suite for video generative models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.\ 21807--21818, Seattle, WA, USA, 2024. IEEE
work page 2024
-
[22]
Speech-forensics: Towards comprehensive synthetic speech dataset establishment and analysis
Ji, Z., Lin, C., Wang, H., and Shen, C. Speech-forensics: Towards comprehensive synthetic speech dataset establishment and analysis. In Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence (IJCAI), pp.\ 413--421, 2024. doi:10.24963/ijcaai.2024/46
-
[23]
Jiying ai: Video generation platform
Jimeng AI . Jiying ai: Video generation platform. Online platform, 2024. URL https://jimeng.jianying.com/. Accessed: 2024-11-24
work page 2024
-
[24]
Spoofceleb: Speech deepfake detection and SASV in the wild
Jung, J., Wu, Y., Wang, X., et al. Spoofceleb: Speech deepfake detection and SASV in the wild. IEEE Open Journal of Signal Processing, 2025
work page 2025
- [25]
-
[26]
Kling ai: Advanced video generation platform
Kling AI . Kling ai: Advanced video generation platform. Online platform, 2024. URL https://klingai.com/global/. Accessed: 2025-11-24
work page 2024
-
[27]
Kling ai: Advanced video generation platform
Kling AI . Kling ai: Advanced video generation platform. Online platform, 2025. URL https://klingai.com/global/. Accessed: 2025-11-24
work page 2025
-
[28]
VideoPoet: A Large Language Model for Zero-Shot Video Generation
Kondratyuk, D., Yu, L., Gu, X., Lezama, J., Huang, J., Schindler, G., Hornung, R., Birodkar, V., Yan, J., Chiu, M.-C., et al. Videopoet: A large language model for zero-shot video generation. arXiv preprint arXiv:2312.14125, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[29]
Snapfusion: Text-to-image diffusion model on mobile devices within two seconds
Li, Y., Wang, H., Jin, Q., Hu, J., Chemerys, P., Fu, Y., Wang, Y., Tulyakov, S., and Ren, J. Snapfusion: Text-to-image diffusion model on mobile devices within two seconds. In Advances in Neural Information Processing Systems (NeurIPS), volume 36, pp.\ 20662--20678, 2023
work page 2023
-
[30]
Stiv: Scalable text and image conditioned video generation
Lin, Z., Liu, W., Chen, C., Lu, J., Hu, W., Fu, T.-J., Allardice, J., Lai, Z., Song, L., Zhang, B., et al. Stiv: Scalable text and image conditioned video generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp.\ 16249--16259, Paris, France, 2025. IEEE
work page 2025
-
[31]
Evalcrafter: Benchmarking and evaluating large video generation models
Liu, Y., Cun, X., Liu, X., Wang, X., Zhang, Y., Chen, H., Liu, Y., Zeng, T., Chan, R., and Shan, Y. Evalcrafter: Benchmarking and evaluating large video generation models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.\ 22139--22149, Seattle, WA, USA, 2024. IEEE
work page 2024
-
[32]
UVE : Are MLLMs unified evaluators for AI -generated videos? arXiv preprint arXiv:2503.09949, 2025
Liu, Y., Zhu, R., Ren, S., Wang, J., Guo, H., Sun, X., and Jiang, L. UVE : Are MLLMs unified evaluators for AI -generated videos? arXiv preprint arXiv:2503.09949, 2025
-
[33]
DeCoF : Generated video detection via frame consistency: The first benchmark dataset
Ma, L., Zhang, J., Deng, H., Zhang, N., Guo, Q., Yu, H., Liao, Y., and Zhou, P. DeCoF : Generated video detection via frame consistency: The first benchmark dataset. arXiv preprint arXiv:2402.xxxxx, 2024
work page 2024
-
[34]
Moonvalley: Ai video generation platform
MoonV . Moonvalley: Ai video generation platform. Online platform, 2024. URL https://www.moonvalley.com/. Accessed: 2025-11-24
work page 2024
-
[35]
OpenVid-1M: A Large-Scale High-Quality Dataset for Text-to-video Generation
Nan, K., Xie, R., Zhou, P., Fan, T., Yang, Z., Chen, Z., Li, X., Yang, J., and Tai, Y. Openvid-1m: A large-scale high-quality dataset for text-to-video generation. arXiv preprint arXiv:2407.02371, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[36]
Genvidbench: A challenging benchmark for detecting ai-generated video
Ni, Z., Yan, Q., Huang, M., Yuan, T., Tang, Y., Hu, H., Chen, X., and Wang, Y. Genvidbench: A challenging benchmark for detecting ai-generated video. arXiv preprint, 2025. arXiv:2501.11340
-
[37]
OpenAI . GPT-4o . Large language model, 2024 a . URL https://openai.com/. Accessed: 2025-11-24
work page 2024
-
[38]
Sora: Creating video from text
OpenAI . Sora: Creating video from text. Online platform, 2024 b . URL https://sora.chatgpt.com/explore. Accessed: 2025-11-24
work page 2024
-
[39]
Sora: Creating video from text
OpenAI . Sora: Creating video from text. Online platform, 2025. URL https://sora.chatgpt.com/explore. Accessed: 2025-11-24
work page 2025
-
[40]
Pika: Ai video generation platform
Pika Labs . Pika: Ai video generation platform. Online platform, 2024. URL https://pika.art/pikaffects. Accessed: 2025-11-24
work page 2024
-
[41]
Runwayml: Ai video generation platform
Runway ML, Inc. Runwayml: Ai video generation platform. Online platform, 2024. URL https://app.runwayml.com/. Accessed: 2025-11-23
work page 2024
-
[42]
Shan, S., Li, Q., Cui, Y., Yang, M., Wang, Y., Yang, Q., Zhou, J., and Zhong, Z. Hunyuanvideo-foley: Multimodal diffusion with representation alignment for high-fidelity foley audio generation. arXiv preprint arXiv:2508.16930, 2025
-
[43]
On learning multi-modal forgery representation for diffusion generated video detection
Song, X., Guo, X., Zhang, J., Li, Q., Bai, L., Liu, X., Zhai, G., and Liu, X. On learning multi-modal forgery representation for diffusion generated video detection. In Advances in Neural Information Processing Systems (NeurIPS), volume 37, pp.\ 122054--122077, 2024
work page 2024
-
[44]
AudioX: A Unified Framework for Anything-to-Audio Generation
Tian, Z., Jin, Y., Liu, Z., Yuan, R., Tan, X., Chen, Q., Xue, W., and Guo, Y. Audiox: Diffusion transformer for anything-to-audio generation. arXiv preprint arXiv:2503.10522, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[45]
Noisee ai: Ai music video generation platform
Tranquillitatis Inc. Noisee ai: Ai music video generation platform. Online platform, 2024. URL https://noisee.com.cn/#/. Accessed: 2024-11-24
work page 2024
-
[46]
Veo3 ai: Advanced video generation platform
Veo3 AI . Veo3 ai: Advanced video generation platform. Online platform, 2025. URL https://www.veo3ai.io/. Accessed: 2025-11-24
work page 2025
-
[47]
Vidu: Ultra-realistic video generation model
Vidu . Vidu: Ultra-realistic video generation model. Online platform, 2024. URL https://www.vidu.cn/. Accessed: 2025-11-24
work page 2024
-
[48]
Viva video ai: Ai video generation platform
Viva Video AI . Viva video ai: Ai video generation platform. Online platform, 2024. URL https://vivavideo.ai/. Accessed: 2024-11-24
work page 2024
-
[49]
Emu3: Next-Token Prediction is All You Need
Wang, X., Zhang, X., Luo, Z., Sun, Q., Cui, Y., Wang, J., Zhang, F., Wang, Y., Li, Z., Yu, Q., Zhao, Y., Ao, Y., Min, X., Li, T., Wu, B., Zhao, B., Zhang, B., Wang, L., Liu, G., He, Z., Yang, X., Liu, J., Lin, Y., Huang, T., and Wang, Z. Emu3: Next-token prediction is all you need. arXiv preprint arXiv:2409.18869, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[50]
InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation
Wang, Y., He, Y., Li, Y., Li, K., Yu, J., Ma, X., Li, X., Chen, G., Chen, X., Wang, Y., et al. Internvid: A large-scale video-text dataset for multimodal understanding and generation. arXiv preprint arXiv:2307.06942, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[51]
Lavie: High-quality video generation with cascaded latent diffusion models
Wang, Y., Chen, X., Ma, X., Zhou, S., Huang, Z., Wang, Y., Yang, C., He, Y., Yu, J., Yang, P., et al. Lavie: High-quality video generation with cascaded latent diffusion models. International Journal of Computer Vision, 133 0 (5): 0 3059--3078, 2025
work page 2025
-
[52]
Art-v: Auto-regressive text-to-video generation with diffusion models
Weng, W., Feng, R., Wang, Y., Dai, Q., Wang, C., Yin, D., Zhao, Z., Qiu, K., Bao, J., Yuan, Y., et al. Art-v: Auto-regressive text-to-video generation with diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.\ 7395--7405, Seattle, WA, USA, 2024. IEEE
work page 2024
-
[53]
UGC-VideoCaptioner : An omni ugc video detail caption model and new benchmarks
Wu, P., Liu, Y., Zhu, Z., Zhou, E., and Shen, J. UGC-VideoCaptioner : An omni ugc video detail caption model and new benchmarks. arXiv preprint arXiv:2507.11336, 2025
-
[54]
MSR-VTT : A large video description dataset for bridging video and language
Xu, J., Mei, T., Yao, T., and Rui, Y. MSR-VTT : A large video description dataset for bridging video and language. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.\ 5288--5296, Las Vegas, NV, USA, 2016. IEEE
work page 2016
-
[55]
Adding conditional control to text-to-image diffusion models
Zhang, L., Rao, A., and Agrawala, M. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp.\ 3836--3847, Paris, France, 2023. IEEE
work page 2023
-
[56]
Foleycrafter: Bring silent videos to life with lifelike and synchronized sounds
Zhang, Y., Gu, Y., Zeng, Y., Xing, Z., Wang, Y., Wu, Z., and Chen, K. Foleycrafter: Bring silent videos to life with lifelike and synchronized sounds. arXiv preprint arXiv:2407.01494, 2024
-
[57]
Occworld: Learning a 3d occupancy world model for autonomous driving
Zheng, W., Chen, W., Huang, Y., Zhang, B., Duan, Y., and Lu, J. Occworld: Learning a 3d occupancy world model for autonomous driving. In Proceedings of the European Conference on Computer Vision (ECCV), pp.\ 55--72. Springer, 2024 a
work page 2024
-
[58]
Open-Sora: Democratizing Efficient Video Production for All
Zheng, Z., Peng, X., Yang, T., Shen, C., Li, S., Liu, H., Zhou, Y., Li, T., and You, Y. Open-sora: Democratizing efficient video production for all. arXiv preprint arXiv:2412.20404, 2024 b
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[59]
Zhou, Z., Mei, K., Lu, Y., Wang, T., and Rao, F. Harmonyset: A comprehensive dataset for understanding video-music semantic alignment and temporal synchronization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.\ 3152--3162, Atlanta, GA, USA, 2025. IEEE
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.