pith. machine review for the scientific record. sign in

arxiv: 2604.10127 · v1 · submitted 2026-04-11 · 💻 cs.CV · cs.AI

Recognition: unknown

VGA-Bench: A Unified Benchmark and Multi-Model Framework for Video Aesthetics and Generation Quality Evaluation

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:40 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords video generation evaluationaesthetic qualitybenchmark datasetneural assessorsAIGCmulti-task learninghuman alignment
0
0 comments X

The pith

VGA-Bench offers a taxonomy-driven benchmark and three neural assessors that align with human judgments for evaluating both aesthetic appeal and technical quality in generated videos.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents VGA-Bench as a new benchmark to assess video generation on aesthetic quality, aesthetic tags, and generation quality using a structured three-tier taxonomy. It generates a large dataset of over 60,000 videos from 1,016 prompts across 12 models and human-annotates a subset to train three multi-task networks: VAQA-Net for quality prediction, VTag-Net for tagging, and VGQA-Net for generation attributes. These assessors are tested and shown to match human opinions accurately and efficiently for scalable use. This matters because video generation has advanced quickly but lacks tools to measure perceptual and artistic aspects beyond basic fidelity.

Core claim

VGA-Bench is a unified benchmark built on a three-tier taxonomy covering Aesthetic Quality, Aesthetic Tagging, and Generation Quality. The authors create 1,016 prompts and over 60,000 videos using 12 models, annotate a human-labeled subset, and train three dedicated neural assessors that demonstrate reliable alignment with human judgments while providing accuracy and efficiency.

What carries the argument

The three multi-task neural assessors—VAQA-Net for aesthetic quality prediction, VTag-Net for automatic aesthetic tagging, and VGQA-Net for generation quality attributes—trained on the human-annotated dataset derived from the taxonomy.

Load-bearing premise

The human-labeled subset is representative enough for the three neural assessors to generalize reliably to unseen prompts, models, and video styles.

What would settle it

Applying the trained assessors to a new set of videos generated by models outside the original 12, collecting independent human ratings on the same taxonomy dimensions, and observing whether the predicted scores correlate highly with the new human labels.

Figures

Figures reproduced from arXiv: 2604.10127 by Bao Peng, Dandan Zheng, Heng Huang, Huaye Wang, Jingdong Chen, Jun Zhou, Longteng Jiang, Qianqian Qiao, Xin Jin, Yihang Bo.

Figure 1
Figure 1. Figure 1: Overview of VGA-Bench. We propose a unified benchmark and multi-model framework for video aesthetic and generation [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 3
Figure 3. Figure 3: Examples of human annotations for the three core di [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Architecture of (1)VAQA-Net, (2)VTag-Net, and [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Comparison of generated videos from different models [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Radar chart comparing the performance of various video generation models across three evaluation dimensions. The concentric [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
read the original abstract

The rapid advancement of AIGC-based video generation has underscored the critical need for comprehensive evaluation frameworks that go beyond traditional generation quality metrics to encompass aesthetic appeal. However, existing benchmarks remain largely focused on technical fidelity, leaving a significant gap in holistic assessment-particularly with respect to perceptual and artistic qualities. To address this limitation, we introduce VGA-Bench, a unified benchmark for joint evaluation of video generation quality and aesthetic quality. VGA-Bench is built upon a principled three-tier taxonomy: Aesthetic Quality, Aesthetic Tagging, and Generation Quality, each decomposed into multiple fine-grained sub-dimensions to enable systematic assessment. Guided by this taxonomy, we design 1,016 diverse prompts and generate a large-scale dataset of over 60,000 videos using 12 video generation models, ensuring broad coverage across content, style, and artifacts. To enable scalable and automated evaluation, we annotate a subset of the dataset via human labeling and develop three dedicated multi-task neural assessors: VAQA-Net for aesthetic quality prediction, VTag-Net for automatic aesthetic tagging, and VGQA-Net for generation and basic quality attributes. Extensive experiments demonstrate that our models achieve reliable alignment with human judgments, offering both accuracy and efficiency. We release VGA-Bench as a public benchmark to foster research in AIGC evaluation, with applications in content moderation, model debugging, and generative model optimization.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces VGA-Bench, a unified benchmark for joint evaluation of video generation quality and aesthetic quality in AIGC. It defines a three-tier taxonomy (Aesthetic Quality, Aesthetic Tagging, Generation Quality) with fine-grained sub-dimensions, generates over 60,000 videos from 1,016 diverse prompts using 12 video generation models, human-annotates a subset of the data, and trains three multi-task neural assessors (VAQA-Net for aesthetic quality prediction, VTag-Net for aesthetic tagging, and VGQA-Net for generation quality attributes). The central claim is that extensive experiments show these models achieve reliable alignment with human judgments while providing accuracy and efficiency; the benchmark and models are released publicly.

Significance. If the human alignment claims are substantiated with proper quantitative validation, this work would provide a valuable large-scale resource and automated tools to address the gap between technical fidelity metrics and perceptual/artistic qualities in video generation evaluation. The public release of the 60k-video dataset, prompts, and trained assessors is a clear strength that could support applications in content moderation, model debugging, and generative optimization.

major comments (2)
  1. [§4] §4 (Experimental evaluation of VAQA-Net, VTag-Net, and VGQA-Net): The central claim that the three neural assessors achieve 'reliable alignment with human judgments' on the full 60k-video dataset is not supported by any reported quantitative metrics (e.g., correlation coefficients, accuracy, or MAE), training details, validation splits, inter-annotator agreement, or error analysis. This is load-bearing because the generalization from the human-annotated subset to the entire dataset and to unseen prompts/models depends on these details.
  2. [§3] §3 (Dataset construction and human annotation): The selection and representativeness of the human-labeled subset (size, stratification across the 12 models, styles, artifacts, and prompts) are not described, nor is any held-out testing strategy (e.g., unseen generators or prompt categories). Without this, the claim that the assessors generalize reliably cannot be evaluated.
minor comments (2)
  1. [§2] The taxonomy definitions in §2 could include explicit annotation guidelines or example video clips for each sub-dimension to aid reproducibility by other researchers.
  2. [Figure 1] Figure 1 and Table 1 would benefit from clearer captions indicating which models and prompt categories are represented in the example videos.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We sincerely thank the referee for the constructive and detailed feedback on our manuscript. We have carefully reviewed the major comments and will revise the paper to provide the requested clarifications and additional details on experimental metrics and dataset construction.

read point-by-point responses
  1. Referee: [§4] §4 (Experimental evaluation of VAQA-Net, VTag-Net, and VGQA-Net): The central claim that the three neural assessors achieve 'reliable alignment with human judgments' on the full 60k-video dataset is not supported by any reported quantitative metrics (e.g., correlation coefficients, accuracy, or MAE), training details, validation splits, inter-annotator agreement, or error analysis. This is load-bearing because the generalization from the human-annotated subset to the entire dataset and to unseen prompts/models depends on these details.

    Authors: We acknowledge the referee's point that the current presentation of results in §4 requires more explicit quantitative support. While the manuscript includes experimental evaluations of the three models, we agree that details such as specific correlation coefficients, MAE, accuracy metrics, training hyperparameters, validation splits, inter-annotator agreement, and error analysis are insufficiently detailed. In the revision, we will expand §4 with these elements, including performance on the human-annotated subset and analysis of generalization to the full dataset, to better substantiate the alignment claims. revision: yes

  2. Referee: [§3] §3 (Dataset construction and human annotation): The selection and representativeness of the human-labeled subset (size, stratification across the 12 models, styles, artifacts, and prompts) are not described, nor is any held-out testing strategy (e.g., unseen generators or prompt categories). Without this, the claim that the assessors generalize reliably cannot be evaluated.

    Authors: We agree that the manuscript would benefit from greater transparency on the human annotation process. The revised version will include a detailed description of the human-labeled subset, specifying its size, the stratification approach across the 12 models, styles, artifacts, and prompts to demonstrate representativeness. We will also add information on the held-out testing strategy, including use of unseen generators or prompt categories, to support evaluation of generalization. revision: yes

Circularity Check

0 steps flagged

No circularity: standard supervised learning on human annotations

full rationale

The paper constructs a dataset of 60k videos from 1016 prompts and 12 models, annotates a human-labeled subset, and trains three multi-task networks (VAQA-Net, VTag-Net, VGQA-Net) to predict aesthetic and quality attributes. Alignment with human judgments is measured by standard supervised evaluation on held-out labels. No equations, predictions, or uniqueness claims reduce to fitted parameters by construction; no self-citations are invoked as load-bearing mathematical facts; the taxonomy and assessors are defined independently of the final performance numbers. This is ordinary empirical ML benchmarking and remains self-contained against external human annotations.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical benchmark and supervised-learning paper; no free parameters, axioms, or invented entities are introduced beyond standard neural-network training on human labels.

pith-pipeline@v0.9.0 · 5575 in / 1125 out tokens · 60965 ms · 2026-05-10T15:40:01.646660+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

44 extracted references · 16 canonical work pages · 13 internal anchors

  1. [1]

    Vivit: A video vision transformer

    Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Lu ˇci´c, and Cordelia Schmid. Vivit: A video vision transformer. InProceedings of the IEEE/CVF inter- national conference on computer vision, pages 6836–6846,

  2. [2]

    Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

    Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023. 1, 2, 8, 10, 12

  3. [3]

    Align your latents: High-resolution video synthesis with la- tent diffusion models

    Andreas Blattmann, Robin Rombach, Huan Ling, Tim Dock- horn, Seung Wook Kim, Sanja Fidler, and Karsten Kreis. Align your latents: High-resolution video synthesis with la- tent diffusion models. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 22563–22575, 2023. 1, 2

  4. [4]

    Routledge,

    Blain Brown.Cinematography: theory and practice: im- age making for cinematographers and directors. Routledge,

  5. [5]

    Emerg- ing properties in self-supervised vision transformers

    Mathilde Caron, Hugo Touvron, Ishan Misra, Herv ´e J´egou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerg- ing properties in self-supervised vision transformers. InPro- ceedings of the IEEE/CVF international conference on com- puter vision, pages 9650–9660, 2021. 1

  6. [6]

    Vlp: A survey on vision-language pre-training.Machine Intelligence Re- search, 20(1):38–56, 2023

    Fei-Long Chen, Du-Zhen Zhang, Ming-Lun Han, Xiu-Yi Chen, Jing Shi, Shuang Xu, and Bo Xu. Vlp: A survey on vision-language pre-training.Machine Intelligence Re- search, 20(1):38–56, 2023. 1

  7. [7]

    Cinematography: the creative use of reality

    Maya Deren. Cinematography: the creative use of reality. Daedalus, 89(1):150–167, 1960. 4

  8. [8]

    Coarse-to-fine vision-language pre-training with fusion in the backbone.Advances in neural information processing systems, 35:32942–32956, 2022

    Zi-Yi Dou, Aishwarya Kamath, Zhe Gan, Pengchuan Zhang, Jianfeng Wang, Linjie Li, Zicheng Liu, Ce Liu, Yann Le- Cun, Nanyun Peng, et al. Coarse-to-fine vision-language pre-training with fusion in the backbone.Advances in neural information processing systems, 35:32942–32956, 2022. 1

  9. [9]

    Vision-language pre-training: Basics, re- cent advances, and future trends.Foundations and Trends® in Computer Graphics and Vision, 14(3–4):163–352, 2022

    Zhe Gan, Linjie Li, Chunyuan Li, Lijuan Wang, Zicheng Liu, Jianfeng Gao, et al. Vision-language pre-training: Basics, re- cent advances, and future trends.Foundations and Trends® in Computer Graphics and Vision, 14(3–4):163–352, 2022. 1

  10. [10]

    AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning

    Yuwei Guo, Ceyuan Yang, Anyi Rao, Zhengyang Liang, Yaohui Wang, Yu Qiao, Maneesh Agrawala, Dahua Lin, and Bo Dai. Animatediff: Animate your personalized text- to-image diffusion models without specific tuning.arXiv preprint arXiv:2307.04725, 2023. 1, 2, 8, 10, 12

  11. [11]

    LTX-Video: Realtime Video Latent Diffusion

    Yoav HaCohen, Nisan Chiprut, Benny Brazowski, Daniel Shalem, Dudu Moshe, Eitan Richardson, Eran Levin, Guy Shiran, Nir Zabari, Ori Gordon, et al. Ltx-video: Realtime video latent diffusion.arXiv preprint arXiv:2501.00103,

  12. [12]

    Clipscore: A reference-free evaluation met- ric for image captioning

    Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Yejin Choi. Clipscore: A reference-free evaluation met- ric for image captioning. InProceedings of the 2021 confer- ence on empirical methods in natural language processing, pages 7514–7528, 2021. 1, 3

  13. [13]

    Denoising dif- fusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising dif- fusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020. 1, 2

  14. [14]

    Imagen Video: High Definition Video Generation with Diffusion Models

    Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Alexey Gritsenko, Diederik P Kingma, Ben Poole, Mohammad Norouzi, David J Fleet, et al. Imagen video: High definition video generation with diffusion mod- els.arXiv preprint arXiv:2210.02303, 2022. 1, 2

  15. [15]

    Vbench: Comprehensive bench- mark suite for video generative models

    Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, et al. Vbench: Comprehensive bench- mark suite for video generative models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21807–21818, 2024. 1, 3, 5

  16. [16]

    Musiq: Multi-scale image quality transformer

    Junjie Ke, Qifei Wang, Yilin Wang, Peyman Milanfar, and Feng Yang. Musiq: Multi-scale image quality transformer. InProceedings of the IEEE/CVF international conference on computer vision, pages 5148–5157, 2021. 1

  17. [17]

    Text2video-zero: Text- to-image diffusion models are zero-shot video generators

    Levon Khachatryan, Andranik Movsisyan, Vahram Tade- vosyan, Roberto Henschel, Zhangyang Wang, Shant Navasardyan, and Humphrey Shi. Text2video-zero: Text- to-image diffusion models are zero-shot video generators. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15954–15964, 2023. 1, 2

  18. [18]

    HunyuanVideo: A Systematic Framework For Large Video Generative Models

    Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024. 1, 7, 8, 10, 12

  19. [19]

    Dit: Self-supervised pre-training for docu- ment image transformer

    Junlong Li, Yiheng Xu, Tengchao Lv, Lei Cui, Cha Zhang, and Furu Wei. Dit: Self-supervised pre-training for docu- ment image transformer. InProceedings of the 30th ACM international conference on multimedia, pages 3530–3539,

  20. [20]

    Fetv: A bench- mark for fine-grained evaluation of open-domain text-to- video generation.Advances in Neural Information Process- ing Systems, 36:62352–62387, 2023

    Yuanxin Liu, Lei Li, Shuhuai Ren, Rundong Gao, Shicheng Li, Sishuo Chen, Xu Sun, and Lu Hou. Fetv: A bench- mark for fine-grained evaluation of open-domain text-to- video generation.Advances in Neural Information Process- ing Systems, 36:62352–62387, 2023. 1, 3

  21. [21]

    Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models

    Yixin Liu, Kai Zhang, Yuan Li, Zhiling Yan, Chujie Gao, Ruoxi Chen, Zhengqing Yuan, Yue Huang, Hanchi Sun, Jian- feng Gao, et al. Sora: A review on background, technology, limitations, and opportunities of large vision models.arXiv preprint arXiv:2402.17177, 2024. 1, 2, 8, 10, 12

  22. [22]

    Video swin transformer

    Ze Liu, Jia Ning, Yue Cao, Yixuan Wei, Zheng Zhang, Stephen Lin, and Han Hu. Video swin transformer. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3202–3211, 2022. 1

  23. [23]

    Videofusion: Decomposed diffusion models for high-quality video generation

    Zhengxiong Luo, Dayou Chen, Yingya Zhang, Yan Huang, Liang Wang, Yujun Shen, Deli Zhao, Jingren Zhou, and Tieniu Tan. Videofusion: Decomposed diffusion mod- els for high-quality video generation.arXiv preprint arXiv:2303.08320, 2023. 1, 2

  24. [24]

    Latte: Latent Diffusion Transformer for Video Generation

    Xin Ma, Yaohui Wang, Gengyun Jia, Xinyuan Chen, Zi- wei Liu, Yuan-Fang Li, Cunjian Chen, and Yu Qiao. Latte: Latent diffusion transformer for video generation.arXiv preprint arXiv:2401.03048, 2024. 1, 2, 7, 8, 10, 12

  25. [25]

    Mustafa Yousry Matbouly. Quantifying the unquantifiable: the color of cinematic lighting and its effect on audience’s impressions towards the appearance of film characters.Cur- rent Psychology, 41(6):3694–3715, 2022. 4

  26. [26]

    Vadb: A large-scale video aesthetic database with professional and multi-dimensional annota- tions.arXiv preprint arXiv:2510.25238, 2025

    Qianqian Qiao, DanDan Zheng, Yihang Bo, Bao Peng, Heng Huang, Longteng Jiang, Huaye Wang, Jingdong Chen, Jun Zhou, and Xin Jin. Vadb: A large-scale video aesthetic database with professional and multi-dimensional annota- tions.arXiv preprint arXiv:2510.25238, 2025. 3, 4, 5, 6, 7

  27. [27]

    Learning transferable visual models from natural language supervi- sion

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 7

  28. [28]

    Video transformers: A survey.IEEE Transactions on Pattern Anal- ysis and Machine Intelligence, 45(11):12922–12943, 2023

    Javier Selva, Anders S Johansen, Sergio Escalera, Kamal Nasrollahi, Thomas B Moeslund, and Albert Clap ´es. Video transformers: A survey.IEEE Transactions on Pattern Anal- ysis and Machine Intelligence, 45(11):12922–12943, 2023. 1

  29. [29]

    Make-A-Video: Text-to-Video Generation without Text-Video Data

    Uriel Singer, Adam Polyak, Thomas Hayes, Xi Yin, Jie An, Songyang Zhang, Qiyuan Hu, Harry Yang, Oron Ashual, Oran Gafni, et al. Make-a-video: Text-to-video generation without text-video data.arXiv preprint arXiv:2209.14792,

  30. [30]

    Score-Based Generative Modeling through Stochastic Differential Equations

    Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Ab- hishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equa- tions.arXiv preprint arXiv:2011.13456, 2020. 1, 2

  31. [31]

    T2v-compbench: A comprehen- sive benchmark for compositional text-to-video generation

    Kaiyue Sun, Kaiyi Huang, Xian Liu, Yue Wu, Zihan Xu, Zhenguo Li, and Xihui Liu. T2v-compbench: A comprehen- sive benchmark for compositional text-to-video generation. InProceedings of the Computer Vision and Pattern Recogni- tion Conference, pages 8406–8416, 2025. 3

  32. [32]

    Human-centric founda- tion models: Perception, generation and agentic modeling

    Shixiang Tang, Yizhou Wang, Lu Chen, Yuan Wang, Sida Peng, Dan Xu, and Wanli Ouyang. Human-centric founda- tion models: Perception, generation and agentic modeling. arXiv preprint arXiv:2502.08556, 2025. 1

  33. [33]

    Mochi 1.https :/ /github .com/ genmoai/models, 2024

    Genmo Team. Mochi 1.https :/ /github .com/ genmoai/models, 2024. 1, 7, 8, 10, 12

  34. [34]

    Fvd: A new metric for video generation

    Thomas Unterthiner, Sjoerd Van Steenkiste, Karol Kurach, Rapha¨el Marinier, Marcin Michalski, and Sylvain Gelly. Fvd: A new metric for video generation. 2019. 1, 3

  35. [35]

    Wan: Open and Advanced Large-Scale Video Generative Models

    Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video gen- erative models.arXiv preprint arXiv:2503.20314, 2025. 1, 8, 10, 12

  36. [36]

    ModelScope Text-to-Video Technical Report

    Jiuniu Wang, Hangjie Yuan, Dayou Chen, Yingya Zhang, Xiang Wang, and Shiwei Zhang. Modelscope text-to-video technical report.arXiv preprint arXiv:2308.06571, 2023. 1, 2, 8, 10, 12

  37. [37]

    Image as a foreign language: Beit pretraining for vision and vision- language tasks

    Wenhui Wang, Hangbo Bao, Li Dong, Johan Bjorck, Zhil- iang Peng, Qiang Liu, Kriti Aggarwal, Owais Khan Mo- hammed, Saksham Singhal, Subhojit Som, et al. Image as a foreign language: Beit pretraining for vision and vision- language tasks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19175– 19186, 2023. 1

  38. [38]

    Lavie: High-quality video generation with cascaded latent diffusion models.International Journal of Computer Vision, 133(5):3059–3078, 2025

    Yaohui Wang, Xinyuan Chen, Xin Ma, Shangchen Zhou, Ziqi Huang, Yi Wang, Ceyuan Yang, Yinan He, Jiashuo Yu, Peiqing Yang, et al. Lavie: High-quality video generation with cascaded latent diffusion models.International Journal of Computer Vision, 133(5):3059–3078, 2025. 1, 8, 10, 12

  39. [39]

    Is your world simulator a good story presenter? a consecu- tive events-based benchmark for future long video genera- tion

    Yiping Wang, Xuehai He, Kuan Wang, Luyao Ma, Jianwei Yang, Shuohang Wang, Simon Shaolei Du, and Yelong Shen. Is your world simulator a good story presenter? a consecu- tive events-based benchmark for future long video genera- tion. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 13629–13638, 2025. 3

  40. [40]

    CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

    Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiao- han Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer.arXiv preprint arXiv:2408.06072, 2024. 1, 2, 7, 8, 10, 12

  41. [41]

    Chronomagic-bench: A bench- mark for metamorphic evaluation of text-to-time-lapse video generation.Advances in Neural Information Processing Sys- tems, 37:21236–21270, 2024

    Shenghai Yuan, Jinfa Huang, Yongqi Xu, Yaoyang Liu, Shaofeng Zhang, Yujun Shi, Rui-Jie Zhu, Xinhua Cheng, Jiebo Luo, and Li Yuan. Chronomagic-bench: A bench- mark for metamorphic evaluation of text-to-time-lapse video generation.Advances in Neural Information Processing Sys- tems, 37:21236–21270, 2024. 3

  42. [42]

    Show-1: Marrying pixel and latent diffusion models for text-to-video generation.International Journal of Com- puter Vision, 133(4):1879–1893, 2025

    David Junhao Zhang, Jay Zhangjie Wu, Jia-Wei Liu, Rui Zhao, Lingmin Ran, Yuchao Gu, Difei Gao, and Mike Zheng Shou. Show-1: Marrying pixel and latent diffusion models for text-to-video generation.International Journal of Com- puter Vision, 133(4):1879–1893, 2025. 1, 7, 8, 10, 12

  43. [43]

    Adding conditional control to text-to-image diffusion models

    Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF international conference on computer vision, pages 3836–3847, 2023. 1, 2

  44. [44]

    VBench-2.0: Advancing Video Generation Benchmark Suite for Intrinsic Faithfulness

    Dian Zheng, Ziqi Huang, Hongbo Liu, Kai Zou, Yinan He, Fan Zhang, Lulu Gu, Yuanhan Zhang, Jingwen He, Wei- Shi Zheng, et al. Vbench-2.0: Advancing video generation benchmark suite for intrinsic faithfulness.arXiv preprint arXiv:2503.21755, 2025. 3