arxiv: 2604.10127 · v1 · submitted 2026-04-11 · 💻 cs.CV · cs.AI

Recognition: unknown

VGA-Bench: A Unified Benchmark and Multi-Model Framework for Video Aesthetics and Generation Quality Evaluation

Longteng Jiang , Dandan Zheng , Qianqian Qiao , Heng Huang , Huaye Wang , Yihang Bo , Bao Peng , Jingdong Chen

show 2 more authors

Jun Zhou Xin Jin

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:40 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords video generation evaluationaesthetic qualitybenchmark datasetneural assessorsAIGCmulti-task learninghuman alignment

0 comments

The pith

VGA-Bench offers a taxonomy-driven benchmark and three neural assessors that align with human judgments for evaluating both aesthetic appeal and technical quality in generated videos.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents VGA-Bench as a new benchmark to assess video generation on aesthetic quality, aesthetic tags, and generation quality using a structured three-tier taxonomy. It generates a large dataset of over 60,000 videos from 1,016 prompts across 12 models and human-annotates a subset to train three multi-task networks: VAQA-Net for quality prediction, VTag-Net for tagging, and VGQA-Net for generation attributes. These assessors are tested and shown to match human opinions accurately and efficiently for scalable use. This matters because video generation has advanced quickly but lacks tools to measure perceptual and artistic aspects beyond basic fidelity.

Core claim

VGA-Bench is a unified benchmark built on a three-tier taxonomy covering Aesthetic Quality, Aesthetic Tagging, and Generation Quality. The authors create 1,016 prompts and over 60,000 videos using 12 models, annotate a human-labeled subset, and train three dedicated neural assessors that demonstrate reliable alignment with human judgments while providing accuracy and efficiency.

What carries the argument

The three multi-task neural assessors—VAQA-Net for aesthetic quality prediction, VTag-Net for automatic aesthetic tagging, and VGQA-Net for generation quality attributes—trained on the human-annotated dataset derived from the taxonomy.

Load-bearing premise

The human-labeled subset is representative enough for the three neural assessors to generalize reliably to unseen prompts, models, and video styles.

What would settle it

Applying the trained assessors to a new set of videos generated by models outside the original 12, collecting independent human ratings on the same taxonomy dimensions, and observing whether the predicted scores correlate highly with the new human labels.

Figures

Figures reproduced from arXiv: 2604.10127 by Bao Peng, Dandan Zheng, Heng Huang, Huaye Wang, Jingdong Chen, Jun Zhou, Longteng Jiang, Qianqian Qiao, Xin Jin, Yihang Bo.

**Figure 1.** Figure 1: Overview of VGA-Bench. We propose a unified benchmark and multi-model framework for video aesthetic and generation [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 3.** Figure 3: Examples of human annotations for the three core di [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Architecture of (1)VAQA-Net, (2)VTag-Net, and [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Comparison of generated videos from different models [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Radar chart comparing the performance of various video generation models across three evaluation dimensions. The concentric [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

read the original abstract

The rapid advancement of AIGC-based video generation has underscored the critical need for comprehensive evaluation frameworks that go beyond traditional generation quality metrics to encompass aesthetic appeal. However, existing benchmarks remain largely focused on technical fidelity, leaving a significant gap in holistic assessment-particularly with respect to perceptual and artistic qualities. To address this limitation, we introduce VGA-Bench, a unified benchmark for joint evaluation of video generation quality and aesthetic quality. VGA-Bench is built upon a principled three-tier taxonomy: Aesthetic Quality, Aesthetic Tagging, and Generation Quality, each decomposed into multiple fine-grained sub-dimensions to enable systematic assessment. Guided by this taxonomy, we design 1,016 diverse prompts and generate a large-scale dataset of over 60,000 videos using 12 video generation models, ensuring broad coverage across content, style, and artifacts. To enable scalable and automated evaluation, we annotate a subset of the dataset via human labeling and develop three dedicated multi-task neural assessors: VAQA-Net for aesthetic quality prediction, VTag-Net for automatic aesthetic tagging, and VGQA-Net for generation and basic quality attributes. Extensive experiments demonstrate that our models achieve reliable alignment with human judgments, offering both accuracy and efficiency. We release VGA-Bench as a public benchmark to foster research in AIGC evaluation, with applications in content moderation, model debugging, and generative model optimization.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

VGA-Bench puts forward a three-tier taxonomy and 60k-video dataset for joint aesthetic and generation quality eval, but the neural assessors' human alignment claims rest on unreported details about the labeled subset.

read the letter

The main thing to know is that this paper builds a benchmark combining aesthetic appeal with technical quality for AI-generated video, using a new taxonomy, a big multi-model dataset, and three dedicated neural assessors. It fills a documented gap where most video eval sticks to fidelity metrics and skips perceptual or artistic angles. The taxonomy breaks things into Aesthetic Quality, Aesthetic Tagging, and Generation Quality with sub-dimensions, and they generated over 60k videos from 1016 prompts across 12 models to cover styles and artifacts. Releasing the whole thing publicly is a practical move that could help model debugging and content moderation work. The three nets (VAQA-Net, VTag-Net, VGQA-Net) are set up as multi-task predictors, which is a reasonable way to make evaluation scalable. That part is new and structured enough to be useful on its own. The soft spot is the experimental support. The abstract says the models show reliable alignment with human judgments and offers accuracy plus efficiency, but it gives no numbers, no subset size, no inter-annotator agreement, no train-test details, and no cross-model generalization checks. The stress-test note is correct here: without knowing how large or representative the human-labeled portion is, or how the splits were done, the alignment claim cannot be trusted to hold for the full dataset or future models. If the full paper supplies those and they look clean, the concern shrinks; based on what is shown it is the load-bearing issue. This is aimed at researchers building or evaluating generative video systems who need better tools than existing technical-only benchmarks. A reader focused on AIGC evaluation or practical deployment would get value from the taxonomy and dataset release even before the assessors are fully validated. It deserves a serious referee to examine the human annotation process and the actual results. I would send it to peer review so the missing experimental details can be checked and the claims can be tested properly.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces VGA-Bench, a unified benchmark for joint evaluation of video generation quality and aesthetic quality in AIGC. It defines a three-tier taxonomy (Aesthetic Quality, Aesthetic Tagging, Generation Quality) with fine-grained sub-dimensions, generates over 60,000 videos from 1,016 diverse prompts using 12 video generation models, human-annotates a subset of the data, and trains three multi-task neural assessors (VAQA-Net for aesthetic quality prediction, VTag-Net for aesthetic tagging, and VGQA-Net for generation quality attributes). The central claim is that extensive experiments show these models achieve reliable alignment with human judgments while providing accuracy and efficiency; the benchmark and models are released publicly.

Significance. If the human alignment claims are substantiated with proper quantitative validation, this work would provide a valuable large-scale resource and automated tools to address the gap between technical fidelity metrics and perceptual/artistic qualities in video generation evaluation. The public release of the 60k-video dataset, prompts, and trained assessors is a clear strength that could support applications in content moderation, model debugging, and generative optimization.

major comments (2)

[§4] §4 (Experimental evaluation of VAQA-Net, VTag-Net, and VGQA-Net): The central claim that the three neural assessors achieve 'reliable alignment with human judgments' on the full 60k-video dataset is not supported by any reported quantitative metrics (e.g., correlation coefficients, accuracy, or MAE), training details, validation splits, inter-annotator agreement, or error analysis. This is load-bearing because the generalization from the human-annotated subset to the entire dataset and to unseen prompts/models depends on these details.
[§3] §3 (Dataset construction and human annotation): The selection and representativeness of the human-labeled subset (size, stratification across the 12 models, styles, artifacts, and prompts) are not described, nor is any held-out testing strategy (e.g., unseen generators or prompt categories). Without this, the claim that the assessors generalize reliably cannot be evaluated.

minor comments (2)

[§2] The taxonomy definitions in §2 could include explicit annotation guidelines or example video clips for each sub-dimension to aid reproducibility by other researchers.
[Figure 1] Figure 1 and Table 1 would benefit from clearer captions indicating which models and prompt categories are represented in the example videos.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We sincerely thank the referee for the constructive and detailed feedback on our manuscript. We have carefully reviewed the major comments and will revise the paper to provide the requested clarifications and additional details on experimental metrics and dataset construction.

read point-by-point responses

Referee: [§4] §4 (Experimental evaluation of VAQA-Net, VTag-Net, and VGQA-Net): The central claim that the three neural assessors achieve 'reliable alignment with human judgments' on the full 60k-video dataset is not supported by any reported quantitative metrics (e.g., correlation coefficients, accuracy, or MAE), training details, validation splits, inter-annotator agreement, or error analysis. This is load-bearing because the generalization from the human-annotated subset to the entire dataset and to unseen prompts/models depends on these details.

Authors: We acknowledge the referee's point that the current presentation of results in §4 requires more explicit quantitative support. While the manuscript includes experimental evaluations of the three models, we agree that details such as specific correlation coefficients, MAE, accuracy metrics, training hyperparameters, validation splits, inter-annotator agreement, and error analysis are insufficiently detailed. In the revision, we will expand §4 with these elements, including performance on the human-annotated subset and analysis of generalization to the full dataset, to better substantiate the alignment claims. revision: yes
Referee: [§3] §3 (Dataset construction and human annotation): The selection and representativeness of the human-labeled subset (size, stratification across the 12 models, styles, artifacts, and prompts) are not described, nor is any held-out testing strategy (e.g., unseen generators or prompt categories). Without this, the claim that the assessors generalize reliably cannot be evaluated.

Authors: We agree that the manuscript would benefit from greater transparency on the human annotation process. The revised version will include a detailed description of the human-labeled subset, specifying its size, the stratification approach across the 12 models, styles, artifacts, and prompts to demonstrate representativeness. We will also add information on the held-out testing strategy, including use of unseen generators or prompt categories, to support evaluation of generalization. revision: yes

Circularity Check

0 steps flagged

No circularity: standard supervised learning on human annotations

full rationale

The paper constructs a dataset of 60k videos from 1016 prompts and 12 models, annotates a human-labeled subset, and trains three multi-task networks (VAQA-Net, VTag-Net, VGQA-Net) to predict aesthetic and quality attributes. Alignment with human judgments is measured by standard supervised evaluation on held-out labels. No equations, predictions, or uniqueness claims reduce to fitted parameters by construction; no self-citations are invoked as load-bearing mathematical facts; the taxonomy and assessors are defined independently of the final performance numbers. This is ordinary empirical ML benchmarking and remains self-contained against external human annotations.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical benchmark and supervised-learning paper; no free parameters, axioms, or invented entities are introduced beyond standard neural-network training on human labels.

pith-pipeline@v0.9.0 · 5575 in / 1125 out tokens · 60965 ms · 2026-05-10T15:40:01.646660+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

44 extracted references · 16 canonical work pages · 13 internal anchors

[1]

Vivit: A video vision transformer

Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Lu ˇci´c, and Cordelia Schmid. Vivit: A video vision transformer. InProceedings of the IEEE/CVF inter- national conference on computer vision, pages 6836–6846,
[2]

Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023. 1, 2, 8, 10, 12

work page internal anchor Pith review arXiv 2023
[3]

Align your latents: High-resolution video synthesis with la- tent diffusion models

Andreas Blattmann, Robin Rombach, Huan Ling, Tim Dock- horn, Seung Wook Kim, Sanja Fidler, and Karsten Kreis. Align your latents: High-resolution video synthesis with la- tent diffusion models. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 22563–22575, 2023. 1, 2

2023
[4]

Routledge,

Blain Brown.Cinematography: theory and practice: im- age making for cinematographers and directors. Routledge,
[5]

Emerg- ing properties in self-supervised vision transformers

Mathilde Caron, Hugo Touvron, Ishan Misra, Herv ´e J´egou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerg- ing properties in self-supervised vision transformers. InPro- ceedings of the IEEE/CVF international conference on com- puter vision, pages 9650–9660, 2021. 1

2021
[6]

Vlp: A survey on vision-language pre-training.Machine Intelligence Re- search, 20(1):38–56, 2023

Fei-Long Chen, Du-Zhen Zhang, Ming-Lun Han, Xiu-Yi Chen, Jing Shi, Shuang Xu, and Bo Xu. Vlp: A survey on vision-language pre-training.Machine Intelligence Re- search, 20(1):38–56, 2023. 1

2023
[7]

Cinematography: the creative use of reality

Maya Deren. Cinematography: the creative use of reality. Daedalus, 89(1):150–167, 1960. 4

1960
[8]

Coarse-to-fine vision-language pre-training with fusion in the backbone.Advances in neural information processing systems, 35:32942–32956, 2022

Zi-Yi Dou, Aishwarya Kamath, Zhe Gan, Pengchuan Zhang, Jianfeng Wang, Linjie Li, Zicheng Liu, Ce Liu, Yann Le- Cun, Nanyun Peng, et al. Coarse-to-fine vision-language pre-training with fusion in the backbone.Advances in neural information processing systems, 35:32942–32956, 2022. 1

2022
[9]

Vision-language pre-training: Basics, re- cent advances, and future trends.Foundations and Trends® in Computer Graphics and Vision, 14(3–4):163–352, 2022

Zhe Gan, Linjie Li, Chunyuan Li, Lijuan Wang, Zicheng Liu, Jianfeng Gao, et al. Vision-language pre-training: Basics, re- cent advances, and future trends.Foundations and Trends® in Computer Graphics and Vision, 14(3–4):163–352, 2022. 1

2022
[10]

AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning

Yuwei Guo, Ceyuan Yang, Anyi Rao, Zhengyang Liang, Yaohui Wang, Yu Qiao, Maneesh Agrawala, Dahua Lin, and Bo Dai. Animatediff: Animate your personalized text- to-image diffusion models without specific tuning.arXiv preprint arXiv:2307.04725, 2023. 1, 2, 8, 10, 12

work page internal anchor Pith review arXiv 2023
[11]

LTX-Video: Realtime Video Latent Diffusion

Yoav HaCohen, Nisan Chiprut, Benny Brazowski, Daniel Shalem, Dudu Moshe, Eitan Richardson, Eran Levin, Guy Shiran, Nir Zabari, Ori Gordon, et al. Ltx-video: Realtime video latent diffusion.arXiv preprint arXiv:2501.00103,

work page internal anchor Pith review arXiv
[12]

Clipscore: A reference-free evaluation met- ric for image captioning

Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Yejin Choi. Clipscore: A reference-free evaluation met- ric for image captioning. InProceedings of the 2021 confer- ence on empirical methods in natural language processing, pages 7514–7528, 2021. 1, 3

2021
[13]

Denoising dif- fusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising dif- fusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020. 1, 2

2020
[14]

Imagen Video: High Definition Video Generation with Diffusion Models

Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Alexey Gritsenko, Diederik P Kingma, Ben Poole, Mohammad Norouzi, David J Fleet, et al. Imagen video: High definition video generation with diffusion mod- els.arXiv preprint arXiv:2210.02303, 2022. 1, 2

work page internal anchor Pith review arXiv 2022
[15]

Vbench: Comprehensive bench- mark suite for video generative models

Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, et al. Vbench: Comprehensive bench- mark suite for video generative models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21807–21818, 2024. 1, 3, 5

2024
[16]

Musiq: Multi-scale image quality transformer

Junjie Ke, Qifei Wang, Yilin Wang, Peyman Milanfar, and Feng Yang. Musiq: Multi-scale image quality transformer. InProceedings of the IEEE/CVF international conference on computer vision, pages 5148–5157, 2021. 1

2021
[17]

Text2video-zero: Text- to-image diffusion models are zero-shot video generators

Levon Khachatryan, Andranik Movsisyan, Vahram Tade- vosyan, Roberto Henschel, Zhangyang Wang, Shant Navasardyan, and Humphrey Shi. Text2video-zero: Text- to-image diffusion models are zero-shot video generators. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15954–15964, 2023. 1, 2

2023
[18]

HunyuanVideo: A Systematic Framework For Large Video Generative Models

Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024. 1, 7, 8, 10, 12

work page internal anchor Pith review Pith/arXiv arXiv 2024
[19]

Dit: Self-supervised pre-training for docu- ment image transformer

Junlong Li, Yiheng Xu, Tengchao Lv, Lei Cui, Cha Zhang, and Furu Wei. Dit: Self-supervised pre-training for docu- ment image transformer. InProceedings of the 30th ACM international conference on multimedia, pages 3530–3539,
[20]

Fetv: A bench- mark for fine-grained evaluation of open-domain text-to- video generation.Advances in Neural Information Process- ing Systems, 36:62352–62387, 2023

Yuanxin Liu, Lei Li, Shuhuai Ren, Rundong Gao, Shicheng Li, Sishuo Chen, Xu Sun, and Lu Hou. Fetv: A bench- mark for fine-grained evaluation of open-domain text-to- video generation.Advances in Neural Information Process- ing Systems, 36:62352–62387, 2023. 1, 3

2023
[21]

Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models

Yixin Liu, Kai Zhang, Yuan Li, Zhiling Yan, Chujie Gao, Ruoxi Chen, Zhengqing Yuan, Yue Huang, Hanchi Sun, Jian- feng Gao, et al. Sora: A review on background, technology, limitations, and opportunities of large vision models.arXiv preprint arXiv:2402.17177, 2024. 1, 2, 8, 10, 12

work page internal anchor Pith review arXiv 2024
[22]

Video swin transformer

Ze Liu, Jia Ning, Yue Cao, Yixuan Wei, Zheng Zhang, Stephen Lin, and Han Hu. Video swin transformer. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3202–3211, 2022. 1

2022
[23]

Videofusion: Decomposed diffusion models for high-quality video generation

Zhengxiong Luo, Dayou Chen, Yingya Zhang, Yan Huang, Liang Wang, Yujun Shen, Deli Zhao, Jingren Zhou, and Tieniu Tan. Videofusion: Decomposed diffusion mod- els for high-quality video generation.arXiv preprint arXiv:2303.08320, 2023. 1, 2

work page arXiv 2023
[24]

Latte: Latent Diffusion Transformer for Video Generation

Xin Ma, Yaohui Wang, Gengyun Jia, Xinyuan Chen, Zi- wei Liu, Yuan-Fang Li, Cunjian Chen, and Yu Qiao. Latte: Latent diffusion transformer for video generation.arXiv preprint arXiv:2401.03048, 2024. 1, 2, 7, 8, 10, 12

work page internal anchor Pith review arXiv 2024
[25]

Mustafa Yousry Matbouly. Quantifying the unquantifiable: the color of cinematic lighting and its effect on audience’s impressions towards the appearance of film characters.Cur- rent Psychology, 41(6):3694–3715, 2022. 4

2022
[26]

Vadb: A large-scale video aesthetic database with professional and multi-dimensional annota- tions.arXiv preprint arXiv:2510.25238, 2025

Qianqian Qiao, DanDan Zheng, Yihang Bo, Bao Peng, Heng Huang, Longteng Jiang, Huaye Wang, Jingdong Chen, Jun Zhou, and Xin Jin. Vadb: A large-scale video aesthetic database with professional and multi-dimensional annota- tions.arXiv preprint arXiv:2510.25238, 2025. 3, 4, 5, 6, 7

work page arXiv 2025
[27]

Learning transferable visual models from natural language supervi- sion

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 7

2021
[28]

Video transformers: A survey.IEEE Transactions on Pattern Anal- ysis and Machine Intelligence, 45(11):12922–12943, 2023

Javier Selva, Anders S Johansen, Sergio Escalera, Kamal Nasrollahi, Thomas B Moeslund, and Albert Clap ´es. Video transformers: A survey.IEEE Transactions on Pattern Anal- ysis and Machine Intelligence, 45(11):12922–12943, 2023. 1

2023
[29]

Make-A-Video: Text-to-Video Generation without Text-Video Data

Uriel Singer, Adam Polyak, Thomas Hayes, Xi Yin, Jie An, Songyang Zhang, Qiyuan Hu, Harry Yang, Oron Ashual, Oran Gafni, et al. Make-a-video: Text-to-video generation without text-video data.arXiv preprint arXiv:2209.14792,

work page internal anchor Pith review arXiv
[30]

Score-Based Generative Modeling through Stochastic Differential Equations

Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Ab- hishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equa- tions.arXiv preprint arXiv:2011.13456, 2020. 1, 2

work page internal anchor Pith review Pith/arXiv arXiv 2011
[31]

T2v-compbench: A comprehen- sive benchmark for compositional text-to-video generation

Kaiyue Sun, Kaiyi Huang, Xian Liu, Yue Wu, Zihan Xu, Zhenguo Li, and Xihui Liu. T2v-compbench: A comprehen- sive benchmark for compositional text-to-video generation. InProceedings of the Computer Vision and Pattern Recogni- tion Conference, pages 8406–8416, 2025. 3

2025
[32]

Human-centric founda- tion models: Perception, generation and agentic modeling

Shixiang Tang, Yizhou Wang, Lu Chen, Yuan Wang, Sida Peng, Dan Xu, and Wanli Ouyang. Human-centric founda- tion models: Perception, generation and agentic modeling. arXiv preprint arXiv:2502.08556, 2025. 1

work page arXiv 2025
[33]

Mochi 1.https :/ /github .com/ genmoai/models, 2024

Genmo Team. Mochi 1.https :/ /github .com/ genmoai/models, 2024. 1, 7, 8, 10, 12

2024
[34]

Fvd: A new metric for video generation

Thomas Unterthiner, Sjoerd Van Steenkiste, Karol Kurach, Rapha¨el Marinier, Marcin Michalski, and Sylvain Gelly. Fvd: A new metric for video generation. 2019. 1, 3

2019
[35]

Wan: Open and Advanced Large-Scale Video Generative Models

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video gen- erative models.arXiv preprint arXiv:2503.20314, 2025. 1, 8, 10, 12

work page internal anchor Pith review Pith/arXiv arXiv 2025
[36]

ModelScope Text-to-Video Technical Report

Jiuniu Wang, Hangjie Yuan, Dayou Chen, Yingya Zhang, Xiang Wang, and Shiwei Zhang. Modelscope text-to-video technical report.arXiv preprint arXiv:2308.06571, 2023. 1, 2, 8, 10, 12

work page internal anchor Pith review arXiv 2023
[37]

Image as a foreign language: Beit pretraining for vision and vision- language tasks

Wenhui Wang, Hangbo Bao, Li Dong, Johan Bjorck, Zhil- iang Peng, Qiang Liu, Kriti Aggarwal, Owais Khan Mo- hammed, Saksham Singhal, Subhojit Som, et al. Image as a foreign language: Beit pretraining for vision and vision- language tasks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19175– 19186, 2023. 1

2023
[38]

Lavie: High-quality video generation with cascaded latent diffusion models.International Journal of Computer Vision, 133(5):3059–3078, 2025

Yaohui Wang, Xinyuan Chen, Xin Ma, Shangchen Zhou, Ziqi Huang, Yi Wang, Ceyuan Yang, Yinan He, Jiashuo Yu, Peiqing Yang, et al. Lavie: High-quality video generation with cascaded latent diffusion models.International Journal of Computer Vision, 133(5):3059–3078, 2025. 1, 8, 10, 12

2025
[39]

Is your world simulator a good story presenter? a consecu- tive events-based benchmark for future long video genera- tion

Yiping Wang, Xuehai He, Kuan Wang, Luyao Ma, Jianwei Yang, Shuohang Wang, Simon Shaolei Du, and Yelong Shen. Is your world simulator a good story presenter? a consecu- tive events-based benchmark for future long video genera- tion. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 13629–13638, 2025. 3

2025
[40]

CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiao- han Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer.arXiv preprint arXiv:2408.06072, 2024. 1, 2, 7, 8, 10, 12

work page internal anchor Pith review arXiv 2024
[41]

Chronomagic-bench: A bench- mark for metamorphic evaluation of text-to-time-lapse video generation.Advances in Neural Information Processing Sys- tems, 37:21236–21270, 2024

Shenghai Yuan, Jinfa Huang, Yongqi Xu, Yaoyang Liu, Shaofeng Zhang, Yujun Shi, Rui-Jie Zhu, Xinhua Cheng, Jiebo Luo, and Li Yuan. Chronomagic-bench: A bench- mark for metamorphic evaluation of text-to-time-lapse video generation.Advances in Neural Information Processing Sys- tems, 37:21236–21270, 2024. 3

2024
[42]

Show-1: Marrying pixel and latent diffusion models for text-to-video generation.International Journal of Com- puter Vision, 133(4):1879–1893, 2025

David Junhao Zhang, Jay Zhangjie Wu, Jia-Wei Liu, Rui Zhao, Lingmin Ran, Yuchao Gu, Difei Gao, and Mike Zheng Shou. Show-1: Marrying pixel and latent diffusion models for text-to-video generation.International Journal of Com- puter Vision, 133(4):1879–1893, 2025. 1, 7, 8, 10, 12

2025
[43]

Adding conditional control to text-to-image diffusion models

Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF international conference on computer vision, pages 3836–3847, 2023. 1, 2

2023
[44]

VBench-2.0: Advancing Video Generation Benchmark Suite for Intrinsic Faithfulness

Dian Zheng, Ziqi Huang, Hongbo Liu, Kai Zou, Yinan He, Fan Zhang, Lulu Gu, Yuanhan Zhang, Jingwen He, Wei- Shi Zheng, et al. Vbench-2.0: Advancing video generation benchmark suite for intrinsic faithfulness.arXiv preprint arXiv:2503.21755, 2025. 3

work page internal anchor Pith review arXiv 2025