Recognition: unknown
VGA-Bench: A Unified Benchmark and Multi-Model Framework for Video Aesthetics and Generation Quality Evaluation
Pith reviewed 2026-05-10 15:40 UTC · model grok-4.3
The pith
VGA-Bench offers a taxonomy-driven benchmark and three neural assessors that align with human judgments for evaluating both aesthetic appeal and technical quality in generated videos.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
VGA-Bench is a unified benchmark built on a three-tier taxonomy covering Aesthetic Quality, Aesthetic Tagging, and Generation Quality. The authors create 1,016 prompts and over 60,000 videos using 12 models, annotate a human-labeled subset, and train three dedicated neural assessors that demonstrate reliable alignment with human judgments while providing accuracy and efficiency.
What carries the argument
The three multi-task neural assessors—VAQA-Net for aesthetic quality prediction, VTag-Net for automatic aesthetic tagging, and VGQA-Net for generation quality attributes—trained on the human-annotated dataset derived from the taxonomy.
Load-bearing premise
The human-labeled subset is representative enough for the three neural assessors to generalize reliably to unseen prompts, models, and video styles.
What would settle it
Applying the trained assessors to a new set of videos generated by models outside the original 12, collecting independent human ratings on the same taxonomy dimensions, and observing whether the predicted scores correlate highly with the new human labels.
Figures
read the original abstract
The rapid advancement of AIGC-based video generation has underscored the critical need for comprehensive evaluation frameworks that go beyond traditional generation quality metrics to encompass aesthetic appeal. However, existing benchmarks remain largely focused on technical fidelity, leaving a significant gap in holistic assessment-particularly with respect to perceptual and artistic qualities. To address this limitation, we introduce VGA-Bench, a unified benchmark for joint evaluation of video generation quality and aesthetic quality. VGA-Bench is built upon a principled three-tier taxonomy: Aesthetic Quality, Aesthetic Tagging, and Generation Quality, each decomposed into multiple fine-grained sub-dimensions to enable systematic assessment. Guided by this taxonomy, we design 1,016 diverse prompts and generate a large-scale dataset of over 60,000 videos using 12 video generation models, ensuring broad coverage across content, style, and artifacts. To enable scalable and automated evaluation, we annotate a subset of the dataset via human labeling and develop three dedicated multi-task neural assessors: VAQA-Net for aesthetic quality prediction, VTag-Net for automatic aesthetic tagging, and VGQA-Net for generation and basic quality attributes. Extensive experiments demonstrate that our models achieve reliable alignment with human judgments, offering both accuracy and efficiency. We release VGA-Bench as a public benchmark to foster research in AIGC evaluation, with applications in content moderation, model debugging, and generative model optimization.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces VGA-Bench, a unified benchmark for joint evaluation of video generation quality and aesthetic quality in AIGC. It defines a three-tier taxonomy (Aesthetic Quality, Aesthetic Tagging, Generation Quality) with fine-grained sub-dimensions, generates over 60,000 videos from 1,016 diverse prompts using 12 video generation models, human-annotates a subset of the data, and trains three multi-task neural assessors (VAQA-Net for aesthetic quality prediction, VTag-Net for aesthetic tagging, and VGQA-Net for generation quality attributes). The central claim is that extensive experiments show these models achieve reliable alignment with human judgments while providing accuracy and efficiency; the benchmark and models are released publicly.
Significance. If the human alignment claims are substantiated with proper quantitative validation, this work would provide a valuable large-scale resource and automated tools to address the gap between technical fidelity metrics and perceptual/artistic qualities in video generation evaluation. The public release of the 60k-video dataset, prompts, and trained assessors is a clear strength that could support applications in content moderation, model debugging, and generative optimization.
major comments (2)
- [§4] §4 (Experimental evaluation of VAQA-Net, VTag-Net, and VGQA-Net): The central claim that the three neural assessors achieve 'reliable alignment with human judgments' on the full 60k-video dataset is not supported by any reported quantitative metrics (e.g., correlation coefficients, accuracy, or MAE), training details, validation splits, inter-annotator agreement, or error analysis. This is load-bearing because the generalization from the human-annotated subset to the entire dataset and to unseen prompts/models depends on these details.
- [§3] §3 (Dataset construction and human annotation): The selection and representativeness of the human-labeled subset (size, stratification across the 12 models, styles, artifacts, and prompts) are not described, nor is any held-out testing strategy (e.g., unseen generators or prompt categories). Without this, the claim that the assessors generalize reliably cannot be evaluated.
minor comments (2)
- [§2] The taxonomy definitions in §2 could include explicit annotation guidelines or example video clips for each sub-dimension to aid reproducibility by other researchers.
- [Figure 1] Figure 1 and Table 1 would benefit from clearer captions indicating which models and prompt categories are represented in the example videos.
Simulated Author's Rebuttal
We sincerely thank the referee for the constructive and detailed feedback on our manuscript. We have carefully reviewed the major comments and will revise the paper to provide the requested clarifications and additional details on experimental metrics and dataset construction.
read point-by-point responses
-
Referee: [§4] §4 (Experimental evaluation of VAQA-Net, VTag-Net, and VGQA-Net): The central claim that the three neural assessors achieve 'reliable alignment with human judgments' on the full 60k-video dataset is not supported by any reported quantitative metrics (e.g., correlation coefficients, accuracy, or MAE), training details, validation splits, inter-annotator agreement, or error analysis. This is load-bearing because the generalization from the human-annotated subset to the entire dataset and to unseen prompts/models depends on these details.
Authors: We acknowledge the referee's point that the current presentation of results in §4 requires more explicit quantitative support. While the manuscript includes experimental evaluations of the three models, we agree that details such as specific correlation coefficients, MAE, accuracy metrics, training hyperparameters, validation splits, inter-annotator agreement, and error analysis are insufficiently detailed. In the revision, we will expand §4 with these elements, including performance on the human-annotated subset and analysis of generalization to the full dataset, to better substantiate the alignment claims. revision: yes
-
Referee: [§3] §3 (Dataset construction and human annotation): The selection and representativeness of the human-labeled subset (size, stratification across the 12 models, styles, artifacts, and prompts) are not described, nor is any held-out testing strategy (e.g., unseen generators or prompt categories). Without this, the claim that the assessors generalize reliably cannot be evaluated.
Authors: We agree that the manuscript would benefit from greater transparency on the human annotation process. The revised version will include a detailed description of the human-labeled subset, specifying its size, the stratification approach across the 12 models, styles, artifacts, and prompts to demonstrate representativeness. We will also add information on the held-out testing strategy, including use of unseen generators or prompt categories, to support evaluation of generalization. revision: yes
Circularity Check
No circularity: standard supervised learning on human annotations
full rationale
The paper constructs a dataset of 60k videos from 1016 prompts and 12 models, annotates a human-labeled subset, and trains three multi-task networks (VAQA-Net, VTag-Net, VGQA-Net) to predict aesthetic and quality attributes. Alignment with human judgments is measured by standard supervised evaluation on held-out labels. No equations, predictions, or uniqueness claims reduce to fitted parameters by construction; no self-citations are invoked as load-bearing mathematical facts; the taxonomy and assessors are defined independently of the final performance numbers. This is ordinary empirical ML benchmarking and remains self-contained against external human annotations.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Vivit: A video vision transformer
Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Lu ˇci´c, and Cordelia Schmid. Vivit: A video vision transformer. InProceedings of the IEEE/CVF inter- national conference on computer vision, pages 6836–6846,
-
[2]
Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets
Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023. 1, 2, 8, 10, 12
work page internal anchor Pith review arXiv 2023
-
[3]
Align your latents: High-resolution video synthesis with la- tent diffusion models
Andreas Blattmann, Robin Rombach, Huan Ling, Tim Dock- horn, Seung Wook Kim, Sanja Fidler, and Karsten Kreis. Align your latents: High-resolution video synthesis with la- tent diffusion models. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 22563–22575, 2023. 1, 2
2023
-
[4]
Routledge,
Blain Brown.Cinematography: theory and practice: im- age making for cinematographers and directors. Routledge,
-
[5]
Emerg- ing properties in self-supervised vision transformers
Mathilde Caron, Hugo Touvron, Ishan Misra, Herv ´e J´egou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerg- ing properties in self-supervised vision transformers. InPro- ceedings of the IEEE/CVF international conference on com- puter vision, pages 9650–9660, 2021. 1
2021
-
[6]
Vlp: A survey on vision-language pre-training.Machine Intelligence Re- search, 20(1):38–56, 2023
Fei-Long Chen, Du-Zhen Zhang, Ming-Lun Han, Xiu-Yi Chen, Jing Shi, Shuang Xu, and Bo Xu. Vlp: A survey on vision-language pre-training.Machine Intelligence Re- search, 20(1):38–56, 2023. 1
2023
-
[7]
Cinematography: the creative use of reality
Maya Deren. Cinematography: the creative use of reality. Daedalus, 89(1):150–167, 1960. 4
1960
-
[8]
Coarse-to-fine vision-language pre-training with fusion in the backbone.Advances in neural information processing systems, 35:32942–32956, 2022
Zi-Yi Dou, Aishwarya Kamath, Zhe Gan, Pengchuan Zhang, Jianfeng Wang, Linjie Li, Zicheng Liu, Ce Liu, Yann Le- Cun, Nanyun Peng, et al. Coarse-to-fine vision-language pre-training with fusion in the backbone.Advances in neural information processing systems, 35:32942–32956, 2022. 1
2022
-
[9]
Vision-language pre-training: Basics, re- cent advances, and future trends.Foundations and Trends® in Computer Graphics and Vision, 14(3–4):163–352, 2022
Zhe Gan, Linjie Li, Chunyuan Li, Lijuan Wang, Zicheng Liu, Jianfeng Gao, et al. Vision-language pre-training: Basics, re- cent advances, and future trends.Foundations and Trends® in Computer Graphics and Vision, 14(3–4):163–352, 2022. 1
2022
-
[10]
AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning
Yuwei Guo, Ceyuan Yang, Anyi Rao, Zhengyang Liang, Yaohui Wang, Yu Qiao, Maneesh Agrawala, Dahua Lin, and Bo Dai. Animatediff: Animate your personalized text- to-image diffusion models without specific tuning.arXiv preprint arXiv:2307.04725, 2023. 1, 2, 8, 10, 12
work page internal anchor Pith review arXiv 2023
-
[11]
LTX-Video: Realtime Video Latent Diffusion
Yoav HaCohen, Nisan Chiprut, Benny Brazowski, Daniel Shalem, Dudu Moshe, Eitan Richardson, Eran Levin, Guy Shiran, Nir Zabari, Ori Gordon, et al. Ltx-video: Realtime video latent diffusion.arXiv preprint arXiv:2501.00103,
work page internal anchor Pith review arXiv
-
[12]
Clipscore: A reference-free evaluation met- ric for image captioning
Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Yejin Choi. Clipscore: A reference-free evaluation met- ric for image captioning. InProceedings of the 2021 confer- ence on empirical methods in natural language processing, pages 7514–7528, 2021. 1, 3
2021
-
[13]
Denoising dif- fusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020
Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising dif- fusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020. 1, 2
2020
-
[14]
Imagen Video: High Definition Video Generation with Diffusion Models
Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Alexey Gritsenko, Diederik P Kingma, Ben Poole, Mohammad Norouzi, David J Fleet, et al. Imagen video: High definition video generation with diffusion mod- els.arXiv preprint arXiv:2210.02303, 2022. 1, 2
work page internal anchor Pith review arXiv 2022
-
[15]
Vbench: Comprehensive bench- mark suite for video generative models
Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, et al. Vbench: Comprehensive bench- mark suite for video generative models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21807–21818, 2024. 1, 3, 5
2024
-
[16]
Musiq: Multi-scale image quality transformer
Junjie Ke, Qifei Wang, Yilin Wang, Peyman Milanfar, and Feng Yang. Musiq: Multi-scale image quality transformer. InProceedings of the IEEE/CVF international conference on computer vision, pages 5148–5157, 2021. 1
2021
-
[17]
Text2video-zero: Text- to-image diffusion models are zero-shot video generators
Levon Khachatryan, Andranik Movsisyan, Vahram Tade- vosyan, Roberto Henschel, Zhangyang Wang, Shant Navasardyan, and Humphrey Shi. Text2video-zero: Text- to-image diffusion models are zero-shot video generators. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15954–15964, 2023. 1, 2
2023
-
[18]
HunyuanVideo: A Systematic Framework For Large Video Generative Models
Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024. 1, 7, 8, 10, 12
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[19]
Dit: Self-supervised pre-training for docu- ment image transformer
Junlong Li, Yiheng Xu, Tengchao Lv, Lei Cui, Cha Zhang, and Furu Wei. Dit: Self-supervised pre-training for docu- ment image transformer. InProceedings of the 30th ACM international conference on multimedia, pages 3530–3539,
-
[20]
Fetv: A bench- mark for fine-grained evaluation of open-domain text-to- video generation.Advances in Neural Information Process- ing Systems, 36:62352–62387, 2023
Yuanxin Liu, Lei Li, Shuhuai Ren, Rundong Gao, Shicheng Li, Sishuo Chen, Xu Sun, and Lu Hou. Fetv: A bench- mark for fine-grained evaluation of open-domain text-to- video generation.Advances in Neural Information Process- ing Systems, 36:62352–62387, 2023. 1, 3
2023
-
[21]
Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models
Yixin Liu, Kai Zhang, Yuan Li, Zhiling Yan, Chujie Gao, Ruoxi Chen, Zhengqing Yuan, Yue Huang, Hanchi Sun, Jian- feng Gao, et al. Sora: A review on background, technology, limitations, and opportunities of large vision models.arXiv preprint arXiv:2402.17177, 2024. 1, 2, 8, 10, 12
work page internal anchor Pith review arXiv 2024
-
[22]
Video swin transformer
Ze Liu, Jia Ning, Yue Cao, Yixuan Wei, Zheng Zhang, Stephen Lin, and Han Hu. Video swin transformer. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3202–3211, 2022. 1
2022
-
[23]
Videofusion: Decomposed diffusion models for high-quality video generation
Zhengxiong Luo, Dayou Chen, Yingya Zhang, Yan Huang, Liang Wang, Yujun Shen, Deli Zhao, Jingren Zhou, and Tieniu Tan. Videofusion: Decomposed diffusion mod- els for high-quality video generation.arXiv preprint arXiv:2303.08320, 2023. 1, 2
-
[24]
Latte: Latent Diffusion Transformer for Video Generation
Xin Ma, Yaohui Wang, Gengyun Jia, Xinyuan Chen, Zi- wei Liu, Yuan-Fang Li, Cunjian Chen, and Yu Qiao. Latte: Latent diffusion transformer for video generation.arXiv preprint arXiv:2401.03048, 2024. 1, 2, 7, 8, 10, 12
work page internal anchor Pith review arXiv 2024
-
[25]
Mustafa Yousry Matbouly. Quantifying the unquantifiable: the color of cinematic lighting and its effect on audience’s impressions towards the appearance of film characters.Cur- rent Psychology, 41(6):3694–3715, 2022. 4
2022
-
[26]
Qianqian Qiao, DanDan Zheng, Yihang Bo, Bao Peng, Heng Huang, Longteng Jiang, Huaye Wang, Jingdong Chen, Jun Zhou, and Xin Jin. Vadb: A large-scale video aesthetic database with professional and multi-dimensional annota- tions.arXiv preprint arXiv:2510.25238, 2025. 3, 4, 5, 6, 7
-
[27]
Learning transferable visual models from natural language supervi- sion
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 7
2021
-
[28]
Video transformers: A survey.IEEE Transactions on Pattern Anal- ysis and Machine Intelligence, 45(11):12922–12943, 2023
Javier Selva, Anders S Johansen, Sergio Escalera, Kamal Nasrollahi, Thomas B Moeslund, and Albert Clap ´es. Video transformers: A survey.IEEE Transactions on Pattern Anal- ysis and Machine Intelligence, 45(11):12922–12943, 2023. 1
2023
-
[29]
Make-A-Video: Text-to-Video Generation without Text-Video Data
Uriel Singer, Adam Polyak, Thomas Hayes, Xi Yin, Jie An, Songyang Zhang, Qiyuan Hu, Harry Yang, Oron Ashual, Oran Gafni, et al. Make-a-video: Text-to-video generation without text-video data.arXiv preprint arXiv:2209.14792,
work page internal anchor Pith review arXiv
-
[30]
Score-Based Generative Modeling through Stochastic Differential Equations
Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Ab- hishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equa- tions.arXiv preprint arXiv:2011.13456, 2020. 1, 2
work page internal anchor Pith review Pith/arXiv arXiv 2011
-
[31]
T2v-compbench: A comprehen- sive benchmark for compositional text-to-video generation
Kaiyue Sun, Kaiyi Huang, Xian Liu, Yue Wu, Zihan Xu, Zhenguo Li, and Xihui Liu. T2v-compbench: A comprehen- sive benchmark for compositional text-to-video generation. InProceedings of the Computer Vision and Pattern Recogni- tion Conference, pages 8406–8416, 2025. 3
2025
-
[32]
Human-centric founda- tion models: Perception, generation and agentic modeling
Shixiang Tang, Yizhou Wang, Lu Chen, Yuan Wang, Sida Peng, Dan Xu, and Wanli Ouyang. Human-centric founda- tion models: Perception, generation and agentic modeling. arXiv preprint arXiv:2502.08556, 2025. 1
-
[33]
Mochi 1.https :/ /github .com/ genmoai/models, 2024
Genmo Team. Mochi 1.https :/ /github .com/ genmoai/models, 2024. 1, 7, 8, 10, 12
2024
-
[34]
Fvd: A new metric for video generation
Thomas Unterthiner, Sjoerd Van Steenkiste, Karol Kurach, Rapha¨el Marinier, Marcin Michalski, and Sylvain Gelly. Fvd: A new metric for video generation. 2019. 1, 3
2019
-
[35]
Wan: Open and Advanced Large-Scale Video Generative Models
Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video gen- erative models.arXiv preprint arXiv:2503.20314, 2025. 1, 8, 10, 12
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[36]
ModelScope Text-to-Video Technical Report
Jiuniu Wang, Hangjie Yuan, Dayou Chen, Yingya Zhang, Xiang Wang, and Shiwei Zhang. Modelscope text-to-video technical report.arXiv preprint arXiv:2308.06571, 2023. 1, 2, 8, 10, 12
work page internal anchor Pith review arXiv 2023
-
[37]
Image as a foreign language: Beit pretraining for vision and vision- language tasks
Wenhui Wang, Hangbo Bao, Li Dong, Johan Bjorck, Zhil- iang Peng, Qiang Liu, Kriti Aggarwal, Owais Khan Mo- hammed, Saksham Singhal, Subhojit Som, et al. Image as a foreign language: Beit pretraining for vision and vision- language tasks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19175– 19186, 2023. 1
2023
-
[38]
Lavie: High-quality video generation with cascaded latent diffusion models.International Journal of Computer Vision, 133(5):3059–3078, 2025
Yaohui Wang, Xinyuan Chen, Xin Ma, Shangchen Zhou, Ziqi Huang, Yi Wang, Ceyuan Yang, Yinan He, Jiashuo Yu, Peiqing Yang, et al. Lavie: High-quality video generation with cascaded latent diffusion models.International Journal of Computer Vision, 133(5):3059–3078, 2025. 1, 8, 10, 12
2025
-
[39]
Is your world simulator a good story presenter? a consecu- tive events-based benchmark for future long video genera- tion
Yiping Wang, Xuehai He, Kuan Wang, Luyao Ma, Jianwei Yang, Shuohang Wang, Simon Shaolei Du, and Yelong Shen. Is your world simulator a good story presenter? a consecu- tive events-based benchmark for future long video genera- tion. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 13629–13638, 2025. 3
2025
-
[40]
CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer
Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiao- han Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer.arXiv preprint arXiv:2408.06072, 2024. 1, 2, 7, 8, 10, 12
work page internal anchor Pith review arXiv 2024
-
[41]
Chronomagic-bench: A bench- mark for metamorphic evaluation of text-to-time-lapse video generation.Advances in Neural Information Processing Sys- tems, 37:21236–21270, 2024
Shenghai Yuan, Jinfa Huang, Yongqi Xu, Yaoyang Liu, Shaofeng Zhang, Yujun Shi, Rui-Jie Zhu, Xinhua Cheng, Jiebo Luo, and Li Yuan. Chronomagic-bench: A bench- mark for metamorphic evaluation of text-to-time-lapse video generation.Advances in Neural Information Processing Sys- tems, 37:21236–21270, 2024. 3
2024
-
[42]
Show-1: Marrying pixel and latent diffusion models for text-to-video generation.International Journal of Com- puter Vision, 133(4):1879–1893, 2025
David Junhao Zhang, Jay Zhangjie Wu, Jia-Wei Liu, Rui Zhao, Lingmin Ran, Yuchao Gu, Difei Gao, and Mike Zheng Shou. Show-1: Marrying pixel and latent diffusion models for text-to-video generation.International Journal of Com- puter Vision, 133(4):1879–1893, 2025. 1, 7, 8, 10, 12
2025
-
[43]
Adding conditional control to text-to-image diffusion models
Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF international conference on computer vision, pages 3836–3847, 2023. 1, 2
2023
-
[44]
VBench-2.0: Advancing Video Generation Benchmark Suite for Intrinsic Faithfulness
Dian Zheng, Ziqi Huang, Hongbo Liu, Kai Zou, Yinan He, Fan Zhang, Lulu Gu, Yuanhan Zhang, Jingwen He, Wei- Shi Zheng, et al. Vbench-2.0: Advancing video generation benchmark suite for intrinsic faithfulness.arXiv preprint arXiv:2503.21755, 2025. 3
work page internal anchor Pith review arXiv 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.