pith. machine review for the scientific record. sign in

arxiv: 2605.05187 · v1 · submitted 2026-05-06 · 💻 cs.CV

Recognition: unknown

LoViF 2026 The First Challenge on Holistic Quality Assessment for 4D World Model (PhyScore)

Authors on Pith no claims yet

Pith reviewed 2026-05-08 16:17 UTC · model grok-4.3

classification 💻 cs.CV
keywords holistic quality assessmentworld modelsvideo generationphysical realismtemporal consistencyanomaly localizationbenchmark challenge4D generation
0
0 comments X

The pith

Perceptual quality alone cannot judge if generated dynamics are physically plausible or temporally coherent.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets up the LoViF 2026 PhyScore challenge because standard perceptual metrics miss whether AI-generated videos follow real physics, maintain time consistency, and match input conditions. It requires new metrics to predict four joint dimensions—Video Quality, Physical Realism, Condition-Video Alignment, and Temporal Consistency—while also identifying exact timestamps of physical anomalies. This matters for world models because single-score evaluations let through dynamics that look fine but violate basic physical rules across 2D and 4D generation. The benchmark supplies 1,554 videos from seven models in 26 physics-relevant categories, with trained human labels plus automated checks to create reliable targets.

Core claim

The central claim is that a holistic metric must jointly predict the four dimensions of Video Quality, Physical Realism, Condition-Video Alignment, and Temporal Consistency, and must localize physical anomaly timestamps, because perceptual quality by itself is insufficient to determine whether generated dynamics are physically plausible, temporally coherent, and consistent with input conditions.

What carries the argument

The four-dimensional prediction task combined with timestamp localization, scored by a composite protocol of TimeStamp_IOU for anomalies and SRCC/PLCC for the dimension scores.

If this is right

  • Metrics optimized on the benchmark will better identify generations that look realistic yet break physical laws.
  • Anomaly localization will let developers target fixes at specific moments rather than retraining entire models.
  • The three-track structure will expose differences in physical consistency between text-to-2D, image-to-4D, and video-to-4D generation.
  • Composite scoring will reward methods that perform well on both overall ratings and precise timing of flaws.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same multi-dimensional approach could be adapted to evaluate consistency in other generative outputs such as simulated 3D scenes or robotic trajectories.
  • Successful metrics from the challenge may reduce dependence on repeated human annotation by serving as training signals for automated physical checkers.
  • Widespread adoption would likely push world-model developers to incorporate explicit physics constraints during training rather than relying on post-hoc correction.

Load-bearing premise

Trained human annotators can produce reliable, unbiased labels for physical realism and anomaly timestamps across diverse physics scenarios without systematic errors from subjective judgment.

What would settle it

A controlled study in which a perceptual-only metric achieves high correlation with human physical-realism scores and anomaly timestamps across the same dataset would show the four-dimensional requirement is unnecessary.

Figures

Figures reproduced from arXiv: 2605.05187 by Chen Gao, Dubing Chen, Fang Liu, Fengbin Guan, Guangtao Zhai, Haoran Li, Huan Zheng, Huiyu Duan, Jing Liu, Kang Fu, Licheng Jiao, Lingling Li, Manabu Tsukada, Manxi Sun, Qiang Hu, Sijing Wu, Tianyi Yan, Wei Luo, Xin Jin, Xin Li, Xiongkuo Min, Yiting Lu, Yi Wen, Yiwen Ren, Yong Li, Yucheng Zhou, Yunhao Li, Yun Li, Zhenglin Du, Zhengyang Li, Zhibo Chen, Zhilong Song, Ziang Xiao, Zixuan Guo, Ziyang Chen.

Figure 1
Figure 1. Figure 1: Thumbnails and multidimensional scores of 10 generated videos in the training set: example 1 view at source ↗
Figure 2
Figure 2. Figure 2: Thumbnails and multidimensional scores of 10 generated videos in the training set: example 2 view at source ↗
Figure 3
Figure 3. Figure 3: Thumbnails and multidimensional scores of 10 generated videos in the training set: example 3 view at source ↗
Figure 4
Figure 4. Figure 4: Overview of team SJTU-MM proposed method. view at source ↗
Figure 5
Figure 5. Figure 5: Overview of team INHI proposed method. anomaly probability classification, and timestamp regres￾sion. The head outputs four quality scores, anomaly proba￾bility, and anomaly start/end timestamps. Training Information. Training uses only official chal￾lenge data with no extra samples. The total loss is L = Lscore + 0.35Lprob + 0.40Lts, where Lscore is Huber loss on quality scores (VideoQual￾ity weighted by … view at source ↗
Figure 6
Figure 6. Figure 6: Overview of the WDL two-stage pipeline and multi-task outputs. view at source ↗
Figure 7
Figure 7. Figure 7: Overview of team DYTH proposed method. lected settings on all official training data. No extra chal￾lenge data is used; external resources are limited to pre￾trained CLIP weights. Testing Information. The factsheet reports a public test score of 0.484 and a local cross-validation score of 0.4936 for the type-separated setting. Inference uses per-type mod￾els and an anomaly branch with post-hoc temporal cor… view at source ↗
read the original abstract

This paper reports on the LoViF 2026 PhyScore challenge, a competition on holistic quality assessment of world-model-generated videos across both 2D and 4D generation settings. The challenge is motivated by a central gap in current evaluation practice: perceptual quality alone is insufficient to judge whether generated dynamics are physically plausible, temporally coherent, and consistent with input conditions. Participants are required to build a metric that jointly predicts four dimensions, i.e., Video Quality, Physical Realism, Condition-Video Alignment, and Temporal Consistency. Depart from that, participants also need to localize physical anomaly timestamps for fine-grained diagnosis. The benchmark dataset contains 1,554 videos generated by seven representative world generative models, organized into three tracks (text-2D, image-to-4D, and video-to-4D) and spanning 26 categories. These categories explicitly cover physics-relevant scenarios, including dynamics, optics, and thermodynamics, together with diverse real-world and creative content. To ensure label reliability, scores and anomaly timestamps are produced through trained human annotation with an additional automated quality-control pass. Evaluation is based on both score prediction and anomaly localization, with a composite protocol that combines TimeStamp_IOU and SRCC/PLCC. This report summarizes the challenge design and provides method-level insights from submitted solutions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript reports on the LoViF 2026 PhyScore challenge for holistic quality assessment of videos generated by 4D world models. It argues that perceptual quality metrics alone are insufficient to evaluate physical plausibility, temporal coherence, and consistency with input conditions. Participants must develop metrics that jointly predict four dimensions (Video Quality, Physical Realism, Condition-Video Alignment, Temporal Consistency) and localize physical anomaly timestamps. The benchmark consists of 1,554 videos from seven generative models across text-to-2D, image-to-4D, and video-to-4D tracks spanning 26 physics-relevant categories. Labels are produced by trained human annotators with an automated QC pass, and evaluation uses a composite protocol of TimeStamp_IOU for localization together with SRCC/PLCC for score prediction. The report summarizes the challenge design and provides method-level insights from submitted solutions.

Significance. If the human annotations reliably capture physical properties rather than perceptual cues, this benchmark could meaningfully advance evaluation standards in generative video and world modeling by addressing the gap between visual quality and physical realism. The sizable dataset (1,554 videos), coverage of 26 physics categories across three distinct generation tracks, and dual emphasis on score prediction plus fine-grained anomaly localization constitute concrete strengths that could support reproducible progress in the field.

major comments (1)
  1. [Annotation and Quality Control] The annotation protocol (trained human annotators plus automated QC for the full set of 1,554 videos) provides no quantitative validation of label reliability. No inter-annotator agreement statistics, no correlation with domain-expert physicists, and no comparison against simulation-derived ground truth for realism scores or anomaly timestamps are reported. This is load-bearing for the central claim that perceptual quality is insufficient, because without such validation the new dimensions risk being redundant with existing video-quality metrics if annotators primarily respond to visual smoothness.
minor comments (2)
  1. [Evaluation Protocol] The composite evaluation protocol is described only at a high level; the exact weighting between TimeStamp_IOU and the SRCC/PLCC terms, including any normalization, should be stated explicitly to enable direct reproduction by future participants.
  2. [Results and Insights] The abstract states that the report provides 'method-level insights from submitted solutions,' yet the manuscript text does not include concrete examples of submitted approaches, their performance numbers, or ablation analyses; adding a concise summary table of top entries would strengthen the contribution.

Simulated Author's Rebuttal

1 responses · 2 unresolved

We thank the referee for the constructive feedback, particularly on the need for quantitative validation of the human annotations. This is a critical aspect for establishing the benchmark's reliability, and we address the concern directly below.

read point-by-point responses
  1. Referee: [Annotation and Quality Control] The annotation protocol (trained human annotators plus automated QC for the full set of 1,554 videos) provides no quantitative validation of label reliability. No inter-annotator agreement statistics, no correlation with domain-expert physicists, and no comparison against simulation-derived ground truth for realism scores or anomaly timestamps are reported. This is load-bearing for the central claim that perceptual quality is insufficient, because without such validation the new dimensions risk being redundant with existing video-quality metrics if annotators primarily respond to visual smoothness.

    Authors: We agree that the absence of quantitative validation metrics in the current manuscript is a limitation. In the revised version, we will add inter-annotator agreement statistics (such as Fleiss' kappa or intraclass correlation coefficients) computed from the multiple annotations per video where available. However, correlations with domain-expert physicists and comparisons to simulation-derived ground truth were not collected as part of the challenge design, which relied on trained human annotators for scalability across 1,554 videos and 26 physics categories. We will explicitly discuss this as a limitation in the revised manuscript, including its potential impact on distinguishing physical realism from perceptual cues, and outline it as future work. The automated QC pass and training protocol were intended to mitigate subjectivity, but we acknowledge that additional validation would further support the claim that the four dimensions capture aspects beyond standard video quality metrics. revision: partial

standing simulated objections not resolved
  • Providing correlations with domain-expert physicists, as no such annotations were collected
  • Providing comparisons against simulation-derived ground truth for realism scores or anomaly timestamps, as no corresponding simulations exist for the generated videos

Circularity Check

0 steps flagged

No circularity in descriptive challenge report

full rationale

The paper is a descriptive report on the LoViF 2026 PhyScore challenge. It outlines the motivation, dataset construction, annotation protocol, and evaluation metrics without any mathematical derivations, equations, fitted parameters, model predictions, or self-referential reductions. No load-bearing steps reduce by construction to inputs, self-citations, or ansatzes. The central motivation (perceptual quality being insufficient) is presented as the rationale for creating the benchmark rather than derived from any prior fitted quantities or uniqueness theorems. This is a standard non-finding for challenge papers that contain no analytical derivations.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The report rests on the assumption that human judgment can serve as ground truth for physical properties without providing independent verification mechanisms.

axioms (1)
  • domain assumption Human annotations after training plus automated QC produce reliable labels for physical realism and anomaly timestamps
    Stated in the abstract as the source of scores and timestamps.

pith-pipeline@v0.9.0 · 5667 in / 1208 out tokens · 52435 ms · 2026-05-08T16:17:17.706996+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

45 extracted references · 15 canonical work pages · 8 internal anchors

  1. [1]

    Cosmos World Foundation Model Platform for Physical AI

    Niket Agarwal, Arslan Ali, Maciej Bala, Yogesh Balaji, Erik Barker, Tiffany Cai, Prithvijit Chattopadhyay, Yongxin Chen, Yin Cui, Yifan Ding, et al. Cosmos world foun- dation model platform for physical ai.arXiv preprint arXiv:2501.03575, 2025. 1

  2. [2]

    Recammaster: Camera-controlled generative ren- dering from a single video

    Jianhong Bai, Menghan Xia, Xiao Fu, Xintao Wang, Lianrui Mu, Jinwen Cao, Zuozhu Liu, Haoji Hu, Xiang Bai, Pengfei Wan, et al. Recammaster: Camera-controlled generative ren- dering from a single video. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 14834– 14844, 2025. 1, 2

  3. [3]

    Qwen3-VL Technical Report

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025. 5, 7

  4. [4]

    Qwen2.5-VL Technical Report

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhao- hai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Jun- yang Lin. Qwen2.5-vl technical repor...

  5. [5]

    Random forests.Machine Learning, 45(1): 5–32, 2001

    Leo Breiman. Random forests.Machine Learning, 45(1): 5–32, 2001. 6

  6. [6]

    Video generation models as world simulators

    Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yu Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, et al. Video generation models as world simulators. OpenAI, 2024. 1

  7. [7]

    arXiv preprint arXiv:2503.13265 (2025)

    Luxi Chen, Zihan Zhou, Min Zhao, Yikai Wang, Ge Zhang, Wenhao Huang, Hao Sun, Ji-Rong Wen, and Chongxuan Li. Flexworld: Progressively expanding 3d scenes for flexiable- view synthesis.arXiv preprint arXiv:2503.13265, 2025. 1

  8. [8]

    LoViF 2026 challenge on real-world all-in-one im- age restoration: Methods and results

    Xiang Chen, Hao Li, Jiangxin Dong, Jinshan Pan, Xin Li, et al. LoViF 2026 challenge on real-world all-in-one im- age restoration: Methods and results. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2026. 2

  9. [9]

    Nearest neighbor pattern clas- sification.IEEE Transactions on Information Theory, 13(1): 21–27, 1967

    Thomas Cover and Peter Hart. Nearest neighbor pattern clas- sification.IEEE Transactions on Information Theory, 13(1): 21–27, 1967. 6

  10. [10]

    Worldscore: A unified evaluation benchmark for world generation

    Haoyi Duan, Hong-Xing Yu, Sirui Chen, Li Fei-Fei, and Ji- ajun Wu. Worldscore: A unified evaluation benchmark for world generation. InProceedings of the IEEE/CVF Interna- tional Conference on Computer Vision, pages 27713–27724,

  11. [11]

    Friedman

    Jerome H. Friedman. Greedy function approximation: A gradient boosting machine.The Annals of Statistics, 29(5): 1189–1232, 2001. 6

  12. [12]

    Ex- tremely randomized trees.Machine Learning, 63(1):3–42,

    Pierre Geurts, Damien Ernst, and Louis Wehenkel. Ex- tremely randomized trees.Machine Learning, 63(1):3–42,

  13. [13]

    Diffusion as shader: 3d-aware video diffusion for ver- satile video generation control

    Zekai Gu, Rui Yan, Jiahao Lu, Peng Li, Zhiyang Dou, Chenyang Si, Zhen Dong, Qifeng Liu, Cheng Lin, Ziwei Liu, et al. Diffusion as shader: 3d-aware video diffusion for ver- satile video generation control. InProceedings of the Special Interest Group on Computer Graphics and Interactive Tech- niques Conference Conference Papers, pages 1–12, 2025. 1, 2

  14. [14]

    Internvqa: Advancing compressed video quality assessment with distilling large foundation model

    Fengbin Guan, Zihao Yu, Yiting Lu, Xin Li, and Zhibo Chen. Internvqa: Advancing compressed video quality assessment with distilling large foundation model. In2025 IEEE Inter- national Symposium on Circuits and Systems (ISCAS), pages 1–5. IEEE, 2025. 1

  15. [15]

    Wenbo Hu, Xiangjun Gao, Xiaoyu Li, Sijie Zhao, Xiaodong Cun, Yong Zhang, Long Quan, and Ying Shan

    Xuan He, Dongfu Jiang, Ping Nie, Minghao Liu, Zhengxuan Jiang, Mingyi Su, Wentao Ma, Junru Lin, Chun Ye, Yi Lu, et al. Videoscore2: Think before you score in generative video evaluation.arXiv preprint arXiv:2509.22799, 2025. 5

  16. [16]

    Clipscore: A reference-free evaluation met- ric for image captioning

    Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Yejin Choi. Clipscore: A reference-free evaluation met- ric for image captioning. InProceedings of the 2021 confer- ence on empirical methods in natural language processing, pages 7514–7528, 2021. 1

  17. [17]

    Hoerl and Robert W

    Arthur E. Hoerl and Robert W. Kennard. Ridge regression: Biased estimation for nonorthogonal problems.Technomet- rics, 12(1):55–67, 1970. 6

  18. [18]

    Ex-4d: Extreme viewpoint 4d video synthesis via depth watertight mesh.arXiv preprint arXiv:2506.05554, 2025

    Tao Hu, Haoyang Peng, Xiao Liu, and Yuewen Ma. Ex-4d: Extreme viewpoint 4d video synthesis via depth watertight mesh.arXiv preprint arXiv:2506.05554, 2025. 1, 2

  19. [19]

    Vbench: Comprehensive bench- mark suite for video generative models

    Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, et al. Vbench: Comprehensive bench- mark suite for video generative models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21807–21818, 2024. 1

  20. [20]

    Subjective video quality assessment methods for multimedia applications

    International Telecommunication Union. Subjective video quality assessment methods for multimedia applications. Technical Report P.910, ITU-T, 2023. Recommendation ITU-T P.910 (10/23). 1, 5

  21. [21]

    HunyuanVideo: A Systematic Framework For Large Video Generative Models

    Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024. 1, 2

  22. [22]

    LoViF 2026 the first challenge on human- oriented semantic image quality assessment: Methods and results

    Xin Li, Daoli Xu, Wei Luo, Guoqiang Xiang, Haoran Li, Chengyu Zhuang, Zhibo Chen, Jian Guan, and Weipin- gand others Li. LoViF 2026 the first challenge on human- oriented semantic image quality assessment: Methods and results. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Work- shops, 2026. 2

  23. [23]

    Aigc- vqa: A holistic perception metric for aigc video quality assessment

    Yiting Lu, Xin Li, Bingchen Li, Zihao Yu, Fengbin Guan, Xinrui Wang, Ruling Liao, Yan Ye, and Zhibo Chen. Aigc- vqa: A holistic perception metric for aigc video quality assessment. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6384– 6394, 2024. 1

  24. [24]

    Kvq: Kwai video quality assessment for short-form videos

    Yiting Lu, Xin Li, Yajing Pei, Kun Yuan, Qizhi Xie, Yunpeng Qu, Ming Sun, Chao Zhou, and Zhibo Chen. Kvq: Kwai video quality assessment for short-form videos. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 25963–25973, 2024. 1

  25. [25]

    4dworldbench: A comprehensive evaluation frame- work for 3d/4d world generation models.arXiv preprint arXiv:2511.19836, 2025

    Yiting Lu, Wei Luo, Peiyan Tu, Haoran Li, Hanxin Zhu, Zi- hao Yu, Xingrui Wang, Xinyi Chen, Xinge Peng, Xin Li, et al. 4dworldbench: A comprehensive evaluation frame- work for 3d/4d world generation models.arXiv preprint arXiv:2511.19836, 2025. 1

  26. [26]

    Towards world simulator: Crafting physical commonsense- based benchmark for video generation

    Fanqing Meng, Jiaqi Liao, Xinyu Tan, Quanfeng Lu, Wenqi Shao, Kaipeng Zhang, Yu Cheng, Dianqi Li, and Ping Luo. Towards world simulator: Crafting physical commonsense- based benchmark for video generation. InInternational Con- ference on Machine Learning, pages 43781–43806. PMLR,

  27. [27]

    LoViF 2026 the first challenge on weather removal in videos

    Chenghao Qian, Xin Li, Yeying Jin, Shangquan Sun, et al. LoViF 2026 the first challenge on weather removal in videos. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2026. 2

  28. [28]

    UI-TARS: Pioneering Automated GUI Interaction with Native Agents

    Yujia Qin, Yining Ye, Junjie Fang, Haoming Wang, Shihao Liang, Shizuo Tian, Junda Zhang, Jiahao Li, Yunxin Li, Shi- jue Huang, et al. Ui-tars: Pioneering automated gui inter- action with native agents.arXiv preprint arXiv:2501.12326,

  29. [29]

    Learning transferable visual models from natural language supervi- sion

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 6

  30. [30]

    arXiv preprint arXiv:2602.08971 (2026)

    Yu Shang, Zhuohang Li, Yiding Ma, Weikang Su, Xin Jin, Ziyou Wang, Lei Jin, Xin Zhang, Yinzhou Tang, Haisheng Su, et al. Worldarena: A unified benchmark for evaluating perception and functional utility of embodied world models. arXiv preprint arXiv:2602.08971, 2026. 1

  31. [31]

    Kling: A powerful world model for video generation

    Kuaishou Technology. Kling: A powerful world model for video generation. 2024. 1

  32. [32]

    Internvideo: General video foundation models via generative and discriminative learning

    Yi Wang, Kunchang Li, Yizhuo Li, Yinan He, Bei Liu, Zhen Zhou, Jiannan Chen, Yi Zhou, Ziyu Wang, Chen Xiao, et al. Internvideo: General video foundation models via generative and discriminative learning.arXiv preprint arXiv:2212.03191, 2022. 5

  33. [33]

    Motionctrl: A unified and flexible motion controller for video generation

    Zhouxia Wang, Ziyang Yuan, Xintao Wang, Yaowei Li, Tianshui Chen, Menghan Xia, Ping Luo, and Ying Shan. Motionctrl: A unified and flexible motion controller for video generation. InACM SIGGRAPH 2024 Conference Pa- pers, pages 1–11, 2024. 1

  34. [34]

    Fast- vqa: Efficient end-to-end video quality assessment with frag- ment sampling

    Haoning Wu, Chaofeng Chen, Jingwen Hou, Liang Liao, Annan Wang, Wenxiu Sun, Qiong Yan, and Weisi Lin. Fast- vqa: Efficient end-to-end video quality assessment with frag- ment sampling. InEuropean conference on computer vision, pages 538–554. Springer, 2022. 1, 5

  35. [35]

    Exploring video quality assessment on user generated contents from aesthetic and technical perspectives

    Haoning Wu, Erli Zhang, Liang Liao, Chaofeng Chen, Jing- wen Hou, Annan Wang, Wenxiu Sun, Qiong Yan, and Weisi Lin. Exploring video quality assessment on user generated contents from aesthetic and technical perspectives. InPro- ceedings of the IEEE/CVF international conference on com- puter vision, pages 20144–20154, 2023. 1

  36. [36]

    CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

    Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiao- han Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer.arXiv preprint arXiv:2408.06072, 2024. 1

  37. [37]

    Wonderjourney: Going from anywhere to everywhere

    Hong-Xing Yu, Haoyi Duan, Junhwa Hur, Kyle Sargent, Michael Rubinstein, William T Freeman, Forrester Cole, De- qing Sun, Noah Snavely, Jiajun Wu, et al. Wonderjourney: Going from anywhere to everywhere. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6658–6667, 2024. 1

  38. [38]

    Tra- jectorycrafter: Redirecting camera trajectory for monocu- lar videos via diffusion models

    Mark Yu, Wenbo Hu, Jinbo Xing, and Ying Shan. Tra- jectorycrafter: Redirecting camera trajectory for monocu- lar videos via diffusion models. InProceedings of the IEEE/CVF international conference on computer vision, pages 100–111, 2025. 1, 2

  39. [39]

    Viewcrafter: Taming video diffusion models for high-fidelity novel view synthesis.IEEE Trans- actions on Pattern Analysis and Machine Intelligence, 2025

    Wangbo Yu, Jinbo Xing, Li Yuan, Wenbo Hu, Xiaoyu Li, Zhipeng Huang, Xiangjun Gao, Tien-Tsin Wong, Ying Shan, and Yonghong Tian. Viewcrafter: Taming video diffusion models for high-fidelity novel view synthesis.IEEE Trans- actions on Pattern Analysis and Machine Intelligence, 2025. 1

  40. [40]

    The 1st LoViF challenge on efficient vlm for multimodal cre- ative quality scoring: Methods and results

    Jusheng Zhang, Qinhan Lyu, Sizhuo Ma, Sheng Cao, Jian Wang, Xin Li, Keze Wang, Yongsen Zheng, Jing Yang, et al. The 1st LoViF challenge on efficient vlm for multimodal cre- ative quality scoring: Methods and results. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2026. 2

  41. [41]

    Q-bench-video: Benchmark the video qual- ity understanding of lmms

    Zicheng Zhang, Ziheng Jia, Haoning Wu, Chunyi Li, Zijian Chen, Yingjie Zhou, Wei Sun, Xiaohong Liu, Xiongkuo Min, Weisi Lin, et al. Q-bench-video: Benchmark the video qual- ity understanding of lmms. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 3229– 3239, 2025. 1

  42. [42]

    Swift: a scalable lightweight infrastructure for fine-tuning

    Yuze Zhao, Jintao Huang, Jinghan Hu, Xingjun Wang, Yun- lin Mao, Daoze Zhang, Zeyinzi Jiang, Zhikai Wu, Baole Ai, Ang Wang, et al. Swift: a scalable lightweight infrastructure for fine-tuning. InProceedings of the AAAI Conference on Artificial Intelligence, pages 29733–29735, 2025. 5

  43. [43]

    VBench-2.0: Advancing Video Generation Benchmark Suite for Intrinsic Faithfulness

    Dian Zheng, Ziqi Huang, Hongbo Liu, Kai Zou, Yinan He, Fan Zhang, Lulu Gu, Yuanhan Zhang, Jingwen He, Wei- Shi Zheng, et al. Vbench-2.0: Advancing video generation benchmark suite for intrinsic faithfulness.arXiv preprint arXiv:2503.21755, 2025. 1

  44. [44]

    arXiv preprint arXiv:2410.15957 , year=

    Guangcong Zheng, Teng Li, Rui Jiang, Yehao Lu, Tao Wu, and Xi Li. Cami2v: Camera-controlled image-to-video dif- fusion model.arXiv preprint arXiv:2410.15957, 2024. 1, 2

  45. [45]

    InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

    Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shen- glong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models.arXiv preprint arXiv:2504.10479, 2025. 5