Recognition: unknown
LoViF 2026 The First Challenge on Holistic Quality Assessment for 4D World Model (PhyScore)
Pith reviewed 2026-05-08 16:17 UTC · model grok-4.3
The pith
Perceptual quality alone cannot judge if generated dynamics are physically plausible or temporally coherent.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that a holistic metric must jointly predict the four dimensions of Video Quality, Physical Realism, Condition-Video Alignment, and Temporal Consistency, and must localize physical anomaly timestamps, because perceptual quality by itself is insufficient to determine whether generated dynamics are physically plausible, temporally coherent, and consistent with input conditions.
What carries the argument
The four-dimensional prediction task combined with timestamp localization, scored by a composite protocol of TimeStamp_IOU for anomalies and SRCC/PLCC for the dimension scores.
If this is right
- Metrics optimized on the benchmark will better identify generations that look realistic yet break physical laws.
- Anomaly localization will let developers target fixes at specific moments rather than retraining entire models.
- The three-track structure will expose differences in physical consistency between text-to-2D, image-to-4D, and video-to-4D generation.
- Composite scoring will reward methods that perform well on both overall ratings and precise timing of flaws.
Where Pith is reading between the lines
- The same multi-dimensional approach could be adapted to evaluate consistency in other generative outputs such as simulated 3D scenes or robotic trajectories.
- Successful metrics from the challenge may reduce dependence on repeated human annotation by serving as training signals for automated physical checkers.
- Widespread adoption would likely push world-model developers to incorporate explicit physics constraints during training rather than relying on post-hoc correction.
Load-bearing premise
Trained human annotators can produce reliable, unbiased labels for physical realism and anomaly timestamps across diverse physics scenarios without systematic errors from subjective judgment.
What would settle it
A controlled study in which a perceptual-only metric achieves high correlation with human physical-realism scores and anomaly timestamps across the same dataset would show the four-dimensional requirement is unnecessary.
Figures
read the original abstract
This paper reports on the LoViF 2026 PhyScore challenge, a competition on holistic quality assessment of world-model-generated videos across both 2D and 4D generation settings. The challenge is motivated by a central gap in current evaluation practice: perceptual quality alone is insufficient to judge whether generated dynamics are physically plausible, temporally coherent, and consistent with input conditions. Participants are required to build a metric that jointly predicts four dimensions, i.e., Video Quality, Physical Realism, Condition-Video Alignment, and Temporal Consistency. Depart from that, participants also need to localize physical anomaly timestamps for fine-grained diagnosis. The benchmark dataset contains 1,554 videos generated by seven representative world generative models, organized into three tracks (text-2D, image-to-4D, and video-to-4D) and spanning 26 categories. These categories explicitly cover physics-relevant scenarios, including dynamics, optics, and thermodynamics, together with diverse real-world and creative content. To ensure label reliability, scores and anomaly timestamps are produced through trained human annotation with an additional automated quality-control pass. Evaluation is based on both score prediction and anomaly localization, with a composite protocol that combines TimeStamp_IOU and SRCC/PLCC. This report summarizes the challenge design and provides method-level insights from submitted solutions.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript reports on the LoViF 2026 PhyScore challenge for holistic quality assessment of videos generated by 4D world models. It argues that perceptual quality metrics alone are insufficient to evaluate physical plausibility, temporal coherence, and consistency with input conditions. Participants must develop metrics that jointly predict four dimensions (Video Quality, Physical Realism, Condition-Video Alignment, Temporal Consistency) and localize physical anomaly timestamps. The benchmark consists of 1,554 videos from seven generative models across text-to-2D, image-to-4D, and video-to-4D tracks spanning 26 physics-relevant categories. Labels are produced by trained human annotators with an automated QC pass, and evaluation uses a composite protocol of TimeStamp_IOU for localization together with SRCC/PLCC for score prediction. The report summarizes the challenge design and provides method-level insights from submitted solutions.
Significance. If the human annotations reliably capture physical properties rather than perceptual cues, this benchmark could meaningfully advance evaluation standards in generative video and world modeling by addressing the gap between visual quality and physical realism. The sizable dataset (1,554 videos), coverage of 26 physics categories across three distinct generation tracks, and dual emphasis on score prediction plus fine-grained anomaly localization constitute concrete strengths that could support reproducible progress in the field.
major comments (1)
- [Annotation and Quality Control] The annotation protocol (trained human annotators plus automated QC for the full set of 1,554 videos) provides no quantitative validation of label reliability. No inter-annotator agreement statistics, no correlation with domain-expert physicists, and no comparison against simulation-derived ground truth for realism scores or anomaly timestamps are reported. This is load-bearing for the central claim that perceptual quality is insufficient, because without such validation the new dimensions risk being redundant with existing video-quality metrics if annotators primarily respond to visual smoothness.
minor comments (2)
- [Evaluation Protocol] The composite evaluation protocol is described only at a high level; the exact weighting between TimeStamp_IOU and the SRCC/PLCC terms, including any normalization, should be stated explicitly to enable direct reproduction by future participants.
- [Results and Insights] The abstract states that the report provides 'method-level insights from submitted solutions,' yet the manuscript text does not include concrete examples of submitted approaches, their performance numbers, or ablation analyses; adding a concise summary table of top entries would strengthen the contribution.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback, particularly on the need for quantitative validation of the human annotations. This is a critical aspect for establishing the benchmark's reliability, and we address the concern directly below.
read point-by-point responses
-
Referee: [Annotation and Quality Control] The annotation protocol (trained human annotators plus automated QC for the full set of 1,554 videos) provides no quantitative validation of label reliability. No inter-annotator agreement statistics, no correlation with domain-expert physicists, and no comparison against simulation-derived ground truth for realism scores or anomaly timestamps are reported. This is load-bearing for the central claim that perceptual quality is insufficient, because without such validation the new dimensions risk being redundant with existing video-quality metrics if annotators primarily respond to visual smoothness.
Authors: We agree that the absence of quantitative validation metrics in the current manuscript is a limitation. In the revised version, we will add inter-annotator agreement statistics (such as Fleiss' kappa or intraclass correlation coefficients) computed from the multiple annotations per video where available. However, correlations with domain-expert physicists and comparisons to simulation-derived ground truth were not collected as part of the challenge design, which relied on trained human annotators for scalability across 1,554 videos and 26 physics categories. We will explicitly discuss this as a limitation in the revised manuscript, including its potential impact on distinguishing physical realism from perceptual cues, and outline it as future work. The automated QC pass and training protocol were intended to mitigate subjectivity, but we acknowledge that additional validation would further support the claim that the four dimensions capture aspects beyond standard video quality metrics. revision: partial
- Providing correlations with domain-expert physicists, as no such annotations were collected
- Providing comparisons against simulation-derived ground truth for realism scores or anomaly timestamps, as no corresponding simulations exist for the generated videos
Circularity Check
No circularity in descriptive challenge report
full rationale
The paper is a descriptive report on the LoViF 2026 PhyScore challenge. It outlines the motivation, dataset construction, annotation protocol, and evaluation metrics without any mathematical derivations, equations, fitted parameters, model predictions, or self-referential reductions. No load-bearing steps reduce by construction to inputs, self-citations, or ansatzes. The central motivation (perceptual quality being insufficient) is presented as the rationale for creating the benchmark rather than derived from any prior fitted quantities or uniqueness theorems. This is a standard non-finding for challenge papers that contain no analytical derivations.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Human annotations after training plus automated QC produce reliable labels for physical realism and anomaly timestamps
Reference graph
Works this paper leans on
-
[1]
Cosmos World Foundation Model Platform for Physical AI
Niket Agarwal, Arslan Ali, Maciej Bala, Yogesh Balaji, Erik Barker, Tiffany Cai, Prithvijit Chattopadhyay, Yongxin Chen, Yin Cui, Yifan Ding, et al. Cosmos world foun- dation model platform for physical ai.arXiv preprint arXiv:2501.03575, 2025. 1
work page internal anchor Pith review arXiv 2025
-
[2]
Recammaster: Camera-controlled generative ren- dering from a single video
Jianhong Bai, Menghan Xia, Xiao Fu, Xintao Wang, Lianrui Mu, Jinwen Cao, Zuozhu Liu, Haoji Hu, Xiang Bai, Pengfei Wan, et al. Recammaster: Camera-controlled generative ren- dering from a single video. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 14834– 14844, 2025. 1, 2
2025
-
[3]
Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025. 5, 7
work page internal anchor Pith review arXiv 2025
-
[4]
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhao- hai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Jun- yang Lin. Qwen2.5-vl technical repor...
work page internal anchor Pith review arXiv 2025
-
[5]
Random forests.Machine Learning, 45(1): 5–32, 2001
Leo Breiman. Random forests.Machine Learning, 45(1): 5–32, 2001. 6
2001
-
[6]
Video generation models as world simulators
Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yu Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, et al. Video generation models as world simulators. OpenAI, 2024. 1
2024
-
[7]
arXiv preprint arXiv:2503.13265 (2025)
Luxi Chen, Zihan Zhou, Min Zhao, Yikai Wang, Ge Zhang, Wenhao Huang, Hao Sun, Ji-Rong Wen, and Chongxuan Li. Flexworld: Progressively expanding 3d scenes for flexiable- view synthesis.arXiv preprint arXiv:2503.13265, 2025. 1
-
[8]
LoViF 2026 challenge on real-world all-in-one im- age restoration: Methods and results
Xiang Chen, Hao Li, Jiangxin Dong, Jinshan Pan, Xin Li, et al. LoViF 2026 challenge on real-world all-in-one im- age restoration: Methods and results. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2026. 2
2026
-
[9]
Nearest neighbor pattern clas- sification.IEEE Transactions on Information Theory, 13(1): 21–27, 1967
Thomas Cover and Peter Hart. Nearest neighbor pattern clas- sification.IEEE Transactions on Information Theory, 13(1): 21–27, 1967. 6
1967
-
[10]
Worldscore: A unified evaluation benchmark for world generation
Haoyi Duan, Hong-Xing Yu, Sirui Chen, Li Fei-Fei, and Ji- ajun Wu. Worldscore: A unified evaluation benchmark for world generation. InProceedings of the IEEE/CVF Interna- tional Conference on Computer Vision, pages 27713–27724,
-
[11]
Friedman
Jerome H. Friedman. Greedy function approximation: A gradient boosting machine.The Annals of Statistics, 29(5): 1189–1232, 2001. 6
2001
-
[12]
Ex- tremely randomized trees.Machine Learning, 63(1):3–42,
Pierre Geurts, Damien Ernst, and Louis Wehenkel. Ex- tremely randomized trees.Machine Learning, 63(1):3–42,
-
[13]
Diffusion as shader: 3d-aware video diffusion for ver- satile video generation control
Zekai Gu, Rui Yan, Jiahao Lu, Peng Li, Zhiyang Dou, Chenyang Si, Zhen Dong, Qifeng Liu, Cheng Lin, Ziwei Liu, et al. Diffusion as shader: 3d-aware video diffusion for ver- satile video generation control. InProceedings of the Special Interest Group on Computer Graphics and Interactive Tech- niques Conference Conference Papers, pages 1–12, 2025. 1, 2
2025
-
[14]
Internvqa: Advancing compressed video quality assessment with distilling large foundation model
Fengbin Guan, Zihao Yu, Yiting Lu, Xin Li, and Zhibo Chen. Internvqa: Advancing compressed video quality assessment with distilling large foundation model. In2025 IEEE Inter- national Symposium on Circuits and Systems (ISCAS), pages 1–5. IEEE, 2025. 1
2025
-
[15]
Wenbo Hu, Xiangjun Gao, Xiaoyu Li, Sijie Zhao, Xiaodong Cun, Yong Zhang, Long Quan, and Ying Shan
Xuan He, Dongfu Jiang, Ping Nie, Minghao Liu, Zhengxuan Jiang, Mingyi Su, Wentao Ma, Junru Lin, Chun Ye, Yi Lu, et al. Videoscore2: Think before you score in generative video evaluation.arXiv preprint arXiv:2509.22799, 2025. 5
-
[16]
Clipscore: A reference-free evaluation met- ric for image captioning
Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Yejin Choi. Clipscore: A reference-free evaluation met- ric for image captioning. InProceedings of the 2021 confer- ence on empirical methods in natural language processing, pages 7514–7528, 2021. 1
2021
-
[17]
Hoerl and Robert W
Arthur E. Hoerl and Robert W. Kennard. Ridge regression: Biased estimation for nonorthogonal problems.Technomet- rics, 12(1):55–67, 1970. 6
1970
-
[18]
Tao Hu, Haoyang Peng, Xiao Liu, and Yuewen Ma. Ex-4d: Extreme viewpoint 4d video synthesis via depth watertight mesh.arXiv preprint arXiv:2506.05554, 2025. 1, 2
-
[19]
Vbench: Comprehensive bench- mark suite for video generative models
Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, et al. Vbench: Comprehensive bench- mark suite for video generative models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21807–21818, 2024. 1
2024
-
[20]
Subjective video quality assessment methods for multimedia applications
International Telecommunication Union. Subjective video quality assessment methods for multimedia applications. Technical Report P.910, ITU-T, 2023. Recommendation ITU-T P.910 (10/23). 1, 5
2023
-
[21]
HunyuanVideo: A Systematic Framework For Large Video Generative Models
Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024. 1, 2
work page internal anchor Pith review arXiv 2024
-
[22]
LoViF 2026 the first challenge on human- oriented semantic image quality assessment: Methods and results
Xin Li, Daoli Xu, Wei Luo, Guoqiang Xiang, Haoran Li, Chengyu Zhuang, Zhibo Chen, Jian Guan, and Weipin- gand others Li. LoViF 2026 the first challenge on human- oriented semantic image quality assessment: Methods and results. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Work- shops, 2026. 2
2026
-
[23]
Aigc- vqa: A holistic perception metric for aigc video quality assessment
Yiting Lu, Xin Li, Bingchen Li, Zihao Yu, Fengbin Guan, Xinrui Wang, Ruling Liao, Yan Ye, and Zhibo Chen. Aigc- vqa: A holistic perception metric for aigc video quality assessment. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6384– 6394, 2024. 1
2024
-
[24]
Kvq: Kwai video quality assessment for short-form videos
Yiting Lu, Xin Li, Yajing Pei, Kun Yuan, Qizhi Xie, Yunpeng Qu, Ming Sun, Chao Zhou, and Zhibo Chen. Kvq: Kwai video quality assessment for short-form videos. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 25963–25973, 2024. 1
2024
-
[25]
Yiting Lu, Wei Luo, Peiyan Tu, Haoran Li, Hanxin Zhu, Zi- hao Yu, Xingrui Wang, Xinyi Chen, Xinge Peng, Xin Li, et al. 4dworldbench: A comprehensive evaluation frame- work for 3d/4d world generation models.arXiv preprint arXiv:2511.19836, 2025. 1
-
[26]
Towards world simulator: Crafting physical commonsense- based benchmark for video generation
Fanqing Meng, Jiaqi Liao, Xinyu Tan, Quanfeng Lu, Wenqi Shao, Kaipeng Zhang, Yu Cheng, Dianqi Li, and Ping Luo. Towards world simulator: Crafting physical commonsense- based benchmark for video generation. InInternational Con- ference on Machine Learning, pages 43781–43806. PMLR,
-
[27]
LoViF 2026 the first challenge on weather removal in videos
Chenghao Qian, Xin Li, Yeying Jin, Shangquan Sun, et al. LoViF 2026 the first challenge on weather removal in videos. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2026. 2
2026
-
[28]
UI-TARS: Pioneering Automated GUI Interaction with Native Agents
Yujia Qin, Yining Ye, Junjie Fang, Haoming Wang, Shihao Liang, Shizuo Tian, Junda Zhang, Jiahao Li, Yunxin Li, Shi- jue Huang, et al. Ui-tars: Pioneering automated gui inter- action with native agents.arXiv preprint arXiv:2501.12326,
work page internal anchor Pith review arXiv
-
[29]
Learning transferable visual models from natural language supervi- sion
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 6
2021
-
[30]
arXiv preprint arXiv:2602.08971 (2026)
Yu Shang, Zhuohang Li, Yiding Ma, Weikang Su, Xin Jin, Ziyou Wang, Lei Jin, Xin Zhang, Yinzhou Tang, Haisheng Su, et al. Worldarena: A unified benchmark for evaluating perception and functional utility of embodied world models. arXiv preprint arXiv:2602.08971, 2026. 1
-
[31]
Kling: A powerful world model for video generation
Kuaishou Technology. Kling: A powerful world model for video generation. 2024. 1
2024
-
[32]
Internvideo: General video foundation models via generative and discriminative learning
Yi Wang, Kunchang Li, Yizhuo Li, Yinan He, Bei Liu, Zhen Zhou, Jiannan Chen, Yi Zhou, Ziyu Wang, Chen Xiao, et al. Internvideo: General video foundation models via generative and discriminative learning.arXiv preprint arXiv:2212.03191, 2022. 5
-
[33]
Motionctrl: A unified and flexible motion controller for video generation
Zhouxia Wang, Ziyang Yuan, Xintao Wang, Yaowei Li, Tianshui Chen, Menghan Xia, Ping Luo, and Ying Shan. Motionctrl: A unified and flexible motion controller for video generation. InACM SIGGRAPH 2024 Conference Pa- pers, pages 1–11, 2024. 1
2024
-
[34]
Fast- vqa: Efficient end-to-end video quality assessment with frag- ment sampling
Haoning Wu, Chaofeng Chen, Jingwen Hou, Liang Liao, Annan Wang, Wenxiu Sun, Qiong Yan, and Weisi Lin. Fast- vqa: Efficient end-to-end video quality assessment with frag- ment sampling. InEuropean conference on computer vision, pages 538–554. Springer, 2022. 1, 5
2022
-
[35]
Exploring video quality assessment on user generated contents from aesthetic and technical perspectives
Haoning Wu, Erli Zhang, Liang Liao, Chaofeng Chen, Jing- wen Hou, Annan Wang, Wenxiu Sun, Qiong Yan, and Weisi Lin. Exploring video quality assessment on user generated contents from aesthetic and technical perspectives. InPro- ceedings of the IEEE/CVF international conference on com- puter vision, pages 20144–20154, 2023. 1
2023
-
[36]
CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer
Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiao- han Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer.arXiv preprint arXiv:2408.06072, 2024. 1
work page internal anchor Pith review arXiv 2024
-
[37]
Wonderjourney: Going from anywhere to everywhere
Hong-Xing Yu, Haoyi Duan, Junhwa Hur, Kyle Sargent, Michael Rubinstein, William T Freeman, Forrester Cole, De- qing Sun, Noah Snavely, Jiajun Wu, et al. Wonderjourney: Going from anywhere to everywhere. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6658–6667, 2024. 1
2024
-
[38]
Tra- jectorycrafter: Redirecting camera trajectory for monocu- lar videos via diffusion models
Mark Yu, Wenbo Hu, Jinbo Xing, and Ying Shan. Tra- jectorycrafter: Redirecting camera trajectory for monocu- lar videos via diffusion models. InProceedings of the IEEE/CVF international conference on computer vision, pages 100–111, 2025. 1, 2
2025
-
[39]
Viewcrafter: Taming video diffusion models for high-fidelity novel view synthesis.IEEE Trans- actions on Pattern Analysis and Machine Intelligence, 2025
Wangbo Yu, Jinbo Xing, Li Yuan, Wenbo Hu, Xiaoyu Li, Zhipeng Huang, Xiangjun Gao, Tien-Tsin Wong, Ying Shan, and Yonghong Tian. Viewcrafter: Taming video diffusion models for high-fidelity novel view synthesis.IEEE Trans- actions on Pattern Analysis and Machine Intelligence, 2025. 1
2025
-
[40]
The 1st LoViF challenge on efficient vlm for multimodal cre- ative quality scoring: Methods and results
Jusheng Zhang, Qinhan Lyu, Sizhuo Ma, Sheng Cao, Jian Wang, Xin Li, Keze Wang, Yongsen Zheng, Jing Yang, et al. The 1st LoViF challenge on efficient vlm for multimodal cre- ative quality scoring: Methods and results. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2026. 2
2026
-
[41]
Q-bench-video: Benchmark the video qual- ity understanding of lmms
Zicheng Zhang, Ziheng Jia, Haoning Wu, Chunyi Li, Zijian Chen, Yingjie Zhou, Wei Sun, Xiaohong Liu, Xiongkuo Min, Weisi Lin, et al. Q-bench-video: Benchmark the video qual- ity understanding of lmms. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 3229– 3239, 2025. 1
2025
-
[42]
Swift: a scalable lightweight infrastructure for fine-tuning
Yuze Zhao, Jintao Huang, Jinghan Hu, Xingjun Wang, Yun- lin Mao, Daoze Zhang, Zeyinzi Jiang, Zhikai Wu, Baole Ai, Ang Wang, et al. Swift: a scalable lightweight infrastructure for fine-tuning. InProceedings of the AAAI Conference on Artificial Intelligence, pages 29733–29735, 2025. 5
2025
-
[43]
VBench-2.0: Advancing Video Generation Benchmark Suite for Intrinsic Faithfulness
Dian Zheng, Ziqi Huang, Hongbo Liu, Kai Zou, Yinan He, Fan Zhang, Lulu Gu, Yuanhan Zhang, Jingwen He, Wei- Shi Zheng, et al. Vbench-2.0: Advancing video generation benchmark suite for intrinsic faithfulness.arXiv preprint arXiv:2503.21755, 2025. 1
work page internal anchor Pith review arXiv 2025
-
[44]
arXiv preprint arXiv:2410.15957 , year=
Guangcong Zheng, Teng Li, Rui Jiang, Yehao Lu, Tao Wu, and Xi Li. Cami2v: Camera-controlled image-to-video dif- fusion model.arXiv preprint arXiv:2410.15957, 2024. 1, 2
-
[45]
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shen- glong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models.arXiv preprint arXiv:2504.10479, 2025. 5
work page internal anchor Pith review arXiv 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.