Recognition: unknown
World-R1: Reinforcing 3D Constraints for Text-to-Video Generation
Pith reviewed 2026-05-08 04:23 UTC · model grok-4.3
The pith
Reinforcement learning with feedback from 3D models enforces geometric consistency in text-to-video generation without changing the base architecture.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
World-R1 shows that alignment with 3D constraints can be achieved post-training through reinforcement learning that receives direct feedback from existing 3D foundation models and vision-language models. By optimizing via Flow-GRPO on a dedicated text-only world-simulation dataset and applying periodic decoupled training, the method raises 3D consistency scores while leaving the underlying video generation network untouched.
What carries the argument
Flow-GRPO reinforcement learning that converts feedback from pre-trained 3D foundation models and vision-language models into policy updates for video generation.
If this is right
- Video generation pipelines can incorporate 3D awareness at inference or fine-tuning time rather than during initial training.
- World-simulation tasks become feasible at larger scale because no new model architecture is required.
- Dynamic scene elements remain fluid because the training schedule explicitly separates geometric and motion objectives.
- Existing high-quality video models can be upgraded for consistency without retraining from scratch.
Where Pith is reading between the lines
- The same feedback loop could be applied to enforce other physical priors such as lighting or material properties once suitable reward models exist.
- If the quality of the 3D feedback models continues to improve, the ceiling on achievable consistency rises without any change to the video generator itself.
- Periodic decoupling may generalize to other trade-offs in generative models where one property must be preserved while another is optimized.
Load-bearing premise
Signals from pre-trained 3D foundation models and vision-language models provide reliable, unbiased measures of structural coherence that improve video output when used as rewards.
What would settle it
A side-by-side comparison on a held-out set of prompts where videos produced after Flow-GRPO optimization show no gain or a loss in standard 3D consistency metrics such as multi-view geometric error or object trajectory stability.
read the original abstract
Recent video foundation models demonstrate impressive visual synthesis but frequently suffer from geometric inconsistencies. While existing methods attempt to inject 3D priors via architectural modifications, they often incur high computational costs and limit scalability. We propose World-R1, a framework that aligns video generation with 3D constraints through reinforcement learning. To facilitate this alignment, we introduce a specialized pure text dataset tailored for world simulation. Utilizing Flow-GRPO, we optimize the model using feedback from pre-trained 3D foundation models and vision-language models to enforce structural coherence without altering the underlying architecture. We further employ a periodic decoupled training strategy to balance rigid geometric consistency with dynamic scene fluidity. Extensive evaluations reveal that our approach significantly enhances 3D consistency while preserving the original visual quality of the foundation model, effectively bridging the gap between video generation and scalable world simulation.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes World-R1, a framework for text-to-video generation that reinforces 3D constraints using reinforcement learning. Specifically, it introduces Flow-GRPO to optimize a video foundation model with feedback from pre-trained 3D foundation models and vision-language models, supported by a new pure text dataset for world simulation and a periodic decoupled training strategy to maintain both geometric consistency and dynamic fluidity. The central claim is that this method significantly improves 3D consistency without altering the underlying model architecture or compromising visual quality.
Significance. Should the quantitative improvements be demonstrated rigorously, this paper would offer a valuable contribution to scalable world simulation by providing an efficient post-hoc alignment technique that avoids the computational overhead of architectural changes. The use of external model feedback for RL in video generation is a promising direction, and the specialized dataset could serve as a useful resource for the community if released.
major comments (3)
- Abstract: The statement that 'extensive evaluations reveal that our approach significantly enhances 3D consistency' lacks any accompanying metrics, baseline comparisons, or ablation studies. This is a load-bearing issue for the empirical claim, as without these, the significance of the improvement cannot be evaluated.
- Method section: The Flow-GRPO optimization relies on reward signals from pre-trained 3D and VLM models, but there is no analysis or validation showing that these signals correlate with actual geometric consistency (e.g., via multi-view reprojection error or point-cloud consistency) on the model's own generated videos. This creates a risk that the optimization improves the proxy rather than the true 3D structure.
- Training strategy section: The periodic decoupled training is intended to balance rigid consistency with fluidity, but without details on the decoupling mechanism, reward formulation, or experiments showing it prevents reward hacking while preserving dynamics, it is unclear if the approach achieves the claimed balance.
minor comments (2)
- Abstract: The term 'pure text dataset' could be clarified regarding its construction, size, and how it differs from existing text corpora used in video training.
- Overall: Some notation around Flow-GRPO could benefit from a formal definition or pseudocode to improve reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript accordingly to strengthen the presentation of our results and methodological details.
read point-by-point responses
-
Referee: Abstract: The statement that 'extensive evaluations reveal that our approach significantly enhances 3D consistency' lacks any accompanying metrics, baseline comparisons, or ablation studies. This is a load-bearing issue for the empirical claim, as without these, the significance of the improvement cannot be evaluated.
Authors: The abstract provides a high-level summary of our contributions, while the full quantitative evaluations—including specific metrics on 3D consistency, baseline comparisons, and ablation studies—are detailed in the Experiments section. To make the abstract more self-contained and address this concern, we will revise it to include representative numerical results and references to the supporting analyses in the main text. revision: yes
-
Referee: Method section: The Flow-GRPO optimization relies on reward signals from pre-trained 3D and VLM models, but there is no analysis or validation showing that these signals correlate with actual geometric consistency (e.g., via multi-view reprojection error or point-cloud consistency) on the model's own generated videos. This creates a risk that the optimization improves the proxy rather than the true 3D structure.
Authors: The reward signals draw from established pre-trained 3D foundation models and VLMs whose geometric capabilities have been validated in prior work. We agree that direct correlation analysis on videos generated by our model would further substantiate the approach. In the revised manuscript, we will add an analysis section validating these signals against geometric consistency measures such as multi-view reprojection error and point-cloud consistency. revision: yes
-
Referee: Training strategy section: The periodic decoupled training is intended to balance rigid consistency with fluidity, but without details on the decoupling mechanism, reward formulation, or experiments showing it prevents reward hacking while preserving dynamics, it is unclear if the approach achieves the claimed balance.
Authors: The periodic decoupled training strategy is outlined in the Training Strategy section to maintain both geometric consistency and dynamic fluidity. We will expand this section in the revision with explicit details on the decoupling mechanism, reward formulation, and additional experiments demonstrating that the strategy mitigates reward hacking while preserving scene dynamics. revision: yes
Circularity Check
No circularity: external pre-trained models supply independent feedback signals
full rationale
The paper's core derivation uses reinforcement learning (Flow-GRPO) driven by feedback from separately pre-trained 3D foundation models and VLMs, followed by independent evaluations of 3D consistency. No equations or definitions reduce the claimed improvement to the input rewards by construction, no self-citations are load-bearing for the central claim, and no fitted parameters are relabeled as predictions. The approach is self-contained because the reward models are external and the final claims rest on separate evaluation metrics rather than tautological re-use of the same signals.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Cogvideox: Text-to-video diffusion models with an expert transformer
Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, Da Yin, Yuxuan.Zhang, Weihan Wang, Yean Cheng, Bin Xu, Xiaotao Gu, Yuxiao Dong, and Jie Tang. Cogvideox: Text-to-video diffusion models with an expert transformer. InInt. Conf. Learn. Represent., 2025
2025
-
[2]
HunyuanVideo: A Systematic Framework For Large Video Generative Models
Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024
work page internal anchor Pith review arXiv 2024
-
[3]
Wan: Open and Advanced Large-Scale Video Generative Models
Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025
work page internal anchor Pith review arXiv 2025
-
[4]
Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets
Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Do- minik Lorenz, Yam Levi, Zion English, Vikram Voleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023
work page internal anchor Pith review arXiv 2023
-
[5]
Videocrafter2: Overcoming data limitations for high-quality video diffusion models
Haoxin Chen, Yong Zhang, Xiaodong Cun, Menghan Xia, Xintao Wang, Chao Weng, and Ying Shan. Videocrafter2: Overcoming data limitations for high-quality video diffusion models. InIEEE Conf. Comput. Vis. Pattern Recog., pages 7310–7320, 2024
2024
-
[6]
Snap video: Scaled spatiotemporal transformers for text-to-video synthesis
Willi Menapace, Aliaksandr Siarohin, Ivan Skorokhodov, Ekaterina Deyneka, Tsai-Shien Chen, Anil Kag, Yuwei Fang, Aleksei Stoliar, Elisa Ricci, Jian Ren, et al. Snap video: Scaled spatiotemporal transformers for text-to-video synthesis. InIEEE Conf. Comput. Vis. Pattern Recog., pages 7038–7048, 2024
2024
-
[7]
Steerx: Creating any camera-free 3d and 4d scenes with geometric steering
Byeongjun Park, Hyojun Go, Hyelin Nam, Byung-Hoon Kim, Hyungjin Chung, and Changick Kim. Steerx: Creating any camera-free 3d and 4d scenes with geometric steering. InInt. Conf. Comput. Vis., pages 27326–27337, 2025
2025
-
[8]
Tianyu Huang, Wangguandong Zheng, Tengfei Wang, Yuhao Liu, Zhenwei Wang, Junta Wu, Jie Jiang, Hui Li, Rynson WH Lau, Wangmeng Zuo, and Chunchao Guo. Voyager: Long-range and world-consistent video diffusion for explorable 3d scene generation.arXiv preprint arXiv:2506.04225, 2025
-
[9]
Video world models with long-term spatial memory.arXiv preprint arXiv:2506.05284,
Tong Wu, Shuai Yang, Ryan Po, Yinghao Xu, Ziwei Liu, Dahua Lin, and Gordon Wetzstein. Video world models with long-term spatial memory.arXiv preprint arXiv:2506.05284, 2025
-
[10]
Geometry Forcing: Marrying Video Diffusion and 3D Representation for Consistent World Modeling
Haoyu Wu, Diankun Wu, Tianyu He, Junliang Guo, Yang Ye, Yueqi Duan, and Jiang Bian. Geometry forcing: Marrying video diffusion and 3d representation for consistent world modeling.arXiv preprint arXiv:2507.07982, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[11]
Chenxi Song, Yanming Yang, Tong Zhao, Ruibo Li, and Chi Zhang. Worldforge: Unlocking emergent 3d/4d generation in video diffusion model via training-free guidance.arXiv preprint arXiv:2509.15130, 2025
-
[12]
Vmem: Consistent interactive video scene generation with surfel-indexed view memory
Runjia Li, Philip Torr, Andrea Vedaldi, and Tomas Jakab. Vmem: Consistent interactive video scene generation with surfel-indexed view memory. InInt. Conf. Comput. Vis., 2025
2025
-
[13]
4dnex: Feed-forward 4d generative modeling made easy.arXiv preprint arXiv:2508.13154, 2025
Zhaoxi Chen, Tianqi Liu, Long Zhuo, Jiawei Ren, Zeng Tao, He Zhu, Fangzhou Hong, Liang Pan, and Ziwei Liu. 4dnex: Feed-forward 4d generative modeling made easy.arXiv preprint arXiv:2508.13154, 2025. 12
-
[14]
Genfusion: Closing the loop between reconstruction and generation via videos
Sibo Wu, Congrong Xu, Binbin Huang, Andreas Geiger, and Anpei Chen. Genfusion: Closing the loop between reconstruction and generation via videos. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 6078–6088, 2025
2025
- [15]
-
[16]
MIT press Cambridge, 1998
Richard S Sutton, Andrew G Barto, et al.Reinforcement learning: An introduction, volume 1. MIT press Cambridge, 1998
1998
-
[17]
Depth Anything 3: Recovering the Visual Space from Any Views
Haotong Lin, Sili Chen, Junhao Liew, Donny Y Chen, Zhenyu Li, Guang Shi, Jiashi Feng, and Bingyi Kang. Depth anything 3: Recovering the visual space from any views.arXiv preprint arXiv:2511.10647, 2025
work page internal anchor Pith review arXiv 2025
-
[18]
Depth anything v2
LiheYang, BingyiKang, ZilongHuang, ZhenZhao, XiaogangXu, JiashiFeng, andHengshuang Zhao. Depth anything v2. InAdv. Neural Inform. Process. Syst., volume 37, pages 21875– 21911, 2024
2024
-
[19]
Depth anything: Unleashing the power of large-scale unlabeled data
Lihe Yang, Bingyi Kang, Zilong Huang, Xiaogang Xu, Jiashi Feng, and Hengshuang Zhao. Depth anything: Unleashing the power of large-scale unlabeled data. InIEEE Conf. Comput. Vis. Pattern Recog., pages 10371–10381, 2024
2024
-
[20]
Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shixuan ...
work page internal anchor Pith review arXiv 2025
-
[21]
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-vl technical report.ar...
work page internal anchor Pith review arXiv 2025
-
[22]
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Yang Fan, Kai Dang, Mengfei Du, Xuancheng Ren, Rui Men, Dayiheng Liu, Chang Zhou, Jingren Zhou, and Junyang Lin. Qwen2-vl: Enhancing vision- language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024
work page internal anchor Pith review arXiv 2024
-
[23]
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond.arXiv preprint arXiv:2308.12966, 2023
work page internal anchor Pith review arXiv 2023
-
[24]
Flow-grpo: Training flow matching models via online rl
Jie Liu, Gongye Liu, Jiajun Liang, Yangguang Li, Jiaheng Liu, Xintao Wang, Pengfei Wan, Di Zhang, and Wanli Ouyang. Flow-grpo: Training flow matching models via online rl. In Adv. Neural Inform. Process. Syst., 2025. 13
2025
-
[25]
AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning
Yuwei Guo, Ceyuan Yang, Anyi Rao, Zhengyang Liang, Yaohui Wang, Yu Qiao, Maneesh Agrawala, Dahua Lin, and Bo Dai. Animatediff: Animate your personalized text-to-image diffusion models without specific tuning.arXiv preprint arXiv:2307.04725, 2023
work page internal anchor Pith review arXiv 2023
-
[26]
Motionctrl: A unified and flexible motion controller for video generation
Zhouxia Wang, Ziyang Yuan, Xintao Wang, Yaowei Li, Tianshui Chen, Menghan Xia, Ping Luo, and Ying Shan. Motionctrl: A unified and flexible motion controller for video generation. InACM SIGGRAPH Conf. Comput. Graph. Interact. Tech., pages 1–11, 2024
2024
-
[27]
CameraCtrl: Enabling Camera Control for Text-to-Video Generation
Hao He, Yinghao Xu, Yuwei Guo, Gordon Wetzstein, Bo Dai, Hongsheng Li, and Ceyuan Yang. Cameractrl: Enabling camera control for text-to-video generation.arXiv preprint arXiv:2404.02101, 2024
work page internal anchor Pith review arXiv 2024
-
[28]
Motionbooth: Motion-aware customized text-to-video generation.Adv
Jianzong Wu, Xiangtai Li, Yanhong Zeng, Jiangning Zhang, Qianyu Zhou, Yining Li, Yunhai Tong, and Kai Chen. Motionbooth: Motion-aware customized text-to-video generation.Adv. Neural Inf. Process. Syst., 37:34322–34348, 2024
2024
-
[29]
Collaborative video diffusion: Consistent multi-video generation with camera control.Adv
Zhengfei Kuang, Shengqu Cai, Hao He, Yinghao Xu, Hongsheng Li, Leonidas J Guibas, and Gordon Wetzstein. Collaborative video diffusion: Consistent multi-video generation with camera control.Adv. Neural Inform. Process. Syst., 37:16240–16271, 2024
2024
-
[30]
Ac3d: Analyzing and improving 3d camera control in video diffusion transformers
Sherwin Bahmani, Ivan Skorokhodov, Guocheng Qian, Aliaksandr Siarohin, Willi Menapace, Andrea Tagliasacchi, David B Lindell, and Sergey Tulyakov. Ac3d: Analyzing and improving 3d camera control in video diffusion transformers. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 22875–22889, 2025
2025
-
[31]
arXiv preprint arXiv:2502.07531 (2025) Abbreviated paper title 21
Sixiao Zheng, Zimian Peng, Yanpeng Zhou, Yi Zhu, Hang Xu, Xiangru Huang, and Yanwei Fu. Vidcraft3: Camera, object, and lighting control for image-to-video generation.arXiv preprint arXiv:2502.07531, 2025
-
[32]
Recammaster: Camera-controlled generative rendering from a single video
Jianhong Bai, Menghan Xia, Xiao Fu, Xintao Wang, Lianrui Mu, Jinwen Cao, Zuozhu Liu, Haoji Hu, Xiang Bai, Pengfei Wan, et al. Recammaster: Camera-controlled generative rendering from a single video. InInt. Conf. Comput. Vis., 2025
2025
-
[33]
Realcam-i2v: Real-world image-to-video generation with interactive complex camera control
Teng Li, Guangcong Zheng, Rui Jiang, Shuigen Zhan, Tao Wu, Yehao Lu, Yining Lin, Chuanyun Deng, Yepan Xiong, Min Chen, et al. Realcam-i2v: Real-world image-to-video generation with interactive complex camera control. InInt. Conf. Comput. Vis., pages 28785–28796, 2025
2025
-
[34]
Go-with-the-flow: Motion- controllable video diffusion models using real-time warped noise
Ryan Burgert, Yuancheng Xu, Wenqi Xian, Oliver Pilarski, Pascal Clausen, Mingming He, Li Ma, Yitong Deng, Lingxiao Li, Mohsen Mousavi, et al. Go-with-the-flow: Motion- controllable video diffusion models using real-time warped noise. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 13–23, 2025
2025
-
[35]
Spatia: Video generation with updatable spatial memory.arXiv preprint arXiv:2512.15716, 2025
Jinjing Zhao, Fangyun Wei, Zhening Liu, Hongyang Zhang, Chang Xu, and Yan Lu. Spatia: Video generation with updatable spatial memory.arXiv preprint arXiv:2512.15716, 2025
-
[36]
World-consistent video diffusion with explicit 3d modeling
Qihang Zhang, Shuangfei Zhai, Miguel Angel Bautista Martin, Kevin Miao, Alexander Toshev, Joshua Susskind, and Jiatao Gu. World-consistent video diffusion with explicit 3d modeling. InIEEE Conf. Comput. Vis. Pattern Recog., pages 21685–21695, 2025
2025
-
[37]
Yixiang Dai, Fan Jiang, Chiyu Wang, Mu Xu, and Yonggang Qi. Fantasyworld: Geometry-consistent world modeling via unified video and 3d prediction.arXiv preprint arXiv:2509.21657, 2025. 14
-
[38]
Stereo magnification: learning view synthesis using multiplane images.ACM Trans
Tinghui Zhou, Richard Tucker, John Flynn, Graham Fyffe, and Noah Snavely. Stereo magnification: learning view synthesis using multiplane images.ACM Trans. Graph., 37(4): 1–12, 2018
2018
-
[39]
Dl3dv-10k: A large-scale scene dataset for deep learning-based 3d vision
Lu Ling, Yichen Sheng, Zhi Tu, Wentian Zhao, Cheng Xin, Kun Wan, Lantao Yu, Qianyu Guo, Zixun Yu, Yawen Lu, et al. Dl3dv-10k: A large-scale scene dataset for deep learning-based 3d vision. InIEEE Conf. Comput. Vis. Pattern Recog., pages 22160–22169, 2024
2024
-
[40]
Proximal Policy Optimization Algorithms
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017
work page internal anchor Pith review arXiv 2017
-
[41]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024
work page internal anchor Pith review arXiv 2024
-
[42]
The unrea- sonable effectiveness of deep features as a perceptual metric
Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unrea- sonable effectiveness of deep features as a perceptual metric. InIEEE Conf. Comput. Vis. Pattern Recog., pages 586–595, 2018
2018
-
[43]
Hpsv3: Towards wide-spectrum human preference score
Yuhang Ma, Xiaoshi Wu, Keqiang Sun, and Hongsheng Li. Hpsv3: Towards wide-spectrum human preference score. InInt. Conf. Comput. Vis., pages 15086–15095, 2025
2025
-
[44]
Gemini: A Family of Highly Capable Multimodal Models
Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023
work page internal anchor Pith review arXiv 2023
-
[45]
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context
Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context.arXiv preprint arXiv:2403.05530, 2024
work page internal anchor Pith review arXiv 2024
-
[46]
3d gaussian splatting for real-time radiance field rendering.ACM Trans
Bernhard Kerbl, Georgios Kopanas, Thomas Leimkühler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering.ACM Trans. Graph., 42(4):139–1, 2023
2023
-
[47]
gsplat: An open-source library for gaussian splatting.JMLR, 26(34):1–17, 2025
Vickie Ye, Ruilong Li, Justin Kerr, Matias Turkulainen, Brent Yi, Zhuoyang Pan, Otto Seiskari, Jianbo Ye, Jeffrey Hu, Matthew Tancik, et al. gsplat: An open-source library for gaussian splatting.JMLR, 26(34):1–17, 2025
2025
-
[48]
Image quality assessment: from error visibility to structural similarity.IEEE Trans
Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. Image quality assessment: from error visibility to structural similarity.IEEE Trans. Image Process., 13(4):600–612, 2004
2004
-
[49]
Vbench: Comprehensive benchmark suite for video generative models
Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, et al. Vbench: Comprehensive benchmark suite for video generative models. InIEEE Conf. Comput. Vis. Pattern Recog., pages 21807–21818, 2024
2024
-
[50]
Generative camera dolly: Extreme monocular dynamic novel view synthesis
Basile Van Hoorick, Rundi Wu, Ege Ozguroglu, Kyle Sargent, Ruoshi Liu, Pavel Tokmakov, Achal Dave, Changxi Zheng, and Carl Vondrick. Generative camera dolly: Extreme monocular dynamic novel view synthesis. InEur. Conf. Comput. Vis., pages 313–331. Springer, 2024
2024
-
[51]
Trajectory attention for fine-grained video motion control.arXiv preprint arXiv:2411.19324, 2024
Zeqi Xiao, Wenqi Ouyang, Yifan Zhou, Shuai Yang, Lei Yang, Jianlou Si, and Xingang Pan. Trajectory attention for fine-grained video motion control.arXiv preprint arXiv:2411.19324, 2024. 15
-
[52]
Diffusion as shader: 3d-aware video diffusion for versatile video generation control
Zekai Gu, Rui Yan, Jiahao Lu, Peng Li, Zhiyang Dou, Chenyang Si, Zhen Dong, Qifeng Liu, Cheng Lin, Ziwei Liu, et al. Diffusion as shader: 3d-aware video diffusion for versatile video generation control. InProceedings of the Special Interest Group on Computer Graphics and Interactive Techniques Conference Conference Papers, pages 1–12, 2025
2025
-
[53]
Worldscore: A unified evaluation benchmark for world generation
Haoyi Duan, Hong-Xing Yu, Sirui Chen, Li Fei-Fei, and Jiajun Wu. Worldscore: A unified evaluation benchmark for world generation. InInt. Conf. Comput. Vis., 2025
2025
-
[54]
arXiv preprint arXiv:2503.05638 (2025) 18 Liu et al
Mark Yu, Wenbo Hu, Jinbo Xing, and Ying Shan. Trajectorycrafter: Redirecting camera trajectory for monocular videos via diffusion models.arXiv preprint arXiv:2503.05638, 2025
-
[55]
Yawen Luo, Jianhong Bai, Xiaoyu Shi, Menghan Xia, Xintao Wang, Pengfei Wan, Di Zhang, Kun Gai, and Tianfan Xue. Camclonemaster: Enabling reference-based camera control for video generation.arXiv preprint arXiv:2506.03140, 2025
-
[56]
Yunpeng Bai, Shaoheng Fang, Chaohui Yu, Fan Wang, and Qixing Huang. Geovideo: Intro- ducing geometric regularization into video generation model.arXiv preprint arXiv:2512.03453, 2025
-
[57]
ViewCrafter: Taming Video Diffusion Models for High-fidelity Novel View Synthesis
Wangbo Yu, Jinbo Xing, Li Yuan, Wenbo Hu, Xiaoyu Li, Zhipeng Huang, Xiangjun Gao, Tien-Tsin Wong, Ying Shan, and Yonghong Tian. Viewcrafter: Taming video diffusion models for high-fidelity novel view synthesis.arXiv preprint arXiv:2409.02048, 2024
work page internal anchor Pith review arXiv 2024
-
[58]
Flash- world: High-quality 3d scene generation within seconds.arXiv preprint arXiv:2510.13678,
Xinyang Li, Tengfei Wang, Zixiao Gu, Shengchuan Zhang, Chunchao Guo, and Liujuan Cao. Flashworld: High-quality3dscenegenerationwithinseconds.arXiv preprint arXiv:2510.13678, 2025
-
[59]
Sixiao Zheng, Minghao Yin, Wenbo Hu, Xiaoyu Li, Ying Shan, and Yanwei Fu. Ver- secrafter: Dynamic realistic video world model with 4d geometric control.arXiv preprint arXiv:2601.05138, 2026. 16 A Implementation Details A.1 Reward Formulation and Details The core of our alignment strategy is the 3D-aware reward, which utilizes an analysis-by-synthesis appr...
-
[60]
•push_in: Move forward along the optical axis
Intra-scene Exploration Movements that investigate the depth and 3D structure of a specific subject. •push_in: Move forward along the optical axis. (Focus: Depth & Parallax) • orbit_left: Revolve counter-clockwise around a focal point. (Focus: 360°Object Consistency) • orbit_right: Revolve clockwise around a focal point. (Focus: 360°Object Consis- tency)
-
[61]
•pull_out: Move backward along the optical axis
Inter-scene Transition Movements that shift the viewport to reveal new environments or expand context. •pull_out: Move backward along the optical axis. (Focus: Context Reveal) •move_left: Lateral truck left. (Focus: Parallax) •move_right: Lateral truck right. (Focus: Parallax) •pan_left: Rotate camera yaw left on axis. (Focus: Panoramic View) •pan_right: ...
-
[62]
•pull_left: Sequence:move_left→pull_out→pan_left
Composite Trajectories Complex multi-axis maneuvers testing long-horizon consistency. •pull_left: Sequence:move_left→pull_out→pan_left. •pull_right: Sequence:move_right→pull_out→pan_right
-
[63]
•fixed: No ego-motion
Static Observation Stationary camera to isolate temporal dynamics. •fixed: No ego-motion. (Focus: Fluid/Particle Dynamics) B.2 Dataset Taxonomy and Examples B.2.1 Natural Landscapes This category focuses on large-scale rigid geometry and natural fluid dynamics. It tests the model’s ability to maintain consistency across vast distances and handle organic s...
-
[64]
Which video better maintains the structural integrity of the physical world?
Geometric Consistency:"Which video better maintains the structural integrity of the physical world?" • Look for:Objects that remain solid (no morphing/vanishing), background elements that stay fixed relative to the camera movement, and correct perspective changes (parallax). • Penalize:Walls that warp, objects that float or disappear when the camera moves...
-
[65]
Which video more accurately executes the camera movement described in the text prompt?
Camera Control Accuracy:"Which video more accurately executes the camera movement described in the text prompt?" • Look for:Strict adherence to the directional command (e.g., if the prompt says "orbit," the camera must circle the object, not just pan or zoom). • Penalize:Errant camera drift, static cameras when motion is requested, or wrong movement direc...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.