pith. machine review for the scientific record. sign in

arxiv: 2604.24764 · v1 · submitted 2026-04-27 · 💻 cs.CV

Recognition: unknown

World-R1: Reinforcing 3D Constraints for Text-to-Video Generation

Authors on Pith no claims yet

Pith reviewed 2026-05-08 04:23 UTC · model grok-4.3

classification 💻 cs.CV
keywords text-to-video generation3D consistencyreinforcement learningworld simulationgeometric constraintsFlow-GRPOfoundation models
0
0 comments X

The pith

Reinforcement learning with feedback from 3D models enforces geometric consistency in text-to-video generation without changing the base architecture.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Video foundation models create visually strong outputs but often violate basic 3D structure across frames. World-R1 addresses this by treating 3D coherence as a reinforcement learning objective rather than an architectural constraint. It supplies a pure-text dataset for world simulation and uses Flow-GRPO to draw reward signals from pre-trained 3D foundation models and vision-language models. A periodic decoupled training schedule keeps rigid geometry and scene motion in balance. The result is measurably higher structural consistency while the original visual quality of the foundation model remains unchanged.

Core claim

World-R1 shows that alignment with 3D constraints can be achieved post-training through reinforcement learning that receives direct feedback from existing 3D foundation models and vision-language models. By optimizing via Flow-GRPO on a dedicated text-only world-simulation dataset and applying periodic decoupled training, the method raises 3D consistency scores while leaving the underlying video generation network untouched.

What carries the argument

Flow-GRPO reinforcement learning that converts feedback from pre-trained 3D foundation models and vision-language models into policy updates for video generation.

If this is right

  • Video generation pipelines can incorporate 3D awareness at inference or fine-tuning time rather than during initial training.
  • World-simulation tasks become feasible at larger scale because no new model architecture is required.
  • Dynamic scene elements remain fluid because the training schedule explicitly separates geometric and motion objectives.
  • Existing high-quality video models can be upgraded for consistency without retraining from scratch.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same feedback loop could be applied to enforce other physical priors such as lighting or material properties once suitable reward models exist.
  • If the quality of the 3D feedback models continues to improve, the ceiling on achievable consistency rises without any change to the video generator itself.
  • Periodic decoupling may generalize to other trade-offs in generative models where one property must be preserved while another is optimized.

Load-bearing premise

Signals from pre-trained 3D foundation models and vision-language models provide reliable, unbiased measures of structural coherence that improve video output when used as rewards.

What would settle it

A side-by-side comparison on a held-out set of prompts where videos produced after Flow-GRPO optimization show no gain or a loss in standard 3D consistency metrics such as multi-view geometric error or object trajectory stability.

read the original abstract

Recent video foundation models demonstrate impressive visual synthesis but frequently suffer from geometric inconsistencies. While existing methods attempt to inject 3D priors via architectural modifications, they often incur high computational costs and limit scalability. We propose World-R1, a framework that aligns video generation with 3D constraints through reinforcement learning. To facilitate this alignment, we introduce a specialized pure text dataset tailored for world simulation. Utilizing Flow-GRPO, we optimize the model using feedback from pre-trained 3D foundation models and vision-language models to enforce structural coherence without altering the underlying architecture. We further employ a periodic decoupled training strategy to balance rigid geometric consistency with dynamic scene fluidity. Extensive evaluations reveal that our approach significantly enhances 3D consistency while preserving the original visual quality of the foundation model, effectively bridging the gap between video generation and scalable world simulation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes World-R1, a framework for text-to-video generation that reinforces 3D constraints using reinforcement learning. Specifically, it introduces Flow-GRPO to optimize a video foundation model with feedback from pre-trained 3D foundation models and vision-language models, supported by a new pure text dataset for world simulation and a periodic decoupled training strategy to maintain both geometric consistency and dynamic fluidity. The central claim is that this method significantly improves 3D consistency without altering the underlying model architecture or compromising visual quality.

Significance. Should the quantitative improvements be demonstrated rigorously, this paper would offer a valuable contribution to scalable world simulation by providing an efficient post-hoc alignment technique that avoids the computational overhead of architectural changes. The use of external model feedback for RL in video generation is a promising direction, and the specialized dataset could serve as a useful resource for the community if released.

major comments (3)
  1. Abstract: The statement that 'extensive evaluations reveal that our approach significantly enhances 3D consistency' lacks any accompanying metrics, baseline comparisons, or ablation studies. This is a load-bearing issue for the empirical claim, as without these, the significance of the improvement cannot be evaluated.
  2. Method section: The Flow-GRPO optimization relies on reward signals from pre-trained 3D and VLM models, but there is no analysis or validation showing that these signals correlate with actual geometric consistency (e.g., via multi-view reprojection error or point-cloud consistency) on the model's own generated videos. This creates a risk that the optimization improves the proxy rather than the true 3D structure.
  3. Training strategy section: The periodic decoupled training is intended to balance rigid consistency with fluidity, but without details on the decoupling mechanism, reward formulation, or experiments showing it prevents reward hacking while preserving dynamics, it is unclear if the approach achieves the claimed balance.
minor comments (2)
  1. Abstract: The term 'pure text dataset' could be clarified regarding its construction, size, and how it differs from existing text corpora used in video training.
  2. Overall: Some notation around Flow-GRPO could benefit from a formal definition or pseudocode to improve reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript accordingly to strengthen the presentation of our results and methodological details.

read point-by-point responses
  1. Referee: Abstract: The statement that 'extensive evaluations reveal that our approach significantly enhances 3D consistency' lacks any accompanying metrics, baseline comparisons, or ablation studies. This is a load-bearing issue for the empirical claim, as without these, the significance of the improvement cannot be evaluated.

    Authors: The abstract provides a high-level summary of our contributions, while the full quantitative evaluations—including specific metrics on 3D consistency, baseline comparisons, and ablation studies—are detailed in the Experiments section. To make the abstract more self-contained and address this concern, we will revise it to include representative numerical results and references to the supporting analyses in the main text. revision: yes

  2. Referee: Method section: The Flow-GRPO optimization relies on reward signals from pre-trained 3D and VLM models, but there is no analysis or validation showing that these signals correlate with actual geometric consistency (e.g., via multi-view reprojection error or point-cloud consistency) on the model's own generated videos. This creates a risk that the optimization improves the proxy rather than the true 3D structure.

    Authors: The reward signals draw from established pre-trained 3D foundation models and VLMs whose geometric capabilities have been validated in prior work. We agree that direct correlation analysis on videos generated by our model would further substantiate the approach. In the revised manuscript, we will add an analysis section validating these signals against geometric consistency measures such as multi-view reprojection error and point-cloud consistency. revision: yes

  3. Referee: Training strategy section: The periodic decoupled training is intended to balance rigid consistency with fluidity, but without details on the decoupling mechanism, reward formulation, or experiments showing it prevents reward hacking while preserving dynamics, it is unclear if the approach achieves the claimed balance.

    Authors: The periodic decoupled training strategy is outlined in the Training Strategy section to maintain both geometric consistency and dynamic fluidity. We will expand this section in the revision with explicit details on the decoupling mechanism, reward formulation, and additional experiments demonstrating that the strategy mitigates reward hacking while preserving scene dynamics. revision: yes

Circularity Check

0 steps flagged

No circularity: external pre-trained models supply independent feedback signals

full rationale

The paper's core derivation uses reinforcement learning (Flow-GRPO) driven by feedback from separately pre-trained 3D foundation models and VLMs, followed by independent evaluations of 3D consistency. No equations or definitions reduce the claimed improvement to the input rewards by construction, no self-citations are load-bearing for the central claim, and no fitted parameters are relabeled as predictions. The approach is self-contained because the reward models are external and the final claims rest on separate evaluation metrics rather than tautological re-use of the same signals.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no equations, training objectives, or implementation specifics, so no free parameters, axioms, or invented entities can be identified.

pith-pipeline@v0.9.0 · 5483 in / 1121 out tokens · 40426 ms · 2026-05-08T04:23:12.305647+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

65 extracted references · 31 canonical work pages · 16 internal anchors

  1. [1]

    Cogvideox: Text-to-video diffusion models with an expert transformer

    Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, Da Yin, Yuxuan.Zhang, Weihan Wang, Yean Cheng, Bin Xu, Xiaotao Gu, Yuxiao Dong, and Jie Tang. Cogvideox: Text-to-video diffusion models with an expert transformer. InInt. Conf. Learn. Represent., 2025

  2. [2]

    HunyuanVideo: A Systematic Framework For Large Video Generative Models

    Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024

  3. [3]

    Wan: Open and Advanced Large-Scale Video Generative Models

    Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

  4. [4]

    Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

    Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Do- minik Lorenz, Yam Levi, Zion English, Vikram Voleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023

  5. [5]

    Videocrafter2: Overcoming data limitations for high-quality video diffusion models

    Haoxin Chen, Yong Zhang, Xiaodong Cun, Menghan Xia, Xintao Wang, Chao Weng, and Ying Shan. Videocrafter2: Overcoming data limitations for high-quality video diffusion models. InIEEE Conf. Comput. Vis. Pattern Recog., pages 7310–7320, 2024

  6. [6]

    Snap video: Scaled spatiotemporal transformers for text-to-video synthesis

    Willi Menapace, Aliaksandr Siarohin, Ivan Skorokhodov, Ekaterina Deyneka, Tsai-Shien Chen, Anil Kag, Yuwei Fang, Aleksei Stoliar, Elisa Ricci, Jian Ren, et al. Snap video: Scaled spatiotemporal transformers for text-to-video synthesis. InIEEE Conf. Comput. Vis. Pattern Recog., pages 7038–7048, 2024

  7. [7]

    Steerx: Creating any camera-free 3d and 4d scenes with geometric steering

    Byeongjun Park, Hyojun Go, Hyelin Nam, Byung-Hoon Kim, Hyungjin Chung, and Changick Kim. Steerx: Creating any camera-free 3d and 4d scenes with geometric steering. InInt. Conf. Comput. Vis., pages 27326–27337, 2025

  8. [8]

    Voyager: Long-range and world-consistent video diffusion for explorable 3d scene generation.arXiv preprint arXiv:2506.04225, 2025

    Tianyu Huang, Wangguandong Zheng, Tengfei Wang, Yuhao Liu, Zhenwei Wang, Junta Wu, Jie Jiang, Hui Li, Rynson WH Lau, Wangmeng Zuo, and Chunchao Guo. Voyager: Long-range and world-consistent video diffusion for explorable 3d scene generation.arXiv preprint arXiv:2506.04225, 2025

  9. [9]

    Video world models with long-term spatial memory.arXiv preprint arXiv:2506.05284,

    Tong Wu, Shuai Yang, Ryan Po, Yinghao Xu, Ziwei Liu, Dahua Lin, and Gordon Wetzstein. Video world models with long-term spatial memory.arXiv preprint arXiv:2506.05284, 2025

  10. [10]

    Geometry Forcing: Marrying Video Diffusion and 3D Representation for Consistent World Modeling

    Haoyu Wu, Diankun Wu, Tianyu He, Junliang Guo, Yang Ye, Yueqi Duan, and Jiang Bian. Geometry forcing: Marrying video diffusion and 3d representation for consistent world modeling.arXiv preprint arXiv:2507.07982, 2025

  11. [11]

    WorldForge: Unlocking emergent 3d/4d generation in video diffusion model via training-free guidance.arXiv preprint arXiv:2509.15130, 2025a

    Chenxi Song, Yanming Yang, Tong Zhao, Ruibo Li, and Chi Zhang. Worldforge: Unlocking emergent 3d/4d generation in video diffusion model via training-free guidance.arXiv preprint arXiv:2509.15130, 2025

  12. [12]

    Vmem: Consistent interactive video scene generation with surfel-indexed view memory

    Runjia Li, Philip Torr, Andrea Vedaldi, and Tomas Jakab. Vmem: Consistent interactive video scene generation with surfel-indexed view memory. InInt. Conf. Comput. Vis., 2025

  13. [13]

    4dnex: Feed-forward 4d generative modeling made easy.arXiv preprint arXiv:2508.13154, 2025

    Zhaoxi Chen, Tianqi Liu, Long Zhuo, Jiawei Ren, Zeng Tao, He Zhu, Fangzhou Hong, Liang Pan, and Ziwei Liu. 4dnex: Feed-forward 4d generative modeling made easy.arXiv preprint arXiv:2508.13154, 2025. 12

  14. [14]

    Genfusion: Closing the loop between reconstruction and generation via videos

    Sibo Wu, Congrong Xu, Binbin Huang, Andreas Geiger, and Anpei Chen. Genfusion: Closing the loop between reconstruction and generation via videos. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 6078–6088, 2025

  15. [15]

    Zixuan Huang, Xiang Li, Zhaoyang Lv, and James M. Rehg. How much 3d do video foundation models encode?arXiv preprint arXiv:2512.19949, 2025

  16. [16]

    MIT press Cambridge, 1998

    Richard S Sutton, Andrew G Barto, et al.Reinforcement learning: An introduction, volume 1. MIT press Cambridge, 1998

  17. [17]

    Depth Anything 3: Recovering the Visual Space from Any Views

    Haotong Lin, Sili Chen, Junhao Liew, Donny Y Chen, Zhenyu Li, Guang Shi, Jiashi Feng, and Bingyi Kang. Depth anything 3: Recovering the visual space from any views.arXiv preprint arXiv:2511.10647, 2025

  18. [18]

    Depth anything v2

    LiheYang, BingyiKang, ZilongHuang, ZhenZhao, XiaogangXu, JiashiFeng, andHengshuang Zhao. Depth anything v2. InAdv. Neural Inform. Process. Syst., volume 37, pages 21875– 21911, 2024

  19. [19]

    Depth anything: Unleashing the power of large-scale unlabeled data

    Lihe Yang, Bingyi Kang, Zilong Huang, Xiaogang Xu, Jiashi Feng, and Hengshuang Zhao. Depth anything: Unleashing the power of large-scale unlabeled data. InIEEE Conf. Comput. Vis. Pattern Recog., pages 10371–10381, 2024

  20. [20]

    Qwen3-VL Technical Report

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shixuan ...

  21. [21]

    Qwen2.5-VL Technical Report

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-vl technical report.ar...

  22. [22]

    Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

    Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Yang Fan, Kai Dang, Mengfei Du, Xuancheng Ren, Rui Men, Dayiheng Liu, Chang Zhou, Jingren Zhou, and Junyang Lin. Qwen2-vl: Enhancing vision- language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024

  23. [23]

    Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

    Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond.arXiv preprint arXiv:2308.12966, 2023

  24. [24]

    Flow-grpo: Training flow matching models via online rl

    Jie Liu, Gongye Liu, Jiajun Liang, Yangguang Li, Jiaheng Liu, Xintao Wang, Pengfei Wan, Di Zhang, and Wanli Ouyang. Flow-grpo: Training flow matching models via online rl. In Adv. Neural Inform. Process. Syst., 2025. 13

  25. [25]

    AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning

    Yuwei Guo, Ceyuan Yang, Anyi Rao, Zhengyang Liang, Yaohui Wang, Yu Qiao, Maneesh Agrawala, Dahua Lin, and Bo Dai. Animatediff: Animate your personalized text-to-image diffusion models without specific tuning.arXiv preprint arXiv:2307.04725, 2023

  26. [26]

    Motionctrl: A unified and flexible motion controller for video generation

    Zhouxia Wang, Ziyang Yuan, Xintao Wang, Yaowei Li, Tianshui Chen, Menghan Xia, Ping Luo, and Ying Shan. Motionctrl: A unified and flexible motion controller for video generation. InACM SIGGRAPH Conf. Comput. Graph. Interact. Tech., pages 1–11, 2024

  27. [27]

    CameraCtrl: Enabling Camera Control for Text-to-Video Generation

    Hao He, Yinghao Xu, Yuwei Guo, Gordon Wetzstein, Bo Dai, Hongsheng Li, and Ceyuan Yang. Cameractrl: Enabling camera control for text-to-video generation.arXiv preprint arXiv:2404.02101, 2024

  28. [28]

    Motionbooth: Motion-aware customized text-to-video generation.Adv

    Jianzong Wu, Xiangtai Li, Yanhong Zeng, Jiangning Zhang, Qianyu Zhou, Yining Li, Yunhai Tong, and Kai Chen. Motionbooth: Motion-aware customized text-to-video generation.Adv. Neural Inf. Process. Syst., 37:34322–34348, 2024

  29. [29]

    Collaborative video diffusion: Consistent multi-video generation with camera control.Adv

    Zhengfei Kuang, Shengqu Cai, Hao He, Yinghao Xu, Hongsheng Li, Leonidas J Guibas, and Gordon Wetzstein. Collaborative video diffusion: Consistent multi-video generation with camera control.Adv. Neural Inform. Process. Syst., 37:16240–16271, 2024

  30. [30]

    Ac3d: Analyzing and improving 3d camera control in video diffusion transformers

    Sherwin Bahmani, Ivan Skorokhodov, Guocheng Qian, Aliaksandr Siarohin, Willi Menapace, Andrea Tagliasacchi, David B Lindell, and Sergey Tulyakov. Ac3d: Analyzing and improving 3d camera control in video diffusion transformers. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 22875–22889, 2025

  31. [31]

    arXiv preprint arXiv:2502.07531 (2025) Abbreviated paper title 21

    Sixiao Zheng, Zimian Peng, Yanpeng Zhou, Yi Zhu, Hang Xu, Xiangru Huang, and Yanwei Fu. Vidcraft3: Camera, object, and lighting control for image-to-video generation.arXiv preprint arXiv:2502.07531, 2025

  32. [32]

    Recammaster: Camera-controlled generative rendering from a single video

    Jianhong Bai, Menghan Xia, Xiao Fu, Xintao Wang, Lianrui Mu, Jinwen Cao, Zuozhu Liu, Haoji Hu, Xiang Bai, Pengfei Wan, et al. Recammaster: Camera-controlled generative rendering from a single video. InInt. Conf. Comput. Vis., 2025

  33. [33]

    Realcam-i2v: Real-world image-to-video generation with interactive complex camera control

    Teng Li, Guangcong Zheng, Rui Jiang, Shuigen Zhan, Tao Wu, Yehao Lu, Yining Lin, Chuanyun Deng, Yepan Xiong, Min Chen, et al. Realcam-i2v: Real-world image-to-video generation with interactive complex camera control. InInt. Conf. Comput. Vis., pages 28785–28796, 2025

  34. [34]

    Go-with-the-flow: Motion- controllable video diffusion models using real-time warped noise

    Ryan Burgert, Yuancheng Xu, Wenqi Xian, Oliver Pilarski, Pascal Clausen, Mingming He, Li Ma, Yitong Deng, Lingxiao Li, Mohsen Mousavi, et al. Go-with-the-flow: Motion- controllable video diffusion models using real-time warped noise. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 13–23, 2025

  35. [35]

    Spatia: Video generation with updatable spatial memory.arXiv preprint arXiv:2512.15716, 2025

    Jinjing Zhao, Fangyun Wei, Zhening Liu, Hongyang Zhang, Chang Xu, and Yan Lu. Spatia: Video generation with updatable spatial memory.arXiv preprint arXiv:2512.15716, 2025

  36. [36]

    World-consistent video diffusion with explicit 3d modeling

    Qihang Zhang, Shuangfei Zhai, Miguel Angel Bautista Martin, Kevin Miao, Alexander Toshev, Joshua Susskind, and Jiatao Gu. World-consistent video diffusion with explicit 3d modeling. InIEEE Conf. Comput. Vis. Pattern Recog., pages 21685–21695, 2025

  37. [37]

    FantasyWorld: Geometry-consistent world modeling via unified video and 3d prediction.arXiv preprint arXiv:2509.21657,

    Yixiang Dai, Fan Jiang, Chiyu Wang, Mu Xu, and Yonggang Qi. Fantasyworld: Geometry-consistent world modeling via unified video and 3d prediction.arXiv preprint arXiv:2509.21657, 2025. 14

  38. [38]

    Stereo magnification: learning view synthesis using multiplane images.ACM Trans

    Tinghui Zhou, Richard Tucker, John Flynn, Graham Fyffe, and Noah Snavely. Stereo magnification: learning view synthesis using multiplane images.ACM Trans. Graph., 37(4): 1–12, 2018

  39. [39]

    Dl3dv-10k: A large-scale scene dataset for deep learning-based 3d vision

    Lu Ling, Yichen Sheng, Zhi Tu, Wentian Zhao, Cheng Xin, Kun Wan, Lantao Yu, Qianyu Guo, Zixun Yu, Yawen Lu, et al. Dl3dv-10k: A large-scale scene dataset for deep learning-based 3d vision. InIEEE Conf. Comput. Vis. Pattern Recog., pages 22160–22169, 2024

  40. [40]

    Proximal Policy Optimization Algorithms

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

  41. [41]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

  42. [42]

    The unrea- sonable effectiveness of deep features as a perceptual metric

    Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unrea- sonable effectiveness of deep features as a perceptual metric. InIEEE Conf. Comput. Vis. Pattern Recog., pages 586–595, 2018

  43. [43]

    Hpsv3: Towards wide-spectrum human preference score

    Yuhang Ma, Xiaoshi Wu, Keqiang Sun, and Hongsheng Li. Hpsv3: Towards wide-spectrum human preference score. InInt. Conf. Comput. Vis., pages 15086–15095, 2025

  44. [44]

    Gemini: A Family of Highly Capable Multimodal Models

    Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023

  45. [45]

    Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

    Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context.arXiv preprint arXiv:2403.05530, 2024

  46. [46]

    3d gaussian splatting for real-time radiance field rendering.ACM Trans

    Bernhard Kerbl, Georgios Kopanas, Thomas Leimkühler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering.ACM Trans. Graph., 42(4):139–1, 2023

  47. [47]

    gsplat: An open-source library for gaussian splatting.JMLR, 26(34):1–17, 2025

    Vickie Ye, Ruilong Li, Justin Kerr, Matias Turkulainen, Brent Yi, Zhuoyang Pan, Otto Seiskari, Jianbo Ye, Jeffrey Hu, Matthew Tancik, et al. gsplat: An open-source library for gaussian splatting.JMLR, 26(34):1–17, 2025

  48. [48]

    Image quality assessment: from error visibility to structural similarity.IEEE Trans

    Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. Image quality assessment: from error visibility to structural similarity.IEEE Trans. Image Process., 13(4):600–612, 2004

  49. [49]

    Vbench: Comprehensive benchmark suite for video generative models

    Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, et al. Vbench: Comprehensive benchmark suite for video generative models. InIEEE Conf. Comput. Vis. Pattern Recog., pages 21807–21818, 2024

  50. [50]

    Generative camera dolly: Extreme monocular dynamic novel view synthesis

    Basile Van Hoorick, Rundi Wu, Ege Ozguroglu, Kyle Sargent, Ruoshi Liu, Pavel Tokmakov, Achal Dave, Changxi Zheng, and Carl Vondrick. Generative camera dolly: Extreme monocular dynamic novel view synthesis. InEur. Conf. Comput. Vis., pages 313–331. Springer, 2024

  51. [51]

    Trajectory attention for fine-grained video motion control.arXiv preprint arXiv:2411.19324, 2024

    Zeqi Xiao, Wenqi Ouyang, Yifan Zhou, Shuai Yang, Lei Yang, Jianlou Si, and Xingang Pan. Trajectory attention for fine-grained video motion control.arXiv preprint arXiv:2411.19324, 2024. 15

  52. [52]

    Diffusion as shader: 3d-aware video diffusion for versatile video generation control

    Zekai Gu, Rui Yan, Jiahao Lu, Peng Li, Zhiyang Dou, Chenyang Si, Zhen Dong, Qifeng Liu, Cheng Lin, Ziwei Liu, et al. Diffusion as shader: 3d-aware video diffusion for versatile video generation control. InProceedings of the Special Interest Group on Computer Graphics and Interactive Techniques Conference Conference Papers, pages 1–12, 2025

  53. [53]

    Worldscore: A unified evaluation benchmark for world generation

    Haoyi Duan, Hong-Xing Yu, Sirui Chen, Li Fei-Fei, and Jiajun Wu. Worldscore: A unified evaluation benchmark for world generation. InInt. Conf. Comput. Vis., 2025

  54. [54]

    arXiv preprint arXiv:2503.05638 (2025) 18 Liu et al

    Mark Yu, Wenbo Hu, Jinbo Xing, and Ying Shan. Trajectorycrafter: Redirecting camera trajectory for monocular videos via diffusion models.arXiv preprint arXiv:2503.05638, 2025

  55. [55]

    Camclonemaster: Enabling reference-based camera control for video generation.arXiv preprint arXiv:2506.03140, 2025

    Yawen Luo, Jianhong Bai, Xiaoyu Shi, Menghan Xia, Xintao Wang, Pengfei Wan, Di Zhang, Kun Gai, and Tianfan Xue. Camclonemaster: Enabling reference-based camera control for video generation.arXiv preprint arXiv:2506.03140, 2025

  56. [56]

    Geovideo: Intro- ducing geometric regularization into video generation model.arXiv preprint arXiv:2512.03453, 2025

    Yunpeng Bai, Shaoheng Fang, Chaohui Yu, Fan Wang, and Qixing Huang. Geovideo: Intro- ducing geometric regularization into video generation model.arXiv preprint arXiv:2512.03453, 2025

  57. [57]

    ViewCrafter: Taming Video Diffusion Models for High-fidelity Novel View Synthesis

    Wangbo Yu, Jinbo Xing, Li Yuan, Wenbo Hu, Xiaoyu Li, Zhipeng Huang, Xiangjun Gao, Tien-Tsin Wong, Ying Shan, and Yonghong Tian. Viewcrafter: Taming video diffusion models for high-fidelity novel view synthesis.arXiv preprint arXiv:2409.02048, 2024

  58. [58]

    Flash- world: High-quality 3d scene generation within seconds.arXiv preprint arXiv:2510.13678,

    Xinyang Li, Tengfei Wang, Zixiao Gu, Shengchuan Zhang, Chunchao Guo, and Liujuan Cao. Flashworld: High-quality3dscenegenerationwithinseconds.arXiv preprint arXiv:2510.13678, 2025

  59. [59]

    billboard

    Sixiao Zheng, Minghao Yin, Wenbo Hu, Xiaoyu Li, Ying Shan, and Yanwei Fu. Ver- secrafter: Dynamic realistic video world model with 4d geometric control.arXiv preprint arXiv:2601.05138, 2026. 16 A Implementation Details A.1 Reward Formulation and Details The core of our alignment strategy is the 3D-aware reward, which utilizes an analysis-by-synthesis appr...

  60. [60]

    •push_in: Move forward along the optical axis

    Intra-scene Exploration Movements that investigate the depth and 3D structure of a specific subject. •push_in: Move forward along the optical axis. (Focus: Depth & Parallax) • orbit_left: Revolve counter-clockwise around a focal point. (Focus: 360°Object Consistency) • orbit_right: Revolve clockwise around a focal point. (Focus: 360°Object Consis- tency)

  61. [61]

    •pull_out: Move backward along the optical axis

    Inter-scene Transition Movements that shift the viewport to reveal new environments or expand context. •pull_out: Move backward along the optical axis. (Focus: Context Reveal) •move_left: Lateral truck left. (Focus: Parallax) •move_right: Lateral truck right. (Focus: Parallax) •pan_left: Rotate camera yaw left on axis. (Focus: Panoramic View) •pan_right: ...

  62. [62]

    •pull_left: Sequence:move_left→pull_out→pan_left

    Composite Trajectories Complex multi-axis maneuvers testing long-horizon consistency. •pull_left: Sequence:move_left→pull_out→pan_left. •pull_right: Sequence:move_right→pull_out→pan_right

  63. [63]

    •fixed: No ego-motion

    Static Observation Stationary camera to isolate temporal dynamics. •fixed: No ego-motion. (Focus: Fluid/Particle Dynamics) B.2 Dataset Taxonomy and Examples B.2.1 Natural Landscapes This category focuses on large-scale rigid geometry and natural fluid dynamics. It tests the model’s ability to maintain consistency across vast distances and handle organic s...

  64. [64]

    Which video better maintains the structural integrity of the physical world?

    Geometric Consistency:"Which video better maintains the structural integrity of the physical world?" • Look for:Objects that remain solid (no morphing/vanishing), background elements that stay fixed relative to the camera movement, and correct perspective changes (parallax). • Penalize:Walls that warp, objects that float or disappear when the camera moves...

  65. [65]

    Which video more accurately executes the camera movement described in the text prompt?

    Camera Control Accuracy:"Which video more accurately executes the camera movement described in the text prompt?" • Look for:Strict adherence to the directional command (e.g., if the prompt says "orbit," the camera must circle the object, not just pan or zoom). • Penalize:Errant camera drift, static cameras when motion is requested, or wrong movement direc...