pith. sign in

arxiv: 2604.24764 · v3 · pith:KKLSRPH4new · submitted 2026-04-27 · 💻 cs.CV

World-R1: Reinforcing 3D Constraints for Text-to-Video Generation

Pith reviewed 2026-05-25 06:42 UTC · model grok-4.3

classification 💻 cs.CV
keywords text-to-video generation3D consistencyreinforcement learningworld simulationgeometric constraintsFlow-GRPOvideo foundation modelsstructural coherence
0
0 comments X

The pith

Reinforcement learning with feedback from pre-trained 3D models enforces geometric consistency in text-to-video generation without changing the base architecture.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents World-R1 as a way to fix geometric inconsistencies in video foundation models by aligning their outputs to 3D constraints. It does this through reinforcement learning on a new pure-text dataset designed for world simulation, using signals from existing 3D foundation models and vision-language models via the Flow-GRPO optimizer. A periodic decoupled training strategy helps maintain both rigid structure and scene motion. If the approach works, text-to-video systems could move from visually appealing but structurally unreliable outputs toward reliable world simulation at scale while keeping their original image quality. A sympathetic reader cares because current video generators often produce scenes that violate basic 3D rules, limiting their use in planning or simulation tasks.

Core claim

World-R1 aligns video generation with 3D constraints through reinforcement learning by optimizing the model using Flow-GRPO on feedback from pre-trained 3D foundation models and vision-language models together with a specialized pure text dataset for world simulation, and applies a periodic decoupled training strategy to balance rigid geometric consistency with dynamic scene fluidity, achieving enhanced 3D consistency while preserving the original visual quality of the foundation model.

What carries the argument

Flow-GRPO optimization that incorporates feedback from pre-trained 3D foundation models and vision-language models to enforce structural coherence during RL fine-tuning, supported by a pure-text world-simulation dataset and periodic decoupled training.

If this is right

  • Video foundation models can gain 3D consistency through post-training optimization rather than expensive architectural redesigns.
  • A pure-text dataset suffices to drive the alignment when paired with external 3D and language model feedback.
  • Periodic decoupled training allows the model to satisfy both geometric rigidity and scene dynamics at once.
  • The resulting videos remain visually comparable to the original foundation model while becoming more suitable for world simulation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same feedback-driven RL loop could be tested on image or 3D asset generators facing consistency problems.
  • Success would imply that external pre-trained models can serve as reliable reward sources for other generative tasks where direct 3D supervision is expensive.
  • If the method scales, it opens a path to iteratively refine any video model toward better physical plausibility using only text prompts and frozen evaluators.

Load-bearing premise

Feedback signals from pre-trained 3D foundation models and vision-language models can reliably enforce structural coherence during RL optimization without introducing new inconsistencies.

What would settle it

Running the same evaluation benchmarks on the base video model versus the World-R1 version and finding no statistically significant improvement in any 3D consistency metric while visual quality remains unchanged.

read the original abstract

Recent video foundation models demonstrate impressive visual synthesis but frequently suffer from geometric inconsistencies. While existing methods attempt to inject 3D priors via architectural modifications, they often incur high computational costs and limit scalability. We propose World-R1, a framework that aligns video generation with 3D constraints through reinforcement learning. To facilitate this alignment, we introduce a specialized pure text dataset tailored for world simulation. Utilizing Flow-GRPO, we optimize the model using feedback from pre-trained 3D foundation models and vision-language models to enforce structural coherence without altering the underlying architecture. We further employ a periodic decoupled training strategy to balance rigid geometric consistency with dynamic scene fluidity. Extensive evaluations reveal that our approach significantly enhances 3D consistency while preserving the original visual quality of the foundation model, effectively bridging the gap between video generation and scalable world simulation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes World-R1, a framework for aligning text-to-video foundation models with 3D constraints via reinforcement learning. It introduces a pure-text dataset for world simulation and uses Flow-GRPO optimization driven by feedback from pre-trained 3D foundation models and vision-language models. A periodic decoupled training strategy is employed to balance geometric consistency and scene dynamics, with the central claim that this improves 3D consistency without architectural modifications while preserving visual quality.

Significance. If the quantitative results hold, the work offers a potentially scalable route to geometric consistency in video generation by leveraging external 3D/VLM feedback and RL rather than costly architectural changes. This could help bridge video synthesis and world simulation, provided the feedback signals prove reliable and do not introduce new artifacts.

major comments (2)
  1. [Abstract] Abstract: The central claim that the method 'significantly enhances 3D consistency' while 'preserving the original visual quality' lacks any reported metrics, baselines, or error bars. Without these, the strength of the result cannot be evaluated.
  2. [Abstract] Abstract: The description of Flow-GRPO and the reward formulation derived from 3D/VLM feedback is stated at a high level only; no equations, reward definitions, or training dynamics are supplied, preventing verification that the optimization enforces structural coherence without new inconsistencies.
minor comments (1)
  1. [Abstract] The abstract refers to 'extensive evaluations' but provides no table or figure references; adding a summary table of 3D consistency metrics (e.g., geometric error, temporal coherence) would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the comments. We address the two major points on the abstract below, clarifying the relationship between the abstract and the full manuscript while proposing targeted revisions where they strengthen the presentation.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim that the method 'significantly enhances 3D consistency' while 'preserving the original visual quality' lacks any reported metrics, baselines, or error bars. Without these, the strength of the result cannot be evaluated.

    Authors: We agree the abstract presents the claim qualitatively. The full manuscript contains quantitative results, including specific metrics for 3D consistency, baselines, and error bars, reported in the Experiments section with comparisons to prior methods. We will revise the abstract to include one or two key quantitative highlights (e.g., relative improvement percentages) to make the claim more self-contained while respecting length constraints. revision: yes

  2. Referee: [Abstract] Abstract: The description of Flow-GRPO and the reward formulation derived from 3D/VLM feedback is stated at a high level only; no equations, reward definitions, or training dynamics are supplied, preventing verification that the optimization enforces structural coherence without new inconsistencies.

    Authors: The abstract follows standard conventions by providing a high-level overview. Full equations for Flow-GRPO, the reward formulation combining 3D foundation model and VLM feedback, and the periodic decoupled training dynamics are detailed in Section 3 (Methods) of the manuscript, including pseudocode and training procedure. This allows verification of how structural coherence is enforced. We do not plan to add equations to the abstract due to space limits but can ensure the abstract explicitly points to the methods section if helpful. revision: no

Circularity Check

0 steps flagged

No significant circularity

full rationale

The abstract and available description present World-R1 as an RL alignment method (Flow-GRPO) that consumes feedback signals from independent pre-trained 3D foundation models and VLMs plus a pure-text dataset. No derivation chain, equations, fitted parameters, or self-citations are shown that reduce a claimed prediction or uniqueness result to the inputs by construction. The central claim therefore remains externally grounded rather than self-referential.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies insufficient technical detail to identify free parameters, axioms, or invented entities.

pith-pipeline@v0.9.0 · 5714 in / 1075 out tokens · 38697 ms · 2026-05-25T06:42:23.750493+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Geo-Align: Video Generation Alignment via Metric Geometry Reward

    cs.CV 2026-05 unverdicted novelty 7.0

    Geo-Align applies RL with a perceptual reward derived from 3D camera trajectory estimation to improve controllability and fidelity in video generation without paired training data.

  2. LaMo: Self-Supervised Latent Motion Priors for Physical Realism in Video Generation

    cs.CV 2026-05 unverdicted novelty 6.0

    LaMo adds self-supervised latent motion priors via a motion drift loss during training and motion prior guidance during sampling to boost physical fidelity in video diffusion models like CogVideoX.

Reference graph

Works this paper leans on

65 extracted references · 65 canonical work pages · cited by 2 Pith papers · 16 internal anchors

  1. [1]

    Cogvideox: Text-to-video diffusion models with an expert transformer

    Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, Da Yin, Yuxuan.Zhang, Weihan Wang, Yean Cheng, Bin Xu, Xiaotao Gu, Yuxiao Dong, and Jie Tang. Cogvideox: Text-to-video diffusion models with an expert transformer. InInt. Conf. Learn. Represent., 2025

  2. [2]

    HunyuanVideo: A Systematic Framework For Large Video Generative Models

    Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024

  3. [3]

    Wan: Open and Advanced Large-Scale Video Generative Models

    Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

  4. [4]

    Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

    Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Do- minik Lorenz, Yam Levi, Zion English, Vikram Voleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023

  5. [5]

    Videocrafter2: Overcoming data limitations for high-quality video diffusion models

    Haoxin Chen, Yong Zhang, Xiaodong Cun, Menghan Xia, Xintao Wang, Chao Weng, and Ying Shan. Videocrafter2: Overcoming data limitations for high-quality video diffusion models. InIEEE Conf. Comput. Vis. Pattern Recog., pages 7310–7320, 2024

  6. [6]

    Snap video: Scaled spatiotemporal transformers for text-to-video synthesis

    Willi Menapace, Aliaksandr Siarohin, Ivan Skorokhodov, Ekaterina Deyneka, Tsai-Shien Chen, Anil Kag, Yuwei Fang, Aleksei Stoliar, Elisa Ricci, Jian Ren, et al. Snap video: Scaled spatiotemporal transformers for text-to-video synthesis. InIEEE Conf. Comput. Vis. Pattern Recog., pages 7038–7048, 2024

  7. [7]

    Steerx: Creating any camera-free 3d and 4d scenes with geometric steering

    Byeongjun Park, Hyojun Go, Hyelin Nam, Byung-Hoon Kim, Hyungjin Chung, and Changick Kim. Steerx: Creating any camera-free 3d and 4d scenes with geometric steering. InInt. Conf. Comput. Vis., pages 27326–27337, 2025

  8. [8]

    Voyager: Long-range and world-consistent video diffusion for explorable 3d scene generation.arXiv preprint arXiv:2506.04225, 2025

    Tianyu Huang, Wangguandong Zheng, Tengfei Wang, Yuhao Liu, Zhenwei Wang, Junta Wu, Jie Jiang, Hui Li, Rynson WH Lau, Wangmeng Zuo, and Chunchao Guo. Voyager: Long-range and world-consistent video diffusion for explorable 3d scene generation.arXiv preprint arXiv:2506.04225, 2025

  9. [9]

    Video world models with long-term spatial memory.arXiv preprint arXiv:2506.05284, 2025

    Tong Wu, Shuai Yang, Ryan Po, Yinghao Xu, Ziwei Liu, Dahua Lin, and Gordon Wetzstein. Video world models with long-term spatial memory.arXiv preprint arXiv:2506.05284, 2025

  10. [10]

    Geometry Forcing: Marrying Video Diffusion and 3D Representation for Consistent World Modeling

    Haoyu Wu, Diankun Wu, Tianyu He, Junliang Guo, Yang Ye, Yueqi Duan, and Jiang Bian. Geometry forcing: Marrying video diffusion and 3d representation for consistent world modeling.arXiv preprint arXiv:2507.07982, 2025

  11. [11]

    Worldforge: Unlocking emergent 3d/4d generation in video diffusion model via training-free guidance.arXiv preprint arXiv:2509.15130, 2025

    Chenxi Song, Yanming Yang, Tong Zhao, Ruibo Li, and Chi Zhang. Worldforge: Unlocking emergent 3d/4d generation in video diffusion model via training-free guidance.arXiv preprint arXiv:2509.15130, 2025

  12. [12]

    Vmem: Consistent interactive video scene generation with surfel-indexed view memory

    Runjia Li, Philip Torr, Andrea Vedaldi, and Tomas Jakab. Vmem: Consistent interactive video scene generation with surfel-indexed view memory. InInt. Conf. Comput. Vis., 2025

  13. [13]

    4dnex: Feed-forward 4d generative modeling made easy.arXiv preprint arXiv:2508.13154, 2025

    Zhaoxi Chen, Tianqi Liu, Long Zhuo, Jiawei Ren, Zeng Tao, He Zhu, Fangzhou Hong, Liang Pan, and Ziwei Liu. 4dnex: Feed-forward 4d generative modeling made easy.arXiv preprint arXiv:2508.13154, 2025. 12

  14. [14]

    Genfusion: Closing the loop between reconstruction and generation via videos

    Sibo Wu, Congrong Xu, Binbin Huang, Andreas Geiger, and Anpei Chen. Genfusion: Closing the loop between reconstruction and generation via videos. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 6078–6088, 2025

  15. [15]

    Zixuan Huang, Xiang Li, Zhaoyang Lv, and James M. Rehg. How much 3d do video foundation models encode?arXiv preprint arXiv:2512.19949, 2025

  16. [16]

    MIT press Cambridge, 1998

    Richard S Sutton, Andrew G Barto, et al.Reinforcement learning: An introduction, volume 1. MIT press Cambridge, 1998

  17. [17]

    Depth Anything 3: Recovering the Visual Space from Any Views

    Haotong Lin, Sili Chen, Junhao Liew, Donny Y Chen, Zhenyu Li, Guang Shi, Jiashi Feng, and Bingyi Kang. Depth anything 3: Recovering the visual space from any views.arXiv preprint arXiv:2511.10647, 2025

  18. [18]

    Depth anything v2

    LiheYang, BingyiKang, ZilongHuang, ZhenZhao, XiaogangXu, JiashiFeng, andHengshuang Zhao. Depth anything v2. InAdv. Neural Inform. Process. Syst., volume 37, pages 21875– 21911, 2024

  19. [19]

    Depth anything: Unleashing the power of large-scale unlabeled data

    Lihe Yang, Bingyi Kang, Zilong Huang, Xiaogang Xu, Jiashi Feng, and Hengshuang Zhao. Depth anything: Unleashing the power of large-scale unlabeled data. InIEEE Conf. Comput. Vis. Pattern Recog., pages 10371–10381, 2024

  20. [20]

    Qwen3-VL Technical Report

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shixuan ...

  21. [21]

    Qwen2.5-VL Technical Report

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-vl technical report.ar...

  22. [22]

    Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

    Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Yang Fan, Kai Dang, Mengfei Du, Xuancheng Ren, Rui Men, Dayiheng Liu, Chang Zhou, Jingren Zhou, and Junyang Lin. Qwen2-vl: Enhancing vision- language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024

  23. [23]

    Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

    Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond.arXiv preprint arXiv:2308.12966, 2023

  24. [24]

    Flow-grpo: Training flow matching models via online rl

    Jie Liu, Gongye Liu, Jiajun Liang, Yangguang Li, Jiaheng Liu, Xintao Wang, Pengfei Wan, Di Zhang, and Wanli Ouyang. Flow-grpo: Training flow matching models via online rl. In Adv. Neural Inform. Process. Syst., 2025. 13

  25. [25]

    AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning

    Yuwei Guo, Ceyuan Yang, Anyi Rao, Zhengyang Liang, Yaohui Wang, Yu Qiao, Maneesh Agrawala, Dahua Lin, and Bo Dai. Animatediff: Animate your personalized text-to-image diffusion models without specific tuning.arXiv preprint arXiv:2307.04725, 2023

  26. [26]

    Motionctrl: A unified and flexible motion controller for video generation

    Zhouxia Wang, Ziyang Yuan, Xintao Wang, Yaowei Li, Tianshui Chen, Menghan Xia, Ping Luo, and Ying Shan. Motionctrl: A unified and flexible motion controller for video generation. InACM SIGGRAPH Conf. Comput. Graph. Interact. Tech., pages 1–11, 2024

  27. [27]

    CameraCtrl: Enabling Camera Control for Text-to-Video Generation

    Hao He, Yinghao Xu, Yuwei Guo, Gordon Wetzstein, Bo Dai, Hongsheng Li, and Ceyuan Yang. Cameractrl: Enabling camera control for text-to-video generation.arXiv preprint arXiv:2404.02101, 2024

  28. [28]

    Motionbooth: Motion-aware customized text-to-video generation.Adv

    Jianzong Wu, Xiangtai Li, Yanhong Zeng, Jiangning Zhang, Qianyu Zhou, Yining Li, Yunhai Tong, and Kai Chen. Motionbooth: Motion-aware customized text-to-video generation.Adv. Neural Inf. Process. Syst., 37:34322–34348, 2024

  29. [29]

    Collaborative video diffusion: Consistent multi-video generation with camera control.Adv

    Zhengfei Kuang, Shengqu Cai, Hao He, Yinghao Xu, Hongsheng Li, Leonidas J Guibas, and Gordon Wetzstein. Collaborative video diffusion: Consistent multi-video generation with camera control.Adv. Neural Inform. Process. Syst., 37:16240–16271, 2024

  30. [30]

    Ac3d: Analyzing and improving 3d camera control in video diffusion transformers

    Sherwin Bahmani, Ivan Skorokhodov, Guocheng Qian, Aliaksandr Siarohin, Willi Menapace, Andrea Tagliasacchi, David B Lindell, and Sergey Tulyakov. Ac3d: Analyzing and improving 3d camera control in video diffusion transformers. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 22875–22889, 2025

  31. [31]

    Vidcraft3: Camera, object, and lighting control for image-to-video generation.arXiv preprint arXiv:2502.07531, 2025

    Sixiao Zheng, Zimian Peng, Yanpeng Zhou, Yi Zhu, Hang Xu, Xiangru Huang, and Yanwei Fu. Vidcraft3: Camera, object, and lighting control for image-to-video generation.arXiv preprint arXiv:2502.07531, 2025

  32. [32]

    Recammaster: Camera-controlled generative rendering from a single video

    Jianhong Bai, Menghan Xia, Xiao Fu, Xintao Wang, Lianrui Mu, Jinwen Cao, Zuozhu Liu, Haoji Hu, Xiang Bai, Pengfei Wan, et al. Recammaster: Camera-controlled generative rendering from a single video. InInt. Conf. Comput. Vis., 2025

  33. [33]

    Realcam-i2v: Real-world image-to-video generation with interactive complex camera control

    Teng Li, Guangcong Zheng, Rui Jiang, Shuigen Zhan, Tao Wu, Yehao Lu, Yining Lin, Chuanyun Deng, Yepan Xiong, Min Chen, et al. Realcam-i2v: Real-world image-to-video generation with interactive complex camera control. InInt. Conf. Comput. Vis., pages 28785–28796, 2025

  34. [34]

    Go-with-the-flow: Motion- controllable video diffusion models using real-time warped noise

    Ryan Burgert, Yuancheng Xu, Wenqi Xian, Oliver Pilarski, Pascal Clausen, Mingming He, Li Ma, Yitong Deng, Lingxiao Li, Mohsen Mousavi, et al. Go-with-the-flow: Motion- controllable video diffusion models using real-time warped noise. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 13–23, 2025

  35. [35]

    Spatia: Video generation with updatable spatial memory.arXiv preprint arXiv:2512.15716, 2025

    Jinjing Zhao, Fangyun Wei, Zhening Liu, Hongyang Zhang, Chang Xu, and Yan Lu. Spatia: Video generation with updatable spatial memory.arXiv preprint arXiv:2512.15716, 2025

  36. [36]

    World-consistent video diffusion with explicit 3d modeling

    Qihang Zhang, Shuangfei Zhai, Miguel Angel Bautista Martin, Kevin Miao, Alexander Toshev, Joshua Susskind, and Jiatao Gu. World-consistent video diffusion with explicit 3d modeling. InIEEE Conf. Comput. Vis. Pattern Recog., pages 21685–21695, 2025

  37. [37]

    Fantasyworld: Geometry-consistent world modeling via unified video and 3d prediction.arXiv preprint arXiv:2509.21657, 2025

    Yixiang Dai, Fan Jiang, Chiyu Wang, Mu Xu, and Yonggang Qi. Fantasyworld: Geometry-consistent world modeling via unified video and 3d prediction.arXiv preprint arXiv:2509.21657, 2025. 14

  38. [38]

    Stereo magnification: learning view synthesis using multiplane images.ACM Trans

    Tinghui Zhou, Richard Tucker, John Flynn, Graham Fyffe, and Noah Snavely. Stereo magnification: learning view synthesis using multiplane images.ACM Trans. Graph., 37(4): 1–12, 2018

  39. [39]

    Dl3dv-10k: A large-scale scene dataset for deep learning-based 3d vision

    Lu Ling, Yichen Sheng, Zhi Tu, Wentian Zhao, Cheng Xin, Kun Wan, Lantao Yu, Qianyu Guo, Zixun Yu, Yawen Lu, et al. Dl3dv-10k: A large-scale scene dataset for deep learning-based 3d vision. InIEEE Conf. Comput. Vis. Pattern Recog., pages 22160–22169, 2024

  40. [40]

    Proximal Policy Optimization Algorithms

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

  41. [41]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

  42. [42]

    The unrea- sonable effectiveness of deep features as a perceptual metric

    Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unrea- sonable effectiveness of deep features as a perceptual metric. InIEEE Conf. Comput. Vis. Pattern Recog., pages 586–595, 2018

  43. [43]

    Hpsv3: Towards wide-spectrum human preference score

    Yuhang Ma, Xiaoshi Wu, Keqiang Sun, and Hongsheng Li. Hpsv3: Towards wide-spectrum human preference score. InInt. Conf. Comput. Vis., pages 15086–15095, 2025

  44. [44]

    Gemini: A Family of Highly Capable Multimodal Models

    Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023

  45. [45]

    Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

    Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context.arXiv preprint arXiv:2403.05530, 2024

  46. [46]

    3d gaussian splatting for real-time radiance field rendering.ACM Trans

    Bernhard Kerbl, Georgios Kopanas, Thomas Leimkühler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering.ACM Trans. Graph., 42(4):139–1, 2023

  47. [47]

    gsplat: An open-source library for gaussian splatting.JMLR, 26(34):1–17, 2025

    Vickie Ye, Ruilong Li, Justin Kerr, Matias Turkulainen, Brent Yi, Zhuoyang Pan, Otto Seiskari, Jianbo Ye, Jeffrey Hu, Matthew Tancik, et al. gsplat: An open-source library for gaussian splatting.JMLR, 26(34):1–17, 2025

  48. [48]

    Image quality assessment: from error visibility to structural similarity.IEEE Trans

    Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. Image quality assessment: from error visibility to structural similarity.IEEE Trans. Image Process., 13(4):600–612, 2004

  49. [49]

    Vbench: Comprehensive benchmark suite for video generative models

    Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, et al. Vbench: Comprehensive benchmark suite for video generative models. InIEEE Conf. Comput. Vis. Pattern Recog., pages 21807–21818, 2024

  50. [50]

    Generative camera dolly: Extreme monocular dynamic novel view synthesis

    Basile Van Hoorick, Rundi Wu, Ege Ozguroglu, Kyle Sargent, Ruoshi Liu, Pavel Tokmakov, Achal Dave, Changxi Zheng, and Carl Vondrick. Generative camera dolly: Extreme monocular dynamic novel view synthesis. InEur. Conf. Comput. Vis., pages 313–331. Springer, 2024

  51. [51]

    Trajectory attention for fine-grained video motion control.arXiv preprint arXiv:2411.19324, 2024

    Zeqi Xiao, Wenqi Ouyang, Yifan Zhou, Shuai Yang, Lei Yang, Jianlou Si, and Xingang Pan. Trajectory attention for fine-grained video motion control.arXiv preprint arXiv:2411.19324, 2024. 15

  52. [52]

    Diffusion as shader: 3d-aware video diffusion for versatile video generation control

    Zekai Gu, Rui Yan, Jiahao Lu, Peng Li, Zhiyang Dou, Chenyang Si, Zhen Dong, Qifeng Liu, Cheng Lin, Ziwei Liu, et al. Diffusion as shader: 3d-aware video diffusion for versatile video generation control. InProceedings of the Special Interest Group on Computer Graphics and Interactive Techniques Conference Conference Papers, pages 1–12, 2025

  53. [53]

    Worldscore: A unified evaluation benchmark for world generation

    Haoyi Duan, Hong-Xing Yu, Sirui Chen, Li Fei-Fei, and Jiajun Wu. Worldscore: A unified evaluation benchmark for world generation. InInt. Conf. Comput. Vis., 2025

  54. [54]

    Trajectorycrafter: Redirecting camera trajectory for monocular videos via diffusion models.arXiv preprint arXiv:2503.05638, 2025

    Mark Yu, Wenbo Hu, Jinbo Xing, and Ying Shan. Trajectorycrafter: Redirecting camera trajectory for monocular videos via diffusion models.arXiv preprint arXiv:2503.05638, 2025

  55. [55]

    Camclonemaster: Enabling reference-based camera control for video generation.arXiv preprint arXiv:2506.03140, 2025

    Yawen Luo, Jianhong Bai, Xiaoyu Shi, Menghan Xia, Xintao Wang, Pengfei Wan, Di Zhang, Kun Gai, and Tianfan Xue. Camclonemaster: Enabling reference-based camera control for video generation.arXiv preprint arXiv:2506.03140, 2025

  56. [56]

    Geovideo: Intro- ducing geometric regularization into video generation model.arXiv preprint arXiv:2512.03453, 2025

    Yunpeng Bai, Shaoheng Fang, Chaohui Yu, Fan Wang, and Qixing Huang. Geovideo: Intro- ducing geometric regularization into video generation model.arXiv preprint arXiv:2512.03453, 2025

  57. [57]

    ViewCrafter: Taming Video Diffusion Models for High-fidelity Novel View Synthesis

    Wangbo Yu, Jinbo Xing, Li Yuan, Wenbo Hu, Xiaoyu Li, Zhipeng Huang, Xiangjun Gao, Tien-Tsin Wong, Ying Shan, and Yonghong Tian. Viewcrafter: Taming video diffusion models for high-fidelity novel view synthesis.arXiv preprint arXiv:2409.02048, 2024

  58. [58]

    Flashworld: High-quality3dscenegenerationwithinseconds.arXiv preprint arXiv:2510.13678, 2025

    Xinyang Li, Tengfei Wang, Zixiao Gu, Shengchuan Zhang, Chunchao Guo, and Liujuan Cao. Flashworld: High-quality3dscenegenerationwithinseconds.arXiv preprint arXiv:2510.13678, 2025

  59. [59]

    billboard

    Sixiao Zheng, Minghao Yin, Wenbo Hu, Xiaoyu Li, Ying Shan, and Yanwei Fu. Ver- secrafter: Dynamic realistic video world model with 4d geometric control.arXiv preprint arXiv:2601.05138, 2026. 16 A Implementation Details A.1 Reward Formulation and Details The core of our alignment strategy is the 3D-aware reward, which utilizes an analysis-by-synthesis appr...

  60. [60]

    •push_in: Move forward along the optical axis

    Intra-scene Exploration Movements that investigate the depth and 3D structure of a specific subject. •push_in: Move forward along the optical axis. (Focus: Depth & Parallax) • orbit_left: Revolve counter-clockwise around a focal point. (Focus: 360°Object Consistency) • orbit_right: Revolve clockwise around a focal point. (Focus: 360°Object Consis- tency)

  61. [61]

    •pull_out: Move backward along the optical axis

    Inter-scene Transition Movements that shift the viewport to reveal new environments or expand context. •pull_out: Move backward along the optical axis. (Focus: Context Reveal) •move_left: Lateral truck left. (Focus: Parallax) •move_right: Lateral truck right. (Focus: Parallax) •pan_left: Rotate camera yaw left on axis. (Focus: Panoramic View) •pan_right: ...

  62. [62]

    •pull_left: Sequence:move_left→pull_out→pan_left

    Composite Trajectories Complex multi-axis maneuvers testing long-horizon consistency. •pull_left: Sequence:move_left→pull_out→pan_left. •pull_right: Sequence:move_right→pull_out→pan_right

  63. [63]

    •fixed: No ego-motion

    Static Observation Stationary camera to isolate temporal dynamics. •fixed: No ego-motion. (Focus: Fluid/Particle Dynamics) B.2 Dataset Taxonomy and Examples B.2.1 Natural Landscapes This category focuses on large-scale rigid geometry and natural fluid dynamics. It tests the model’s ability to maintain consistency across vast distances and handle organic s...

  64. [64]

    Which video better maintains the structural integrity of the physical world?

    Geometric Consistency:"Which video better maintains the structural integrity of the physical world?" • Look for:Objects that remain solid (no morphing/vanishing), background elements that stay fixed relative to the camera movement, and correct perspective changes (parallax). • Penalize:Walls that warp, objects that float or disappear when the camera moves...

  65. [65]

    Which video more accurately executes the camera movement described in the text prompt?

    Camera Control Accuracy:"Which video more accurately executes the camera movement described in the text prompt?" • Look for:Strict adherence to the directional command (e.g., if the prompt says "orbit," the camera must circle the object, not just pan or zoom). • Penalize:Errant camera drift, static cameras when motion is requested, or wrong movement direc...