arxiv: 2604.24764 · v1 · submitted 2026-04-27 · 💻 cs.CV

Recognition: unknown

World-R1: Reinforcing 3D Constraints for Text-to-Video Generation

Weijie Wang , Xiaoxuan He , Youping Gu , Yifan Yang , Zeyu Zhang , Yefei He , Yanbo Ding , Xirui Hu

show 4 more authors

Donny Y. Chen Zhiyuan He Yuqing Yang Bohan Zhuang

Authors on Pith no claims yet

Pith reviewed 2026-05-08 04:23 UTC · model grok-4.3

classification 💻 cs.CV

keywords text-to-video generation3D consistencyreinforcement learningworld simulationgeometric constraintsFlow-GRPOfoundation models

0 comments

The pith

Reinforcement learning with feedback from 3D models enforces geometric consistency in text-to-video generation without changing the base architecture.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Video foundation models create visually strong outputs but often violate basic 3D structure across frames. World-R1 addresses this by treating 3D coherence as a reinforcement learning objective rather than an architectural constraint. It supplies a pure-text dataset for world simulation and uses Flow-GRPO to draw reward signals from pre-trained 3D foundation models and vision-language models. A periodic decoupled training schedule keeps rigid geometry and scene motion in balance. The result is measurably higher structural consistency while the original visual quality of the foundation model remains unchanged.

Core claim

World-R1 shows that alignment with 3D constraints can be achieved post-training through reinforcement learning that receives direct feedback from existing 3D foundation models and vision-language models. By optimizing via Flow-GRPO on a dedicated text-only world-simulation dataset and applying periodic decoupled training, the method raises 3D consistency scores while leaving the underlying video generation network untouched.

What carries the argument

Flow-GRPO reinforcement learning that converts feedback from pre-trained 3D foundation models and vision-language models into policy updates for video generation.

If this is right

Video generation pipelines can incorporate 3D awareness at inference or fine-tuning time rather than during initial training.
World-simulation tasks become feasible at larger scale because no new model architecture is required.
Dynamic scene elements remain fluid because the training schedule explicitly separates geometric and motion objectives.
Existing high-quality video models can be upgraded for consistency without retraining from scratch.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same feedback loop could be applied to enforce other physical priors such as lighting or material properties once suitable reward models exist.
If the quality of the 3D feedback models continues to improve, the ceiling on achievable consistency rises without any change to the video generator itself.
Periodic decoupling may generalize to other trade-offs in generative models where one property must be preserved while another is optimized.

Load-bearing premise

Signals from pre-trained 3D foundation models and vision-language models provide reliable, unbiased measures of structural coherence that improve video output when used as rewards.

What would settle it

A side-by-side comparison on a held-out set of prompts where videos produced after Flow-GRPO optimization show no gain or a loss in standard 3D consistency metrics such as multi-view geometric error or object trajectory stability.

read the original abstract

Recent video foundation models demonstrate impressive visual synthesis but frequently suffer from geometric inconsistencies. While existing methods attempt to inject 3D priors via architectural modifications, they often incur high computational costs and limit scalability. We propose World-R1, a framework that aligns video generation with 3D constraints through reinforcement learning. To facilitate this alignment, we introduce a specialized pure text dataset tailored for world simulation. Utilizing Flow-GRPO, we optimize the model using feedback from pre-trained 3D foundation models and vision-language models to enforce structural coherence without altering the underlying architecture. We further employ a periodic decoupled training strategy to balance rigid geometric consistency with dynamic scene fluidity. Extensive evaluations reveal that our approach significantly enhances 3D consistency while preserving the original visual quality of the foundation model, effectively bridging the gap between video generation and scalable world simulation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

World-R1 applies RL to align video generation with 3D constraints via external model feedback, but the abstract provides no quantitative support for the claimed improvements.

read the letter

The punchline for this one is that World-R1 uses reinforcement learning driven by pre-trained 3D and VLM feedback to improve 3D consistency in text-to-video generation without changing the base model architecture. The new parts are the Flow-GRPO optimization, the periodic decoupled training strategy, and the specialized text dataset for world simulation. The paper does a good job laying out why architectural modifications are costly and proposing this alignment route instead. Keeping the original visual quality while targeting structural coherence is a sensible priority. Where it gets soft is the empirical side. Claims of significant enhancements come without any reported metrics, baselines, or ablation results in the abstract, so the central improvement is not yet visible. The risk that the reward signals from those external models do not reliably indicate actual geometric consistency on the generated videos is a legitimate one, and the paper would need to demonstrate that the optimization leads to better independent 3D measures rather than just higher reward scores. If the full version includes those checks and shows the method holds up, this could be useful for researchers focused on building more reliable video-based world simulators. It is the kind of incremental but practical step that fits in the current generative video literature. I would recommend sending it for peer review, as the idea has potential and the problem it targets is clear, even if more validation is required.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes World-R1, a framework for text-to-video generation that reinforces 3D constraints using reinforcement learning. Specifically, it introduces Flow-GRPO to optimize a video foundation model with feedback from pre-trained 3D foundation models and vision-language models, supported by a new pure text dataset for world simulation and a periodic decoupled training strategy to maintain both geometric consistency and dynamic fluidity. The central claim is that this method significantly improves 3D consistency without altering the underlying model architecture or compromising visual quality.

Significance. Should the quantitative improvements be demonstrated rigorously, this paper would offer a valuable contribution to scalable world simulation by providing an efficient post-hoc alignment technique that avoids the computational overhead of architectural changes. The use of external model feedback for RL in video generation is a promising direction, and the specialized dataset could serve as a useful resource for the community if released.

major comments (3)

Abstract: The statement that 'extensive evaluations reveal that our approach significantly enhances 3D consistency' lacks any accompanying metrics, baseline comparisons, or ablation studies. This is a load-bearing issue for the empirical claim, as without these, the significance of the improvement cannot be evaluated.
Method section: The Flow-GRPO optimization relies on reward signals from pre-trained 3D and VLM models, but there is no analysis or validation showing that these signals correlate with actual geometric consistency (e.g., via multi-view reprojection error or point-cloud consistency) on the model's own generated videos. This creates a risk that the optimization improves the proxy rather than the true 3D structure.
Training strategy section: The periodic decoupled training is intended to balance rigid consistency with fluidity, but without details on the decoupling mechanism, reward formulation, or experiments showing it prevents reward hacking while preserving dynamics, it is unclear if the approach achieves the claimed balance.

minor comments (2)

Abstract: The term 'pure text dataset' could be clarified regarding its construction, size, and how it differs from existing text corpora used in video training.
Overall: Some notation around Flow-GRPO could benefit from a formal definition or pseudocode to improve reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript accordingly to strengthen the presentation of our results and methodological details.

read point-by-point responses

Referee: Abstract: The statement that 'extensive evaluations reveal that our approach significantly enhances 3D consistency' lacks any accompanying metrics, baseline comparisons, or ablation studies. This is a load-bearing issue for the empirical claim, as without these, the significance of the improvement cannot be evaluated.

Authors: The abstract provides a high-level summary of our contributions, while the full quantitative evaluations—including specific metrics on 3D consistency, baseline comparisons, and ablation studies—are detailed in the Experiments section. To make the abstract more self-contained and address this concern, we will revise it to include representative numerical results and references to the supporting analyses in the main text. revision: yes
Referee: Method section: The Flow-GRPO optimization relies on reward signals from pre-trained 3D and VLM models, but there is no analysis or validation showing that these signals correlate with actual geometric consistency (e.g., via multi-view reprojection error or point-cloud consistency) on the model's own generated videos. This creates a risk that the optimization improves the proxy rather than the true 3D structure.

Authors: The reward signals draw from established pre-trained 3D foundation models and VLMs whose geometric capabilities have been validated in prior work. We agree that direct correlation analysis on videos generated by our model would further substantiate the approach. In the revised manuscript, we will add an analysis section validating these signals against geometric consistency measures such as multi-view reprojection error and point-cloud consistency. revision: yes
Referee: Training strategy section: The periodic decoupled training is intended to balance rigid consistency with fluidity, but without details on the decoupling mechanism, reward formulation, or experiments showing it prevents reward hacking while preserving dynamics, it is unclear if the approach achieves the claimed balance.

Authors: The periodic decoupled training strategy is outlined in the Training Strategy section to maintain both geometric consistency and dynamic fluidity. We will expand this section in the revision with explicit details on the decoupling mechanism, reward formulation, and additional experiments demonstrating that the strategy mitigates reward hacking while preserving scene dynamics. revision: yes

Circularity Check

0 steps flagged

No circularity: external pre-trained models supply independent feedback signals

full rationale

The paper's core derivation uses reinforcement learning (Flow-GRPO) driven by feedback from separately pre-trained 3D foundation models and VLMs, followed by independent evaluations of 3D consistency. No equations or definitions reduce the claimed improvement to the input rewards by construction, no self-citations are load-bearing for the central claim, and no fitted parameters are relabeled as predictions. The approach is self-contained because the reward models are external and the final claims rest on separate evaluation metrics rather than tautological re-use of the same signals.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no equations, training objectives, or implementation specifics, so no free parameters, axioms, or invented entities can be identified.

pith-pipeline@v0.9.0 · 5483 in / 1121 out tokens · 40426 ms · 2026-05-08T04:23:12.305647+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

65 extracted references · 31 canonical work pages · 16 internal anchors

[1]

Cogvideox: Text-to-video diffusion models with an expert transformer

Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, Da Yin, Yuxuan.Zhang, Weihan Wang, Yean Cheng, Bin Xu, Xiaotao Gu, Yuxiao Dong, and Jie Tang. Cogvideox: Text-to-video diffusion models with an expert transformer. InInt. Conf. Learn. Represent., 2025

2025
[2]

HunyuanVideo: A Systematic Framework For Large Video Generative Models

Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024

work page internal anchor Pith review arXiv 2024
[3]

Wan: Open and Advanced Large-Scale Video Generative Models

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

work page internal anchor Pith review arXiv 2025
[4]

Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Do- minik Lorenz, Yam Levi, Zion English, Vikram Voleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023

work page internal anchor Pith review arXiv 2023
[5]

Videocrafter2: Overcoming data limitations for high-quality video diffusion models

Haoxin Chen, Yong Zhang, Xiaodong Cun, Menghan Xia, Xintao Wang, Chao Weng, and Ying Shan. Videocrafter2: Overcoming data limitations for high-quality video diffusion models. InIEEE Conf. Comput. Vis. Pattern Recog., pages 7310–7320, 2024

2024
[6]

Snap video: Scaled spatiotemporal transformers for text-to-video synthesis

Willi Menapace, Aliaksandr Siarohin, Ivan Skorokhodov, Ekaterina Deyneka, Tsai-Shien Chen, Anil Kag, Yuwei Fang, Aleksei Stoliar, Elisa Ricci, Jian Ren, et al. Snap video: Scaled spatiotemporal transformers for text-to-video synthesis. InIEEE Conf. Comput. Vis. Pattern Recog., pages 7038–7048, 2024

2024
[7]

Steerx: Creating any camera-free 3d and 4d scenes with geometric steering

Byeongjun Park, Hyojun Go, Hyelin Nam, Byung-Hoon Kim, Hyungjin Chung, and Changick Kim. Steerx: Creating any camera-free 3d and 4d scenes with geometric steering. InInt. Conf. Comput. Vis., pages 27326–27337, 2025

2025
[8]

Voyager: Long-range and world-consistent video diffusion for explorable 3d scene generation.arXiv preprint arXiv:2506.04225, 2025

Tianyu Huang, Wangguandong Zheng, Tengfei Wang, Yuhao Liu, Zhenwei Wang, Junta Wu, Jie Jiang, Hui Li, Rynson WH Lau, Wangmeng Zuo, and Chunchao Guo. Voyager: Long-range and world-consistent video diffusion for explorable 3d scene generation.arXiv preprint arXiv:2506.04225, 2025

work page arXiv 2025
[9]

Video world models with long-term spatial memory.arXiv preprint arXiv:2506.05284,

Tong Wu, Shuai Yang, Ryan Po, Yinghao Xu, Ziwei Liu, Dahua Lin, and Gordon Wetzstein. Video world models with long-term spatial memory.arXiv preprint arXiv:2506.05284, 2025

work page arXiv 2025
[10]

Geometry Forcing: Marrying Video Diffusion and 3D Representation for Consistent World Modeling

Haoyu Wu, Diankun Wu, Tianyu He, Junliang Guo, Yang Ye, Yueqi Duan, and Jiang Bian. Geometry forcing: Marrying video diffusion and 3d representation for consistent world modeling.arXiv preprint arXiv:2507.07982, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[11]

WorldForge: Unlocking emergent 3d/4d generation in video diffusion model via training-free guidance.arXiv preprint arXiv:2509.15130, 2025a

Chenxi Song, Yanming Yang, Tong Zhao, Ruibo Li, and Chi Zhang. Worldforge: Unlocking emergent 3d/4d generation in video diffusion model via training-free guidance.arXiv preprint arXiv:2509.15130, 2025

work page arXiv 2025
[12]

Vmem: Consistent interactive video scene generation with surfel-indexed view memory

Runjia Li, Philip Torr, Andrea Vedaldi, and Tomas Jakab. Vmem: Consistent interactive video scene generation with surfel-indexed view memory. InInt. Conf. Comput. Vis., 2025

2025
[13]

4dnex: Feed-forward 4d generative modeling made easy.arXiv preprint arXiv:2508.13154, 2025

Zhaoxi Chen, Tianqi Liu, Long Zhuo, Jiawei Ren, Zeng Tao, He Zhu, Fangzhou Hong, Liang Pan, and Ziwei Liu. 4dnex: Feed-forward 4d generative modeling made easy.arXiv preprint arXiv:2508.13154, 2025. 12

work page arXiv 2025
[14]

Genfusion: Closing the loop between reconstruction and generation via videos

Sibo Wu, Congrong Xu, Binbin Huang, Andreas Geiger, and Anpei Chen. Genfusion: Closing the loop between reconstruction and generation via videos. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 6078–6088, 2025

2025
[15]

Zixuan Huang, Xiang Li, Zhaoyang Lv, and James M. Rehg. How much 3d do video foundation models encode?arXiv preprint arXiv:2512.19949, 2025

work page arXiv 2025
[16]

MIT press Cambridge, 1998

Richard S Sutton, Andrew G Barto, et al.Reinforcement learning: An introduction, volume 1. MIT press Cambridge, 1998

1998
[17]

Depth Anything 3: Recovering the Visual Space from Any Views

Haotong Lin, Sili Chen, Junhao Liew, Donny Y Chen, Zhenyu Li, Guang Shi, Jiashi Feng, and Bingyi Kang. Depth anything 3: Recovering the visual space from any views.arXiv preprint arXiv:2511.10647, 2025

work page internal anchor Pith review arXiv 2025
[18]

Depth anything v2

LiheYang, BingyiKang, ZilongHuang, ZhenZhao, XiaogangXu, JiashiFeng, andHengshuang Zhao. Depth anything v2. InAdv. Neural Inform. Process. Syst., volume 37, pages 21875– 21911, 2024

2024
[19]

Depth anything: Unleashing the power of large-scale unlabeled data

Lihe Yang, Bingyi Kang, Zilong Huang, Xiaogang Xu, Jiashi Feng, and Hengshuang Zhao. Depth anything: Unleashing the power of large-scale unlabeled data. InIEEE Conf. Comput. Vis. Pattern Recog., pages 10371–10381, 2024

2024
[20]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shixuan ...

work page internal anchor Pith review arXiv 2025
[21]

Qwen2.5-VL Technical Report

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-vl technical report.ar...

work page internal anchor Pith review arXiv 2025
[22]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Yang Fan, Kai Dang, Mengfei Du, Xuancheng Ren, Rui Men, Dayiheng Liu, Chang Zhou, Jingren Zhou, and Junyang Lin. Qwen2-vl: Enhancing vision- language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024

work page internal anchor Pith review arXiv 2024
[23]

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond.arXiv preprint arXiv:2308.12966, 2023

work page internal anchor Pith review arXiv 2023
[24]

Flow-grpo: Training flow matching models via online rl

Jie Liu, Gongye Liu, Jiajun Liang, Yangguang Li, Jiaheng Liu, Xintao Wang, Pengfei Wan, Di Zhang, and Wanli Ouyang. Flow-grpo: Training flow matching models via online rl. In Adv. Neural Inform. Process. Syst., 2025. 13

2025
[25]

AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning

Yuwei Guo, Ceyuan Yang, Anyi Rao, Zhengyang Liang, Yaohui Wang, Yu Qiao, Maneesh Agrawala, Dahua Lin, and Bo Dai. Animatediff: Animate your personalized text-to-image diffusion models without specific tuning.arXiv preprint arXiv:2307.04725, 2023

work page internal anchor Pith review arXiv 2023
[26]

Motionctrl: A unified and flexible motion controller for video generation

Zhouxia Wang, Ziyang Yuan, Xintao Wang, Yaowei Li, Tianshui Chen, Menghan Xia, Ping Luo, and Ying Shan. Motionctrl: A unified and flexible motion controller for video generation. InACM SIGGRAPH Conf. Comput. Graph. Interact. Tech., pages 1–11, 2024

2024
[27]

CameraCtrl: Enabling Camera Control for Text-to-Video Generation

Hao He, Yinghao Xu, Yuwei Guo, Gordon Wetzstein, Bo Dai, Hongsheng Li, and Ceyuan Yang. Cameractrl: Enabling camera control for text-to-video generation.arXiv preprint arXiv:2404.02101, 2024

work page internal anchor Pith review arXiv 2024
[28]

Motionbooth: Motion-aware customized text-to-video generation.Adv

Jianzong Wu, Xiangtai Li, Yanhong Zeng, Jiangning Zhang, Qianyu Zhou, Yining Li, Yunhai Tong, and Kai Chen. Motionbooth: Motion-aware customized text-to-video generation.Adv. Neural Inf. Process. Syst., 37:34322–34348, 2024

2024
[29]

Collaborative video diffusion: Consistent multi-video generation with camera control.Adv

Zhengfei Kuang, Shengqu Cai, Hao He, Yinghao Xu, Hongsheng Li, Leonidas J Guibas, and Gordon Wetzstein. Collaborative video diffusion: Consistent multi-video generation with camera control.Adv. Neural Inform. Process. Syst., 37:16240–16271, 2024

2024
[30]

Ac3d: Analyzing and improving 3d camera control in video diffusion transformers

Sherwin Bahmani, Ivan Skorokhodov, Guocheng Qian, Aliaksandr Siarohin, Willi Menapace, Andrea Tagliasacchi, David B Lindell, and Sergey Tulyakov. Ac3d: Analyzing and improving 3d camera control in video diffusion transformers. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 22875–22889, 2025

2025
[31]

arXiv preprint arXiv:2502.07531 (2025) Abbreviated paper title 21

Sixiao Zheng, Zimian Peng, Yanpeng Zhou, Yi Zhu, Hang Xu, Xiangru Huang, and Yanwei Fu. Vidcraft3: Camera, object, and lighting control for image-to-video generation.arXiv preprint arXiv:2502.07531, 2025

work page arXiv 2025
[32]

Recammaster: Camera-controlled generative rendering from a single video

Jianhong Bai, Menghan Xia, Xiao Fu, Xintao Wang, Lianrui Mu, Jinwen Cao, Zuozhu Liu, Haoji Hu, Xiang Bai, Pengfei Wan, et al. Recammaster: Camera-controlled generative rendering from a single video. InInt. Conf. Comput. Vis., 2025

2025
[33]

Realcam-i2v: Real-world image-to-video generation with interactive complex camera control

Teng Li, Guangcong Zheng, Rui Jiang, Shuigen Zhan, Tao Wu, Yehao Lu, Yining Lin, Chuanyun Deng, Yepan Xiong, Min Chen, et al. Realcam-i2v: Real-world image-to-video generation with interactive complex camera control. InInt. Conf. Comput. Vis., pages 28785–28796, 2025

2025
[34]

Go-with-the-flow: Motion- controllable video diffusion models using real-time warped noise

Ryan Burgert, Yuancheng Xu, Wenqi Xian, Oliver Pilarski, Pascal Clausen, Mingming He, Li Ma, Yitong Deng, Lingxiao Li, Mohsen Mousavi, et al. Go-with-the-flow: Motion- controllable video diffusion models using real-time warped noise. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 13–23, 2025

2025
[35]

Spatia: Video generation with updatable spatial memory.arXiv preprint arXiv:2512.15716, 2025

Jinjing Zhao, Fangyun Wei, Zhening Liu, Hongyang Zhang, Chang Xu, and Yan Lu. Spatia: Video generation with updatable spatial memory.arXiv preprint arXiv:2512.15716, 2025

work page arXiv 2025
[36]

World-consistent video diffusion with explicit 3d modeling

Qihang Zhang, Shuangfei Zhai, Miguel Angel Bautista Martin, Kevin Miao, Alexander Toshev, Joshua Susskind, and Jiatao Gu. World-consistent video diffusion with explicit 3d modeling. InIEEE Conf. Comput. Vis. Pattern Recog., pages 21685–21695, 2025

2025
[37]

FantasyWorld: Geometry-consistent world modeling via unified video and 3d prediction.arXiv preprint arXiv:2509.21657,

Yixiang Dai, Fan Jiang, Chiyu Wang, Mu Xu, and Yonggang Qi. Fantasyworld: Geometry-consistent world modeling via unified video and 3d prediction.arXiv preprint arXiv:2509.21657, 2025. 14

work page arXiv 2025
[38]

Stereo magnification: learning view synthesis using multiplane images.ACM Trans

Tinghui Zhou, Richard Tucker, John Flynn, Graham Fyffe, and Noah Snavely. Stereo magnification: learning view synthesis using multiplane images.ACM Trans. Graph., 37(4): 1–12, 2018

2018
[39]

Dl3dv-10k: A large-scale scene dataset for deep learning-based 3d vision

Lu Ling, Yichen Sheng, Zhi Tu, Wentian Zhao, Cheng Xin, Kun Wan, Lantao Yu, Qianyu Guo, Zixun Yu, Yawen Lu, et al. Dl3dv-10k: A large-scale scene dataset for deep learning-based 3d vision. InIEEE Conf. Comput. Vis. Pattern Recog., pages 22160–22169, 2024

2024
[40]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review arXiv 2017
[41]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review arXiv 2024
[42]

The unrea- sonable effectiveness of deep features as a perceptual metric

Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unrea- sonable effectiveness of deep features as a perceptual metric. InIEEE Conf. Comput. Vis. Pattern Recog., pages 586–595, 2018

2018
[43]

Hpsv3: Towards wide-spectrum human preference score

Yuhang Ma, Xiaoshi Wu, Keqiang Sun, and Hongsheng Li. Hpsv3: Towards wide-spectrum human preference score. InInt. Conf. Comput. Vis., pages 15086–15095, 2025

2025
[44]

Gemini: A Family of Highly Capable Multimodal Models

Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023

work page internal anchor Pith review arXiv 2023
[45]

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context.arXiv preprint arXiv:2403.05530, 2024

work page internal anchor Pith review arXiv 2024
[46]

3d gaussian splatting for real-time radiance field rendering.ACM Trans

Bernhard Kerbl, Georgios Kopanas, Thomas Leimkühler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering.ACM Trans. Graph., 42(4):139–1, 2023

2023
[47]

gsplat: An open-source library for gaussian splatting.JMLR, 26(34):1–17, 2025

Vickie Ye, Ruilong Li, Justin Kerr, Matias Turkulainen, Brent Yi, Zhuoyang Pan, Otto Seiskari, Jianbo Ye, Jeffrey Hu, Matthew Tancik, et al. gsplat: An open-source library for gaussian splatting.JMLR, 26(34):1–17, 2025

2025
[48]

Image quality assessment: from error visibility to structural similarity.IEEE Trans

Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. Image quality assessment: from error visibility to structural similarity.IEEE Trans. Image Process., 13(4):600–612, 2004

2004
[49]

Vbench: Comprehensive benchmark suite for video generative models

Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, et al. Vbench: Comprehensive benchmark suite for video generative models. InIEEE Conf. Comput. Vis. Pattern Recog., pages 21807–21818, 2024

2024
[50]

Generative camera dolly: Extreme monocular dynamic novel view synthesis

Basile Van Hoorick, Rundi Wu, Ege Ozguroglu, Kyle Sargent, Ruoshi Liu, Pavel Tokmakov, Achal Dave, Changxi Zheng, and Carl Vondrick. Generative camera dolly: Extreme monocular dynamic novel view synthesis. InEur. Conf. Comput. Vis., pages 313–331. Springer, 2024

2024
[51]

Trajectory attention for fine-grained video motion control.arXiv preprint arXiv:2411.19324, 2024

Zeqi Xiao, Wenqi Ouyang, Yifan Zhou, Shuai Yang, Lei Yang, Jianlou Si, and Xingang Pan. Trajectory attention for fine-grained video motion control.arXiv preprint arXiv:2411.19324, 2024. 15

work page arXiv 2024
[52]

Diffusion as shader: 3d-aware video diffusion for versatile video generation control

Zekai Gu, Rui Yan, Jiahao Lu, Peng Li, Zhiyang Dou, Chenyang Si, Zhen Dong, Qifeng Liu, Cheng Lin, Ziwei Liu, et al. Diffusion as shader: 3d-aware video diffusion for versatile video generation control. InProceedings of the Special Interest Group on Computer Graphics and Interactive Techniques Conference Conference Papers, pages 1–12, 2025

2025
[53]

Worldscore: A unified evaluation benchmark for world generation

Haoyi Duan, Hong-Xing Yu, Sirui Chen, Li Fei-Fei, and Jiajun Wu. Worldscore: A unified evaluation benchmark for world generation. InInt. Conf. Comput. Vis., 2025

2025
[54]

arXiv preprint arXiv:2503.05638 (2025) 18 Liu et al

Mark Yu, Wenbo Hu, Jinbo Xing, and Ying Shan. Trajectorycrafter: Redirecting camera trajectory for monocular videos via diffusion models.arXiv preprint arXiv:2503.05638, 2025

work page arXiv 2025
[55]

Camclonemaster: Enabling reference-based camera control for video generation.arXiv preprint arXiv:2506.03140, 2025

Yawen Luo, Jianhong Bai, Xiaoyu Shi, Menghan Xia, Xintao Wang, Pengfei Wan, Di Zhang, Kun Gai, and Tianfan Xue. Camclonemaster: Enabling reference-based camera control for video generation.arXiv preprint arXiv:2506.03140, 2025

work page arXiv 2025
[56]

Geovideo: Intro- ducing geometric regularization into video generation model.arXiv preprint arXiv:2512.03453, 2025

Yunpeng Bai, Shaoheng Fang, Chaohui Yu, Fan Wang, and Qixing Huang. Geovideo: Intro- ducing geometric regularization into video generation model.arXiv preprint arXiv:2512.03453, 2025

work page arXiv 2025
[57]

ViewCrafter: Taming Video Diffusion Models for High-fidelity Novel View Synthesis

Wangbo Yu, Jinbo Xing, Li Yuan, Wenbo Hu, Xiaoyu Li, Zhipeng Huang, Xiangjun Gao, Tien-Tsin Wong, Ying Shan, and Yonghong Tian. Viewcrafter: Taming video diffusion models for high-fidelity novel view synthesis.arXiv preprint arXiv:2409.02048, 2024

work page internal anchor Pith review arXiv 2024
[58]

Flash- world: High-quality 3d scene generation within seconds.arXiv preprint arXiv:2510.13678,

Xinyang Li, Tengfei Wang, Zixiao Gu, Shengchuan Zhang, Chunchao Guo, and Liujuan Cao. Flashworld: High-quality3dscenegenerationwithinseconds.arXiv preprint arXiv:2510.13678, 2025

work page arXiv 2025
[59]

billboard

Sixiao Zheng, Minghao Yin, Wenbo Hu, Xiaoyu Li, Ying Shan, and Yanwei Fu. Ver- secrafter: Dynamic realistic video world model with 4d geometric control.arXiv preprint arXiv:2601.05138, 2026. 16 A Implementation Details A.1 Reward Formulation and Details The core of our alignment strategy is the 3D-aware reward, which utilizes an analysis-by-synthesis appr...

work page arXiv 2026
[60]

•push_in: Move forward along the optical axis

Intra-scene Exploration Movements that investigate the depth and 3D structure of a specific subject. •push_in: Move forward along the optical axis. (Focus: Depth & Parallax) • orbit_left: Revolve counter-clockwise around a focal point. (Focus: 360°Object Consistency) • orbit_right: Revolve clockwise around a focal point. (Focus: 360°Object Consis- tency)
[61]

•pull_out: Move backward along the optical axis

Inter-scene Transition Movements that shift the viewport to reveal new environments or expand context. •pull_out: Move backward along the optical axis. (Focus: Context Reveal) •move_left: Lateral truck left. (Focus: Parallax) •move_right: Lateral truck right. (Focus: Parallax) •pan_left: Rotate camera yaw left on axis. (Focus: Panoramic View) •pan_right: ...
[62]

•pull_left: Sequence:move_left→pull_out→pan_left

Composite Trajectories Complex multi-axis maneuvers testing long-horizon consistency. •pull_left: Sequence:move_left→pull_out→pan_left. •pull_right: Sequence:move_right→pull_out→pan_right
[63]

•fixed: No ego-motion

Static Observation Stationary camera to isolate temporal dynamics. •fixed: No ego-motion. (Focus: Fluid/Particle Dynamics) B.2 Dataset Taxonomy and Examples B.2.1 Natural Landscapes This category focuses on large-scale rigid geometry and natural fluid dynamics. It tests the model’s ability to maintain consistency across vast distances and handle organic s...
[64]

Which video better maintains the structural integrity of the physical world?

Geometric Consistency:"Which video better maintains the structural integrity of the physical world?" • Look for:Objects that remain solid (no morphing/vanishing), background elements that stay fixed relative to the camera movement, and correct perspective changes (parallax). • Penalize:Walls that warp, objects that float or disappear when the camera moves...
[65]

Which video more accurately executes the camera movement described in the text prompt?

Camera Control Accuracy:"Which video more accurately executes the camera movement described in the text prompt?" • Look for:Strict adherence to the directional command (e.g., if the prompt says "orbit," the camera must circle the object, not just pan or zoom). • Penalize:Errant camera drift, static cameras when motion is requested, or wrong movement direc...

work page arXiv