pith. sign in

arxiv: 2507.07982 · v2 · submitted 2025-07-10 · 💻 cs.CV · cs.AI

Geometry Forcing: Marrying Video Diffusion and 3D Representation for Consistent World Modeling

Pith reviewed 2026-05-19 05:10 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords video diffusion3D consistencygeometry alignmentworld modelingvideo generationfeature alignment
0
0 comments X

The pith

Aligning intermediate features of video diffusion models with geometric representations improves 3D consistency in generated videos.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Videos are 2D projections of a 3D world, yet video diffusion models trained only on raw video data often fail to capture meaningful geometric structure in their internal representations. Geometry Forcing addresses this by adding two alignment objectives during training: angular alignment that matches directional information via cosine similarity, and scale alignment that preserves magnitude by regressing geometric features from the diffusion model's normalized representations. The method is tested on both camera-view conditioned and action-conditioned video generation. If the approach holds, the resulting models would produce videos with stronger visual quality and better 3D consistency across frames and viewpoints, moving closer to reliable world modeling from 2D data alone.

Core claim

The central claim is that guiding a video diffusion model's intermediate representations to align with features from a geometric foundation model causes the diffusion model to internalize 3D-aware structure. This alignment is implemented through Angular Alignment, which enforces directional consistency with cosine similarity, and Scale Alignment, which regresses geometric features from normalized diffusion representations. When applied to standard video generation tasks, the resulting models show substantially higher visual quality and 3D consistency than baselines trained without these objectives.

What carries the argument

Geometry Forcing, the training-time alignment of diffusion model intermediate representations with geometric foundation model features using angular and scale objectives.

If this is right

  • Video generation becomes more consistent across camera viewpoints and time steps.
  • The improvements apply to both camera-conditioned and action-conditioned generation settings.
  • Geometric awareness is added without altering the diffusion model's architecture or inference procedure.
  • Training on raw video alone is no longer sufficient for high-quality 3D-aware outputs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same alignment principle might transfer to other generative models that operate on 2D projections of 3D scenes.
  • Stronger 3D consistency could support downstream uses such as simulation or planning that require reliable spatial understanding.
  • Longer video sequences might benefit most, as accumulated geometric errors would be reduced.

Load-bearing premise

That aligning diffusion features with geometric features will make the model internalize genuine 3D structure rather than superficial correlations.

What would settle it

Training identical video diffusion models with and without the two alignment objectives and finding no measurable difference in 3D consistency metrics such as multi-view reprojection error or temporal geometry stability.

Figures

Figures reproduced from arXiv: 2507.07982 by Diankun Wu, Haoyu Wu, Jiang Bian, Junliang Guo, Tianyu He, Yang Ye, Yueqi Duan.

Figure 1
Figure 1. Figure 1: Geometry Forcing equips video diffusion models with 3D awareness. (a) We propose Geometry Forcing (GF), a simple yet effective paradigm to internalize geometric-aware structure into video diffusion models by aligning with features from a pretrained geometric foundation model, i.e., VGGT (Wang et al., 2025). (b) Compared to the baseline method (Song et al., 2025), our method produces more consistent generat… view at source ↗
Figure 2
Figure 2. Figure 2: Qualitative comparison of camera view-conditioned video generation under full￾circle rotation. Videos are generated from a single input frame and corresponding per-frame camera poses simulating a full 360° rotation. Our method (GF) is compared with DFoT (Song et al., 2025), VideoREPA (Zhang et al., 2025c), and REPA (Zhang et al., 2025c). The results demonstrate that the baseline methods fail to maintain te… view at source ↗
Figure 3
Figure 3. Figure 3: Ablation study on alignment depth. We present FVD-256 and FVD-16 results for aligning VGGT to different layers of the diffu￾sion model. The results suggest that mid-level feature alignment is most effective for improv￾ing long-term video quality [PITH_FULL_IMAGE:figures/full_fig_p010_3.png] view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative comparisons on camera-conditioned video generation. All the videos are generated given first frame and per-frame camera pose. We comprehensively compare GF (ours) with DFoT (Song et al., 2025), VideoREPA (Zhang et al., 2025c), REPA (Zhang et al., 2025c). The results demostrate consistency in long-term video generation both inside (left) and outside (right) scenes [PITH_FULL_IMAGE:figures/full_… view at source ↗
read the original abstract

Videos inherently represent 2D projections of a dynamic 3D world. However, our analysis suggests that video diffusion models trained solely on raw video data often fail to capture meaningful geometric-aware structure in their learned representations. To bridge the gap between video diffusion models and the underlying 3D nature of the physical world, we propose Geometry Forcing, a simple yet effective method that encourages video diffusion models to internalize 3D representations. Our key insight is to guide the model's intermediate representations toward geometry-aware structure by aligning them with features from a geometric foundation model. To this end, we introduce two complementary alignment objectives: Angular Alignment, which enforces directional consistency via cosine similarity, and Scale Alignment, which preserves scale-related information by regressing geometric features from normalized diffusion representations. We evaluate Geometry Forcing on both camera-view conditioned and action-conditioned video generation tasks. Experimental results demonstrate that our method substantially improves visual quality and 3D consistency over the baseline methods. Project page: https://GeometryForcing.github.io.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes Geometry Forcing to address the limitation that video diffusion models trained on raw video data fail to capture meaningful geometric structure. It aligns intermediate diffusion representations with features from a geometric foundation model via two objectives: Angular Alignment (cosine similarity on normalized features for directional consistency) and Scale Alignment (regression to preserve scale information). The approach is evaluated on camera-view conditioned and action-conditioned video generation tasks, claiming substantial improvements in visual quality and 3D consistency over baselines.

Significance. If the alignment objectives can be shown to reshape the learned denoising trajectory such that generated videos respect 3D structure under novel camera motions and actions, the method would offer a lightweight way to inject geometric awareness into video diffusion models. This could advance consistent world modeling for downstream applications in simulation and robotics, provided the gains are not reducible to superficial feature correlations.

major comments (2)
  1. [Abstract] Abstract: The central claim that 'Experimental results demonstrate that our method substantially improves visual quality and 3D consistency over the baseline methods' is unsupported by any reported quantitative metrics, baseline descriptions, dataset details, or statistical significance tests, making it impossible to assess whether the results validate the claim.
  2. [Method] Method section (alignment objectives): Angular Alignment (cosine similarity) and Scale Alignment (regression) are applied to intermediate representations, but no analysis or ablation demonstrates that these losses propagate into the generative sampling process to enforce 3D-consistent dynamics under novel conditions rather than producing superficial correlations with the external geometric model outputs.
minor comments (1)
  1. [Method] The paper would benefit from explicit layer indices or feature extraction details when aligning with the geometric foundation model to improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and indicate the changes made to the manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim that 'Experimental results demonstrate that our method substantially improves visual quality and 3D consistency over the baseline methods' is unsupported by any reported quantitative metrics, baseline descriptions, dataset details, or statistical significance tests, making it impossible to assess whether the results validate the claim.

    Authors: We agree that the abstract would benefit from more concrete support for its claims. The full manuscript presents quantitative metrics, baseline comparisons (including standard video diffusion models without geometry alignment), dataset specifications for the camera-view and action-conditioned tasks, and evaluation protocols in the experiments section. To directly address this point, we have revised the abstract to reference key quantitative improvements and the evaluation setup while maintaining its concise nature. revision: yes

  2. Referee: [Method] Method section (alignment objectives): Angular Alignment (cosine similarity) and Scale Alignment (regression) are applied to intermediate representations, but no analysis or ablation demonstrates that these losses propagate into the generative sampling process to enforce 3D-consistent dynamics under novel conditions rather than producing superficial correlations with the external geometric model outputs.

    Authors: This comment correctly identifies a gap in mechanistic analysis. Our current results demonstrate improved 3D consistency on novel camera motions and actions, which provides indirect evidence that the alignment influences generation beyond superficial correlations. However, we acknowledge that explicit analysis of the denoising trajectory would strengthen the paper. We have added ablations and visualizations in the revised manuscript examining how the objectives affect intermediate sampling steps and feature propagation during generation. revision: yes

Circularity Check

0 steps flagged

No significant circularity in proposed alignment method

full rationale

The paper introduces Geometry Forcing as a training technique that aligns intermediate representations of a video diffusion model with features extracted from an external geometric foundation model, using two explicit objectives (Angular Alignment via cosine similarity and Scale Alignment via regression). These objectives are defined independently of the target generative outputs and are evaluated through separate experiments on camera-conditioned and action-conditioned video generation. No derivation step reduces a claimed prediction or 3D consistency result to a quantity defined in terms of the alignment losses themselves, nor does any central claim rely on a self-citation chain or uniqueness theorem imported from prior author work. The method remains self-contained against the external geometric model and experimental benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that geometric foundation model features encode transferable 3D structure that can be injected into diffusion representations via alignment losses. No free parameters or invented entities are described in the abstract.

axioms (1)
  • domain assumption Aligning diffusion model intermediate representations with geometric foundation model features will produce geometry-aware structure inside the diffusion model.
    This is the key insight stated in the abstract that motivates the two alignment objectives.

pith-pipeline@v0.9.0 · 5723 in / 1262 out tokens · 47928 ms · 2026-05-19T05:10:06.181067+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 17 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Geo-Align: Video Generation Alignment via Metric Geometry Reward

    cs.CV 2026-05 unverdicted novelty 7.0

    Geo-Align applies RL with a perceptual reward derived from 3D camera trajectory estimation to improve controllability and fidelity in video generation without paired training data.

  2. Trust It or Not: Evidential Uncertainty for Feed-Forward 3D Reconstruction with Trust3R

    cs.CV 2026-05 unverdicted novelty 7.0

    Trust3R introduces a gated residual refinement plus Normal-Inverse-Wishart evidential head that produces closed-form multivariate Student-t uncertainty for per-point geometry in feed-forward 3D reconstruction and impr...

  3. 3D-Belief: Embodied Belief Inference via Generative 3D World Modeling

    cs.CV 2026-05 unverdicted novelty 7.0

    3D-Belief maintains and updates explicit 3D beliefs about partially observed environments to enable multi-hypothesis imagination and improved performance on embodied tasks.

  4. Learning Visual Feature-Based World Models via Residual Latent Action

    cs.CV 2026-05 unverdicted novelty 7.0

    RLA-WM predicts residual latent actions via flow matching to create visual feature world models that outperform prior feature-based and diffusion approaches while enabling offline video-based robot RL.

  5. MultiWorld: Scalable Multi-Agent Multi-View Video World Models

    cs.CV 2026-04 unverdicted novelty 7.0

    MultiWorld is a scalable framework for multi-agent multi-view video world models that improves controllability and consistency over single-agent baselines in game and robot tasks.

  6. Action Images: End-to-End Policy Learning via Multiview Video Generation

    cs.CV 2026-04 unverdicted novelty 7.0

    Action Images turn robot arm motions into interpretable multiview pixel videos, letting video backbones serve as zero-shot policies for end-to-end robot learning.

  7. TORA: Topological Representation Alignment for 3D Shape Assembly

    cs.CV 2026-04 unverdicted novelty 7.0

    TORA distills topological structure from pretrained 3D encoders into flow-matching backbones via cosine matching and CKA loss, delivering up to 6.9x faster convergence and better accuracy on 3D shape assembly benchmar...

  8. GeoFlow: Enforcing Implicit Geometric Consistency in Video Generation

    cs.CV 2026-05 unverdicted novelty 6.0

    GeoFlow adds a geometry-consistency reward based on rigid camera flow and object appearance preservation, integrated via reinforcement fine-tuning to improve geometric coherence in video generation.

  9. Improved Baselines with Representation Autoencoders

    cs.CV 2026-05 conditional novelty 6.0

    RAE v2 reaches gFID 1.06 on ImageNet-256 in 80 epochs by combining multi-layer encoder sums, complementary REPA targets, and free guidance via output reparameterization.

  10. Divide and Conquer: Decoupled Representation Alignment for Multimodal World Models

    cs.CV 2026-05 unverdicted novelty 6.0

    M²-REPA decouples modality-specific features inside a diffusion model and aligns each to its matching expert foundation model via an alignment loss plus a decoupling regularizer, yielding better visual quality and lon...

  11. CoInteract: Physically-Consistent Human-Object Interaction Video Synthesis via Spatially-Structured Co-Generation

    cs.CV 2026-04 unverdicted novelty 6.0

    CoInteract adds a human-aware mixture-of-experts and spatially-structured co-generation to a diffusion transformer to synthesize videos with stable structures and physically plausible human-object contacts.

  12. Lyra 2.0: Explorable Generative 3D Worlds

    cs.CV 2026-04 unverdicted novelty 6.0

    Lyra 2.0 produces persistent 3D-consistent video sequences for large explorable worlds by using per-frame geometry for information routing and self-augmented training to correct temporal drift.

  13. Representations Before Pixels: Semantics-Guided Hierarchical Video Prediction

    cs.CV 2026-04 unverdicted novelty 6.0

    Re2Pix decomposes video prediction into semantic feature forecasting followed by representation-conditioned diffusion synthesis, with nested dropout and mixed supervision to handle prediction errors.

  14. GEM-4D: Geometry-Enhanced Video World Models for Robot Manipulation

    cs.CV 2026-05 unverdicted novelty 5.0

    GEM-4D is a video world model that injects 4D correspondence supervision to improve geometric consistency and robot manipulation success from 61% to 81%.

  15. World-R1: Reinforcing 3D Constraints for Text-to-Video Generation

    cs.CV 2026-04 unverdicted novelty 5.0

    World-R1 applies RL via Flow-GRPO on a new text dataset for world simulation to enforce 3D constraints in video generation while preserving visual quality.

  16. World-R1: Reinforcing 3D Constraints for Text-to-Video Generation

    cs.CV 2026-04 unverdicted novelty 5.0

    World-R1 uses Flow-GRPO reinforcement learning and a new text dataset to enforce 3D consistency in text-to-video generation while keeping the original model's visual quality.

  17. World-R1: Reinforcing 3D Constraints for Text-to-Video Generation

    cs.CV 2026-04 unverdicted novelty 4.0

    World-R1 uses RL with 3D model feedback and a new text dataset to improve geometric consistency in text-to-video generation while keeping the base model unchanged.

Reference graph

Works this paper leans on

100 extracted references · 100 canonical work pages · cited by 15 Pith papers · 21 internal anchors

  1. [1]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

  2. [2]

    Aether: Geometric-aware unified world modeling.arXiv preprint arXiv:2503.18945, 2025

    Aether, Haoyi Zhu, Yifan Wang, Jianjun Zhou, Wenzheng Chang, Yang Zhou, Zizun Li, Junyi Chen, Chunhua Shen, Jiangmiao Pang, et al. Aether: Geometric-aware unified world modeling. arXiv preprint arXiv:2503.18945, 2025

  3. [3]

    Cosmos World Foundation Model Platform for Physical AI

    Niket Agarwal, Arslan Ali, Maciej Bala, Yogesh Balaji, Erik Barker, Tiffany Cai, Prithvijit Chattopadhyay, Yongxin Chen, Yin Cui, Yifan Ding, et al. Cosmos world foundation model platform for physical ai. arXiv preprint arXiv:2501.03575, 2025

  4. [4]

    Diffusion for world modeling: Visual details matter in atari

    Eloi Alonso, Adam Jelley, Vincent Micheli, Anssi Kanervisto, Amos J Storkey, Tim Pearce, and Fran c ois Fleuret. Diffusion for world modeling: Visual details matter in atari. Advances in Neural Information Processing Systems, 37: 0 58757--58791, 2024

  5. [5]

    Tc4d: Trajectory-conditioned text-to-4d generation

    Sherwin Bahmani, Xian Liu, Wang Yifan, Ivan Skorokhodov, Victor Rong, Ziwei Liu, Xihui Liu, Jeong Joon Park, Sergey Tulyakov, Gordon Wetzstein, et al. Tc4d: Trajectory-conditioned text-to-4d generation. In European Conference on Computer Vision, pp.\ 53--72. Springer, 2024 a

  6. [6]

    4d-fy: Text-to-4d generation using hybrid score distillation sampling

    Sherwin Bahmani, Ivan Skorokhodov, Victor Rong, Gordon Wetzstein, Leonidas Guibas, Peter Wonka, Sergey Tulyakov, Jeong Joon Park, Andrea Tagliasacchi, and David B Lindell. 4d-fy: Text-to-4d generation using hybrid score distillation sampling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 7996--8006, 2024 b

  7. [7]

    Video pretraining (vpt): Learning to act by watching unlabeled online videos

    Bowen Baker, Ilge Akkaya, Peter Zhokov, Joost Huizinga, Jie Tang, Adrien Ecoffet, Brandon Houghton, Raul Sampedro, and Jeff Clune. Video pretraining (vpt): Learning to act by watching unlabeled online videos. Advances in Neural Information Processing Systems, 35: 0 24639--24654, 2022

  8. [8]

    All are worth words: A vit backbone for diffusion models

    Fan Bao, Shen Nie, Kaiwen Xue, Yue Cao, Chongxuan Li, Hang Su, and Jun Zhu. All are worth words: A vit backbone for diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.\ 22669--22679, 2023

  9. [9]

    Navigation world models

    Amir Bar, Gaoyue Zhou, Danny Tran, Trevor Darrell, and Yann LeCun. Navigation world models. In Proceedings of the Computer Vision and Pattern Recognition Conference, pp.\ 15791--15801, 2025

  10. [10]

    Language models are few-shot learners

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in Neural Information Processing Systems (NeurIPS), 33: 0 1877--1901, 2020

  11. [11]

    Genie: Generative interactive environments

    Jake Bruce, Michael D Dennis, Ashley Edwards, Jack Parker-Holder, Yuge Shi, Edward Hughes, Matthew Lai, Aditi Mavalankar, Richie Steigerwald, Chris Apps, et al. Genie: Generative interactive environments. In Forty-first International Conference on Machine Learning, 2024

  12. [12]

    Diffusion forcing: Next-token prediction meets full-sequence diffusion

    Boyuan Chen, Diego Mart \' Mons \'o , Yilun Du, Max Simchowitz, Russ Tedrake, and Vincent Sitzmann. Diffusion forcing: Next-token prediction meets full-sequence diffusion. Advances in Neural Information Processing Systems, 37: 0 24081--24125, 2024 a

  13. [13]

    VideoCrafter1: Open Diffusion Models for High-Quality Video Generation

    Haoxin Chen, Menghan Xia, Yingqing He, Yong Zhang, Xiaodong Cun, Shaoshu Yang, Jinbo Xing, Yaofang Liu, Qifeng Chen, Xintao Wang, et al. Videocrafter1: Open diffusion models for high-quality video generation. arXiv preprint arXiv:2310.19512, 2023

  14. [14]

    Videocrafter2: Overcoming data limita- tions for high-quality video diffusion models.arXiv preprint arXiv:2401.09047, 2024

    Haoxin Chen, Yong Zhang, Xiaodong Cun, Menghan Xia, Xintao Wang, Chao Weng, and Ying Shan. Videocrafter2: Overcoming data limitations for high-quality video diffusion models. arXiv preprint arXiv:2401.09047, 2024 b

  15. [15]

    Flexworld: Progressively expanding 3d scenes for flexiable-view synthesis

    Luxi Chen, Zihan Zhou, Min Zhao, Yikai Wang, Ge Zhang, Wenhao Huang, Hao Sun, Ji-Rong Wen, and Chongxuan Li. Flexworld: Progressively expanding 3d scenes for flexiable-view synthesis. arXiv preprint arXiv:2503.13265, 2025

  16. [16]

    Playing with transformer at 30+ fps via next-frame diffusion.arXiv preprint arXiv:2506.01380, 2025

    Xinle Cheng, Tianyu He, Jiayi Xu, Junliang Guo, Di He, and Jiang Bian. Playing with transformer at 30+ fps via next-frame diffusion. arXiv preprint arXiv:2506.01380, 2025

  17. [17]

    Luciddreamer: Domain-free gen- eration of 3d gaussian splatting scenes

    Jaeyoung Chung, Suyoung Lee, Hyeongjin Nam, Jaerin Lee, and Kyoung Mu Lee. Luciddreamer: Domain-free generation of 3d gaussian splatting scenes. arXiv preprint arXiv:2311.13384, 2023

  18. [18]

    Oasis: A universe in a transformer

    Decart, Quevedo Julian, McIntyre Quinn, Campbell Spruce, Chen Xinlei, and Wachen Robert. Oasis: A universe in a transformer. 2024. URL https://oasis-model.github.io/

  19. [19]

    Worldscore: A unified evaluation benchmark for world generation.arXiv preprint arXiv:2504.00983, 2025

    Haoyi Duan, Hong-Xing Yu, Sirui Chen, Li Fei-Fei, and Jiajun Wu. Worldscore: A unified evaluation benchmark for world generation. arXiv preprint arXiv:2504.00983, 2025

  20. [20]

    Institutionum calculi integralis, volume 4

    Leonhard Euler. Institutionum calculi integralis, volume 4. impensis Academiae imperialis scientiarum, 1845

  21. [21]

    VLM-3R: Vision-Language Models Augmented with Instruction-Aligned 3D Reconstruction

    Zhiwen Fan, Jian Zhang, Renjie Li, Junge Zhang, Runjin Chen, Hezhen Hu, Kevin Wang, Huaizhi Qu, Dilin Wang, Zhicheng Yan, et al. Vlm-3r: Vision-language models augmented with instruction-aligned 3d reconstruction. arXiv preprint arXiv:2505.20279, 2025

  22. [22]

    Driv3R: Learn- ing dense 4d reconstruction for autonomous driving.arXiv preprint arXiv:2412.06777, 2024

    Xin Fei, Wenzhao Zheng, Yueqi Duan, Wei Zhan, Masayoshi Tomizuka, Kurt Keutzer, and Jiwen Lu. Driv3r: Learning dense 4d reconstruction for autonomous driving. ArXiv, abs/2412.06777, 2024. URL https://api.semanticscholar.org/CorpusID:274610426

  23. [23]

    The matrix: Infinite-horizon world generation with real-time moving control

    Ruili Feng, Han Zhang, Zhantao Yang, Jie Xiao, Zhilei Shu, Zhiheng Liu, Andy Zheng, Yukun Huang, Yu Liu, and Hongyang Zhang. The matrix: Infinite-horizon world generation with real-time moving control. arXiv preprint arXiv:2412.03568, 2024

  24. [24]

    Maskflow: Discrete flows for flexible and efficient long video generation

    Michael Fuest, Vincent Tao Hu, and Bj \"o rn Ommer. Maskflow: Discrete flows for flexible and efficient long video generation. arXiv preprint arXiv:2502.11234, 2025

  25. [25]

    An introduction to ray tracing

    Andrew S Glassner. An introduction to ray tracing. Morgan Kaufmann, 1989

  26. [26]

    Google. Veo 3. https://deepmind.google/models/veo/, 2025

  27. [27]

    Mineworld: a real-time and open-source interactive world model on minecraft.arXiv preprint arXiv:2504.08388, 2025

    Junliang Guo, Yang Ye, Tianyu He, Haoyu Wu, Yushu Jiang, Tim Pearce, and Jiang Bian. Mineworld: a real-time and open-source interactive world model on minecraft. arXiv preprint arXiv:2504.08388, 2025

  28. [28]

    World Models

    David Ha and J \"u rgen Schmidhuber. World models. arXiv preprint arXiv:1803.10122, 2018

  29. [29]

    CameraCtrl: Enabling Camera Control for Text-to-Video Generation

    Hao He, Yinghao Xu, Yuwei Guo, Gordon Wetzstein, Bo Dai, Hongsheng Li, and Ceyuan Yang. Cameractrl: Enabling camera control for text-to-video generation. arXiv preprint arXiv:2404.02101, 2024

  30. [30]

    Momentum contrast for unsupervised visual representation learning

    Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.\ 9729--9738, 2020

  31. [31]

    Gans trained by a two time-scale update rule converge to a local nash equilibrium

    Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems, 30, 2017

  32. [32]

    Denoising diffusion probabilistic models

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems (NeurIPS), 33: 0 6840--6851, 2020

  33. [33]

    GAIA-1: A Generative World Model for Autonomous Driving

    Anthony Hu, Lloyd Russell, Hudson Yeo, Zak Murez, George Fedoseev, Alex Kendall, Jamie Shotton, and Gianluca Corrado. Gaia-1: A generative world model for autonomous driving. arXiv preprint arXiv:2309.17080, 2023

  34. [34]

    3drs: Mllms need 3d-aware representation supervision for scene understanding.arXiv preprint arXiv:2506.01946, 2025

    Xiaohu Huang, Jingjing Wu, Qunyi Xie, and Kai Han. Mllms need 3d-aware representation supervision for scene understanding. arXiv preprint arXiv:2506.01946, 2025 a

  35. [35]

    Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion

    Xun Huang, Zhengqi Li, Guande He, Mingyuan Zhou, and Eli Shechtman. Self forcing: Bridging the train-test gap in autoregressive video diffusion. arXiv preprint arXiv:2506.08009, 2025 b

  36. [36]

    Geo4d: Leveraging video generators for geometric 4d scene reconstruction

    Zeren Jiang, Chuanxia Zheng, Iro Laina, Diane Larlus, and Andrea Vedaldi. Geo4d: Leveraging video generators for geometric 4d scene reconstruction. arXiv preprint arXiv:2504.07961, 2025

  37. [37]

    3d gaussian splatting for real-time radiance field rendering

    Bernhard Kerbl, Georgios Kopanas, Thomas Leimk \"u hler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering. ACM Transactions on Graphics, 42 0 (4): 0 1--14, 2023

  38. [38]

    Videopoet: A large language model for zero-shot video generation

    Dan Kondratyuk, Lijun Yu, Xiuye Gu, Jose Lezama, Jonathan Huang, Grant Schindler, Rachel Hornung, Vighnesh Birodkar, Jimmy Yan, Ming-Chang Chiu, et al. Videopoet: A large language model for zero-shot video generation. In International Conference on Machine Learning, pp.\ 25105--25124. PMLR, 2024

  39. [39]

    HunyuanVideo: A Systematic Framework For Large Video Generative Models

    Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models. arXiv preprint arXiv:2412.03603, 2024

  40. [40]

    Vividdream: Generating 3d scene with ambient dynamics

    Yao-Chih Lee, Yi-Ting Chen, Andrew Wang, Ting-Hsuan Liao, Brandon Y Feng, and Jia-Bin Huang. Vividdream: Generating 3d scene with ambient dynamics. arXiv preprint arXiv:2405.20334, 2024

  41. [41]

    T2v- turbo: Breaking the quality bottleneck of video consis- tency model with mixed reward feedback

    Jiachen Li, Weixi Feng, Tsu-Jui Fu, Xinyi Wang, Sugato Basu, Wenhu Chen, and William Yang Wang. T2v-turbo: Breaking the quality bottleneck of video consistency model with mixed reward feedback. arXiv preprint arXiv:2405.18750, 2024

  42. [42]

    Megasam: Accurate, fast and robust structure and motion from casual dynamic videos

    Zhengqi Li, Richard Tucker, Forrester Cole, Qianqian Wang, Linyi Jin, Vickie Ye, Angjoo Kanazawa, Aleksander Holynski, and Noah Snavely. Megasam: Accurate, fast and robust structure and motion from casual dynamic videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

  43. [43]

    Flow matching for generative modeling

    Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matthew Le. Flow matching for generative modeling. In The Eleventh International Conference on Learning Representations, 2023

  44. [44]

    Videodpo: Omni-preference alignment for video diffusion generation

    Runtao Liu, Haoyu Wu, Ziqiang Zheng, Chen Wei, Yingqing He, Renjie Pi, and Qifeng Chen. Videodpo: Omni-preference alignment for video diffusion generation. In Proceedings of the Computer Vision and Pattern Recognition Conference, pp.\ 8009--8019, 2025 a

  45. [45]

    Flow straight and fast: Learning to generate and transfer data with rectified flow

    Xingchao Liu, Chengyue Gong, et al. Flow straight and fast: Learning to generate and transfer data with rectified flow. In The Eleventh International Conference on Learning Representations, 2023

  46. [46]

    Slam3r: Real-time dense scene reconstruction from monocular rgb videos

    Yuzheng Liu, Siyan Dong, Shuzhe Wang, Yingda Yin, Yanchao Yang, Qingnan Fan, and Baoquan Chen. Slam3r: Real-time dense scene reconstruction from monocular rgb videos. In Proceedings of the Computer Vision and Pattern Recognition Conference, pp.\ 16651--16662, 2025 b

  47. [47]

    VGGT-SLAM: Dense RGB SLAM Optimized on the SL(4) Manifold

    Dominic Maggio, Hyungtae Lim, and Luca Carlone. Vggt-slam: Dense rgb slam optimized on the sl(4) manifold. ArXiv, abs/2505.12549, 2025. URL https://api.semanticscholar.org/CorpusID:278739766

  48. [48]

    Can video diffusion model reconstruct 4d geometry? arXiv preprint arXiv:2503.21082, 2025

    Jinjie Mai, Wenxuan Zhu, Haozhe Liu, Bing Li, Cheng Zheng, J \"u rgen Schmidhuber, and Bernard Ghanem. Can video diffusion model reconstruct 4d geometry? arXiv preprint arXiv:2503.21082, 2025

  49. [49]

    Nerf: Representing scenes as neural radiance fields for view synthesis

    Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view synthesis. Communications of the ACM, 65 0 (1): 0 99--106, 2021

  50. [50]

    Giraffe: Representing scenes as compositional generative neural feature fields

    Michael Niemeyer and Andreas Geiger. Giraffe: Representing scenes as compositional generative neural feature fields. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.\ 11453--11464, 2021

  51. [51]

    OpenAI. Sora. https://openai.com/index/sora/, 2024

  52. [52]

    DINOv2: Learning Robust Visual Features without Supervision

    Maxime Oquab, Timoth \'e e Darcet, Th \'e o Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023

  53. [53]

    Genie 2: A large-scale foundation world model

    J Parker-Holder, P Ball, J Bruce, V Dasagi, K Holsheimer, C Kaplanis, A Moufarek, G Scully, J Shar, J Shi, et al. Genie 2: A large-scale foundation world model. URL: https://deepmind. google/discover/blog/genie-2-a-large-scale-foundation-world-model, 2024

  54. [54]

    Scalable diffusion models with transformers

    William Peebles and Saining Xie. Scalable diffusion models with transformers. In Proceedings of the IEEE/CVF international conference on computer vision, pp.\ 4195--4205, 2023

  55. [55]

    Unidepth: Universal monocular metric depth estimation

    Luigi Piccinelli, Yung-Hsu Yang, Christos Sakaridis, Mattia Segu, Siyuan Li, Luc Van Gool, and Fisher Yu. Unidepth: Universal monocular metric depth estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 10106--10116, 2024

  56. [56]

    Long-context state-space video world models.ArXiv, abs/2505.20171, 2025

    Ryan Po, Yotam Nitzan, Richard Zhang, Berlin Chen, Tri Dao, Eli Shechtman, Gordon Wetzstein, and Xun Huang. Long-context state-space video world models. arXiv preprint arXiv:2505.20171, 2025

  57. [57]

    Movie Gen: A Cast of Media Foundation Models

    A Polyak, A Zohar, A Brown, A Tjandra, A Sinha, A Lee, A Vyas, B Shi, CY Ma, CY Chuang, et al. Movie gen: A cast of media foundation models. 2024a. arXiv preprint arXiv:2410.13720, 2024

  58. [58]

    Language models are unsupervised multitask learners

    Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. OpenAI blog, 1 0 (8): 0 9, 2019

  59. [59]

    Dreambooth3d: Subject-driven text-to-3d generation

    Amit Raj, Srinivas Kaza, Ben Poole, Michael Niemeyer, Nataniel Ruiz, Ben Mildenhall, Shiran Zada, Kfir Aberman, Michael Rubinstein, Jonathan Barron, et al. Dreambooth3d: Subject-driven text-to-3d generation. In Proceedings of the IEEE/CVF international conference on computer vision, pp.\ 2349--2359, 2023

  60. [60]

    Vision transformers for dense prediction

    Ren \'e Ranftl, Alexey Bochkovskiy, and Vladlen Koltun. Vision transformers for dense prediction. In Proceedings of the IEEE/CVF international conference on computer vision, pp.\ 12179--12188, 2021

  61. [61]

    High-resolution image synthesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj \"o rn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.\ 10684--10695, 2022

  62. [62]

    Wham: Reconstructing world-grounded humans with accurate 3d motion

    Soyong Shin, Juyong Kim, Eni Halilaj, and Michael J Black. Wham: Reconstructing world-grounded humans with accurate 3d motion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 2070--2080, 2024

  63. [63]

    Splatt3R: Zero-shot Gaussian Splatting from Uncalibrated Image Pairs

    Brandon Smart, Chuanxia Zheng, Iro Laina, and Victor Adrian Prisacariu. Splatt3r: Zero-shot gaussian splatting from uncalibrated image pairs. arXiv preprint arXiv:2408.13912, 2024

  64. [64]

    History-Guided Video Diffusion

    Kiwhan Song, Boyuan Chen, Max Simchowitz, Yilun Du, Russ Tedrake, and Vincent Sitzmann. History-guided video diffusion. arXiv preprint arXiv:2502.06764, 2025

  65. [65]

    Ar-diffusion: Asynchronous video generation with auto-regressive diffusion

    Mingzhen Sun, Weining Wang, Gen Li, Jiawei Liu, Jiahui Sun, Wanquan Feng, Shanshan Lao, SiYu Zhou, Qian He, and Jing Liu. Ar-diffusion: Asynchronous video generation with auto-regressive diffusion. In Proceedings of the Computer Vision and Pattern Recognition Conference, pp.\ 7364--7373, 2025

  66. [66]

    Droid-slam: Deep visual slam for monocular, stereo, and rgb-d cameras

    Zachary Teed and Jia Deng. Droid-slam: Deep visual slam for monocular, stereo, and rgb-d cameras. Advances in neural information processing systems, 34: 0 16558--16569, 2021

  67. [67]

    Towards Accurate Generative Models of Video: A New Metric & Challenges

    Thomas Unterthiner, Sjoerd Van Steenkiste, Karol Kurach, Raphael Marinier, Marcin Michalski, and Sylvain Gelly. Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:1812.01717, 2018

  68. [68]

    Diffusion Models Are Real-Time Game Engines

    Dani Valevski, Yaniv Leviathan, Moab Arar, and Shlomi Fruchter. Diffusion models are real-time game engines. arXiv preprint arXiv:2408.14837, 2024

  69. [69]

    Attention is all you need

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, ukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in Neural Information Processing Systems (NeurIPS), 30, 2017

  70. [70]

    Wan: Open and Advanced Large-Scale Video Generative Models

    Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models. arXiv preprint arXiv:2503.20314, 2025

  71. [71]

    Vggt: Visual geometry grounded transformer

    Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Visual geometry grounded transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

  72. [72]

    Videomae v2: Scaling video masked autoencoders with dual masking

    Limin Wang, Bingkun Huang, Zhiyu Zhao, Zhan Tong, Yinan He, Yi Wang, Yali Wang, and Yu Qiao. Videomae v2: Scaling video masked autoencoders with dual masking. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.\ 14549--14560, 2023

  73. [73]

    Efros, and Angjoo Kanazawa

    Qianqian Wang*, Yifei Zhang*, Aleksander Holynski, Alexei A. Efros, and Angjoo Kanazawa. Continuous 3d perception model with persistent state. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025

  74. [74]

    Dust3r: Geometric 3d vision made easy

    Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. Dust3r: Geometric 3d vision made easy. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 20697--20709, 2024

  75. [75]

    Image quality assessment: from error visibility to structural similarity

    Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing, 13 0 (4): 0 600--612, 2004

  76. [76]

    A learning algorithm for continually running fully recurrent neural networks

    Ronald J Williams and David Zipser. A learning algorithm for continually running fully recurrent neural networks. Neural computation, 1 0 (2): 0 270--280, 1989

  77. [77]

    Spatial-MLLM: Boosting MLLM Capabilities in Visual-based Spatial Intelligence

    Diankun Wu, Fangfu Liu, Yi-Hsin Hung, and Yueqi Duan. Spatial-mllm: Boosting mllm capabilities in visual-based spatial intelligence. arXiv preprint arXiv:2505.23747, 2025 a

  78. [78]

    4d-fly: Fast 4d reconstruction from a single monocular video

    Diankun Wu, Fangfu Liu, Yi-Hsin Hung, Yue Qian, Xiaohang Zhan, and Yueqi Duan. 4d-fly: Fast 4d reconstruction from a single monocular video. In Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR), pp.\ 16663--16673, June 2025 b

  79. [79]

    Video world models with long-term spatial memory.arXiv preprint arXiv:2506.05284, 2025

    Tong Wu, Shuai Yang, Ryan Po, Yinghao Xu, Ziwei Liu, Dahua Lin, and Gordon Wetzstein. Video world models with long-term spatial memory. arXiv preprint arXiv:2506.05284, 2025 c

  80. [80]

    Structured 3d latents for scalable and versatile 3d generation

    Jianfeng Xiang, Zelong Lv, Sicheng Xu, Yu Deng, Ruicheng Wang, Bowen Zhang, Dong Chen, Xin Tong, and Jiaolong Yang. Structured 3d latents for scalable and versatile 3d generation. In Proceedings of the Computer Vision and Pattern Recognition Conference, pp.\ 21469--21480, 2025

Showing first 80 references.