pith. machine review for the scientific record. sign in

arxiv: 2603.11911 · v3 · submitted 2026-03-12 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

InSpatio-WorldFM: An Open-Source Real-Time Generative Frame Model

Authors on Pith no claims yet

Pith reviewed 2026-05-15 12:02 UTC · model grok-4.3

classification 💻 cs.CV
keywords generative frame modelreal-time spatial inferencemulti-view consistency3D anchorsspatial memoryimage diffusion modelworld simulationconsumer GPU
0
0 comments X

The pith

InSpatio-WorldFM generates each frame independently using explicit 3D anchors and implicit spatial memory to deliver real-time multi-view consistency.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a frame-based generative model that produces individual frames without sequential video processing to achieve low-latency spatial inference. It transforms a pretrained image diffusion model through a three-stage pipeline into a controllable real-time generator while enforcing consistency across viewpoints. This design aims to preserve global scene geometry and fine details on consumer hardware, offering an alternative to latency-heavy video world models for interactive exploration.

Core claim

InSpatio-WorldFM adopts a frame-based paradigm that generates each frame independently, enforcing multi-view spatial consistency through explicit 3D anchors and implicit spatial memory to preserve global scene geometry and fine-grained visual details across viewpoint changes, after a progressive three-stage training process that starts from a pretrained image diffusion model and ends with few-step distillation for real-time operation.

What carries the argument

The combination of explicit 3D anchors and implicit spatial memory that enforces multi-view consistency during independent per-frame generation.

Load-bearing premise

That explicit 3D anchors combined with implicit spatial memory are sufficient to preserve global scene geometry and fine-grained details across arbitrary viewpoint changes without additional mechanisms.

What would settle it

Demonstration of lost geometric consistency or missing details when the model generates frames from rapidly changing or previously unseen viewpoints on the same scene.

Figures

Figures reproduced from arXiv: 2603.11911 by Guofeng Zhang, Haomin Liu, Haoyu Ji, InSpatio Team: Donghui Shen, Jialin Liu, Jing Guo, Nan Wang, Siji Pan, Weihong Pan, Weijian Xie, Xianbin Liu, Xiaojun Xiang, Xiaoyu Zhang, Yifu Wang, Yipeng Chen, Zhewen Le, Zhichao Ye, Ziqiang Zhao.

Figure 1
Figure 1. Figure 1: Examples of generated worlds across diverse styles, including photorealistic, science-fiction, [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview. In the offline stage, a multi-view-consistent model generates plausible observations that provide 3D anchors and reference appearances. In the online stage, frame model performs fast real￾time inference while updating scene content at keyframes. 2 InSpatio-WorldFM 2.1 Overview As illustrated in [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: The pipeline of InSpatio-WorldFM. The left part illustrates the conditional novel-view syn￾thesis pipeline of WorldFM. WorldFM takes a reference image xref (implicit scene memory), noisy latents zt , and point cloud rendering xˆtgt (explicit 3D anchor) as inputs, which are spatially concatenated along the width dimension. Reference pose πref and target pose πtgt are also injected as control sig￾nals. The f… view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative results of teacher model. 8 [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative results of InSpatio-WorldFM. [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative results of InSpatio-WorldFM. [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Qualitative results of InSpatio-WorldFM. [PITH_FULL_IMAGE:figures/full_fig_p011_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Qualitative results of InSpatio-WorldFM. Across observations at varying viewing distances, [PITH_FULL_IMAGE:figures/full_fig_p012_8.png] view at source ↗
read the original abstract

We present InSpatio-WorldFM, an open-source real-time frame model for spatial intelligence. Unlike video-based world models that rely on sequential frame generation and incur substantial latency due to window-level processing, InSpatio-WorldFM adopts a frame-based paradigm that generates each frame independently, enabling low-latency real-time spatial inference. By enforcing multi-view spatial consistency through explicit 3D anchors and implicit spatial memory, the model preserves global scene geometry while maintaining fine-grained visual details across viewpoint changes. We further introduce a progressive three-stage training pipeline that transforms a pretrained image diffusion model into a controllable frame model and finally into a real-time generator through few-step distillation. Experimental results show that InSpatio-WorldFM achieves strong multi-view consistency while supporting interactive exploration on consumer-grade GPUs, providing an efficient alternative to traditional video-based world models for real-time world simulation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript presents InSpatio-WorldFM, an open-source real-time generative frame model for spatial intelligence. Unlike sequential video-based world models, it generates each frame independently using explicit 3D anchors combined with implicit spatial memory to enforce multi-view consistency and preserve global scene geometry across viewpoint changes. A progressive three-stage training pipeline is described that distills a pretrained image diffusion model into a controllable frame model and finally into a few-step real-time generator. The central claim is that this yields strong multi-view consistency while enabling interactive exploration on consumer-grade GPUs as an efficient alternative to window-based video models.

Significance. If the quantitative claims hold, the work would provide a practical advance in real-time spatial simulation by addressing latency bottlenecks in video-based approaches through independent frame generation. The open-source release and targeting of consumer hardware are strengths that could facilitate reproducibility and adoption in interactive applications. The progressive distillation pipeline offers a coherent mechanism for efficiency gains that is directly testable.

major comments (2)
  1. [Abstract] Abstract: The assertion of 'experimental results' showing strong multi-view consistency supplies no metrics, baselines, error analysis, or dataset details. This information is load-bearing for the central claim and must be supplied from the results section to allow evaluation of the geometry-preservation performance.
  2. [Training Pipeline] Training pipeline description: The integration of explicit 3D anchors with implicit spatial memory is presented as sufficient to preserve global geometry under independent frame generation, but no ablation or quantitative test is referenced that isolates this combination from potential failure modes under large viewpoint changes.
minor comments (1)
  1. [Abstract] Abstract: Consider including a direct link to the open-source repository to support the reproducibility claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment point by point below, providing clarifications from the manuscript and indicating revisions where the comments identify opportunities to strengthen the presentation.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The assertion of 'experimental results' showing strong multi-view consistency supplies no metrics, baselines, error analysis, or dataset details. This information is load-bearing for the central claim and must be supplied from the results section to allow evaluation of the geometry-preservation performance.

    Authors: We agree that the abstract would be strengthened by explicitly referencing supporting quantitative details. The full manuscript reports these metrics, baselines, error analysis, and dataset information in the Experiments section. In the revised version we have updated the abstract to concisely incorporate the key multi-view consistency scores, primary baselines, and dataset references while remaining within length limits. revision: yes

  2. Referee: [Training Pipeline] Training pipeline description: The integration of explicit 3D anchors with implicit spatial memory is presented as sufficient to preserve global geometry under independent frame generation, but no ablation or quantitative test is referenced that isolates this combination from potential failure modes under large viewpoint changes.

    Authors: The manuscript already contains quantitative evaluations in the Experiments section that measure geometry preservation across large viewpoint changes when both components are present. To directly isolate their joint contribution, the revised manuscript adds a targeted ablation study comparing the full model against variants lacking 3D anchors or spatial memory, with results confirming improved consistency under extreme viewpoint shifts. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The provided abstract and description outline a frame-based generation paradigm enforced via explicit 3D anchors plus implicit spatial memory, implemented through a progressive three-stage training pipeline (pretrained image diffusion to controllable frame model to few-step distillation). No equations, derivations, or fitted parameters are shown that reduce any claimed prediction or consistency result to the inputs by construction. No self-citations, uniqueness theorems, or ansatz smuggling appear in the text. The central claims rest on an externally described training process rather than self-referential definitions, making the derivation self-contained against the stated benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard assumptions of diffusion model adaptability and the sufficiency of 3D anchors for consistency; no new entities are introduced.

free parameters (1)
  • distillation steps and training hyperparameters
    Standard in the three-stage pipeline but not enumerated in the abstract.
axioms (1)
  • domain assumption Pretrained image diffusion models can be transformed into controllable frame generators via staged training
    Invoked in the progressive three-stage pipeline description.

pith-pipeline@v0.9.0 · 5521 in / 1135 out tokens · 47324 ms · 2026-05-15T12:02:56.565901+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. 3D-Anchored Lookahead Planning for Persistent Robotic Scene Memory via World-Model-Based MCTS

    cs.RO 2026-04 unverdicted novelty 7.0

    3D-ALP achieves 0.65 success on memory-dependent 5-step robotic reach tasks versus near-zero for reactive baselines by anchoring MCTS planning to a persistent 3D camera-to-world frame.

  2. INSPATIO-WORLD: A Real-Time 4D World Simulator via Spatiotemporal Autoregressive Modeling

    cs.CV 2026-04 unverdicted novelty 6.0

    INSPATIO-WORLD is a real-time framework for high-fidelity 4D scene generation and navigation from monocular videos via STAR architecture with implicit caching, explicit geometric constraints, and distribution-matching...

  3. ST-Gen4D: Embedding 4D Spatiotemporal Cognition into World Model for 4D Generation

    cs.CV 2026-05 unverdicted novelty 5.0

    ST-Gen4D uses a world model that fuses global appearance and local dynamic graphs into a 4D cognition representation to guide consistent 4D Gaussian generation.

Reference graph

Works this paper leans on

51 extracted references · 51 canonical work pages · cited by 3 Pith papers · 9 internal anchors

  1. [1]

    Cosmos World Foundation Model Platform for Physical AI

    Niket Agarwal, Arslan Ali, Maciej Bala, Yogesh Balaji, Erik Barker, Tiffany Cai, Prithvijit Chattopadhyay, Yongxin Chen, Yin Cui, Yifan Ding, et al. Cosmos world foundation model platform for physical ai.arXiv preprint arXiv:2501.03575, 2025

  2. [2]

    Cosmos-transfer1: Conditional world generation with adaptive multimodal control.arXiv preprint arXiv:2503.14492, 2025

    Hassan Abu Alhaija, Jose Alvarez, Maciej Bala, Tiffany Cai, Tianshi Cao, Liz Cha, Joshua Chen, Mike Chen, Francesco Ferroni, Sanja Fidler, et al. Cosmos-transfer1: Conditional world generation with adaptive multimodal control.arXiv preprint arXiv:2503.14492, 2025

  3. [3]

    ReCammaster: Camera-controlled generative rendering from a single video

    Jianhong Bai, Menghan Xia, Xiao Fu, Xintao Wang, Lianrui Mu, Jinwen Cao, Zuozhu Liu, Haoji Hu, Xiang Bai, Pengfei Wan, et al. ReCammaster: Camera-controlled generative rendering from a single video. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 14834–14844, 2025

  4. [4]

    Lumiere: A space-time diffusion model for video generation

    Omer Bar-Tal, Hila Chefer, Omer Tov, Charles Herrmann, Roni Paiss, Shiran Zada, Ariel Ephrat, Junhwa Hur, Guanghui Liu, Amit Raj, Yuanzhen Li, Michael Rubinstein, Tomer Michaeli, Oliver Wang, Deqing Sun, Tali Dekel, and Inbar Mosseri. Lumiere: A space-time diffusion model for video generation. InSIGGRAPH Asia, 2024

  5. [5]

    Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

    Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, Varun Jampani, and Robin Rombach. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023

  6. [6]

    Video generation models as world simulators.OpenAI Blog, 2024

    Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, Clarence Ng, Ricky Wang, and Aditya Ramesh. Video generation models as world simulators.OpenAI Blog, 2024

  7. [7]

    Video generation models as world simulators

    Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, Clarence Ng, Ricky Wang, and Aditya Ramesh. Video generation models as world simulators. 2024

  8. [8]

    Pixart-σ: Weak-to-strong training of diffusion transformer for 4k text-to- image generation

    Junsong Chen, Chongjian Ge, Enze Xie, Yue Wu, Lewei Yao, Xiaozhe Ren, Zhongdao Wang, Ping Luo, Huchuan Lu, and Zhenguo Li. Pixart-σ: Weak-to-strong training of diffusion transformer for 4k text-to- image generation. InEuropean Conference on Computer Vision, pages 74–91. Springer, 2024

  9. [9]

    LucidDreamer: Domain- free generation of 3D gaussian splatting scenes.arXiv preprint arXiv:2311.13384, 2023

    Jaeyoung Chung, Suyoung Lee, Hyeongjin Nam, Jaerin Lee, and Kyoung Mu Lee. LucidDreamer: Domain- free generation of 3D gaussian splatting scenes.arXiv preprint arXiv:2311.13384, 2023

  10. [10]

    Genie 3: A new frontier for world models.https://deepmind.google/blog/ genie-3-a-new-frontier-for-world-models/, 2025

    Google DeepMind. Genie 3: A new frontier for world models.https://deepmind.google/blog/ genie-3-a-new-frontier-for-world-models/, 2025

  11. [11]

    Under- standing world or predicting future? a comprehensive survey of world models.ACM Computing Surveys, 2025

    Jingtao Ding, Yunke Zhang, Yu Shang, Jie Feng, Yuheng Zhang, Zefang Zong, Yuan Yuan, Hongyuan Su, Nian Li, Jinghua Piao, Yucheng Deng, Nicholas Sukiennik, Chen Gao, Fengli Xu, and Yong Li. Under- standing world or predicting future? a comprehensive survey of world models.ACM Computing Surveys, 2025

  12. [12]

    Unreal Engine.https://www.unrealengine.com/, 2023

    Epic Games. Unreal Engine.https://www.unrealengine.com/, 2023. Accessed: 2026-01-25

  13. [13]

    Diffusion360: Seamless 360 degree panoramic image generation based on diffusion models.arXiv preprint arXiv:2311.13141, 2023

    Mengyang Feng, Jinlin Liu, Miaomiao Cui, and Xuansong Xie. Diffusion360: Seamless 360 degree panoramic image generation based on diffusion models.arXiv preprint arXiv:2311.13141, 2023

  14. [14]

    Cat3D: Create anything in 3d with multi-view diffusion models

    Ruiqi Gao, Aleksander Holynski, Philipp Henzler, Arthur Brussee, Ricardo Martin-Brualla, Pratul Srini- vasan, Jonathan T Barron, and Ben Poole. Cat3D: Create anything in 3d with multi-view diffusion models. arXiv preprint arXiv:2405.10314, 2024

  15. [15]

    Cameractrl ii: Dynamic scene exploration via camera-controlled video diffu- sion models

    Hao He, Ceyuan Yang, Shanchuan Lin, Yinghao Xu, Meng Wei, Liangke Gui, Qi Zhao, Gordon Wetzstein, Lu Jiang, and Hongsheng Li. Cameractrl ii: Dynamic scene exploration via camera-controlled video diffu- sion models. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 13416– 13426, 2025

  16. [16]

    V oyager: Long-range and world-consistent video diffusion for explorable 3d scene generation.ACM Transactions on Graphics (TOG), 44(6):1–15, 2025

    Tianyu Huang, Wangguandong Zheng, Tengfei Wang, Yuhao Liu, Zhenwei Wang, Junta Wu, Jie Jiang, Hui Li, Rynson Lau, Wangmeng Zuo, et al. V oyager: Long-range and world-consistent video diffusion for explorable 3d scene generation.ACM Transactions on Graphics (TOG), 44(6):1–15, 2025. 14

  17. [17]

    MapAnything: Universal Feed-Forward Metric 3D Reconstruction

    Nikhil Keetha, Norman M ¨uller, Johannes Sch¨onberger, Lorenzo Porzi, Yuchen Zhang, Tobias Fischer, Arno Knapitsch, Duncan Zauss, Ethan Weber, Nelson Antunes, et al. MapAnything: Universal feed-forward metric 3d reconstruction.arXiv preprint arXiv:2509.13414, 2025

  18. [18]

    Flux.https://github.com/black-forest-labs/flux, 2024

    Black Forest Labs. Flux.https://github.com/black-forest-labs/flux, 2024

  19. [19]

    FLUX.1 kontext: Flow matching for in-context image generation and editing in latent space, 2025

    Black Forest Labs, Stephen Batifol, Andreas Blattmann, Frederic Boesel, Saksham Consul, Cyril Diagne, Tim Dockhorn, Jack English, Zion English, Patrick Esser, Sumith Kulal, Kyle Lacey, Yam Levi, Cheng Li, Dominik Lorenz, Jonas M ¨uller, Dustin Podell, Robin Rombach, Harry Saini, Axel Sauer, and Luke Smith. FLUX.1 kontext: Flow matching for in-context imag...

  20. [20]

    Cameras as relative positional encoding

    Ruilong Li, Brent Yi, Junchen Liu, Hang Gao, Yi Ma, and Angjoo Kanazawa. Cameras as relative positional encoding.arXiv preprint arXiv:2507.10496, 2025

  21. [21]

    MegaDepth: Learning single-view depth prediction from internet photos

    Zhengqi Li and Noah Snavely. MegaDepth: Learning single-view depth prediction from internet photos. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2041–2050, 2018

  22. [22]

    DL3DV-10K: A large-scale scene dataset for deep learning-based 3d vision

    Lu Ling, Yichen Sheng, Zhi Tu, Wentian Zhao, Cheng Xin, Kun Wan, Lantao Yu, Qianyu Guo, Zixun Yu, Yawen Lu, et al. DL3DV-10K: A large-scale scene dataset for deep learning-based 3d vision. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22160–22169, 2024

  23. [23]

    Physics3D: Learning physical properties of 3D gaussians via video diffusion.arXiv preprint arXiv:2406.04338, 2024

    Fangfu Liu, Hanyang Wang, Shunyu Yao, Shengjun Zhang, Jie Zhou, and Yueqi Duan. Physics3D: Learning physical properties of 3D gaussians via video diffusion.arXiv preprint arXiv:2406.04338, 2024

  24. [24]

    Scalable diffusion models with transformers

    William Peebles and Saining Xie. Scalable diffusion models with transformers. InICCV, 2023

  25. [25]

    Infinigen indoors: Photorealistic indoor scenes using procedural generation

    Alexander Raistrick, Lingjie Mei, Karhan Kayan, David Yan, Yiming Zuo, Beining Han, Hongyu Wen, Meenal Parakh, Stamatis Alexandropoulos, Lahav Lipson, et al. Infinigen indoors: Photorealistic indoor scenes using procedural generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21783–21794, 2024

  26. [26]

    Gen3c: 3d-informed world-consistent video generation with precise camera control

    Xuanchi Ren, Tianchang Shen, Jiahui Huang, Huan Ling, Yifan Lu, Merlin Nimier-David, Thomas M ¨uller, Alexander Keller, Sanja Fidler, and Jun Gao. Gen3c: 3d-informed world-consistent video generation with precise camera control. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recog- nition, pages 6121–6132, 2025

  27. [27]

    High-resolution image synthesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022

  28. [28]

    Make-A-Video: Text-to-Video Generation without Text-Video Data

    Uriel Singer, Adam Polyak, Thomas Hayes, Xi Yin, Jie An, Songyang Zhang, Qiyuan Hu, Harry Yang, Oron Ashual, Oran Gafni, Devi Parikh, Sonal Gupta, and Yaniv Taigman. Make-a-video: Text-to-video generation without text-video data.arXiv preprint arXiv:2209.14792, 2022

  29. [29]

    WorldPlay: Towards Long-Term Geometric Consistency for Real-Time Interactive World Modeling

    Wenqiang Sun, Haiyu Zhang, Haoyuan Wang, Junta Wu, Zehan Wang, Zhenwei Wang, Yunhong Wang, Jun Zhang, Tengfei Wang, and Chunchao Guo. Worldplay: Towards long-term geometric consistency for real-time interactive world modeling.arXiv preprint arXiv:2512.14614, 2025

  30. [30]

    HunyuanVideo: A Systematic Framework For Large Video Generative Models

    Hunyuan Foundation Model Team. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024

  31. [31]

    Hunyuanworld 1.0: Generating immersive, explorable, and inter- active 3d worlds from words or pixels.arXiv preprint arXiv:2507.21809, 2025

    HunyuanWorld Team, Zhenwei Wang, Yuhao Liu, Junta Wu, Zixiao Gu, Haoyuan Wang, Xuhui Zuo, Tianyu Huang, Wenhuan Li, Sheng Zhang, et al. Hunyuanworld 1.0: Generating immersive, explorable, and inter- active 3d worlds from words or pixels.arXiv preprint arXiv:2507.21809, 2025

  32. [32]

    Advancing open-source world models.arXiv preprint arXiv:2601.20540, 2026

    Robbyant Team, Zelin Gao, Qiuyu Wang, Yanhong Zeng, Jiapeng Zhu, Ka Leong Cheng, Yixuan Li, Hanlin Wang, Yinghao Xu, Shuailei Ma, Yihang Chen, Jie Liu, Yansong Cheng, Yao Yao, Jiayi Zhu, Yihao Meng, Kecheng Zheng, Qingyan Bai, Jingye Chen, Zehong Shen, Yue Yu, Xing Zhu, Yujun Shen, and Hao Ouyang. Advancing open-source world models.arXiv preprint arXiv:26...

  33. [33]

    Wan: Open and Advanced Large-Scale Video Generative Models

    Wan Team. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025. 15

  34. [34]

    VGGT: Visual geometry grounded transformer

    Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. VGGT: Visual geometry grounded transformer. InProceedings of the Computer Vision and Pattern Recog- nition Conference, pages 5294–5306, 2025

  35. [35]

    MoGe: Unlocking accurate monocular geometry estimation for open-domain images with optimal training supervi- sion, 2024

    Ruicheng Wang, Sicheng Xu, Cassie Dai, Jianfeng Xiang, Yu Deng, Xin Tong, and Jiaolong Yang. MoGe: Unlocking accurate monocular geometry estimation for open-domain images with optimal training supervi- sion, 2024

  36. [36]

    Dust3R: Geometric 3d vision made easy

    Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. Dust3R: Geometric 3d vision made easy. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 20697–20709, 2024

  37. [37]

    RTFM: A real-time frame model.https://www.worldlabs.ai/blog/rtfm, 2025

    WorldLabs. RTFM: A real-time frame model.https://www.worldlabs.ai/blog/rtfm, 2025

  38. [38]

    Video world models with long-term spatial memory.arXiv preprint arXiv:2506.05284, 2025

    Tong Wu, Shuai Yang, Ryan Po, Yinghao Xu, Ziwei Liu, Dahua Lin, and Gordon Wetzstein. Video world models with long-term spatial memory.arXiv preprint arXiv:2506.05284, 2025

  39. [39]

    Layer- pano3D: Layered 3d panorama for hyper-immersive scene generation

    Shuai Yang, Jing Tan, Mengchen Zhang, Tong Wu, Gordon Wetzstein, Ziwei Liu, and Dahua Lin. Layer- pano3D: Layered 3d panorama for hyper-immersive scene generation. InProceedings of the special interest group on computer graphics and interactive techniques conference conference papers, pages 1–10, 2025

  40. [40]

    CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

    Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer.arXiv preprint arXiv:2408.06072, 2024

  41. [41]

    Blend- edMVS: A large-scale dataset for generalized multi-view stereo networks

    Yao Yao, Zixin Luo, Shiwei Li, Jingyang Zhang, Yufan Ren, Lei Zhou, Tian Fang, and Long Quan. Blend- edMVS: A large-scale dataset for generalized multi-view stereo networks. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1790–1799, 2020

  42. [42]

    One-step diffusion with distribution matching distillation

    Tianwei Yin, Micha ¨el Gharbi, Richard Zhang, Eli Shechtman, Fredo Durand, William T Freeman, and Taesung Park. One-step diffusion with distribution matching distillation. InCVPR, 2024

  43. [43]

    WonderWorld: In- teractive 3d scene generation from a single image

    Hong-Xing Yu, Haoyi Duan, Charles Herrmann, William T Freeman, and Jiajun Wu. WonderWorld: In- teractive 3d scene generation from a single image. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 5916–5926, 2025

  44. [44]

    WonderJourney: Going from anywhere to everywhere

    Hong-Xing Yu, Haoyi Duan, Junhwa Hur, Kyle Sargent, Michael Rubinstein, William T Freeman, Forrester Cole, Deqing Sun, Noah Snavely, Jiajun Wu, et al. WonderJourney: Going from anywhere to everywhere. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6658–6667, 2024

  45. [45]

    StarGen: A spatiotemporal autoregression framework with video diffusion model for scalable and controllable scene generation

    Shangjin Zhai, Zhichao Ye, Jialin Liu, Weijian Xie, Jiaqi Hu, Zhen Peng, Hua Xue, Danpeng Chen, Xi- aomeng Wang, Lei Yang, et al. StarGen: A spatiotemporal autoregression framework with video diffusion model for scalable and controllable scene generation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 26822–26833, 2025

  46. [46]

    Taming stable diffusion for text to 360 panorama image generation

    Cheng Zhang, Qianyi Wu, Camilo Cruz Gambardella, Xiaoshui Huang, Dinh Phung, Wanli Ouyang, and Jianfei Cai. Taming stable diffusion for text to 360 panorama image generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6347–6357, 2024

  47. [47]

    PhysDreamer: Physics-based interaction with 3d objects via video generation

    Tianyuan Zhang, Hong-Xing Yu, Rundi Wu, Brandon Y Feng, Changxi Zheng, Noah Snavely, Jiajun Wu, and William T Freeman. PhysDreamer: Physics-based interaction with 3d objects via video generation. In European Conference on Computer Vision, pages 388–406. Springer, 2024

  48. [48]

    Matrix-game: Interactive world foundation model.arXiv preprint arXiv:2506.18701, 2025

    Yifan Zhang, Chunli Peng, Boyang Wang, Puyi Wang, Qingcheng Zhu, Fei Kang, Biao Jiang, Zedong Gao, Eric Li, Yang Liu, and Yahui Zhou. Matrix-game: Interactive world foundation model.arXiv preprint arXiv:2506.18701, 2025

  49. [49]

    Stable virtual camera: Generative view synthesis with diffusion models

    Jensen Zhou, Hang Gao, Vikram V oleti, Aaryaman Vasishta, Chun-Han Yao, Mark Boss, Philip Torr, Chris- tian Rupprecht, and Varun Jampani. Stable virtual camera: Generative view synthesis with diffusion models. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 12405–12414, 2025. 16

  50. [50]

    SceneX: Procedural controllable large-scale scene generation

    Mengqi Zhou, Yuxi Wang, Jun Hou, Shougao Zhang, Yiwei Li, Chuanchen Luo, Junran Peng, and Zhaox- iang Zhang. SceneX: Procedural controllable large-scale scene generation. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 10806–10814, 2025

  51. [51]

    Stereo Magnification: Learning View Synthesis using Multiplane Images

    Tinghui Zhou, Richard Tucker, John Flynn, Graham Fyffe, and Noah Snavely. Stereo magnification: Learn- ing view synthesis using multiplane images.arXiv preprint arXiv:1805.09817, 2018. 17