arxiv: 2603.11911 · v3 · submitted 2026-03-12 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

InSpatio-WorldFM: An Open-Source Real-Time Generative Frame Model

InSpatio Team: Donghui Shen , Guofeng Zhang , Haomin Liu , Haoyu Ji , Jialin Liu , Jing Guo , Nan Wang , Siji Pan

show 10 more authors

Weihong Pan Weijian Xie Xiaojun Xiang Xiaoyu Zhang Xianbin Liu Yifu Wang Yipeng Chen Zhewen Le Zhichao Ye Ziqiang Zhao

Authors on Pith no claims yet

Pith reviewed 2026-05-15 12:02 UTC · model grok-4.3

classification 💻 cs.CV

keywords generative frame modelreal-time spatial inferencemulti-view consistency3D anchorsspatial memoryimage diffusion modelworld simulationconsumer GPU

0 comments

The pith

InSpatio-WorldFM generates each frame independently using explicit 3D anchors and implicit spatial memory to deliver real-time multi-view consistency.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a frame-based generative model that produces individual frames without sequential video processing to achieve low-latency spatial inference. It transforms a pretrained image diffusion model through a three-stage pipeline into a controllable real-time generator while enforcing consistency across viewpoints. This design aims to preserve global scene geometry and fine details on consumer hardware, offering an alternative to latency-heavy video world models for interactive exploration.

Core claim

InSpatio-WorldFM adopts a frame-based paradigm that generates each frame independently, enforcing multi-view spatial consistency through explicit 3D anchors and implicit spatial memory to preserve global scene geometry and fine-grained visual details across viewpoint changes, after a progressive three-stage training process that starts from a pretrained image diffusion model and ends with few-step distillation for real-time operation.

What carries the argument

The combination of explicit 3D anchors and implicit spatial memory that enforces multi-view consistency during independent per-frame generation.

Load-bearing premise

That explicit 3D anchors combined with implicit spatial memory are sufficient to preserve global scene geometry and fine-grained details across arbitrary viewpoint changes without additional mechanisms.

What would settle it

Demonstration of lost geometric consistency or missing details when the model generates frames from rapidly changing or previously unseen viewpoints on the same scene.

Figures

Figures reproduced from arXiv: 2603.11911 by Guofeng Zhang, Haomin Liu, Haoyu Ji, InSpatio Team: Donghui Shen, Jialin Liu, Jing Guo, Nan Wang, Siji Pan, Weihong Pan, Weijian Xie, Xianbin Liu, Xiaojun Xiang, Xiaoyu Zhang, Yifu Wang, Yipeng Chen, Zhewen Le, Zhichao Ye, Ziqiang Zhao.

**Figure 2.** Figure 2: Overview. In the offline stage, a multi-view-consistent model generates plausible observations that provide 3D anchors and reference appearances. In the online stage, frame model performs fast realtime inference while updating scene content at keyframes. 2 InSpatio-WorldFM 2.1 Overview As illustrated in [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: The pipeline of InSpatio-WorldFM. The left part illustrates the conditional novel-view synthesis pipeline of WorldFM. WorldFM takes a reference image xref (implicit scene memory), noisy latents zt , and point cloud rendering xˆtgt (explicit 3D anchor) as inputs, which are spatially concatenated along the width dimension. Reference pose πref and target pose πtgt are also injected as control signals. The f… view at source ↗

**Figure 4.** Figure 4: Qualitative results of teacher model. 8 [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Qualitative results of InSpatio-WorldFM. [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: Qualitative results of InSpatio-WorldFM. [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗

**Figure 7.** Figure 7: Qualitative results of InSpatio-WorldFM. [PITH_FULL_IMAGE:figures/full_fig_p011_7.png] view at source ↗

**Figure 8.** Figure 8: Qualitative results of InSpatio-WorldFM. Across observations at varying viewing distances, [PITH_FULL_IMAGE:figures/full_fig_p012_8.png] view at source ↗

read the original abstract

We present InSpatio-WorldFM, an open-source real-time frame model for spatial intelligence. Unlike video-based world models that rely on sequential frame generation and incur substantial latency due to window-level processing, InSpatio-WorldFM adopts a frame-based paradigm that generates each frame independently, enabling low-latency real-time spatial inference. By enforcing multi-view spatial consistency through explicit 3D anchors and implicit spatial memory, the model preserves global scene geometry while maintaining fine-grained visual details across viewpoint changes. We further introduce a progressive three-stage training pipeline that transforms a pretrained image diffusion model into a controllable frame model and finally into a real-time generator through few-step distillation. Experimental results show that InSpatio-WorldFM achieves strong multi-view consistency while supporting interactive exploration on consumer-grade GPUs, providing an efficient alternative to traditional video-based world models for real-time world simulation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

InSpatio-WorldFM shifts to independent frame generation with 3D anchors for lower-latency spatial modeling, but the abstract supplies no metrics to back the consistency results.

read the letter

The main thing to know is that this paper moves away from sequential video generation toward independent frames, using explicit 3D anchors plus implicit spatial memory to hold scene geometry together at interactive speeds on ordinary GPUs. The progressive distillation pipeline from a pretrained image diffusion model down to a controllable frame model and then a few-step generator is the concrete mechanism they describe for cutting latency without window-level processing. That setup is presented as a practical alternative for real-time world simulation. It does well by releasing the work open-source and targeting consumer hardware, which makes the efficiency claims something the community can actually test rather than take on faith. The training stages are laid out in a straightforward sequence that directly addresses the delay issue in video-based models. The soft spots sit mostly in the evidence. The abstract states that experiments show strong multi-view consistency, yet it includes no numbers, baselines, error breakdowns, or dataset details. Without those, it is difficult to judge whether the 3D anchors and memory combination really preserves fine details across arbitrary views or if extra mechanisms are still needed. The central assumption looks coherent on the surface but remains untested in the summary provided. This paper is aimed at computer vision researchers working on spatial world models for robotics or interactive simulation. Anyone looking for lower-latency options to heavy video diffusion setups could extract useful architecture ideas from the pipeline description, though they would need the full methods and results to assess the gains. I would send it for peer review so the quantitative claims and implementation can be checked properly.

Referee Report

2 major / 1 minor

Summary. The manuscript presents InSpatio-WorldFM, an open-source real-time generative frame model for spatial intelligence. Unlike sequential video-based world models, it generates each frame independently using explicit 3D anchors combined with implicit spatial memory to enforce multi-view consistency and preserve global scene geometry across viewpoint changes. A progressive three-stage training pipeline is described that distills a pretrained image diffusion model into a controllable frame model and finally into a few-step real-time generator. The central claim is that this yields strong multi-view consistency while enabling interactive exploration on consumer-grade GPUs as an efficient alternative to window-based video models.

Significance. If the quantitative claims hold, the work would provide a practical advance in real-time spatial simulation by addressing latency bottlenecks in video-based approaches through independent frame generation. The open-source release and targeting of consumer hardware are strengths that could facilitate reproducibility and adoption in interactive applications. The progressive distillation pipeline offers a coherent mechanism for efficiency gains that is directly testable.

major comments (2)

[Abstract] Abstract: The assertion of 'experimental results' showing strong multi-view consistency supplies no metrics, baselines, error analysis, or dataset details. This information is load-bearing for the central claim and must be supplied from the results section to allow evaluation of the geometry-preservation performance.
[Training Pipeline] Training pipeline description: The integration of explicit 3D anchors with implicit spatial memory is presented as sufficient to preserve global geometry under independent frame generation, but no ablation or quantitative test is referenced that isolates this combination from potential failure modes under large viewpoint changes.

minor comments (1)

[Abstract] Abstract: Consider including a direct link to the open-source repository to support the reproducibility claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment point by point below, providing clarifications from the manuscript and indicating revisions where the comments identify opportunities to strengthen the presentation.

read point-by-point responses

Referee: [Abstract] Abstract: The assertion of 'experimental results' showing strong multi-view consistency supplies no metrics, baselines, error analysis, or dataset details. This information is load-bearing for the central claim and must be supplied from the results section to allow evaluation of the geometry-preservation performance.

Authors: We agree that the abstract would be strengthened by explicitly referencing supporting quantitative details. The full manuscript reports these metrics, baselines, error analysis, and dataset information in the Experiments section. In the revised version we have updated the abstract to concisely incorporate the key multi-view consistency scores, primary baselines, and dataset references while remaining within length limits. revision: yes
Referee: [Training Pipeline] Training pipeline description: The integration of explicit 3D anchors with implicit spatial memory is presented as sufficient to preserve global geometry under independent frame generation, but no ablation or quantitative test is referenced that isolates this combination from potential failure modes under large viewpoint changes.

Authors: The manuscript already contains quantitative evaluations in the Experiments section that measure geometry preservation across large viewpoint changes when both components are present. To directly isolate their joint contribution, the revised manuscript adds a targeted ablation study comparing the full model against variants lacking 3D anchors or spatial memory, with results confirming improved consistency under extreme viewpoint shifts. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The provided abstract and description outline a frame-based generation paradigm enforced via explicit 3D anchors plus implicit spatial memory, implemented through a progressive three-stage training pipeline (pretrained image diffusion to controllable frame model to few-step distillation). No equations, derivations, or fitted parameters are shown that reduce any claimed prediction or consistency result to the inputs by construction. No self-citations, uniqueness theorems, or ansatz smuggling appear in the text. The central claims rest on an externally described training process rather than self-referential definitions, making the derivation self-contained against the stated benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard assumptions of diffusion model adaptability and the sufficiency of 3D anchors for consistency; no new entities are introduced.

free parameters (1)

distillation steps and training hyperparameters
Standard in the three-stage pipeline but not enumerated in the abstract.

axioms (1)

domain assumption Pretrained image diffusion models can be transformed into controllable frame generators via staged training
Invoked in the progressive three-stage pipeline description.

pith-pipeline@v0.9.0 · 5521 in / 1135 out tokens · 47324 ms · 2026-05-15T12:02:56.565901+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

By enforcing multi-view spatial consistency through explicit 3D anchors and implicit spatial memory, the model preserves global scene geometry
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We adopt a three-stage training pipeline that progressively evolves a foundation image generator into an efficient real-time frame model

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

3D-Anchored Lookahead Planning for Persistent Robotic Scene Memory via World-Model-Based MCTS
cs.RO 2026-04 unverdicted novelty 7.0

3D-ALP achieves 0.65 success on memory-dependent 5-step robotic reach tasks versus near-zero for reactive baselines by anchoring MCTS planning to a persistent 3D camera-to-world frame.
INSPATIO-WORLD: A Real-Time 4D World Simulator via Spatiotemporal Autoregressive Modeling
cs.CV 2026-04 unverdicted novelty 6.0

INSPATIO-WORLD is a real-time framework for high-fidelity 4D scene generation and navigation from monocular videos via STAR architecture with implicit caching, explicit geometric constraints, and distribution-matching...
ST-Gen4D: Embedding 4D Spatiotemporal Cognition into World Model for 4D Generation
cs.CV 2026-05 unverdicted novelty 5.0

ST-Gen4D uses a world model that fuses global appearance and local dynamic graphs into a 4D cognition representation to guide consistent 4D Gaussian generation.

Reference graph

Works this paper leans on

51 extracted references · 51 canonical work pages · cited by 3 Pith papers · 9 internal anchors

[1]

Cosmos World Foundation Model Platform for Physical AI

Niket Agarwal, Arslan Ali, Maciej Bala, Yogesh Balaji, Erik Barker, Tiffany Cai, Prithvijit Chattopadhyay, Yongxin Chen, Yin Cui, Yifan Ding, et al. Cosmos world foundation model platform for physical ai.arXiv preprint arXiv:2501.03575, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

Cosmos-transfer1: Conditional world generation with adaptive multimodal control.arXiv preprint arXiv:2503.14492, 2025

Hassan Abu Alhaija, Jose Alvarez, Maciej Bala, Tiffany Cai, Tianshi Cao, Liz Cha, Joshua Chen, Mike Chen, Francesco Ferroni, Sanja Fidler, et al. Cosmos-transfer1: Conditional world generation with adaptive multimodal control.arXiv preprint arXiv:2503.14492, 2025

work page arXiv 2025
[3]

ReCammaster: Camera-controlled generative rendering from a single video

Jianhong Bai, Menghan Xia, Xiao Fu, Xintao Wang, Lianrui Mu, Jinwen Cao, Zuozhu Liu, Haoji Hu, Xiang Bai, Pengfei Wan, et al. ReCammaster: Camera-controlled generative rendering from a single video. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 14834–14844, 2025

work page 2025
[4]

Lumiere: A space-time diffusion model for video generation

Omer Bar-Tal, Hila Chefer, Omer Tov, Charles Herrmann, Roni Paiss, Shiran Zada, Ariel Ephrat, Junhwa Hur, Guanghui Liu, Amit Raj, Yuanzhen Li, Michael Rubinstein, Tomer Michaeli, Oliver Wang, Deqing Sun, Tali Dekel, and Inbar Mosseri. Lumiere: A space-time diffusion model for video generation. InSIGGRAPH Asia, 2024

work page 2024
[5]

Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, Varun Jampani, and Robin Rombach. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[6]

Video generation models as world simulators.OpenAI Blog, 2024

Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, Clarence Ng, Ricky Wang, and Aditya Ramesh. Video generation models as world simulators.OpenAI Blog, 2024

work page 2024
[7]

Video generation models as world simulators

Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, Clarence Ng, Ricky Wang, and Aditya Ramesh. Video generation models as world simulators. 2024

work page 2024
[8]

Pixart-σ: Weak-to-strong training of diffusion transformer for 4k text-to- image generation

Junsong Chen, Chongjian Ge, Enze Xie, Yue Wu, Lewei Yao, Xiaozhe Ren, Zhongdao Wang, Ping Luo, Huchuan Lu, and Zhenguo Li. Pixart-σ: Weak-to-strong training of diffusion transformer for 4k text-to- image generation. InEuropean Conference on Computer Vision, pages 74–91. Springer, 2024

work page 2024
[9]

LucidDreamer: Domain- free generation of 3D gaussian splatting scenes.arXiv preprint arXiv:2311.13384, 2023

Jaeyoung Chung, Suyoung Lee, Hyeongjin Nam, Jaerin Lee, and Kyoung Mu Lee. LucidDreamer: Domain- free generation of 3D gaussian splatting scenes.arXiv preprint arXiv:2311.13384, 2023

work page arXiv 2023
[10]

Genie 3: A new frontier for world models.https://deepmind.google/blog/ genie-3-a-new-frontier-for-world-models/, 2025

Google DeepMind. Genie 3: A new frontier for world models.https://deepmind.google/blog/ genie-3-a-new-frontier-for-world-models/, 2025

work page 2025
[11]

Under- standing world or predicting future? a comprehensive survey of world models.ACM Computing Surveys, 2025

Jingtao Ding, Yunke Zhang, Yu Shang, Jie Feng, Yuheng Zhang, Zefang Zong, Yuan Yuan, Hongyuan Su, Nian Li, Jinghua Piao, Yucheng Deng, Nicholas Sukiennik, Chen Gao, Fengli Xu, and Yong Li. Under- standing world or predicting future? a comprehensive survey of world models.ACM Computing Surveys, 2025

work page 2025
[12]

Unreal Engine.https://www.unrealengine.com/, 2023

Epic Games. Unreal Engine.https://www.unrealengine.com/, 2023. Accessed: 2026-01-25

work page 2023
[13]

Diffusion360: Seamless 360 degree panoramic image generation based on diffusion models.arXiv preprint arXiv:2311.13141, 2023

Mengyang Feng, Jinlin Liu, Miaomiao Cui, and Xuansong Xie. Diffusion360: Seamless 360 degree panoramic image generation based on diffusion models.arXiv preprint arXiv:2311.13141, 2023

work page arXiv 2023
[14]

Cat3D: Create anything in 3d with multi-view diffusion models

Ruiqi Gao, Aleksander Holynski, Philipp Henzler, Arthur Brussee, Ricardo Martin-Brualla, Pratul Srini- vasan, Jonathan T Barron, and Ben Poole. Cat3D: Create anything in 3d with multi-view diffusion models. arXiv preprint arXiv:2405.10314, 2024

work page arXiv 2024
[15]

Cameractrl ii: Dynamic scene exploration via camera-controlled video diffu- sion models

Hao He, Ceyuan Yang, Shanchuan Lin, Yinghao Xu, Meng Wei, Liangke Gui, Qi Zhao, Gordon Wetzstein, Lu Jiang, and Hongsheng Li. Cameractrl ii: Dynamic scene exploration via camera-controlled video diffu- sion models. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 13416– 13426, 2025

work page 2025
[16]

V oyager: Long-range and world-consistent video diffusion for explorable 3d scene generation.ACM Transactions on Graphics (TOG), 44(6):1–15, 2025

Tianyu Huang, Wangguandong Zheng, Tengfei Wang, Yuhao Liu, Zhenwei Wang, Junta Wu, Jie Jiang, Hui Li, Rynson Lau, Wangmeng Zuo, et al. V oyager: Long-range and world-consistent video diffusion for explorable 3d scene generation.ACM Transactions on Graphics (TOG), 44(6):1–15, 2025. 14

work page 2025
[17]

MapAnything: Universal Feed-Forward Metric 3D Reconstruction

Nikhil Keetha, Norman M ¨uller, Johannes Sch¨onberger, Lorenzo Porzi, Yuchen Zhang, Tobias Fischer, Arno Knapitsch, Duncan Zauss, Ethan Weber, Nelson Antunes, et al. MapAnything: Universal feed-forward metric 3d reconstruction.arXiv preprint arXiv:2509.13414, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[18]

Flux.https://github.com/black-forest-labs/flux, 2024

Black Forest Labs. Flux.https://github.com/black-forest-labs/flux, 2024

work page 2024
[19]

FLUX.1 kontext: Flow matching for in-context image generation and editing in latent space, 2025

Black Forest Labs, Stephen Batifol, Andreas Blattmann, Frederic Boesel, Saksham Consul, Cyril Diagne, Tim Dockhorn, Jack English, Zion English, Patrick Esser, Sumith Kulal, Kyle Lacey, Yam Levi, Cheng Li, Dominik Lorenz, Jonas M ¨uller, Dustin Podell, Robin Rombach, Harry Saini, Axel Sauer, and Luke Smith. FLUX.1 kontext: Flow matching for in-context imag...

work page 2025
[20]

Cameras as relative positional encoding

Ruilong Li, Brent Yi, Junchen Liu, Hang Gao, Yi Ma, and Angjoo Kanazawa. Cameras as relative positional encoding.arXiv preprint arXiv:2507.10496, 2025

work page arXiv 2025
[21]

MegaDepth: Learning single-view depth prediction from internet photos

Zhengqi Li and Noah Snavely. MegaDepth: Learning single-view depth prediction from internet photos. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2041–2050, 2018

work page 2041
[22]

DL3DV-10K: A large-scale scene dataset for deep learning-based 3d vision

Lu Ling, Yichen Sheng, Zhi Tu, Wentian Zhao, Cheng Xin, Kun Wan, Lantao Yu, Qianyu Guo, Zixun Yu, Yawen Lu, et al. DL3DV-10K: A large-scale scene dataset for deep learning-based 3d vision. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22160–22169, 2024

work page 2024
[23]

Physics3D: Learning physical properties of 3D gaussians via video diffusion.arXiv preprint arXiv:2406.04338, 2024

Fangfu Liu, Hanyang Wang, Shunyu Yao, Shengjun Zhang, Jie Zhou, and Yueqi Duan. Physics3D: Learning physical properties of 3D gaussians via video diffusion.arXiv preprint arXiv:2406.04338, 2024

work page arXiv 2024
[24]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InICCV, 2023

work page 2023
[25]

Infinigen indoors: Photorealistic indoor scenes using procedural generation

Alexander Raistrick, Lingjie Mei, Karhan Kayan, David Yan, Yiming Zuo, Beining Han, Hongyu Wen, Meenal Parakh, Stamatis Alexandropoulos, Lahav Lipson, et al. Infinigen indoors: Photorealistic indoor scenes using procedural generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21783–21794, 2024

work page 2024
[26]

Gen3c: 3d-informed world-consistent video generation with precise camera control

Xuanchi Ren, Tianchang Shen, Jiahui Huang, Huan Ling, Yifan Lu, Merlin Nimier-David, Thomas M ¨uller, Alexander Keller, Sanja Fidler, and Jun Gao. Gen3c: 3d-informed world-consistent video generation with precise camera control. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recog- nition, pages 6121–6132, 2025

work page 2025
[27]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022

work page 2022
[28]

Make-A-Video: Text-to-Video Generation without Text-Video Data

Uriel Singer, Adam Polyak, Thomas Hayes, Xi Yin, Jie An, Songyang Zhang, Qiyuan Hu, Harry Yang, Oron Ashual, Oran Gafni, Devi Parikh, Sonal Gupta, and Yaniv Taigman. Make-a-video: Text-to-video generation without text-video data.arXiv preprint arXiv:2209.14792, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[29]

WorldPlay: Towards Long-Term Geometric Consistency for Real-Time Interactive World Modeling

Wenqiang Sun, Haiyu Zhang, Haoyuan Wang, Junta Wu, Zehan Wang, Zhenwei Wang, Yunhong Wang, Jun Zhang, Tengfei Wang, and Chunchao Guo. Worldplay: Towards long-term geometric consistency for real-time interactive world modeling.arXiv preprint arXiv:2512.14614, 2025

work page internal anchor Pith review arXiv 2025
[30]

HunyuanVideo: A Systematic Framework For Large Video Generative Models

Hunyuan Foundation Model Team. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[31]

Hunyuanworld 1.0: Generating immersive, explorable, and inter- active 3d worlds from words or pixels.arXiv preprint arXiv:2507.21809, 2025

HunyuanWorld Team, Zhenwei Wang, Yuhao Liu, Junta Wu, Zixiao Gu, Haoyuan Wang, Xuhui Zuo, Tianyu Huang, Wenhuan Li, Sheng Zhang, et al. Hunyuanworld 1.0: Generating immersive, explorable, and inter- active 3d worlds from words or pixels.arXiv preprint arXiv:2507.21809, 2025

work page arXiv 2025
[32]

Advancing open-source world models.arXiv preprint arXiv:2601.20540, 2026

Robbyant Team, Zelin Gao, Qiuyu Wang, Yanhong Zeng, Jiapeng Zhu, Ka Leong Cheng, Yixuan Li, Hanlin Wang, Yinghao Xu, Shuailei Ma, Yihang Chen, Jie Liu, Yansong Cheng, Yao Yao, Jiayi Zhu, Yihao Meng, Kecheng Zheng, Qingyan Bai, Jingye Chen, Zehong Shen, Yue Yu, Xing Zhu, Yujun Shen, and Hao Ouyang. Advancing open-source world models.arXiv preprint arXiv:26...

work page arXiv 2026
[33]

Wan: Open and Advanced Large-Scale Video Generative Models

Wan Team. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025. 15

work page internal anchor Pith review Pith/arXiv arXiv 2025
[34]

VGGT: Visual geometry grounded transformer

Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. VGGT: Visual geometry grounded transformer. InProceedings of the Computer Vision and Pattern Recog- nition Conference, pages 5294–5306, 2025

work page 2025
[35]

MoGe: Unlocking accurate monocular geometry estimation for open-domain images with optimal training supervi- sion, 2024

Ruicheng Wang, Sicheng Xu, Cassie Dai, Jianfeng Xiang, Yu Deng, Xin Tong, and Jiaolong Yang. MoGe: Unlocking accurate monocular geometry estimation for open-domain images with optimal training supervi- sion, 2024

work page 2024
[36]

Dust3R: Geometric 3d vision made easy

Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. Dust3R: Geometric 3d vision made easy. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 20697–20709, 2024

work page 2024
[37]

RTFM: A real-time frame model.https://www.worldlabs.ai/blog/rtfm, 2025

WorldLabs. RTFM: A real-time frame model.https://www.worldlabs.ai/blog/rtfm, 2025

work page 2025
[38]

Video world models with long-term spatial memory.arXiv preprint arXiv:2506.05284, 2025

Tong Wu, Shuai Yang, Ryan Po, Yinghao Xu, Ziwei Liu, Dahua Lin, and Gordon Wetzstein. Video world models with long-term spatial memory.arXiv preprint arXiv:2506.05284, 2025

work page arXiv 2025
[39]

Layer- pano3D: Layered 3d panorama for hyper-immersive scene generation

Shuai Yang, Jing Tan, Mengchen Zhang, Tong Wu, Gordon Wetzstein, Ziwei Liu, and Dahua Lin. Layer- pano3D: Layered 3d panorama for hyper-immersive scene generation. InProceedings of the special interest group on computer graphics and interactive techniques conference conference papers, pages 1–10, 2025

work page 2025
[40]

CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer.arXiv preprint arXiv:2408.06072, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[41]

Blend- edMVS: A large-scale dataset for generalized multi-view stereo networks

Yao Yao, Zixin Luo, Shiwei Li, Jingyang Zhang, Yufan Ren, Lei Zhou, Tian Fang, and Long Quan. Blend- edMVS: A large-scale dataset for generalized multi-view stereo networks. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1790–1799, 2020

work page 2020
[42]

One-step diffusion with distribution matching distillation

Tianwei Yin, Micha ¨el Gharbi, Richard Zhang, Eli Shechtman, Fredo Durand, William T Freeman, and Taesung Park. One-step diffusion with distribution matching distillation. InCVPR, 2024

work page 2024
[43]

WonderWorld: In- teractive 3d scene generation from a single image

Hong-Xing Yu, Haoyi Duan, Charles Herrmann, William T Freeman, and Jiajun Wu. WonderWorld: In- teractive 3d scene generation from a single image. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 5916–5926, 2025

work page 2025
[44]

WonderJourney: Going from anywhere to everywhere

Hong-Xing Yu, Haoyi Duan, Junhwa Hur, Kyle Sargent, Michael Rubinstein, William T Freeman, Forrester Cole, Deqing Sun, Noah Snavely, Jiajun Wu, et al. WonderJourney: Going from anywhere to everywhere. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6658–6667, 2024

work page 2024
[45]

StarGen: A spatiotemporal autoregression framework with video diffusion model for scalable and controllable scene generation

Shangjin Zhai, Zhichao Ye, Jialin Liu, Weijian Xie, Jiaqi Hu, Zhen Peng, Hua Xue, Danpeng Chen, Xi- aomeng Wang, Lei Yang, et al. StarGen: A spatiotemporal autoregression framework with video diffusion model for scalable and controllable scene generation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 26822–26833, 2025

work page 2025
[46]

Taming stable diffusion for text to 360 panorama image generation

Cheng Zhang, Qianyi Wu, Camilo Cruz Gambardella, Xiaoshui Huang, Dinh Phung, Wanli Ouyang, and Jianfei Cai. Taming stable diffusion for text to 360 panorama image generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6347–6357, 2024

work page 2024
[47]

PhysDreamer: Physics-based interaction with 3d objects via video generation

Tianyuan Zhang, Hong-Xing Yu, Rundi Wu, Brandon Y Feng, Changxi Zheng, Noah Snavely, Jiajun Wu, and William T Freeman. PhysDreamer: Physics-based interaction with 3d objects via video generation. In European Conference on Computer Vision, pages 388–406. Springer, 2024

work page 2024
[48]

Matrix-game: Interactive world foundation model.arXiv preprint arXiv:2506.18701, 2025

Yifan Zhang, Chunli Peng, Boyang Wang, Puyi Wang, Qingcheng Zhu, Fei Kang, Biao Jiang, Zedong Gao, Eric Li, Yang Liu, and Yahui Zhou. Matrix-game: Interactive world foundation model.arXiv preprint arXiv:2506.18701, 2025

work page arXiv 2025
[49]

Stable virtual camera: Generative view synthesis with diffusion models

Jensen Zhou, Hang Gao, Vikram V oleti, Aaryaman Vasishta, Chun-Han Yao, Mark Boss, Philip Torr, Chris- tian Rupprecht, and Varun Jampani. Stable virtual camera: Generative view synthesis with diffusion models. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 12405–12414, 2025. 16

work page 2025
[50]

SceneX: Procedural controllable large-scale scene generation

Mengqi Zhou, Yuxi Wang, Jun Hou, Shougao Zhang, Yiwei Li, Chuanchen Luo, Junran Peng, and Zhaox- iang Zhang. SceneX: Procedural controllable large-scale scene generation. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 10806–10814, 2025

work page 2025
[51]

Stereo Magnification: Learning View Synthesis using Multiplane Images

Tinghui Zhou, Richard Tucker, John Flynn, Graham Fyffe, and Noah Snavely. Stereo magnification: Learn- ing view synthesis using multiplane images.arXiv preprint arXiv:1805.09817, 2018. 17

work page internal anchor Pith review Pith/arXiv arXiv 2018