Recognition: 2 theorem links
· Lean TheoremInSpatio-WorldFM: An Open-Source Real-Time Generative Frame Model
Pith reviewed 2026-05-15 12:02 UTC · model grok-4.3
The pith
InSpatio-WorldFM generates each frame independently using explicit 3D anchors and implicit spatial memory to deliver real-time multi-view consistency.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
InSpatio-WorldFM adopts a frame-based paradigm that generates each frame independently, enforcing multi-view spatial consistency through explicit 3D anchors and implicit spatial memory to preserve global scene geometry and fine-grained visual details across viewpoint changes, after a progressive three-stage training process that starts from a pretrained image diffusion model and ends with few-step distillation for real-time operation.
What carries the argument
The combination of explicit 3D anchors and implicit spatial memory that enforces multi-view consistency during independent per-frame generation.
Load-bearing premise
That explicit 3D anchors combined with implicit spatial memory are sufficient to preserve global scene geometry and fine-grained details across arbitrary viewpoint changes without additional mechanisms.
What would settle it
Demonstration of lost geometric consistency or missing details when the model generates frames from rapidly changing or previously unseen viewpoints on the same scene.
Figures
read the original abstract
We present InSpatio-WorldFM, an open-source real-time frame model for spatial intelligence. Unlike video-based world models that rely on sequential frame generation and incur substantial latency due to window-level processing, InSpatio-WorldFM adopts a frame-based paradigm that generates each frame independently, enabling low-latency real-time spatial inference. By enforcing multi-view spatial consistency through explicit 3D anchors and implicit spatial memory, the model preserves global scene geometry while maintaining fine-grained visual details across viewpoint changes. We further introduce a progressive three-stage training pipeline that transforms a pretrained image diffusion model into a controllable frame model and finally into a real-time generator through few-step distillation. Experimental results show that InSpatio-WorldFM achieves strong multi-view consistency while supporting interactive exploration on consumer-grade GPUs, providing an efficient alternative to traditional video-based world models for real-time world simulation.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents InSpatio-WorldFM, an open-source real-time generative frame model for spatial intelligence. Unlike sequential video-based world models, it generates each frame independently using explicit 3D anchors combined with implicit spatial memory to enforce multi-view consistency and preserve global scene geometry across viewpoint changes. A progressive three-stage training pipeline is described that distills a pretrained image diffusion model into a controllable frame model and finally into a few-step real-time generator. The central claim is that this yields strong multi-view consistency while enabling interactive exploration on consumer-grade GPUs as an efficient alternative to window-based video models.
Significance. If the quantitative claims hold, the work would provide a practical advance in real-time spatial simulation by addressing latency bottlenecks in video-based approaches through independent frame generation. The open-source release and targeting of consumer hardware are strengths that could facilitate reproducibility and adoption in interactive applications. The progressive distillation pipeline offers a coherent mechanism for efficiency gains that is directly testable.
major comments (2)
- [Abstract] Abstract: The assertion of 'experimental results' showing strong multi-view consistency supplies no metrics, baselines, error analysis, or dataset details. This information is load-bearing for the central claim and must be supplied from the results section to allow evaluation of the geometry-preservation performance.
- [Training Pipeline] Training pipeline description: The integration of explicit 3D anchors with implicit spatial memory is presented as sufficient to preserve global geometry under independent frame generation, but no ablation or quantitative test is referenced that isolates this combination from potential failure modes under large viewpoint changes.
minor comments (1)
- [Abstract] Abstract: Consider including a direct link to the open-source repository to support the reproducibility claim.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address each major comment point by point below, providing clarifications from the manuscript and indicating revisions where the comments identify opportunities to strengthen the presentation.
read point-by-point responses
-
Referee: [Abstract] Abstract: The assertion of 'experimental results' showing strong multi-view consistency supplies no metrics, baselines, error analysis, or dataset details. This information is load-bearing for the central claim and must be supplied from the results section to allow evaluation of the geometry-preservation performance.
Authors: We agree that the abstract would be strengthened by explicitly referencing supporting quantitative details. The full manuscript reports these metrics, baselines, error analysis, and dataset information in the Experiments section. In the revised version we have updated the abstract to concisely incorporate the key multi-view consistency scores, primary baselines, and dataset references while remaining within length limits. revision: yes
-
Referee: [Training Pipeline] Training pipeline description: The integration of explicit 3D anchors with implicit spatial memory is presented as sufficient to preserve global geometry under independent frame generation, but no ablation or quantitative test is referenced that isolates this combination from potential failure modes under large viewpoint changes.
Authors: The manuscript already contains quantitative evaluations in the Experiments section that measure geometry preservation across large viewpoint changes when both components are present. To directly isolate their joint contribution, the revised manuscript adds a targeted ablation study comparing the full model against variants lacking 3D anchors or spatial memory, with results confirming improved consistency under extreme viewpoint shifts. revision: yes
Circularity Check
No significant circularity detected
full rationale
The provided abstract and description outline a frame-based generation paradigm enforced via explicit 3D anchors plus implicit spatial memory, implemented through a progressive three-stage training pipeline (pretrained image diffusion to controllable frame model to few-step distillation). No equations, derivations, or fitted parameters are shown that reduce any claimed prediction or consistency result to the inputs by construction. No self-citations, uniqueness theorems, or ansatz smuggling appear in the text. The central claims rest on an externally described training process rather than self-referential definitions, making the derivation self-contained against the stated benchmarks.
Axiom & Free-Parameter Ledger
free parameters (1)
- distillation steps and training hyperparameters
axioms (1)
- domain assumption Pretrained image diffusion models can be transformed into controllable frame generators via staged training
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AlexanderDuality.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
By enforcing multi-view spatial consistency through explicit 3D anchors and implicit spatial memory, the model preserves global scene geometry
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We adopt a three-stage training pipeline that progressively evolves a foundation image generator into an efficient real-time frame model
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 3 Pith papers
-
3D-Anchored Lookahead Planning for Persistent Robotic Scene Memory via World-Model-Based MCTS
3D-ALP achieves 0.65 success on memory-dependent 5-step robotic reach tasks versus near-zero for reactive baselines by anchoring MCTS planning to a persistent 3D camera-to-world frame.
-
INSPATIO-WORLD: A Real-Time 4D World Simulator via Spatiotemporal Autoregressive Modeling
INSPATIO-WORLD is a real-time framework for high-fidelity 4D scene generation and navigation from monocular videos via STAR architecture with implicit caching, explicit geometric constraints, and distribution-matching...
-
ST-Gen4D: Embedding 4D Spatiotemporal Cognition into World Model for 4D Generation
ST-Gen4D uses a world model that fuses global appearance and local dynamic graphs into a 4D cognition representation to guide consistent 4D Gaussian generation.
Reference graph
Works this paper leans on
-
[1]
Cosmos World Foundation Model Platform for Physical AI
Niket Agarwal, Arslan Ali, Maciej Bala, Yogesh Balaji, Erik Barker, Tiffany Cai, Prithvijit Chattopadhyay, Yongxin Chen, Yin Cui, Yifan Ding, et al. Cosmos world foundation model platform for physical ai.arXiv preprint arXiv:2501.03575, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[2]
Hassan Abu Alhaija, Jose Alvarez, Maciej Bala, Tiffany Cai, Tianshi Cao, Liz Cha, Joshua Chen, Mike Chen, Francesco Ferroni, Sanja Fidler, et al. Cosmos-transfer1: Conditional world generation with adaptive multimodal control.arXiv preprint arXiv:2503.14492, 2025
-
[3]
ReCammaster: Camera-controlled generative rendering from a single video
Jianhong Bai, Menghan Xia, Xiao Fu, Xintao Wang, Lianrui Mu, Jinwen Cao, Zuozhu Liu, Haoji Hu, Xiang Bai, Pengfei Wan, et al. ReCammaster: Camera-controlled generative rendering from a single video. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 14834–14844, 2025
work page 2025
-
[4]
Lumiere: A space-time diffusion model for video generation
Omer Bar-Tal, Hila Chefer, Omer Tov, Charles Herrmann, Roni Paiss, Shiran Zada, Ariel Ephrat, Junhwa Hur, Guanghui Liu, Amit Raj, Yuanzhen Li, Michael Rubinstein, Tomer Michaeli, Oliver Wang, Deqing Sun, Tali Dekel, and Inbar Mosseri. Lumiere: A space-time diffusion model for video generation. InSIGGRAPH Asia, 2024
work page 2024
-
[5]
Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets
Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, Varun Jampani, and Robin Rombach. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[6]
Video generation models as world simulators.OpenAI Blog, 2024
Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, Clarence Ng, Ricky Wang, and Aditya Ramesh. Video generation models as world simulators.OpenAI Blog, 2024
work page 2024
-
[7]
Video generation models as world simulators
Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, Clarence Ng, Ricky Wang, and Aditya Ramesh. Video generation models as world simulators. 2024
work page 2024
-
[8]
Pixart-σ: Weak-to-strong training of diffusion transformer for 4k text-to- image generation
Junsong Chen, Chongjian Ge, Enze Xie, Yue Wu, Lewei Yao, Xiaozhe Ren, Zhongdao Wang, Ping Luo, Huchuan Lu, and Zhenguo Li. Pixart-σ: Weak-to-strong training of diffusion transformer for 4k text-to- image generation. InEuropean Conference on Computer Vision, pages 74–91. Springer, 2024
work page 2024
-
[9]
Jaeyoung Chung, Suyoung Lee, Hyeongjin Nam, Jaerin Lee, and Kyoung Mu Lee. LucidDreamer: Domain- free generation of 3D gaussian splatting scenes.arXiv preprint arXiv:2311.13384, 2023
-
[10]
Google DeepMind. Genie 3: A new frontier for world models.https://deepmind.google/blog/ genie-3-a-new-frontier-for-world-models/, 2025
work page 2025
-
[11]
Jingtao Ding, Yunke Zhang, Yu Shang, Jie Feng, Yuheng Zhang, Zefang Zong, Yuan Yuan, Hongyuan Su, Nian Li, Jinghua Piao, Yucheng Deng, Nicholas Sukiennik, Chen Gao, Fengli Xu, and Yong Li. Under- standing world or predicting future? a comprehensive survey of world models.ACM Computing Surveys, 2025
work page 2025
-
[12]
Unreal Engine.https://www.unrealengine.com/, 2023
Epic Games. Unreal Engine.https://www.unrealengine.com/, 2023. Accessed: 2026-01-25
work page 2023
-
[13]
Mengyang Feng, Jinlin Liu, Miaomiao Cui, and Xuansong Xie. Diffusion360: Seamless 360 degree panoramic image generation based on diffusion models.arXiv preprint arXiv:2311.13141, 2023
-
[14]
Cat3D: Create anything in 3d with multi-view diffusion models
Ruiqi Gao, Aleksander Holynski, Philipp Henzler, Arthur Brussee, Ricardo Martin-Brualla, Pratul Srini- vasan, Jonathan T Barron, and Ben Poole. Cat3D: Create anything in 3d with multi-view diffusion models. arXiv preprint arXiv:2405.10314, 2024
-
[15]
Cameractrl ii: Dynamic scene exploration via camera-controlled video diffu- sion models
Hao He, Ceyuan Yang, Shanchuan Lin, Yinghao Xu, Meng Wei, Liangke Gui, Qi Zhao, Gordon Wetzstein, Lu Jiang, and Hongsheng Li. Cameractrl ii: Dynamic scene exploration via camera-controlled video diffu- sion models. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 13416– 13426, 2025
work page 2025
-
[16]
Tianyu Huang, Wangguandong Zheng, Tengfei Wang, Yuhao Liu, Zhenwei Wang, Junta Wu, Jie Jiang, Hui Li, Rynson Lau, Wangmeng Zuo, et al. V oyager: Long-range and world-consistent video diffusion for explorable 3d scene generation.ACM Transactions on Graphics (TOG), 44(6):1–15, 2025. 14
work page 2025
-
[17]
MapAnything: Universal Feed-Forward Metric 3D Reconstruction
Nikhil Keetha, Norman M ¨uller, Johannes Sch¨onberger, Lorenzo Porzi, Yuchen Zhang, Tobias Fischer, Arno Knapitsch, Duncan Zauss, Ethan Weber, Nelson Antunes, et al. MapAnything: Universal feed-forward metric 3d reconstruction.arXiv preprint arXiv:2509.13414, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[18]
Flux.https://github.com/black-forest-labs/flux, 2024
Black Forest Labs. Flux.https://github.com/black-forest-labs/flux, 2024
work page 2024
-
[19]
FLUX.1 kontext: Flow matching for in-context image generation and editing in latent space, 2025
Black Forest Labs, Stephen Batifol, Andreas Blattmann, Frederic Boesel, Saksham Consul, Cyril Diagne, Tim Dockhorn, Jack English, Zion English, Patrick Esser, Sumith Kulal, Kyle Lacey, Yam Levi, Cheng Li, Dominik Lorenz, Jonas M ¨uller, Dustin Podell, Robin Rombach, Harry Saini, Axel Sauer, and Luke Smith. FLUX.1 kontext: Flow matching for in-context imag...
work page 2025
-
[20]
Cameras as relative positional encoding
Ruilong Li, Brent Yi, Junchen Liu, Hang Gao, Yi Ma, and Angjoo Kanazawa. Cameras as relative positional encoding.arXiv preprint arXiv:2507.10496, 2025
-
[21]
MegaDepth: Learning single-view depth prediction from internet photos
Zhengqi Li and Noah Snavely. MegaDepth: Learning single-view depth prediction from internet photos. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2041–2050, 2018
work page 2041
-
[22]
DL3DV-10K: A large-scale scene dataset for deep learning-based 3d vision
Lu Ling, Yichen Sheng, Zhi Tu, Wentian Zhao, Cheng Xin, Kun Wan, Lantao Yu, Qianyu Guo, Zixun Yu, Yawen Lu, et al. DL3DV-10K: A large-scale scene dataset for deep learning-based 3d vision. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22160–22169, 2024
work page 2024
-
[23]
Fangfu Liu, Hanyang Wang, Shunyu Yao, Shengjun Zhang, Jie Zhou, and Yueqi Duan. Physics3D: Learning physical properties of 3D gaussians via video diffusion.arXiv preprint arXiv:2406.04338, 2024
-
[24]
Scalable diffusion models with transformers
William Peebles and Saining Xie. Scalable diffusion models with transformers. InICCV, 2023
work page 2023
-
[25]
Infinigen indoors: Photorealistic indoor scenes using procedural generation
Alexander Raistrick, Lingjie Mei, Karhan Kayan, David Yan, Yiming Zuo, Beining Han, Hongyu Wen, Meenal Parakh, Stamatis Alexandropoulos, Lahav Lipson, et al. Infinigen indoors: Photorealistic indoor scenes using procedural generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21783–21794, 2024
work page 2024
-
[26]
Gen3c: 3d-informed world-consistent video generation with precise camera control
Xuanchi Ren, Tianchang Shen, Jiahui Huang, Huan Ling, Yifan Lu, Merlin Nimier-David, Thomas M ¨uller, Alexander Keller, Sanja Fidler, and Jun Gao. Gen3c: 3d-informed world-consistent video generation with precise camera control. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recog- nition, pages 6121–6132, 2025
work page 2025
-
[27]
High-resolution image synthesis with latent diffusion models
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022
work page 2022
-
[28]
Make-A-Video: Text-to-Video Generation without Text-Video Data
Uriel Singer, Adam Polyak, Thomas Hayes, Xi Yin, Jie An, Songyang Zhang, Qiyuan Hu, Harry Yang, Oron Ashual, Oran Gafni, Devi Parikh, Sonal Gupta, and Yaniv Taigman. Make-a-video: Text-to-video generation without text-video data.arXiv preprint arXiv:2209.14792, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[29]
WorldPlay: Towards Long-Term Geometric Consistency for Real-Time Interactive World Modeling
Wenqiang Sun, Haiyu Zhang, Haoyuan Wang, Junta Wu, Zehan Wang, Zhenwei Wang, Yunhong Wang, Jun Zhang, Tengfei Wang, and Chunchao Guo. Worldplay: Towards long-term geometric consistency for real-time interactive world modeling.arXiv preprint arXiv:2512.14614, 2025
work page internal anchor Pith review arXiv 2025
-
[30]
HunyuanVideo: A Systematic Framework For Large Video Generative Models
Hunyuan Foundation Model Team. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[31]
HunyuanWorld Team, Zhenwei Wang, Yuhao Liu, Junta Wu, Zixiao Gu, Haoyuan Wang, Xuhui Zuo, Tianyu Huang, Wenhuan Li, Sheng Zhang, et al. Hunyuanworld 1.0: Generating immersive, explorable, and inter- active 3d worlds from words or pixels.arXiv preprint arXiv:2507.21809, 2025
-
[32]
Advancing open-source world models.arXiv preprint arXiv:2601.20540, 2026
Robbyant Team, Zelin Gao, Qiuyu Wang, Yanhong Zeng, Jiapeng Zhu, Ka Leong Cheng, Yixuan Li, Hanlin Wang, Yinghao Xu, Shuailei Ma, Yihang Chen, Jie Liu, Yansong Cheng, Yao Yao, Jiayi Zhu, Yihao Meng, Kecheng Zheng, Qingyan Bai, Jingye Chen, Zehong Shen, Yue Yu, Xing Zhu, Yujun Shen, and Hao Ouyang. Advancing open-source world models.arXiv preprint arXiv:26...
-
[33]
Wan: Open and Advanced Large-Scale Video Generative Models
Wan Team. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025. 15
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[34]
VGGT: Visual geometry grounded transformer
Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. VGGT: Visual geometry grounded transformer. InProceedings of the Computer Vision and Pattern Recog- nition Conference, pages 5294–5306, 2025
work page 2025
-
[35]
Ruicheng Wang, Sicheng Xu, Cassie Dai, Jianfeng Xiang, Yu Deng, Xin Tong, and Jiaolong Yang. MoGe: Unlocking accurate monocular geometry estimation for open-domain images with optimal training supervi- sion, 2024
work page 2024
-
[36]
Dust3R: Geometric 3d vision made easy
Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. Dust3R: Geometric 3d vision made easy. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 20697–20709, 2024
work page 2024
-
[37]
RTFM: A real-time frame model.https://www.worldlabs.ai/blog/rtfm, 2025
WorldLabs. RTFM: A real-time frame model.https://www.worldlabs.ai/blog/rtfm, 2025
work page 2025
-
[38]
Video world models with long-term spatial memory.arXiv preprint arXiv:2506.05284, 2025
Tong Wu, Shuai Yang, Ryan Po, Yinghao Xu, Ziwei Liu, Dahua Lin, and Gordon Wetzstein. Video world models with long-term spatial memory.arXiv preprint arXiv:2506.05284, 2025
-
[39]
Layer- pano3D: Layered 3d panorama for hyper-immersive scene generation
Shuai Yang, Jing Tan, Mengchen Zhang, Tong Wu, Gordon Wetzstein, Ziwei Liu, and Dahua Lin. Layer- pano3D: Layered 3d panorama for hyper-immersive scene generation. InProceedings of the special interest group on computer graphics and interactive techniques conference conference papers, pages 1–10, 2025
work page 2025
-
[40]
CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer
Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer.arXiv preprint arXiv:2408.06072, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[41]
Blend- edMVS: A large-scale dataset for generalized multi-view stereo networks
Yao Yao, Zixin Luo, Shiwei Li, Jingyang Zhang, Yufan Ren, Lei Zhou, Tian Fang, and Long Quan. Blend- edMVS: A large-scale dataset for generalized multi-view stereo networks. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1790–1799, 2020
work page 2020
-
[42]
One-step diffusion with distribution matching distillation
Tianwei Yin, Micha ¨el Gharbi, Richard Zhang, Eli Shechtman, Fredo Durand, William T Freeman, and Taesung Park. One-step diffusion with distribution matching distillation. InCVPR, 2024
work page 2024
-
[43]
WonderWorld: In- teractive 3d scene generation from a single image
Hong-Xing Yu, Haoyi Duan, Charles Herrmann, William T Freeman, and Jiajun Wu. WonderWorld: In- teractive 3d scene generation from a single image. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 5916–5926, 2025
work page 2025
-
[44]
WonderJourney: Going from anywhere to everywhere
Hong-Xing Yu, Haoyi Duan, Junhwa Hur, Kyle Sargent, Michael Rubinstein, William T Freeman, Forrester Cole, Deqing Sun, Noah Snavely, Jiajun Wu, et al. WonderJourney: Going from anywhere to everywhere. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6658–6667, 2024
work page 2024
-
[45]
Shangjin Zhai, Zhichao Ye, Jialin Liu, Weijian Xie, Jiaqi Hu, Zhen Peng, Hua Xue, Danpeng Chen, Xi- aomeng Wang, Lei Yang, et al. StarGen: A spatiotemporal autoregression framework with video diffusion model for scalable and controllable scene generation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 26822–26833, 2025
work page 2025
-
[46]
Taming stable diffusion for text to 360 panorama image generation
Cheng Zhang, Qianyi Wu, Camilo Cruz Gambardella, Xiaoshui Huang, Dinh Phung, Wanli Ouyang, and Jianfei Cai. Taming stable diffusion for text to 360 panorama image generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6347–6357, 2024
work page 2024
-
[47]
PhysDreamer: Physics-based interaction with 3d objects via video generation
Tianyuan Zhang, Hong-Xing Yu, Rundi Wu, Brandon Y Feng, Changxi Zheng, Noah Snavely, Jiajun Wu, and William T Freeman. PhysDreamer: Physics-based interaction with 3d objects via video generation. In European Conference on Computer Vision, pages 388–406. Springer, 2024
work page 2024
-
[48]
Matrix-game: Interactive world foundation model.arXiv preprint arXiv:2506.18701, 2025
Yifan Zhang, Chunli Peng, Boyang Wang, Puyi Wang, Qingcheng Zhu, Fei Kang, Biao Jiang, Zedong Gao, Eric Li, Yang Liu, and Yahui Zhou. Matrix-game: Interactive world foundation model.arXiv preprint arXiv:2506.18701, 2025
-
[49]
Stable virtual camera: Generative view synthesis with diffusion models
Jensen Zhou, Hang Gao, Vikram V oleti, Aaryaman Vasishta, Chun-Han Yao, Mark Boss, Philip Torr, Chris- tian Rupprecht, and Varun Jampani. Stable virtual camera: Generative view synthesis with diffusion models. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 12405–12414, 2025. 16
work page 2025
-
[50]
SceneX: Procedural controllable large-scale scene generation
Mengqi Zhou, Yuxi Wang, Jun Hou, Shougao Zhang, Yiwei Li, Chuanchen Luo, Junran Peng, and Zhaox- iang Zhang. SceneX: Procedural controllable large-scale scene generation. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 10806–10814, 2025
work page 2025
-
[51]
Stereo Magnification: Learning View Synthesis using Multiplane Images
Tinghui Zhou, Richard Tucker, John Flynn, Graham Fyffe, and Noah Snavely. Stereo magnification: Learn- ing view synthesis using multiplane images.arXiv preprint arXiv:1805.09817, 2018. 17
work page internal anchor Pith review Pith/arXiv arXiv 2018
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.