MoVerse: Real-Time Video World Modeling with Panoramic Gaussian Scaffold
Pith reviewed 2026-06-27 07:19 UTC · model grok-4.3
The pith
MoVerse turns one narrow-view image into a real-time navigable video world at 8 FPS on consumer hardware.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
MoVerse separates world construction from observation rendering by expanding the narrow input into a gravity-aligned 360 panorama with topology-aware diffusion, lifting the panorama into a dense 3D Gaussian scaffold via panoramic geometry-aware residual prediction, and translating scaffold renderings into photorealistic video through a Gaussian-conditioned renderer that is distilled from a bidirectional diffusion teacher into a causal autoregressive student for bounded-latency streaming.
What carries the argument
Panoramic Gaussian scaffold: the dense, directly renderable 3D spatial memory created from the completed panorama that supplies consistent geometry for subsequent video rendering.
If this is right
- The explicit 3D scaffold supplies long-range consistency that pure generative video models lack.
- User-specified camera trajectories can be followed controllably while maintaining temporal coherence.
- Distillation from bidirectional teacher to causal student enables 8 FPS streaming on a single consumer GPU.
- The pipeline combines the controllability of explicit 3D representations with the perceptual quality of generative video models.
Where Pith is reading between the lines
- The same separation of scaffold construction from rendering could be applied to short video inputs to initialize richer initial geometry.
- Adding simple dynamics on the Gaussian scaffold might allow basic object interactions without retraining the renderer.
- Further compression of the student model could support deployment on lower-power devices for mobile scene exploration.
Load-bearing premise
The topology-aware diffusion reliably produces a geometrically consistent 360 panorama without errors that propagate into the later panoramic geometry-aware residual prediction.
What would settle it
Visible geometric drift, seams, or view-inconsistent artifacts appearing in the output video when the camera trajectory enters regions far outside the original narrow field of view.
read the original abstract
We present MoVerse, a real-time video world model that creates an interactively navigable scene from a single narrow-field-of-view image. This setting is challenging because the input observes only a small fraction of the environment, while interactive roaming requires a complete surrounding world, persistent geometry, controllable camera motion, and temporally coherent high-fidelity observations. MoVerse addresses this problem by separating world construction from observation rendering. It first expands the input into a gravity-aligned 360$^\circ$ panorama with topology-aware diffusion, closing the missing field of view before 3D reasoning. It then lifts the panorama into a persistent 3D Gaussian scaffold using panoramic geometry-aware residual prediction, yielding a dense and directly renderable spatial memory. Finally, a Gaussian-conditioned video renderer translates scaffold renderings along user-specified camera trajectories into photorealistic video. To make this renderer practical for interaction, we train a bidirectional diffusion teacher for high-quality conditional rendering and distill it into a causal autoregressive student for bounded-latency streaming. This design combines the controllability and long-range consistency of explicit 3D representations with the perceptual quality of generative video models. MoVerse supports real-time scene roaming at 8~FPS on a single NVIDIA RTX~4090 GPU, demonstrating a practical path toward single-image world creation with interactive video output.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents MoVerse, a real-time video world model that constructs an interactively navigable 3D scene from a single narrow-FOV image. It first applies topology-aware diffusion to expand the input into a gravity-aligned 360° panorama, then lifts the panorama into a persistent 3D Gaussian scaffold via panoramic geometry-aware residual prediction. A Gaussian-conditioned video renderer, trained via bidirectional diffusion teacher and distilled to a causal autoregressive student, translates scaffold renderings along user-specified trajectories into photorealistic video. The system claims 8 FPS real-time roaming on a single RTX 4090 GPU.
Significance. If the geometric consistency and real-time claims hold, the work offers a practical path to single-image world modeling that combines explicit 3D representations for controllability and long-range consistency with generative video models for perceptual quality. The separation of world construction from rendering and the distillation for bounded-latency streaming are notable design choices that could influence future interactive 3D generation systems.
major comments (2)
- [Abstract / pipeline description] The central claim of persistent, controllable 3D geometry from a single image depends on the topology-aware diffusion step producing outputs that are sufficiently geometrically consistent for the subsequent panoramic geometry-aware residual prediction. No quantitative metrics (e.g., depth seam error, gravity alignment error, or propagation to scaffold drift over long trajectories) are reported to validate this assumption, which is load-bearing for the scaffold's ability to support artifact-free roaming.
- [Abstract / results claim] The reported 8 FPS real-time performance on RTX 4090 is a key practical result, but the manuscript provides no breakdown of per-component timings (diffusion panorama completion, scaffold construction, renderer inference) or ablation on how distillation affects latency versus quality, making it impossible to assess whether the claimed speed is robust or tied to specific unstated implementation choices.
minor comments (2)
- Notation for 'Panoramic Gaussian Scaffold' and 'panoramic geometry-aware residual prediction' should be defined with explicit equations or pseudocode in the methods section to clarify how residuals are computed and applied.
- The abstract mentions 'gravity-aligned' panorama but does not specify the alignment mechanism or any failure cases when the input image lacks clear gravity cues.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback highlighting the need for stronger quantitative validation of geometric consistency and detailed performance analysis. We address both major comments below and will incorporate additional metrics and breakdowns in the revised manuscript to better support the claims.
read point-by-point responses
-
Referee: [Abstract / pipeline description] The central claim of persistent, controllable 3D geometry from a single image depends on the topology-aware diffusion step producing outputs that are sufficiently geometrically consistent for the subsequent panoramic geometry-aware residual prediction. No quantitative metrics (e.g., depth seam error, gravity alignment error, or propagation to scaffold drift over long trajectories) are reported to validate this assumption, which is load-bearing for the scaffold's ability to support artifact-free roaming.
Authors: We recognize that explicit quantitative metrics for the geometric consistency of the topology-aware diffusion outputs would provide stronger support for the pipeline's assumptions. While the manuscript validates consistency through downstream visual quality, user studies on roaming, and qualitative panorama/scaffold results, we agree these specific metrics would directly address the concern. In the revised version, we will add depth seam error, gravity alignment error, and scaffold drift measurements over long trajectories, evaluated on a held-out validation set of scenes. revision: yes
-
Referee: [Abstract / results claim] The reported 8 FPS real-time performance on RTX 4090 is a key practical result, but the manuscript provides no breakdown of per-component timings (diffusion panorama completion, scaffold construction, renderer inference) or ablation on how distillation affects latency versus quality, making it impossible to assess whether the claimed speed is robust or tied to specific unstated implementation choices.
Authors: We agree that a per-component timing breakdown and distillation ablation are necessary to substantiate the real-time claim and allow assessment of robustness. In the revised manuscript, we will include a table with averaged inference times for panorama diffusion, scaffold construction, and renderer stages on the RTX 4090, along with an ablation comparing the bidirectional teacher and causal student models on both latency and quality metrics. This will clarify the contribution of distillation to achieving bounded-latency performance. revision: yes
Circularity Check
No circularity; pipeline uses external standard components
full rationale
The abstract and described pipeline separate world construction (topology-aware diffusion for 360° panorama, then panoramic geometry-aware residual prediction for Gaussian scaffold) from rendering (Gaussian-conditioned video model with teacher-student distillation). No equations, fitted parameters renamed as predictions, or self-citations are shown that reduce any load-bearing claim to its own inputs by construction. All steps invoke standard external techniques (diffusion models, 3D Gaussian representations) without self-referential definitions or uniqueness theorems imported from the authors' prior work. The real-time performance claim is presented as an empirical outcome on RTX 4090 hardware rather than a tautological derivation.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Topology-aware diffusion can produce geometrically consistent 360 panoramas from narrow-FOV inputs without breaking downstream 3D lifting.
invented entities (1)
-
Panoramic Gaussian Scaffold
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Wonderjourney: Going from anywhere to everywhere,
H.-X. Yu, H. Duan, J. Hur, K. Sargent, M. Rubinstein, W. T. Freeman, F. Cole, D. Sun, N. Snavely, J. Wuet al., “Wonderjourney: Going from anywhere to everywhere,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 6658–6667
2024
-
[2]
Wonderworld: Interactive 3d scene generation from a single image,
H.-X. Yu, H. Duan, C. Herrmann, W. T. Freeman, and J. Wu, “Wonderworld: Interactive 3d scene generation from a single image,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 5916–5926. 17
2025
-
[3]
Layerpano3d: Layered 3d panorama for hyper-immersive scene generation,
S. Yang, J. Tan, M. Zhang, T. Wu, G. Wetzstein, Z. Liu, and D. Lin, “Layerpano3d: Layered 3d panorama for hyper-immersive scene generation,” inProceedings of the special interest group on computer graphics and interactive techniques conference conference papers, 2025, pp. 1–10
2025
-
[4]
Self-evolving 3d scene generation from a single image,
K. Zheng, Y. Fan, J. Gu, Z. Xu, X. He, and X. E. Wang, “Self-evolving 3d scene generation from a single image,” arXiv preprint arXiv:2512.08905, 2025
arXiv 2025
-
[5]
Hunyuanworld 1.0: Generating immersive, explorable, and interactive 3d worlds from words or pixels,
H. Team, Z. Wang, Y. Liu, J. Wu, Z. Gu, H. Wang, X. Zuo, T. Huang, W. Li, S. Zhanget al., “Hunyuanworld 1.0: Generating immersive, explorable, and interactive 3d worlds from words or pixels,”arXiv preprint arXiv:2507.21809, 2025
arXiv 2025
-
[6]
Worldexplorer: Towards generating fully navigable 3d scenes,
M.-A. Schneider, L. Höllein, and M. Nießner, “Worldexplorer: Towards generating fully navigable 3d scenes,” in Proceedings of the SIGGRAPH Asia 2025 Conference Papers, 2025, pp. 1–11
2025
-
[7]
Matrix-3d: Omnidirectional explorable 3d world generation,
Z. Yang, W. Ge, Y. Li, J. Chen, H. Li, M. An, F. Kang, H. Xue, B. Xu, Y. Yinet al., “Matrix-3d: Omnidirectional explorable 3d world generation,”arXiv preprint arXiv:2508.08086, 2025
arXiv 2025
-
[8]
Lyra 2.0: Explorable generative 3d worlds,
T. Shen, S. Bahmani, K. He, S. G. Srinivasan, T. Cao, J. Ren, R. Li, Z. Wang, N. Sharp, Z. Gojcicet al., “Lyra 2.0: Explorable generative 3d worlds,”arXiv preprint arXiv:2604.13036, 2026
Pith/arXiv arXiv 2026
-
[9]
Hy-world 2.0: A multi-modal world model for reconstructing, generating, and simulating 3d worlds,
T. HY-World, C. Cao, X. Zuo, Z. Wang, Y. Zhang, J. Wu, Z. Liu, Y. Gong, Y. Liu, B. Yuanet al., “Hy-world 2.0: A multi-modal world model for reconstructing, generating, and simulating 3d worlds,”arXiv preprint arXiv:2604.14268, 2026
Pith/arXiv arXiv 2026
-
[10]
Genie 3: A new frontier for world models,
G. DeepMind, “Genie 3: A new frontier for world models,” https://deepmind.google/blog/ genie-3-a-new-frontier-for-world-models/, 2025
2025
-
[11]
RTFM: A real-time frame model,
WorldLabs, “RTFM: A real-time frame model,” https://www.worldlabs.ai/blog/rtfm, 2025
2025
-
[12]
Context as memory: Scene-consistent interactive long video generation with memory retrieval,
J. Yu, J. Bai, Y. Qin, Q. Liu, X. Wang, P. Wan, D. Zhang, and X. Liu, “Context as memory: Scene-consistent interactive long video generation with memory retrieval,” inProceedings of the SIGGRAPH Asia 2025 Conference Papers, 2025, pp. 1–11
2025
-
[13]
Matrix-game 2.0: An open-source real-time and streaming interactive world model,
X. He, C. Peng, Z. Liu, B. Wang, Y. Zhang, Q. Cui, F. Kang, B. Jiang, M. An, Y. Renet al., “Matrix-game 2.0: An open-source real-time and streaming interactive world model,”arXiv preprint arXiv:2508.13009, 2025
Pith/arXiv arXiv 2025
-
[14]
Hy-world 1.5: A systematic framework for interactive world modeling with real-time latency and geometric consistency,
T. HunyuanWorld, “Hy-world 1.5: A systematic framework for interactive world modeling with real-time latency and geometric consistency,”arXiv preprint, 2025
2025
-
[15]
Yume: An interactive world generation model,
X. Mao, S. Lin, Z. Li, C. Li, W. Peng, T. He, J. Pang, M. Chi, Y. Qiao, and K. Zhang, “Yume: An interactive world generation model,”arXiv preprint arXiv:2507.17744, 2025
arXiv 2025
-
[16]
Relic: Interactive video world model with long-horizon memory,
Y. Hong, Y. Mei, C. Ge, Y. Xu, Y. Zhou, S. Bi, Y. Hold-Geoffroy, M. Roberts, M. Fisher, E. Shechtmanet al., “Relic: Interactive video world model with long-horizon memory,”arXiv preprint arXiv:2512.04040, 2025
arXiv 2025
-
[17]
Advancing open-source world models,
R. Team, Z. Gao, Q. Wang, Y. Zeng, J. Zhu, K. L. Cheng, Y. Li, H. Wang, Y. Xu, S. Maet al., “Advancing open-source world models,”arXiv preprint arXiv:2601.20540, 2026
Pith/arXiv arXiv 2026
-
[18]
Matrix-game 3.0: Real-time and streaming interactive world model with long-horizon memory,
Z. Wang, Z. Liu, J. Li, K. Huang, B. Xu, F. Kang, M. An, P. Wang, B. Jiang, Y. Weiet al., “Matrix-game 3.0: Real-time and streaming interactive world model with long-horizon memory,”arXiv preprint arXiv:2604.08995, 2026
Pith/arXiv arXiv 2026
-
[19]
Sana-wm: Efficient minute-scale world modeling with hybrid linear diffusion transformer,
H. Zhu, H. Liu, Y. Zhao, T. Ye, J. Chen, J. Yu, T. He, S. Han, and E. Xie, “Sana-wm: Efficient minute-scale world modeling with hybrid linear diffusion transformer,”arXiv preprint arXiv:2605.15178, 2026
Pith/arXiv arXiv 2026
-
[20]
Evoworld: Evolving panoramic world generation with explicit 3d memory,
J. Wang, L. Ye, T. Lu, J. Xiao, J. Zhang, Y. Guo, X. Liu, R. Chellappa, C. Peng, A. Yuilleet al., “Evoworld: Evolving panoramic world generation with explicit 3d memory,”arXiv preprint arXiv:2510.01183, 2025
arXiv 2025
-
[21]
Gen3c: 3d-informed world-consistent video generation with precise camera control,
X. Ren, T. Shen, J. Huang, H. Ling, Y. Lu, M. Nimier-David, T. Müller, A. Keller, S. Fidler, and J. Gao, “Gen3c: 3d-informed world-consistent video generation with precise camera control,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025, pp. 6121–6132
2025
-
[22]
Trajectorycrafter: Redirecting camera trajectory for monocular videos via diffusion models,
M. Yu, W. Hu, J. Xing, and Y. Shan, “Trajectorycrafter: Redirecting camera trajectory for monocular videos via diffusion models,” inProceedings of the IEEE/CVF international conference on computer vision, 2025, pp. 100–111
2025
-
[23]
Mocam: Unified novel view synthesis via structured denoising dynamics,
H. Liu, Y. Zhou, Z. Wang, Z. Xu, Z. Peng, J. Ma, J. Liang, S. He, and J. Li, “Mocam: Unified novel view synthesis via structured denoising dynamics,”arXiv preprint arXiv:2605.12119, 2026
Pith/arXiv arXiv 2026
-
[24]
One2scene: Geometric consistent explorable 3d scene generation from a single image,
P. Wang, L. Chen, Z. Ma, Y. Guo, G. Zhang, and L. Zhang, “One2scene: Geometric consistent explorable 3d scene generation from a single image,”arXiv preprint arXiv:2602.19766, 2026
arXiv 2026
-
[25]
Inspatio-worldfm: An open-source real-time generative frame model,
I. Team, D. Shen, G. Zhang, H. Liu, H. Ji, J. Liu, J. Guo, N. Wang, S. Pan, W. Panet al., “Inspatio-worldfm: An open-source real-time generative frame model,”arXiv preprint arXiv:2603.11911, 2026. 18
Pith/arXiv arXiv 2026
-
[26]
Panodiffusion: 360-degree panorama outpainting via diffusion,
T. Wu, C. Zheng, and T.-J. Cham, “Panodiffusion: 360-degree panorama outpainting via diffusion,” inICLR, 2024
2024
-
[27]
Dit360: High-fidelity panoramic image generation via hybrid training,
H. Feng, D. Zhang, X. Li, B. Du, and L. Qi, “Dit360: High-fidelity panoramic image generation via hybrid training,”arXiv preprint arXiv:2510.11712, 2025
arXiv 2025
-
[28]
Panorama generation from nfov image done right,
D. Zheng, C. Zhang, X.-M. Wu, C. Li, C. Lv, J.-F. Hu, and W.-S. Zheng, “Panorama generation from nfov image done right,” inCVPR, 2025, pp. 21610–21619
2025
-
[29]
Camfreediff: camera-free image to panorama generation with diffusion model,
X. Yuan, S. Tang, K. Li, and P. Wang, “Camfreediff: camera-free image to panorama generation with diffusion model,” inCVPR, 2025, pp. 16408–16417
2025
-
[30]
Anysplat: Feed-forward 3d gaussian splatting from unconstrained views,
L. Jiang, Y. Mao, L. Xu, T. Lu, K. Ren, Y. Jin, X. Xu, M. Yu, J. Pang, F. Zhaoet al., “Anysplat: Feed-forward 3d gaussian splatting from unconstrained views,”ACM Transactions on Graphics (TOG), vol. 44, no. 6, pp. 1–16, 2025
2025
-
[31]
Vg3t: Visual geometry grounded gaussian transformer,
J. Kim and S. Lee, “Vg3t: Visual geometry grounded gaussian transformer,”arXiv preprint arXiv:2512.05988, 2025
arXiv 2025
-
[32]
Splatter-360: Generalizable 360 gaussian splatting for wide-baseline panoramic images,
Z. Chen, C. Wu, Z. Shen, C. Zhao, W. Ye, H. Feng, E. Ding, and S.-H. Zhang, “Splatter-360: Generalizable 360 gaussian splatting for wide-baseline panoramic images,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 21590–21599
2025
-
[33]
Pansplat: 4k panorama synthesis with feed-forward gaussian splatting,
C. Zhang, H. Xu, Q. Wu, C. C. Gambardella, D. Phung, and J. Cai, “Pansplat: 4k panorama synthesis with feed-forward gaussian splatting,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 11437–11447
2025
-
[34]
Panosplatt3r: Leveraging perspective pretraining for generalized unposed wide-baseline panorama reconstruction,
J. Ren, M. Xiang, J. Zhu, and Y. Dai, “Panosplatt3r: Leveraging perspective pretraining for generalized unposed wide-baseline panorama reconstruction,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 28959–28969
2025
-
[35]
Sharp monocular view synthesis in less than a second,
L. Mescheder, W. Dong, S. Li, X. Bai, M. Santos, P. Hu, B. Lecouat, M. Zhen, A. Delaunoy, T. Fanget al., “Sharp monocular view synthesis in less than a second,”arXiv preprint arXiv:2512.10685, 2025
arXiv 2025
-
[36]
Wan: Open and advanced large-scale video generative models,
T. Wan, A. Wang, B. Ai, B. Wen, C. Mao, C.-W. Xie, D. Chen, F. Yu, H. Zhao, J. Yanget al., “Wan: Open and advanced large-scale video generative models,”arXiv preprint arXiv:2503.20314, 2025
Pith/arXiv arXiv 2025
-
[37]
Cogvideox: Text-to-video diffusion models with an expert transformer,
Z. Yang, J. Teng, W. Zheng, M. Ding, S. Huang, J. Xu, Y. Yang, W. Hong, X. Zhang, G. Fenget al., “Cogvideox: Text-to-video diffusion models with an expert transformer,”arXiv preprint arXiv:2408.06072, 2024
Pith/arXiv arXiv 2024
-
[38]
Hunyuanvideo: A systematic framework for large video generative models,
W. Kong, Q. Tian, Z. Zhang, R. Min, Z. Dai, J. Zhou, J. Xiong, X. Li, B. Wu, J. Zhanget al., “Hunyuanvideo: A systematic framework for large video generative models,”arXiv preprint arXiv:2412.03603, 2024
Pith/arXiv arXiv 2024
-
[39]
Pluralistic image completion,
C. Zheng, T.-J. Cham, and J. Cai, “Pluralistic image completion,” inCVPR, 2019, pp. 1438–1447
2019
-
[40]
Wide-context semantic image extrapolation,
Y. Wang, X. Tao, X. Shen, and J. Jia, “Wide-context semantic image extrapolation,” inCVPR, 2019, pp. 1399–1408
2019
-
[41]
Large scale image completion via co-modulated generative adversarial networks,
S. Zhao, J. Cui, Y. Sheng, Y. Dong, X. Liang, E. I. Chang, and Y. Xu, “Large scale image completion via co-modulated generative adversarial networks,”arXiv preprint arXiv:2103.10428, 2021
arXiv 2021
-
[42]
High-resolution image synthesis with latent diffusion models,
R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High-resolution image synthesis with latent diffusion models,” inCVPR, 2022, pp. 10684–10695
2022
-
[43]
Repaint: Inpainting using denoising diffusion probabilistic models,
A. Lugmayr, M. Danelljan, A. Romero, F. Yu, R. Timofte, and L. Van Gool, “Repaint: Inpainting using denoising diffusion probabilistic models,” inCVPR, 2022, pp. 11461–11471
2022
-
[44]
Panogen: Text-conditioned panoramic environment generation for vision-and-language navigation,
J. Li and M. Bansal, “Panogen: Text-conditioned panoramic environment generation for vision-and-language navigation,”NeurIPS, vol. 36, pp. 21878–21894, 2023
2023
-
[45]
Diffusion360: Seamless 360 degree panoramic image generation based on diffusion models,
M. Feng, J. Liu, M. Cui, and X. Xie, “Diffusion360: Seamless 360 degree panoramic image generation based on diffusion models,”arXiv preprint arXiv:2311.13141, 2023
arXiv 2023
-
[46]
Matrix3d: Large photogrammetry model all-in-one,
Y. Lu, J. Zhang, T. Fang, J.-D. Nahmias, Y. Tsin, L. Quan, X. Cao, Y. Yao, and S. Li, “Matrix3d: Large photogrammetry model all-in-one,” inCVPR, 2025, pp. 11250–11263
2025
-
[47]
Syncdiffusion: Coherent montage via synchronized joint diffusions,
Y. Lee, K. Kim, H. Kim, and M. Sung, “Syncdiffusion: Coherent montage via synchronized joint diffusions,” NeurIPS, vol. 36, pp. 50648–50660, 2023
2023
-
[48]
360dvd: Controllable panorama video generation with 360-degree video diffusion model,
Q. Wang, W. Li, C. Mou, X. Cheng, and J. Zhang, “360dvd: Controllable panorama video generation with 360-degree video diffusion model,” inCVPR, 2024, pp. 6913–6923
2024
-
[49]
Cylin-painting: Seamless 360 panoramic image outpainting and beyond,
K. Liao, X. Xu, C. Lin, W. Ren, Y. Wei, and Y. Zhao, “Cylin-painting: Seamless 360 panoramic image outpainting and beyond,”IEEE TIP, vol. 33, pp. 382–394, 2023
2023
-
[50]
Spatial transformer networks,
M. Jaderberg, K. Simonyan, A. Zissermanet al., “Spatial transformer networks,”NeurIPS, vol. 28, 2015. 19
2015
-
[51]
Recognizing scene viewpoint using panoramic place represen- tation,
J. Xiao, K. A. Ehinger, A. Oliva, and A. Torralba, “Recognizing scene viewpoint using panoramic place represen- tation,” inCVPR. IEEE, 2012, pp. 2695–2702
2012
-
[52]
Matterport3d: Learning from rgb-d data in indoor environments,
A.Chang, A.Dai, T.Funkhouser, M.Halber, M.Niebner, M.Savva, S.Song, A.Zeng, andY.Zhang, “Matterport3d: Learning from rgb-d data in indoor environments,” in2017 International Conference on 3D Vision (3DV). IEEE Computer Society, 2017, pp. 667–676
2017
-
[53]
Poly haven hdris,
Poly Haven, “Poly haven hdris,” https://polyhaven.com/hdris, accessed: December 2025
2025
-
[54]
Panocontext: A whole-room 3d context model for panoramic scene understanding,
Y. Zhang, S. Song, P. Tan, and J. Xiao, “Panocontext: A whole-room 3d context model for panoramic scene understanding,” inECCV. Springer, 2014, pp. 668–686
2014
-
[55]
Layoutnet: Reconstructing the 3d room layout from a single rgb image,
C. Zou, A. Colburn, Q. Shan, and D. Hoiem, “Layoutnet: Reconstructing the 3d room layout from a single rgb image,” inCVPR, 2018, pp. 2051–2059
2018
-
[56]
Horizonnet: Learning room layout with 1d representation and pano stretch data augmentation,
C. Sun, C.-W. Hsiao, M. Sun, and H.-T. Chen, “Horizonnet: Learning room layout with 1d representation and pano stretch data augmentation,” inCVPR, 2019, pp. 1047–1056
2019
-
[57]
Lgt-net: Indoor panoramic room layout estimation with geometry-aware transformer network,
Z. Jiang, Z. Xiang, J. Xu, and M. Zhao, “Lgt-net: Indoor panoramic room layout estimation with geometry-aware transformer network,” inCVPR, 2022, pp. 1654–1663
2022
-
[58]
Hohonet: 360 indoor holistic understanding with latent horizontal features,
C. Sun, M. Sun, and H.-T. Chen, “Hohonet: 360 indoor holistic understanding with latent horizontal features,” in CVPR, 2021, pp. 2573–2582
2021
-
[59]
Habitat-matterport 3d dataset (HM3d): 1000 large-scale 3d environments for embodied AI,
S. K. Ramakrishnan, A. Gokaslan, E. Wijmans, O. Maksymets, A. Clegg, J. M. Turner, E. Undersander, W. Galuba, A. Westbury, A. X. Chang, M. Savva, Y. Zhao, and D. Batra, “Habitat-matterport 3d dataset (HM3d): 1000 large-scale 3d environments for embodied AI,” inThirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track, ...
Pith/arXiv arXiv 2021
-
[60]
Self forcing: Bridging the train-test gap in autoregressive video diffusion,
X. Huang, Z. Li, G. He, M. Zhou, and E. Shechtman, “Self forcing: Bridging the train-test gap in autoregressive video diffusion,”Advances in Neural Information Processing Systems, vol. 38, pp. 167283–167308, 2026
2026
-
[61]
One-step diffusion with distribution matching distillation
T. Yin, M. Gharbi, R. Zhang, E. Shechtman, F. Durand, W. T. Freeman, and T. Park, “One-step diffusion with distribution matching distillation.” inCVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023, pp. 6613–6623
2023
-
[62]
Raven: Real-time autoregressive video extrapolation with consistency-model grpo,
Y. Lu, R. Zuo, and J. Deng, “Raven: Real-time autoregressive video extrapolation with consistency-model grpo,” arXiv preprint arXiv:2605.15190, 2026
Pith/arXiv arXiv 2026
-
[63]
Memrope: Training-free infinite video generation via evolving memory tokens,
Y. Kim, Q. Hu, C.-C. J. Kuo, and P. A. Beerel, “Memrope: Training-free infinite video generation via evolving memory tokens,”arXiv preprint arXiv:2603.12513, 2026
arXiv 2026
-
[64]
Taehv: Tiny autoencoder for hunyuan video,
O. Boer Bohan, “Taehv: Tiny autoencoder for hunyuan video,” https://github.com/madebyollin/taehv, 2025. 20
2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.