pith. sign in

arxiv: 2605.25449 · v2 · pith:NLFYDT64new · submitted 2026-05-25 · 💻 cs.CV

Pantheon360: Taming Digital Twin Generation via 3D-Aware 360{deg} Video Diffusion

Pith reviewed 2026-06-29 22:44 UTC · model grok-4.3

classification 💻 cs.CV
keywords 360 video generationdigital twins3D-aware diffusiongeometric consistencypanoramic videoscene generationdiffusion models
0
0 comments X

The pith

A 3D cache from sparse 360 inputs acts as a geometric scaffold to produce consistent panoramic videos via diffusion.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a controllable framework for synthesizing 360-degree videos from limited panoramic inputs to support digital twin creation. Conventional perspective video generators suffer from narrow fields of view that force complex trajectories and produce cross-view drift. Pantheon360 reconstructs an explicit 3D cache from the sparse inputs to guide the diffusion model, letting it handle texture while the cache maintains global geometry for any chosen camera path. This setup yields videos that remain spatially and temporally coherent, directly addressing the needs of simulation and scene reconstruction tasks.

Core claim

Pantheon360 synthesizes high-fidelity 360° videos from sparse 360° inputs by using an explicit 3D Cache reconstructed from the input as a geometric scaffold for any user-defined camera path. This allows the diffusion model to focus on photorealistic texture refinement while the 3D Cache enforces global geometric consistency. Experiments show that Pantheon360 achieves superior visual quality and unmatched geometric coherence, enabling reliable and flexible 360° scene generation for downstream simulation and digital-twin applications.

What carries the argument

Explicit 3D Cache reconstructed from sparse 360° inputs, serving as a geometric scaffold that enforces consistency inside the diffusion process.

If this is right

  • 360° videos can be generated with superior visual quality from sparse panoramic inputs.
  • Geometric coherence holds across arbitrary user-defined camera paths inside the scene.
  • The resulting videos support downstream simulation and digital-twin applications without additional multi-view capture.
  • Trajectory design simplifies because panoramic coverage supplies global context from the start.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Fewer input views may suffice for building usable digital twins than current multi-camera pipelines require.
  • The cache-guided approach could integrate with existing robotics simulators to test navigation in reconstructed environments.
  • Extending the cache to handle time-varying elements might enable consistent video of moving scenes.

Load-bearing premise

The 3D Cache reconstructed from sparse 360° inputs can reliably serve as a geometric scaffold that enforces global consistency for arbitrary user-defined camera paths inside the diffusion process.

What would settle it

Generating videos along novel camera trajectories and then measuring visible geometric distortions or inconsistencies absent from the original input views would falsify the claim.

Figures

Figures reproduced from arXiv: 2605.25449 by Cho-Ying Wu, David Paz, Fangzhou Lin, Hengyuan Zhang, Jie-Ying Lee, Liu Ren, Tao Tu, Ting-Hsuan Chen, Xinyu Huang, Ying-Huan Chen, Yue Wang, Yuliang Guo, Yu-Lun Liu.

Figure 1
Figure 1. Figure 1: Pantheon360: Controllable 360° Video Generation. Given sparse or single 360° input images, Pantheon360 generates temporally consistent 360° videos along user-defined camera trajectories with precise geometric control. Top: From sparse views or a single view, our method synthesizes smooth videos following diverse camera trajectories across varied scenes, demonstrating flexible trajectory control from minima… view at source ↗
Figure 2
Figure 2. Figure 2: Motivation for Using 360° Images for Generation. Left: When traversing to the back of the room, 360° anchor frames provide complete scene context, enabling accurate generation of occluded regions. In contrast, perspective anchor frames have a limited field-of-view and must hallucinate unseen areas, leading to significant artifacts. Right: Generating 360° outputs in a single pass ensures global coherence an… view at source ↗
Figure 3
Figure 3. Figure 3: Pantheon360 Pipeline. Given sparse 360° input frames, we first crop them into perspective views and reconstruct a 3D point cloud cache using foundation models (e.g., π 3 [73], VGGT [67]). We then render this cache along the target camera trajectory to produce a geometry-only equirectangular video Vgeo, which is encoded into latent space and concatenated with noised latents for geometric conditioning. Simul… view at source ↗
Figure 4
Figure 4. Figure 4: Visualization Results on the Web360 [62] and Habitat [46] datasets. Our method (Ours) generates temporally consistent videos with coherent cross-view geometry across diverse camera trajectories. In contrast, perspective-based baselines (ViewCrafter [87], TrajectoryCrafter [86], GEN3C [52]) exhibit severe cross-view inconsistencies when rendered from different viewing angles (Left vs. Right), revealing thei… view at source ↗
Figure 5
Figure 5. Figure 5: Application. Video Synthesis from Google Street View. Our method generates consistent 360° videos from sparse Google Street View imagery, enabling smooth navigation across extended trajectories [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 7
Figure 7. Figure 7: Comparison to GenEX [43]. Our method maintains consistent quality throughout the trajectory while GenEX’s quality degrades rapidly with increasing geometric inconsistencies. Results. As shown in [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗
Figure 6
Figure 6. Figure 6: Comparison to PanoSplatt3R [51]. Our method pro￾duces geometrically accurate interpolations with clean structure, while PanoSplatt3R exhibits visible artifacts and geometric distor￾tions. are provided and we generate novel views between them. This task is particularly relevant for synthesizing continuous videos from sparse Google Maps Street View panoramas. Results. As shown in [PITH_FULL_IMAGE:figures/fu… view at source ↗
Figure 8
Figure 8. Figure 8: Novel View Synthesis on Google Maps Street View. Our method produces geometrically accurate renderings across different viewing angles with consistent structures. GEN3C [52] suffers from ghosting artifacts, geometric distortions, and inter-view inconsistencies [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: 3D Point Cloud Reconstruction Quality. We reconstruct 3D point clouds from generated videos using π 3 [73]. Our method yields dense, structurally coherent reconstructions (right), while GEN3C [52] produces sparse, fragmented results (left), demonstrat￾ing our superior 3D consistency. 5. Conclusion We present Pantheon360, a framework for controllable 360° video generation with precise camera trajectory cont… view at source ↗
Figure 10
Figure 10. Figure 10: , the point cloud reconstructed from the reference image (Before) contains only the visible geometry from the input viewpoint. In contrast, the point cloud reconstructed from our generated video (After) is significantly more com￾plete, successfully hallucinating previously occluded regions while maintaining consistency with the original scene struc￾ture. This demonstrates that our method not only preserve… view at source ↗
Figure 11
Figure 11. Figure 11: Robustness to 3D Cache imperfections. Our video diffusion prior can inpaint and refine regions where the 3D Cache contains holes or inaccuracies caused by dynamic objects or low￾light conditions. with stitching artifacts that propagate through the pipeline. Addressing these limitations requires future advances in 4D reconstruction or higher-quality input capture. Input Image Generated Video Complex Motion… view at source ↗
Figure 12
Figure 12. Figure 12: Failure cases. Our method struggles with (1) crowded dynamic scenes with many moving objects, and (2) input 360° images with stitching artifacts [PITH_FULL_IMAGE:figures/full_fig_p015_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Closed-loop trajectory validation. Our method gener￾ates consistent video along a closed-loop trajectory, successfully revisiting the starting region without temporal inconsistencies [PITH_FULL_IMAGE:figures/full_fig_p016_13.png] view at source ↗
read the original abstract

Generating complete digital twins from videos requires precise camera control, global scene coverage, and strict spatial-temporal consistency constraints that remain challenging for perspective video generators due to their limited field of view (FoV). Their narrow FoV forces long or multi-view trajectories, amplifying cross-view inconsistency and temporal drift. We argue that 360{\deg} video generation offers a natural solution: panoramic coverage simplifies trajectory design and provides a strong global context for maintaining coherence. We introduce Pantheon360: Taming Digital Twin Generation via 3D-Aware 360{\deg} Video Diffusion, a controllable 360{\deg} video generation framework that synthesizes high-fidelity videos from sparse 360{\deg} inputs. The key idea is an explicit 3D Cache, reconstructed from the input, which serves as a geometric scaffold for any user-defined camera path. This allows the diffusion model to focus on photorealistic texture refinement while the 3D Cache enforces global geometric consistency. Experiments show that Pantheon360 achieves superior visual quality and unmatched geometric coherence, enabling reliable and flexible 360{\deg} scene generation for downstream simulation and digital-twin applications.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper claims to introduce Pantheon360, a controllable 3D-aware 360° video diffusion framework that reconstructs an explicit 3D Cache from sparse 360° inputs to act as a geometric scaffold; this allows the diffusion model to synthesize photorealistic textures while the cache enforces global geometric consistency for arbitrary user-defined camera paths, yielding superior visual quality and unmatched coherence for digital-twin applications.

Significance. If the 3D Cache mechanism is shown to reliably enforce consistency for paths distant from the input views, the work would be significant for cs.CV by offering a practical separation of geometry enforcement from texture synthesis in panoramic generation, addressing FoV limitations of perspective models and supporting downstream simulation tasks.

major comments (2)
  1. [Abstract] Abstract: the performance claims of 'superior visual quality and unmatched geometric coherence' and 'reliable and flexible 360° scene generation' are stated without any metrics, baselines, quantitative results, or experimental details, which is load-bearing for assessing whether the central contribution holds.
  2. [Method] Method (3D Cache description): the claim that the 3D Cache 'enforces global geometric consistency' for arbitrary user-defined camera paths is central; the manuscript must specify the exact conditioning mechanism inside the diffusion process (hard constraint vs. soft guidance) and provide failure-case analysis for reconstruction errors, holes, or depth inaccuracies common with sparse panoramic inputs.
minor comments (1)
  1. [Abstract] Abstract: the phrasing 'taming digital twin generation' is informal and could be replaced with a more precise description of the technical contribution.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the two major comments point-by-point below and will make the requested clarifications and additions in the revised manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the performance claims of 'superior visual quality and unmatched geometric coherence' and 'reliable and flexible 360° scene generation' are stated without any metrics, baselines, quantitative results, or experimental details, which is load-bearing for assessing whether the central contribution holds.

    Authors: We agree that the abstract would be strengthened by including concrete quantitative support. In the revision we will add a concise sentence reporting the key metrics (e.g., PSNR/SSIM gains and geometric consistency scores versus the strongest baselines) while remaining within the abstract word limit. Full tables, baselines, and experimental protocols already appear in the Experiments section and will be cross-referenced. revision: yes

  2. Referee: [Method] Method (3D Cache description): the claim that the 3D Cache 'enforces global geometric consistency' for arbitrary user-defined camera paths is central; the manuscript must specify the exact conditioning mechanism inside the diffusion process (hard constraint vs. soft guidance) and provide failure-case analysis for reconstruction errors, holes, or depth inaccuracies common with sparse panoramic inputs.

    Authors: We will expand the Method section to state that the 3D Cache is injected as soft guidance: multi-scale depth and feature maps from the cache are concatenated with the noisy latent and fed into the U-Net via cross-attention and skip connections at every denoising step. We will also add a new subsection with qualitative failure-case examples (reconstruction holes, depth errors on distant surfaces) together with the corresponding generated frames, showing both the limitations and how the diffusion model mitigates them. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper's core construction reconstructs an explicit 3D Cache directly from the sparse 360° input videos and uses it as conditioning for the diffusion process. This separation treats the cache as an independent geometric input rather than deriving it from the model's own outputs or predictions. No equations, self-citations, or uniqueness claims are shown that reduce the claimed geometric consistency to a fitted parameter, self-definition, or prior author work. The derivation chain is therefore self-contained against external reconstruction inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

The central claim depends on the 3D Cache functioning as an effective external geometric constraint; this component is introduced without independent evidence or derivation details in the abstract.

invented entities (1)
  • 3D Cache no independent evidence
    purpose: Geometric scaffold reconstructed from input to guide diffusion along arbitrary paths
    Presented as the key mechanism that enforces consistency while the diffusion model handles texture.

pith-pipeline@v0.9.1-grok · 5781 in / 974 out tokens · 26921 ms · 2026-06-29T22:44:34.803817+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

94 extracted references · 24 canonical work pages · 8 internal anchors

  1. [1]

    Cosmos World Foundation Model Platform for Physical AI

    Niket Agarwal, Arslan Ali, Maciej Bala, Yogesh Balaji, Erik Barker, Tiffany Cai, Prithvijit Chattopadhyay, Yongxin Chen, Yin Cui, Yifan Ding, et al. Cosmos world foundation model platform for physical ai.arXiv preprint arXiv:2501.03575,

  2. [2]

    Di- verse plausible 360-degree image outpainting for efficient 3dcg background creation

    Naofumi Akimoto, Yuhi Matsuo, and Yoshimitsu Aoki. Di- verse plausible 360-degree image outpainting for efficient 3dcg background creation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 11441–11450, 2022. 3

  3. [3]

    Met3r: Measuring multi- view consistency in generated images

    Mohammad Asim, Christopher Wewer, Thomas Wimmer, Bernt Schiele, and Jan Eric Lenssen. Met3r: Measuring multi- view consistency in generated images. InComputer Vision and Pattern Recognition (CVPR), 2024. 5

  4. [4]

    Lindell, and Sergey Tulyakov

    Sherwin Bahmani, Ivan Skorokhodov, Guocheng Qian, Ali- aksandr Siarohin, Willi Menapace, Andrea Tagliasacchi, David B. Lindell, and Sergey Tulyakov. Ac3d: Analyzing and improving 3d camera control in video diffusion transformers,

  5. [5]

    Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

    Andreas Blattmann et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023. 2, 3, 13

  6. [6]

    Video generation models as world simula- tors.OpenAI Technical Report, 2024

    Tim Brooks et al. Video generation models as world simula- tors.OpenAI Technical Report, 2024. 2

  7. [7]

    Flexworld: Progressively expanding 3d scenes for flexiable- view synthesis.arXiv preprint arXiv:2503.13265, 2025

    Luxi Chen, Zihan Zhou, Min Zhao, Yikai Wang, Ge Zhang, Wenhao Huang, Hao Sun, Ji-Rong Wen, and Chongxuan Li. Flexworld: Progressively expanding 3d scenes for flexiable- view synthesis.arXiv preprint arXiv:2503.13265, 2025. 2

  8. [8]

    Mvsplat360: Feed-forward 360 scene synthesis from sparse views.Ad- vances in Neural Information Processing Systems, 37:107064– 107086, 2024

    Yuedong Chen, Chuanxia Zheng, Haofei Xu, Bohan Zhuang, Andrea Vedaldi, Tat-Jen Cham, and Jianfei Cai. Mvsplat360: Feed-forward 360 scene synthesis from sparse views.Ad- vances in Neural Information Processing Systems, 37:107064– 107086, 2024. 2

  9. [9]

    Text2light: Zero-shot text-driven hdr panorama generation.ACM Trans- actions on Graphics (TOG), 41(6):1–16, 2022

    Zhaoxi Chen, Guangcong Wang, and Ziwei Liu. Text2light: Zero-shot text-driven hdr panorama generation.ACM Trans- actions on Graphics (TOG), 41(6):1–16, 2022. 3

  10. [10]

    Panogrf: Generalizable spherical radiance fields for wide-baseline panoramas

    Zheng Chen, Yan-Pei Cao, Yuan-Chen Guo, Chen Wang, Ying Shan, and Song-Hai Zhang. Panogrf: Generalizable spherical radiance fields for wide-baseline panoramas. InAdvances in Neural Information Processing Systems (NeurIPS), pages 6961–6985, 2023. 3

  11. [11]

    Splatter-360: Generalizable 360 gaussian splatting for wide- baseline panoramic images

    Zheng Chen, Chenming Wu, Zhelun Shen, Chen Zhao, We- icai Ye, Haocheng Feng, Errui Ding, and Song-Hai Zhang. Splatter-360: Generalizable 360 gaussian splatting for wide- baseline panoramic images. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 21590– 21599, 2025. 3

  12. [12]

    Splannequin: Freezing monocular mannequin-challenge footage with dual-detection splatting

    Hao-Jen Chien, Yi-Chuan Huang, Chung-Ho Wu, Wei-Lun Chao, and Yu-Lun Liu. Splannequin: Freezing monocular mannequin-challenge footage with dual-detection splatting. arXiv preprint arXiv:2512.05113, 2025. 2

  13. [13]

    Spherenet: Learning spherical representations for detection and classification in omnidirectional images

    Benjamin Coors, Alexandru Paul Condurache, and Andreas Geiger. Spherenet: Learning spherical representations for detection and classification in omnidirectional images. In European Conference on Computer Vision (ECCV), pages 518–533. Springer, 2018. 3

  14. [14]

    Human-compatible driving agents through data-regularized self-play reinforce- ment learning.Reinforcement Learning Journal, 1, 2024

    Daphne Cornelisse and Eugene Vinitsky. Human-compatible driving agents through data-regularized self-play reinforce- ment learning.Reinforcement Learning Journal, 1, 2024. 2

  15. [15]

    Leveraging large-scale pretrained vision foundation models for label-efficient 3d point cloud segmen- tation.arXiv preprint arXiv:2311.01989, 2023

    Shichao Dong et al. Leveraging large-scale pretrained vision foundation models for label-efficient 3d point cloud segmen- tation.arXiv preprint arXiv:2311.01989, 2023. 2

  16. [16]

    Digital twin generation from visual data: A survey.arXiv preprint arXiv:2504.13159, 2025

    Shichao Dong et al. Digital twin generation from visual data: A survey.arXiv preprint arXiv:2504.13159, 2025. 2

  17. [17]

    Pano popups: Indoor 3d reconstruction with a plane-aware network

    Marc Eder, Pierre Moulon, and Li Guan. Pano popups: Indoor 3d reconstruction with a plane-aware network. InInterna- tional Conference on 3D Vision (3DV), pages 76–84. IEEE,

  18. [18]

    Spec- tromotion: Dynamic 3d reconstruction of specular scenes

    Cheng-De Fan, Chen-Wei Chang, Yi-Ruei Liu, Jie-Ying Lee, Jiun-Long Huang, Yu-Chee Tseng, and Yu-Lun Liu. Spec- tromotion: Dynamic 3d reconstruction of specular scenes. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 21328–21338, 2025. 2, 3

  19. [19]

    Explo- rative inbetweening of time and space

    Haiwen Feng, Zheng Ding, Zhihao Xia, Simon Niklaus, Vic- toria Abrevaya, Michael J Black, and Xuaner Zhang. Explo- rative inbetweening of time and space. InEuropean Confer- ence on Computer Vision, pages 378–395. Springer, 2024. 4, 5, 13, 14

  20. [20]

    Diffusion360: Seamless 360 degree panoramic im- age generation based on diffusion models.arXiv preprint arXiv:2311.13141, 2023

    Mengyang Feng, Jinlin Liu, Miaomiao Cui, and Xuansong Xie. Diffusion360: Seamless 360 degree panoramic im- age generation based on diffusion models.arXiv preprint arXiv:2311.13141, 2023. 3

  21. [21]

    Flowr: Flowing from sparse to dense 3d reconstructions

    Tobias Fischer, Samuel Rota Bul`o, Yung-Hsu Yang, Nikhil Keetha, Lorenzo Porzi, Norman M ¨uller, Katja Schwarz, Jonathon Luiten, Marc Pollefeys, and Peter Kontschieder. Flowr: Flowing from sparse to dense 3d reconstructions. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 27702–27712, 2025. 2

  22. [22]

    3dtrajmaster: Mastering 3d trajectory for multi-entity motion in video generation, 2025

    Xiao Fu, Xian Liu, Xintao Wang, Sida Peng, Menghan Xia, Xiaoyu Shi, Ziyang Yuan, Pengfei Wan, Di Zhang, and Dahua Lin. 3dtrajmaster: Mastering 3d trajectory for multi-entity motion in video generation, 2025. 2

  23. [24]

    Cameractrl: Enabling camera control for text-to-video generation, 2025

    Hao He, Yinghao Xu, Yuwei Guo, Gordon Wetzstein, Bo Dai, Hongsheng Li, and Ceyuan Yang. Cameractrl: Enabling camera control for text-to-video generation, 2025. 2

  24. [25]

    Cameractrl ii: Dynamic scene exploration via camera-controlled video diffusion models.arXiv preprint arXiv:2503.10592, 2025

    Hao He, Ceyuan Yang, Shanchuan Lin, Yinghao Xu, Meng Wei, Liangke Gui, Qi Zhao, Gordon Wetzstein, Lu Jiang, and Hongsheng Li. Cameractrl ii: Dynamic scene exploration via camera-controlled video diffusion models.arXiv preprint arXiv:2503.10592, 2025. 2

  25. [26]

    CameraCtrl: Enabling Camera Control for Text-to-Video Generation

    Hao He et al. Cameractrl: Enabling camera control for text- to-video generation.arXiv preprint arXiv:2404.02101, 2024. 2

  26. [27]

    Generative camera dolly: Extreme monocular dynamic novel view synthesis, 2024

    Basile Van Hoorick, Rundi Wu, Ege Ozguroglu, Kyle Sar- gent, Ruoshi Liu, Pavel Tokmakov, Achal Dave, Changxi Zheng, and Carl V ondrick. Generative camera dolly: Extreme monocular dynamic novel view synthesis, 2024. 2

  27. [28]

    Training-free camera control for video generation, 2025

    Chen Hou and Zhibo Chen. Training-free camera control for video generation, 2025. 2

  28. [29]

    Vivid4d: Improving 4d reconstruction from monocular video by video inpainting

    Jiaxin Huang, Sheng Miao, Bangbang Yang, Yuewen Ma, and Yiyi Liao. Vivid4d: Improving 4d reconstruction from monocular video by video inpainting. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 12592–12604, 2025. 2

  29. [30]

    ViPE: Video Pose Engine for 3D Geometric Perception

    Jiahui Huang, Qunjie Zhou, Hesam Rabeti, Aleksandr Ko- rovko, Huan Ling, Xuanchi Ren, Tianchang Shen, Jun Gao, Dmitry Slepichev, Chen-Hsuan Lin, Jiawei Ren, Kevin Xie, Joydeep Biswas, Laura Leal-Taixe, and Sanja Fidler. Vipe: Video pose engine for 3d geometric perception. InNVIDIA Research Whitepapers arXiv:2508.10934, 2025. 5, 14

  30. [31]

    CamPVG: Camera-Controlled Panoramic Video Generation with Epipolar-Aware Diffusion

    Chenhao Ji, Zidong Tan, Yiyang Shen, and Fuchen Tan. Cam- pvg: Camera-controlled panoramic video generation with epipolar-aware diffusion.arXiv preprint arXiv:2509.19979,

  31. [32]

    Bench2drive: Towards multi-ability bench- marking of closed-loop end-to-end autonomous driving

    Xiaosong Jia, Zhenjie Yang, Qifeng Li, Zhiyuan Zhang, and Junchi Yan. Bench2drive: Towards multi-ability bench- marking of closed-loop end-to-end autonomous driving. In NeurIPS 2024 Datasets and Benchmarks Track, 2024. 2

  32. [33]

    Scalable deep reinforcement learning for vision-based robotic manipulation

    Dmitry Kalashnikov, Alex Irpan, Peter Pastor, Julian Ibarz, Alexander Herzog, Eric Jang, Deirdre Quillen, Ethan Holly, Mrinal Kalakrishnan, Vincent Vanhoucke, and Sergey Levine. Scalable deep reinforcement learning for vision-based robotic manipulation. InProceedings of The 2nd Conference on Robot Learning, pages 651–673. PMLR, 2018. 2

  33. [34]

    Cubediff: Repurposing diffusion-based image models for panorama generation

    Nikolai Kalischek, Michael Oechsle, Fabian Manhardt, Philipp Henzler, Konrad Schindler, and Federico Tombari. Cubediff: Repurposing diffusion-based image models for panorama generation. InThe Thirteenth International Con- ference on Learning Representations, 2025. 3

  34. [35]

    3d gaussian splatting for real-time radiance field rendering.ACM Trans

    Bernhard Kerbl, Georgios Kopanas, Thomas Leimk¨uhler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering.ACM Trans. Graph., 42(4):139–1, 2023. 2

  35. [36]

    Skyfall-gs: Synthesizing immersive 3d urban scenes from satellite imagery.arXiv preprint arXiv:2510.15869, 2025

    Jie-Ying Lee, Yi-Ruei Liu, Shr-Ruei Tsai, Wei-Cheng Chang, Chung-Ho Wu, Jiewen Chan, Zhenjun Zhao, Chieh Hubert Lin, and Yu-Lun Liu. Skyfall-gs: Synthesizing immersive 3d urban scenes from satellite imagery.arXiv preprint arXiv:2510.15869, 2025. 3

  36. [37]

    Ground- ing image matching in 3d with mast3r

    Vincent Leroy, Yohann Cabon, and J´erˆome Revaud. Ground- ing image matching in 3d with mast3r. InEuropean Con- ference on Computer Vision (ECCV), pages 71–91. Springer,

  37. [38]

    Plataniotis, Sergey Tulyakov, and Jian Ren

    Hanwen Liang, Junli Cao, Vidit Goel, Guocheng Qian, Sergei Korolev, Demetri Terzopoulos, Konstantinos N. Plataniotis, Sergey Tulyakov, and Jian Ren. Wonderland: Navigating 3d scenes from a single image, 2025. 2

  38. [39]

    Cylin- painting: Seamless 360° panoramic image outpainting and beyond.IEEE Transactions on Image Processing, 33:290– 305, 2023

    Kang Liao, Lang Nie, Shujuan Huang, Chunyu Lin, Jing Zhang, Yao Zhao, Moncef Gabbouj, and Dacheng Tao. Cylin- painting: Seamless 360° panoramic image outpainting and beyond.IEEE Transactions on Image Processing, 33:290– 305, 2023. 3

  39. [40]

    Longsplat: Robust unposed 3d gaussian splatting for casual long videos

    Chin-Yang Lin, Cheng Sun, Fu-En Yang, Min-Hung Chen, Yen-Yu Lin, and Yu-Lun Liu. Longsplat: Robust unposed 3d gaussian splatting for casual long videos. InICCV, 2025. 2

  40. [41]

    Reconx: Reconstruct any scene from sparse views with video diffusion model.arXiv preprint arXiv:2408.16767, 2024

    Fangfu Liu, Wenqiang Sun, Hanyang Wang, Yikai Wang, Haowen Sun, Junliang Ye, Jun Zhang, and Yueqi Duan. Re- conx: Reconstruct any scene from sparse views with video diffusion model.arXiv preprint arXiv:2408.16767, 2024. 2

  41. [42]

    Robust dynamic radiance fields

    Yu-Lun Liu, Chen Gao, Andreas Meuleman, Hung-Yu Tseng, Ayush Saraf, Changil Kim, Yung-Yu Chuang, Johannes Kopf, and Jia-Bin Huang. Robust dynamic radiance fields. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13–23, 2023. 2

  42. [43]

    Genex: Generating an explorable world

    Taiming Lu, Tianmin Shu, Junfei Xiao, Luoxin Ye, Jiahao Wang, Cheng Peng, Chen Wei, Daniel Khashabi, Rama Chel- lappa, Alan Yuille, and Jieneng Chen. Genex: Generating an explorable world. In13th International Conference on Learning Representations (ICLR), 2025. 2, 3, 7

  43. [44]

    You see it, you got it: Learning 3d creation on pose-free videos at scale

    Baorui Ma, Huachen Gao, Haoge Deng, Zhengxiong Luo, Tiejun Huang, Lulu Tang, and Xinlong Wang. You see it, you got it: Learning 3d creation on pose-free videos at scale. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 2016–2029, 2025. 2

  44. [45]

    Wan-Duo Kurt Ma, J. P. Lewis, and W. Bastiaan Kleijn. Trail- blazer: Trajectory control for diffusion-based video genera- tion.arXiv preprint arXiv:2401.00896, 2024. 2

  45. [46]

    Habitat: A Platform for Embodied AI Research

    Manolis Savva*, Abhishek Kadian*, Oleksandr Maksymets*, Yili Zhao, Erik Wijmans, Bhavana Jain, Julian Straub, Jia Liu, Vladlen Koltun, Jitendra Malik, Devi Parikh, and Dhruv Batra. Habitat: A Platform for Embodied AI Research. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2019. 5, 6, 7, 8

  46. [47]

    Nerf: Representing scenes as neural radiance fields for view syn- thesis.Communications of the ACM, 65(1):99–106, 2021

    Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view syn- thesis.Communications of the ACM, 65(1):99–106, 2021. 2

  47. [48]

    Efficient synthetic defect on 3d object reconstruction and generation pipeline for digital twins smart factory.Sensors, 25(22):6908, 2024

    Rafay Mohiuddin et al. Efficient synthetic defect on 3d object reconstruction and generation pipeline for digital twins smart factory.Sensors, 25(22):6908, 2024. 2

  48. [49]

    Bips: Bi-modal indoor panorama synthesis via residual depth-aided adversarial learn- ing

    Changgyoon Oh, Wonjune Cho, Yujeong Chae, Daehee Park, Lin Wang, and Kuk-Jin Yoon. Bips: Bi-modal indoor panorama synthesis via residual depth-aided adversarial learn- ing. InEuropean Conference on Computer Vision, pages 352–371. Springer, 2022. 3

  49. [50]

    Greire Payen de La Garanderie, Amir Atapour Abarghouei, and Toby P. Breckon. Eliminating the blind spot: Adapting 3d object detection and monocular depth estimation to 360 panoramic imagery. InEuropean Conference on Computer Vision (ECCV), pages 789–807. Springer, 2018. 3

  50. [51]

    Panos- platt3r: Leveraging perspective pretraining for generalized unposed wide-baseline panorama reconstruction

    Jiahui Ren, Mochu Xiang, Jiajun Zhu, and Yuchao Dai. Panos- platt3r: Leveraging perspective pretraining for generalized unposed wide-baseline panorama reconstruction. InProceed- ings of the IEEE/CVF International Conference on Computer Vision, pages 28959–28969, 2025. 3, 7

  51. [52]

    Gen3c: 3d-informed world- consistent video generation with precise camera control

    Xuanchi Ren, Tianchang Shen, Jiahui Huang, Huan Ling, Yifan Lu, Merlin Nimier-David, Thomas M¨uller, Alexander Keller, Sanja Fidler, and Jun Gao. Gen3c: 3d-informed world- consistent video generation with precise camera control. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, 2025. 2, 5, 6, 8

  52. [53]

    Videoartgs: Building digital twins of articulated objects from monocular video.arXiv preprint arXiv:2509.17647, 2025

    Alexander Rusnak et al. Videoartgs: Building digital twins of articulated objects from monocular video.arXiv preprint arXiv:2509.17647, 2025. 2

  53. [54]

    Prior-enhanced gaussian splatting for dynamic scene reconstruction from casual video.arXiv preprint arXiv:2512.11356, 2025

    Meng-Li Shih, Ying-Huan Chen, Yu-Lun Liu, and Brian Curless. Prior-enhanced gaussian splatting for dynamic scene reconstruction from casual video.arXiv preprint arXiv:2512.11356, 2025. 2

  54. [55]

    Imagine360: Immersive 360 video generation from perspective anchor.arXiv preprint arXiv:2412.03552, 2024

    Jing Tan, Shuai Yang, Tong Wu, Jingwen He, Yuwei Guo, Ziwei Liu, and Dahua Lin. Imagine360: Immersive 360 video generation from perspective anchor.arXiv preprint arXiv:2412.03552, 2024. 2, 3

  55. [56]

    Beyond the Frame: Generating 360 Panoramic Videos from Perspective Videos

    Zidong Tan, Chenhao Ji, Fuchen Tan, and Yiyang Shen. Be- yond the frame: Generating 360° panoramic videos from perspective videos.arXiv preprint arXiv:2504.07940, 2025. 2, 3, 13

  56. [57]

    Mvdiffusion: Enabling holistic multi- view image generation with correspondence-aware diffusion,

    Shitao Tang, Fuyang Zhang, Jiacheng Chen, Peng Wang, and Yasutaka Furukawa. Mvdiffusion: Enabling holistic multi- view image generation with correspondence-aware diffusion,

  57. [58]

    Distortion-aware convolutional filters for dense prediction in panoramic images

    Keisuke Tateno, Nassir Navab, and Federico Tombari. Distortion-aware convolutional filters for dense prediction in panoramic images. InEuropean Conference on Computer Vision (ECCV), pages 707–722. Springer, 2018. 3

  58. [59]

    Raft: Recurrent all-pairs field transforms for optical flow

    Zachary Teed and Jia Deng. Raft: Recurrent all-pairs field transforms for optical flow. InEuropean conference on com- puter vision, pages 402–419. Springer, 2020. 14

  59. [60]

    Droid-slam: Deep visual slam for monocular, stereo, and rgb-d cameras.Advances in neural information processing systems, 34:16558–16569, 2021

    Zachary Teed and Jia Deng. Droid-slam: Deep visual slam for monocular, stereo, and rgb-d cameras.Advances in neural information processing systems, 34:16558–16569, 2021. 5

  60. [61]

    Foundational models for 3d point clouds: A survey and outlook.arXiv preprint arXiv:2501.18594, 2025

    Vishal Thengane et al. Foundational models for 3d point clouds: A survey and outlook.arXiv preprint arXiv:2501.18594, 2025. 2

  61. [62]

    From an image to a scene: Learning to imagine the world from a million 360 videos.Advances in Neural Information Process- ing Systems, 37:17743–17760, 2024

    Matthew Wallingford, Anand Bhattad, Aditya Kusupati, Vivek Ramanujan, Matt Deitke, Aniruddha Kembhavi, Roozbeh Mottaghi, Wei-Chiu Ma, and Ali Farhadi. From an image to a scene: Learning to imagine the world from a million 360 videos.Advances in Neural Information Process- ing Systems, 37:17743–17760, 2024. 4, 6, 14

  62. [63]

    Stylelight: Hdr panorama generation for lighting esti- mation and editing

    Guangcong Wang, Yinuo Yang, Chen Change Loy, and Ziwei Liu. Stylelight: Hdr panorama generation for lighting esti- mation and editing. InEuropean Conference on Computer Vision (ECCV), pages 477–492. Springer, 2022. 3

  63. [64]

    Customizing 360-degree panoramas through text-to-image diffusion models

    Hai Wang, Xiaoyu Xiang, Yuchen Fan, and Jing-Hao Xue. Customizing 360-degree panoramas through text-to-image diffusion models. InProceedings of the IEEE/CVF Win- ter Conference on Applications of Computer Vision (WACV), pages 4933–4943, 2024. 3

  64. [65]

    Videoscene: Distilling video diffusion model to generate 3d scenes in one step

    Hanyang Wang, Fangfu Liu, Jiawei Chi, and Yueqi Duan. Videoscene: Distilling video diffusion model to generate 3d scenes in one step. In2025 IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR), pages 16475– 16485. IEEE, 2025. 2

  65. [66]

    Vistadream: Sampling multiview consistent images for single-view scene reconstruction

    Haiping Wang, Yuan Liu, Ziwei Liu, Wenping Wang, Zhen Dong, and Bisheng Yang. Vistadream: Sampling multiview consistent images for single-view scene reconstruction. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 26772–26782, 2025. 2

  66. [67]

    Vggt: Visual geometry grounded transformer

    Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Visual geometry grounded transformer. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 5294–5306, 2025. 2, 3, 4, 13

  67. [68]

    Ning-Hsu Albert Wang and Yu-Lun Liu. Depth anywhere: Enhancing 360 monocular depth estimation via perspective distillation and unlabeled data augmentation.Advances in Neural Information Processing Systems, 37:127739–127764,

  68. [69]

    360dvd: Controllable panorama video generation with 360-degree video diffusion model

    Qian Wang, Weiqi Li, Chong Mou, Xinhua Cheng, and Jian Zhang. 360dvd: Controllable panorama video generation with 360-degree video diffusion model. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6913–6923, 2024. 3, 5

  69. [70]

    Continuous 3d perception model with persistent state

    Qianqian Wang, Yifei Zhang, Aleksander Holynski, Alexei A Efros, and Angjoo Kanazawa. Continuous 3d perception model with persistent state. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 10510– 10522, 2025. 3

  70. [71]

    Dust3r: Geometric 3d vi- sion made easy

    Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. Dust3r: Geometric 3d vi- sion made easy. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 20697–20709, 2024. 2, 3, 13

  71. [72]

    Cpa: Camera-pose-awareness diffusion transformer for video generation, 2024

    Yuelei Wang, Jian Zhang, Pengtao Jiang, Hao Zhang, Jinwei Chen, and Bo Li. Cpa: Camera-pose-awareness diffusion transformer for video generation, 2024. 2

  72. [73]

    $\pi^3$: Permutation-Equivariant Visual Geometry Learning

    Yifan Wang, Jianjun Zhou, Haoyi Zhu, Wenzheng Chang, Yang Zhou, Zizun Li, Junyi Chen, Jiangmiao Pang, Chunhua Shen, and Tong He. π3: Scalable permutation-equivariant visual geometry learning.arXiv preprint arXiv:2507.13347,

  73. [74]

    2, 3, 4, 5, 8, 13, 15

  74. [75]

    Motionctrl: A uni- fied and flexible motion controller for video generation, 2024

    Zhouxia Wang, Ziyang Yuan, Xintao Wang, Tianshui Chen, Menghan Xia, Ping Luo, and Ying Shan. Motionctrl: A uni- fied and flexible motion controller for video generation, 2024. 2

  75. [76]

    Motionctrl: A unified and flexible mo- tion controller for video generation.ACM Transactions on Graphics, 2024

    Zhouxia Wang et al. Motionctrl: A unified and flexible mo- tion controller for video generation.ACM Transactions on Graphics, 2024. 2

  76. [77]

    Daniel Watson, Saurabh Saxena, Lala Li, Andrea Tagliasacchi, and David J. Fleet. Controlling space and time with diffusion models, 2025. 2

  77. [78]

    Aurafu- sion360: Augmented unseen region alignment for reference- based 360deg unbounded scene inpainting

    Chung-Ho Wu, Yang-Jung Chen, Ying-Huan Chen, Jie-Ying Lee, Bo-Hsu Ke, Chun-Wei Tuan Mu, Yi-Chuan Huang, Chin-Yang Lin, Min-Hung Chen, Yen-Yu Lin, et al. Aurafu- sion360: Augmented unseen region alignment for reference- based 360deg unbounded scene inpainting. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 16366–16376, 2025. 3

  78. [79]

    Cat4d: Create anything in 4d with multi-view video diffusion models

    Rundi Wu, Ruiqi Gao, Ben Poole, Alex Trevithick, Changxi Zheng, Jonathan T Barron, and Aleksander Holynski. Cat4d: Create anything in 4d with multi-view video diffusion models. InProceedings of the Computer Vision and Pattern Recogni- tion Conference, pages 26057–26068, 2025. 2

  79. [80]

    Genfusion: Closing the loop between recon- struction and generation via videos

    Sibo Wu, Congrong Xu, Binbin Huang, Andreas Geiger, and Anpei Chen. Genfusion: Closing the loop between recon- struction and generation via videos. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 6078–6088, 2025. 2

  80. [81]

    Panodiffu- sion: 360-degree panorama outpainting via diffusion.arXiv preprint arXiv:2307.03177, 2023

    Tianhao Wu, Chuanxia Zheng, and Tat-Jen Cham. Panodiffu- sion: 360-degree panorama outpainting via diffusion.arXiv preprint arXiv:2307.03177, 2023. 3

Showing first 80 references.