Recognition: unknown
Rein3D: Reinforced 3D Indoor Scene Generation with Panoramic Video Diffusion Models
Pith reviewed 2026-05-10 15:30 UTC · model grok-4.3
The pith
Rein3D reconstructs photorealistic and globally consistent 3D indoor scenes from sparse inputs by restoring imperfect panoramic videos with diffusion models to refine 3D Gaussian Splatting.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Rein3D follows a restore-and-refine paradigm that couples explicit 3D Gaussian Splatting with temporally coherent priors from video diffusion models. A radial exploration strategy renders imperfect panoramic videos from the origin to uncover occluded regions. These sequences are restored by a panoramic video-to-video diffusion model and enhanced via video super-resolution. The refined videos then serve as pseudo-ground truths to update the global 3D Gaussian field.
What carries the argument
The restore-and-refine paradigm that renders imperfect panoramic videos from an initial 3D Gaussian Splatting model, restores them with a video-to-video diffusion model, and uses the outputs as pseudo-ground truths to refine the global 3D field.
Load-bearing premise
The panoramic video-to-video diffusion model can reliably restore massive missing geometry and textures in occluded regions to produce pseudo-ground truths that improve rather than degrade the global 3D Gaussian field.
What would settle it
A controlled test that applies the full restore-and-refine loop to a scene with known ground-truth geometry and shows higher reconstruction error or new visual inconsistencies in previously occluded regions after the update step.
Figures
read the original abstract
The growing demand for Embodied AI and VR applications has highlighted the need for synthesizing high-quality 3D indoor scenes from sparse inputs. However, existing approaches struggle to infer massive amounts of missing geometry in large unseen areas while maintaining global consistency, often producing locally plausible but globally inconsistent reconstructions. We present Rein3D, a framework that reconstructs full 360-degree indoor environments by coupling explicit 3D Gaussian Splatting (3DGS) with temporally coherent priors from video diffusion models. Our approach follows a "restore-and-refine" paradigm: we employ a radial exploration strategy to render imperfect panoramic videos along trajectories starting from the origin, effectively uncovering occluded regions from a coarse 3DGS initialization. These sequences are restored by a panoramic video-to-video diffusion model and further enhanced via video super-resolution to synthesize high-fidelity geometry and textures. Finally, these refined videos serve as pseudo-ground truths to update the global 3D Gaussian field. To support this task, we construct PanoV2V-15K, a dataset of over 15K paired clean and degraded panoramic videos for diffusion-based scene restoration. Experiments demonstrate that Rein3D produces photorealistic and globally consistent 3D scenes and significantly improves long-range camera exploration compared with existing baselines.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Rein3D, a restore-and-refine framework that initializes a coarse 3D Gaussian Splatting (3DGS) field from sparse inputs, renders panoramic videos along radial trajectories to expose occluded regions, restores these videos using a panoramic video-to-video diffusion model trained on the newly constructed PanoV2V-15K dataset, applies video super-resolution, and feeds the restored sequences back as pseudo-ground truths to refine the global 3DGS field for photorealistic, globally consistent 3D indoor scenes.
Significance. If the central claim holds, the work offers a practical way to leverage pre-trained video diffusion priors for large-scale inpainting in 3D reconstruction, which could benefit Embodied AI and VR applications by improving long-range consistency beyond what pure 3DGS or NeRF methods achieve from sparse views.
major comments (2)
- [Abstract and §4] Abstract and §4 (Experiments): the claim that 'experiments demonstrate... significantly improves long-range camera exploration' is unsupported by any reported quantitative metrics, error bars, baseline details, ablation studies, or per-region reconstruction errors against held-out geometry; without these, it is impossible to verify whether the diffusion-restored pseudo-ground truths improve rather than degrade the 3DGS field.
- [§3.2] §3.2 (Restore-and-refine loop): the pipeline assumes that the panoramic video-to-video diffusion model (trained on PanoV2V-15K) reliably restores massive missing geometry and textures in occluded regions without introducing geometrically inconsistent hallucinations; no direct evidence (e.g., PSNR/SSIM on held-out real panoramas or geometric consistency metrics) is provided to confirm this assumption is load-bearing for the global consistency claim.
minor comments (2)
- [§3.1] The construction details and statistics of the PanoV2V-15K dataset (e.g., how degraded/clean pairs were generated, diversity of scenes) should be expanded for reproducibility.
- [§3.3] Notation for the radial exploration trajectories and the exact loss terms used when updating the 3DGS field from restored videos could be clarified with an equation.
Simulated Author's Rebuttal
We thank the referee for their constructive comments. We address each major comment below and indicate the revisions we will make to strengthen the manuscript.
read point-by-point responses
-
Referee: [Abstract and §4] Abstract and §4 (Experiments): the claim that 'experiments demonstrate... significantly improves long-range camera exploration' is unsupported by any reported quantitative metrics, error bars, baseline details, ablation studies, or per-region reconstruction errors against held-out geometry; without these, it is impossible to verify whether the diffusion-restored pseudo-ground truths improve rather than degrade the 3DGS field.
Authors: We agree that the long-range exploration claim requires stronger quantitative backing. The current experiments emphasize qualitative visual results and consistency in rendered trajectories, but lack explicit numerical metrics, error bars, and per-region analysis for occluded areas. In the revised manuscript we will add PSNR, SSIM, and LPIPS scores on held-out long-range views, baseline comparisons with error bars from repeated runs, and ablation studies isolating the contribution of the restore-and-refine loop. We will also report per-region reconstruction errors to confirm that the pseudo-ground truths improve rather than degrade the 3DGS field. revision: yes
-
Referee: [§3.2] §3.2 (Restore-and-refine loop): the pipeline assumes that the panoramic video-to-video diffusion model (trained on PanoV2V-15K) reliably restores massive missing geometry and textures in occluded regions without introducing geometrically inconsistent hallucinations; no direct evidence (e.g., PSNR/SSIM on held-out real panoramas or geometric consistency metrics) is provided to confirm this assumption is load-bearing for the global consistency claim.
Authors: The referee is correct that direct validation of the diffusion restoration step is currently missing. While end-to-end 3D consistency provides indirect support, we will add explicit metrics in the revision: PSNR and SSIM on held-out pairs from PanoV2V-15K, plus geometric consistency measures such as depth-map error and normal consistency across restored frames. These additions will demonstrate that the model does not introduce hallucinations that undermine global consistency. revision: yes
Circularity Check
No significant circularity in derivation chain
full rationale
The paper describes a procedural 'restore-and-refine' pipeline that couples 3D Gaussian Splatting with a panoramic video-to-video diffusion model trained on the newly constructed PanoV2V-15K dataset. No equations, derivations, or first-principles results are presented that reduce any prediction or output to fitted parameters or self-referential definitions by construction. The method relies on external pre-trained diffusion models and an independently constructed paired dataset for restoration, with the central claims of photorealism and consistency arising from the iterative application of these components rather than from any self-definitional or load-bearing self-citation loop.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption A pre-trained panoramic video-to-video diffusion model can accurately infer and restore large occluded regions in indoor scenes without introducing global inconsistencies.
Forward citations
Cited by 1 Pith paper
-
3D-Belief: Embodied Belief Inference via Generative 3D World Modeling
3D-Belief maintains and updates explicit 3D beliefs about partially observed environments to enable multi-hypothesis imagination and improved performance on embodied tasks.
Reference graph
Works this paper leans on
-
[1]
Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets
Blattmann, A., Dockhorn, T., Kulal, S., Mendelevitch, D., Kilian, M., Lorenz, D., Levi, Y., English, Z., Voleti, V., Letts, A., et al.: Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127 (2023)
work page internal anchor Pith review arXiv 2023
-
[2]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Blattmann, A., Rombach, R., Ling, H., Dockhorn, T., Kim, S.W., Fidler, S., Kreis, K.: Align your latents: High-resolution video synthesis with latent diffusion mod- els. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 22563–22575 (2023)
2023
-
[3]
arXiv preprint arXiv:2503.13265 (2025)
Chen, L., Zhou, Z., Zhao, M., Wang, Y., Zhang, G., Huang, W., Sun, H., Wen, J.R., Li, C.: Flexworld: Progressively expanding 3d scenes for flexiable-view synthesis. arXiv preprint arXiv:2503.13265 (2025)
-
[4]
In: Proceedings of the Computer Vision and Pattern Recognition Conference
Chen, S., Ge, C., Zhang, Y., Zhang, Y., Zhu, F., Yang, H., Hao, H., Wu, H., Lai, Z., Hu, Y., et al.: Goku: Flow based video generative foundation models. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 23516–23527 (2025)
2025
-
[5]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Chen, S., Xu, M., Ren, J., Cong, Y., He, S., Xie, Y., Sinha, A., Luo, P., Xiang, T., Perez-Rua, J.M.: Gentron: Diffusion transformers for image and video generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 6441–6451 (2024)
2024
-
[6]
Lu- ciddreamer: Domain-free generation of 3d gaussian splatting scenes, 2023
Chung, J., Lee, S., Nam, H., Lee, J., Lee, K.M.: Luciddreamer: Domain-free gen- eration of 3d gaussian splatting scenes. arXiv preprint arXiv:2311.13384 (2023)
-
[7]
SPATIALGEN: Layout-guided 3D indoor scene generation, 2025
Fang, C., Li, H., Liang, Y., Zheng, J., Mao, Y., Liu, Y., Tang, R., Zhou, Z., Tan, P.: Spatialgen: Layout-guided 3d indoor scene generation. arXiv preprint arXiv:2509.14981 (2025)
-
[8]
arXiv preprint arXiv:2506.23513 (2025)
Fang, Z., Zhu, K., Liu, Z., Liu, Y., Zhai, W., Cao, Y., Zha, Z.J.: Panoramic video generation with pretrained diffusion models. arXiv preprint arXiv:2506.23513 (2025)
-
[9]
Feng, H., Zhang, D., Li, X., Du, B., Qi, L.: Dit360: High-fidelity panoramic image generation via hybrid training (2025)
2025
-
[10]
LTX-Video: Realtime Video Latent Diffusion
HaCohen, Y., Chiprut, N., Brazowski, B., Shalem, D., Moshe, D., Richardson, E., Levin, E., Shiran, G., Zabari, N., Gordon, O., et al.: Ltx-video: Realtime video latent diffusion. arXiv preprint arXiv:2501.00103 (2024)
work page internal anchor Pith review arXiv 2024
-
[11]
He, H., Xu, Y., Guo, Y., Wetzstein, G., Dai, B., Li, H., Yang, C.: Cameractrl: En- ablingcameracontrolfortext-to-videogeneration.arXivpreprintarXiv:2404.02101 (2024)
work page internal anchor Pith review arXiv 2024
-
[12]
Latent Video Diffusion Models for High-Fidelity Long Video Generation
He, Y., Yang, T., Zhang, Y., Shan, Y., Chen, Q.: Latent video diffusion models for high-fidelity long video generation. arXiv preprint arXiv:2211.13221 (2022)
work page internal anchor Pith review arXiv 2022
-
[13]
Advances in neural information processing systems33, 6840–6851 (2020) 16 D
Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. Advances in neural information processing systems33, 6840–6851 (2020) 16 D. Wang et al
2020
-
[14]
In: Proceedings of the IEEE/CVF International Conference on Computer Vision
Höllein, L., Cao, A., Owens, A., Johnson, J., Nießner, M.: Text2room: Extract- ing textured 3d meshes from 2d text-to-image models. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 7909–7920 (2023)
2023
-
[15]
In: Forty-first International Conference on Machine Learning (2024)
Hu, Z., Iscen, A., Jain, A., Kipf, T., Yue, Y., Ross, D.A., Schmid, C., Fathi, A.: Scenecraft: An llm agent for synthesizing 3d scenes as blender code. In: Forty-first International Conference on Machine Learning (2024)
2024
-
[16]
ACM Transactions on Graphics (TOG)44(6), 1–15 (2025)
Huang, T., Zheng, W., Wang, T., Liu, Y., Wang, Z., Wu, J., Jiang, J., Li, H., Lau, R., Zuo, W., et al.: Voyager: Long-range and world-consistent video diffusion for explorable 3d scene generation. ACM Transactions on Graphics (TOG)44(6), 1–15 (2025)
2025
-
[17]
arXiv preprint arXiv:2510.26800 (2025)
Huang, Y., Yu, J., Zhou, Y., Wang, J., Wang, X., Wan, P., Liu, X.: Omnix: From unified panoramic generation and perception to graphics-ready 3d scenes. arXiv preprint arXiv:2510.26800 (2025)
-
[18]
In: Proceedings of the IEEE/CVF International Conference on Computer Vision
Jiang, Z., Han, Z., Mao, C., Zhang, J., Pan, Y., Liu, Y.: Vace: All-in-one video creation and editing. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 17191–17202 (2025)
2025
-
[19]
In: ACM TOG
Kerbl, B., Kopanas, G., Leimkühler, T., Drettakis, G.: 3d gaussian splatting for real-time radiance field rendering. In: ACM TOG. vol. 42 (2023)
2023
-
[20]
HunyuanVideo: A Systematic Framework For Large Video Generative Models
Kong, W., Tian, Q., Zhang, Z., Min, R., Dai, Z., Zhou, J., Xiong, J., Li, X., Wu, B., Zhang, J., et al.: Hunyuanvideo: A systematic framework for large video generative models. arXiv preprint arXiv:2412.03603 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[21]
Da 2: Depth anything in any direction.arXiv preprint arXiv:2509.26618, 2025
Li, H., Zheng, W., He, J., Liu, Y., Lin, X., Yang, X., Chen, Y.C., Guo, C.: DA2: Depth anything in any direction. arXiv preprint arXiv:2509.26618 (2025)
-
[22]
arXiv preprint arXiv:2406.13527 (2024)
Li, R., Pan, P., Yang, B., Xu, D., Zhou, S., Zhang, X., Li, Z., Kadambi, A., Wang, Z., Tu, Z., et al.: 4k4dgen: Panoramic 4d generation at 4k resolution. arXiv preprint arXiv:2406.13527 (2024)
-
[23]
arXiv preprint arXiv:2505.02836 (2025)
Ling, L., Lin, C.H., Lin, T.Y., Ding, Y., Zeng, Y., Sheng, Y., Ge, Y., Liu, M.Y., Bera, A., Li, Z.: Scenethesis: A language and vision agentic framework for 3d scene generation. arXiv preprint arXiv:2505.02836 (2025)
-
[24]
Decoupled Weight Decay Regularization
Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017)
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[25]
In: SIGGRAPH Asia 2024 Conference Papers
Ma, J., Lu, E., Paiss, R., Zada, S., Holynski, A., Dekel, T., Curless, B., Rubinstein, M., Cole, F.: Vidpanos: Generative panoramic videos from casual panning videos. In: SIGGRAPH Asia 2024 Conference Papers. pp. 1–11 (2024)
2024
-
[26]
arXiv preprint arXiv:2508.15769 (2025)
Meng, Y., Wu, H., Zhang, Y., Xie, W.: Scenegen: Single-image 3d scene generation in one feedforward pass. arXiv preprint arXiv:2508.15769 (2025)
-
[27]
Towards physically executable 3d gaussian for embodied navigation,
Miao, B., Wei, R., Ge, Z., Gao, S., Zhu, J., Wang, R., Tang, S., Xiao, J., Tang, R., Li, J., et al.: Towards physically executable 3d gaussian for embodied navigation. arXiv preprint arXiv:2510.21307 (2025)
-
[28]
Commu- nications of the ACM65(1), 99–106 (2021)
Mildenhall, B., Srinivasan, P.P., Tancik, M., Barron, J.T., Ramamoorthi, R., Ng, R.: Nerf: Representing scenes as neural radiance fields for view synthesis. Commu- nications of the ACM65(1), 99–106 (2021)
2021
-
[29]
IEEE Transactions on image processing21(12), 4695–4708 (2012)
Mittal, A., Moorthy, A.K., Bovik, A.C.: No-reference image quality assessment in the spatial domain. IEEE Transactions on image processing21(12), 4695–4708 (2012)
2012
-
[30]
completely blind
Mittal, A., Soundararajan, R., Bovik, A.C.: Making a “completely blind” image quality analyzer. IEEE Signal processing letters20(3), 209–212 (2012)
2012
-
[31]
Peebles,W.,Xie,S.:Scalablediffusionmodelswithtransformers.In:Proceedingsof the IEEE/CVF international conference on computer vision. pp. 4195–4205 (2023)
2023
-
[32]
In: SIGGRAPH Asia 2024 Conference Papers
Pu, G., Zhao, Y., Lian, Z.: Pano2room: Novel view synthesis from a single indoor panorama. In: SIGGRAPH Asia 2024 Conference Papers. pp. 1–11 (2024) Rein3D 17
2024
-
[33]
In: International conference on machine learning
Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International conference on machine learning. pp. 8748–8763. PmLR (2021)
2021
-
[34]
Hierarchical Text-Conditional Image Generation with CLIP Latents
Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., Chen, M.: Hierarchical text- conditional image generation with clip latents. arXiv preprint arXiv:2204.06125 1(2), 3 (2022)
work page internal anchor Pith review arXiv 2022
-
[35]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 10684–10695 (2022)
2022
-
[36]
Denoising Diffusion Implicit Models
Song, J., Meng, C., Ermon, S.: Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502 (2020)
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[37]
Score-Based Generative Modeling through Stochastic Differential Equations
Song, Y., Sohl-Dickstein, J., Kingma, D.P., Kumar, A., Ermon, S., Poole, B.: Score- based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456 (2020)
work page internal anchor Pith review Pith/arXiv arXiv 2011
-
[38]
Sun, W., Chen, S., Liu, F., Chen, Z., Duan, Y., Zhang, J., Wang, Y.: Dimensionx: Create any 3d and 4d scenes from a single image with controllable video diffusion. arXiv preprint arXiv:2411.04928 (2024)
-
[39]
arXiv preprint arXiv:2412.03552 (2024)
Tan, J., Yang, S., Wu, T., He, J., Guo, Y., Liu, Z., Lin, D.: Imagine360: Immersive 360 video generation from perspective anchor. arXiv preprint arXiv:2412.03552 (2024)
-
[40]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Tang, J., Nie, Y., Markhasin, L., Dai, A., Thies, J., Nießner, M.: Diffuscene: De- noising diffusion models for generative indoor scene synthesis. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 20507– 20518 (2024)
2024
-
[41]
Unterthiner, T., Van Steenkiste, S., Kurach, K., Marinier, R., Michalski, M., Gelly, S.: Fvd: A new metric for video generation (2019)
2019
-
[42]
Wan: Open and Advanced Large-Scale Video Generative Models
Wan, T., Wang, A., Ai, B., Wen, B., Mao, C., Xie, C.W., Chen, D., Yu, F., Zhao, H., Yang, J., et al.: Wan: Open and advanced large-scale video generative models. arXiv preprint arXiv:2503.20314 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[43]
IEEE Transactions on Pattern Analysis and Machine Intelligence46(10), 6905–6918 (2024)
Wang,G.,Wang,P.,Chen,Z.,Wang,W.,Loy,C.C.,Liu,Z.:Perf:Panoramicneural radiance field from a single panorama. IEEE Transactions on Pattern Analysis and Machine Intelligence46(10), 6905–6918 (2024)
2024
-
[44]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Wang, Q., Li, W., Mou, C., Cheng, X., Zhang, J.: 360dvd: Controllable panorama video generation with 360-degree video diffusion model. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 6913– 6923 (2024)
2024
-
[45]
arXiv preprint arXiv:2506.10600 (2025)
Wang, X., Liu, L., Cao, Y., Wu, R., Qin, W., Wang, D., Sui, W., Su, Z.: Em- bodiedgen: Towards a generative 3d world engine for embodied intelligence. arXiv preprint arXiv:2506.10600 (2025)
-
[46]
Advances in neural information processing systems36, 8406–8441 (2023)
Wang, Z., Lu, C., Wang, Y., Bao, F., Li, C., Su, H., Zhu, J.: Prolificdreamer: High-fidelity and diverse text-to-3d generation with variational score distillation. Advances in neural information processing systems36, 8406–8441 (2023)
2023
-
[47]
IEEE transactions on image processing 13(4), 600–612 (2004)
Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing 13(4), 600–612 (2004)
2004
-
[48]
In: ACM SIGGRAPH 2024 Conference Papers
Wang, Z., Yuan, Z., Wang, X., Li, Y., Chen, T., Xia, M., Luo, P., Shan, Y.: Mo- tionctrl: A unified and flexible motion controller for video generation. In: ACM SIGGRAPH 2024 Conference Papers. pp. 1–11 (2024) 18 D. Wang et al
2024
-
[49]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Wiles, O., Gkioxari, G., Szeliski, R., Johnson, J.: Synsin: End-to-end view synthesis from a single image. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 7467–7477 (2020)
2020
-
[50]
arXiv preprint arXiv:2312.17090 (2023)
Wu, H., Zhang, Z., Zhang, W., Chen, C., Liao, L., Li, C., Gao, Y., Wang, A., Zhang, E., Sun, W., et al.: Q-align: Teaching lmms for visual scoring via discrete text-defined levels. arXiv preprint arXiv:2312.17090 (2023)
-
[51]
Video world models with long-term spatial memory.arXiv preprint arXiv:2506.05284,
Wu, T., Yang, S., Po, R., Xu, Y., Liu, Z., Lin, D., Wetzstein, G.: Video world models with long-term spatial memory. arXiv preprint arXiv:2506.05284 (2025)
-
[52]
In: Advances in Neural Information Processing Systems (2025)
Xia, Y., Weng, S., Yang, S., Liu, J., Zhu, C., Teng, M., Jia, Z., Jiang, H., Shi, B.: Panowan:Liftingdiffusionvideogenerationmodelsto360°withlatitude/longitude- aware mechanisms. In: Advances in Neural Information Processing Systems (2025)
2025
-
[53]
In: Proceedings of the Computer Vision and Pattern Recognition Conference
Xiang, J., Lv, Z., Xu, S., Deng, Y., Wang, R., Zhang, B., Chen, D., Tong, X., Yang, J.: Structured 3d latents for scalable and versatile 3d generation. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 21469–21480 (2025)
2025
-
[54]
arXiv preprint arXiv:2504.11389 (2025)
Xie, K., Sabour, A., Huang, J., Paschalidou, D., Klar, G., Iqbal, U., Fidler, S., Zeng, X.: Videopanda: Video panoramic diffusion with multi-view attention. arXiv preprint arXiv:2504.11389 (2025)
-
[55]
Xie, Z.: Worldgen: Generate any 3d scene in seconds.https://github.com/ ZiYang-xie/WorldGen(2025)
2025
-
[56]
Advances in Neural Information Processing Systems37, 82060–82084 (2024)
Yang, X., Man, Y., Chen, J., Wang, Y.X.: Scenecraft: Layout-guided 3d scene generation. Advances in Neural Information Processing Systems37, 82060–82084 (2024)
2024
-
[57]
In: Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition
Yang, Y., Jia, B., Zhi, P., Huang, S.: Physcene: Physically interactable 3d scene synthesis for embodied ai. In: Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition. pp. 16262–16272 (2024)
2024
-
[58]
CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer
Yang, Z., Teng, J., Zheng, W., Ding, M., Huang, S., Xu, J., Yang, Y., Hong, W., Zhang, X., Feng, G., et al.: Cogvideox: Text-to-video diffusion models with an expert transformer. arXiv preprint arXiv:2408.06072 (2024)
work page internal anchor Pith review arXiv 2024
-
[59]
Journal of Machine Learning Research26(34), 1–17 (2025)
Ye, V., Li, R., Kerr, J., Turkulainen, M., Yi, B., Pan, Z., Seiskari, O., Ye, J., Hu, J., Tancik, M., Kanazawa, A.: gsplat: An open-source library for gaussian splatting. Journal of Machine Learning Research26(34), 1–17 (2025)
2025
-
[60]
In: Proceedings of the Computer Vision and Pattern Recognition Conference
Yu, H.X., Duan, H., Herrmann, C., Freeman, W.T., Wu, J.: Wonderworld: Inter- active 3d scene generation from a single image. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 5916–5926 (2025)
2025
-
[61]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Yu, H.X., Duan, H., Hur, J., Sargent, K., Rubinstein, M., Freeman, W.T., Cole, F., Sun, D., Snavely, N., Wu, J., et al.: Wonderjourney: Going from anywhere to everywhere. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 6658–6667 (2024)
2024
-
[62]
ViewCrafter: Taming Video Diffusion Models for High-fidelity Novel View Synthesis
Yu, W., Xing, J., Yuan, L., Hu, W., Li, X., Huang, Z., Gao, X., Wong, T.T., Shan, Y., Tian, Y.: Viewcrafter: Taming video diffusion models for high-fidelity novel view synthesis. arXiv preprint arXiv:2409.02048 (2024)
work page internal anchor Pith review arXiv 2024
-
[63]
In: Proceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition
Yu, Z., Chen, A., Huang, B., Sattler, T., Geiger, A.: Mip-splatting: Alias-free 3d gaussian splatting. In: Proceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition. pp. 19447–19456 (2024)
2024
-
[64]
In: European Conference on Computer Vision
Zheng, J., Zhang, J., Li, J., Tang, R., Gao, S., Zhou, Z.: Structured3d: A large photo-realistic dataset for structured 3d modeling. In: European Conference on Computer Vision. pp. 519–535. Springer (2020)
2020
-
[65]
In: Proceedings of the IEEE/CVF international conference on computer vision
Zhou, S., Li, C., Chan, K.C., Loy, C.C.: Propainter: Improving propagation and transformer for video inpainting. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 10477–10486 (2023) Rein3D 19
2023
-
[66]
In: European Conference on Computer Vision
Zhou, S., Fan, Z., Xu, D., Chang, H., Chari, P., Bharadwaj, T., You, S., Wang, Z., Kadambi, A.: Dreamscene360: Unconstrained text-to-3d scene generation with panoramic gaussian splatting. In: European Conference on Computer Vision. pp. 324–342. Springer (2024)
2024
-
[67]
Zhuang, J., Guo, S., Cai, X., Li, X., Liu, Y., Yuan, C., Xue, T.: Flashvsr: To- wards real-time diffusion-based streaming video super-resolution. arXiv preprint arXiv:2510.12747 (2025)
-
[68]
Zwicker, M., Pfister, H., Van Baar, J., Gross, M.: Ewa volume splatting. In: VIS. pp. 29–538 (2001)
2001
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.