pith. machine review for the scientific record. sign in

arxiv: 2604.10789 · v1 · submitted 2026-04-12 · 💻 cs.CV

Recognition: unknown

ReplicateAnyScene: Zero-Shot Video-to-3D Composition via Textual-Visual-Spatial Alignment

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:33 UTC · model grok-4.3

classification 💻 cs.CV
keywords zero-shot video-to-3Dcompositional 3D reconstructionvision foundation modelstextual-visual-spatial alignment3D scene compositionC3DR benchmarkspatial intelligence
0
0 comments X

The pith

A five-stage cascade extracts and aligns textual, visual, and spatial priors from vision models to convert casual videos into coherent 3D scenes automatically.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to replicate the human ability to segment objects in video and mentally assemble them into structured 3D scenes, a capability it calls compositional 3D reconstruction. Existing approaches require manual prompts, extra visual inputs, or training on limited scenes, which blocks real-world use. ReplicateAnyScene addresses this with a fully automated, zero-shot pipeline that draws generic priors from off-the-shelf vision foundation models and grounds them into 3D representations. The authors also release the C3DR benchmark to measure reconstruction quality across multiple dimensions. If the approach works, it removes the main barriers to deploying spatial intelligence systems that build 3D scenes directly from everyday video.

Core claim

ReplicateAnyScene is a zero-shot framework that transforms casually captured videos into compositional 3D scenes through a five-stage cascade. The cascade extracts generic priors from vision foundation models along textual, visual, and spatial dimensions, then structurally aligns and grounds those priors into physically plausible 3D representations while preserving semantic coherence.

What carries the argument

The five-stage cascade that extracts textual, visual, and spatial priors from foundation models and aligns them into structured 3D representations.

If this is right

  • Removes the need for manual object prompting or auxiliary visual inputs in video-to-3D pipelines.
  • Produces scenes that maintain semantic coherence and physical plausibility across diverse casual videos.
  • Extends beyond the simple scenes that training-biased methods can handle.
  • Provides the C3DR benchmark for systematic multi-aspect evaluation of compositional 3D reconstruction.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the alignment succeeds across modalities, similar cascades could be applied to other video-to-geometry tasks such as dynamic scene editing.
  • The method implicitly suggests that foundation-model priors are already rich enough for 3D grounding, reducing the need for large 3D-specific training sets.
  • Deployment in embodied AI would become feasible once the cascade handles longer videos or changing lighting without drift.

Load-bearing premise

Off-the-shelf vision foundation models already supply accurate and mutually consistent textual, visual, and spatial priors that a fixed five-stage cascade can turn into physically plausible 3D scenes without any task-specific fine-tuning.

What would settle it

A test set of casual videos in which the foundation-model priors disagree on object boundaries or spatial relations, followed by measurement of whether the output 3D scenes show clear physical implausibility or semantic errors.

Figures

Figures reproduced from arXiv: 2604.10789 by Chong Xia, Long Xu, Mingyuan Jia, Mingyu Dong, Weichen Lyu, Yueqi Duan, Zheng Zhu.

Figure 1
Figure 1. Figure 1: We propose ReplicateAnyScene, a framework for fully automated and zero￾shot compositional 3D reconstruction from casually captured videos. Our method ex￾tracts and aligns cross-modal priors from vision foundation models to generate seman￾tically coherent and physically plausible 3D scenes. Abstract. Humans exhibit an innate capacity to rapidly perceive and segment objects from video observations, and even … view at source ↗
Figure 2
Figure 2. Figure 2: Framework Overview. Our pipeline consists of a five-stage cascade where each stage is specifically designed to resolve targeted alignment gaps among our three core modalities including textual (green), visual (orange), and spatial (blue). The gradient backgrounds and multi-colored dashed borders within each module ex￾plicitly illustrate the specific cross-modal alignment process occurring at that step. and… view at source ↗
Figure 3
Figure 3. Figure 3: Overview of the proposed C3DR benchmark. [PITH_FULL_IMAGE:figures/full_fig_p010_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative comparison on the C3DR benchmark. [PITH_FULL_IMAGE:figures/full_fig_p011_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative ablation study on stages 2 to 5 of our pipeline. [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗
read the original abstract

Humans exhibit an innate capacity to rapidly perceive and segment objects from video observations, and even mentally assemble them into structured 3D scenes. Replicating such capability, termed compositional 3D reconstruction, is pivotal for the advancement of Spatial Intelligence and Embodied AI. However, existing methods struggle to achieve practical deployment due to the insufficient integration of cross-modal information, leaving them dependent on manual object prompting, reliant on auxiliary visual inputs, and restricted to overly simplistic scenes by training biases. To address these limitations, we propose ReplicateAnyScene, a framework capable of fully automated and zero-shot transformation of casually captured videos into compositional 3D scenes. Specifically, our pipeline incorporates a five-stage cascade to extract and structurally align generic priors from vision foundation models across textual, visual, and spatial dimensions, grounding them into structured 3D representations and ensuring semantic coherence and physical plausibility of the constructed scenes. To facilitate a more comprehensive evaluation of this task, we further introduce the C3DR benchmark to assess reconstruction quality from diverse aspects. Extensive experiments demonstrate the superiority of our method over existing baselines in generating high-quality compositional 3D scenes.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes ReplicateAnyScene, a zero-shot framework for transforming casually captured videos into compositional 3D scenes. It employs a five-stage cascade to extract and structurally align generic priors from vision foundation models across textual, visual, and spatial dimensions, grounding them into structured 3D representations while claiming to ensure semantic coherence and physical plausibility without manual prompting, auxiliary inputs, or task-specific fine-tuning. The work also introduces the C3DR benchmark for multi-aspect evaluation and reports superiority over baselines in experiments.

Significance. If the cascade reliably reconciles potentially inconsistent priors from off-the-shelf models (e.g., CLIP-style text, SAM-style segmentation, monocular depth/pose) into physically plausible 3D scenes, the result would advance automated compositional reconstruction for embodied AI and spatial intelligence. The introduction of the C3DR benchmark is a clear positive contribution that enables more rigorous future comparisons. The zero-shot, fully automated design addresses practical deployment barriers noted in prior work.

major comments (2)
  1. [Abstract] Abstract: The central claim that the five-stage cascade 'extract[s] and structurally align[s] generic priors ... grounding them into structured 3D representations and ensuring semantic coherence and physical plausibility' is load-bearing, yet the abstract provides no description of the stages, alignment procedure, or mechanism for detecting/reconciling conflicts (e.g., segmentation boundaries disagreeing with depth edges or captions omitting spatial relations). Without such details or ablations, it is impossible to verify that a fixed non-learned cascade suffices for complex casual videos.
  2. [Abstract] Abstract (and implied §4 Experiments): The assertion of 'superiority ... in generating high-quality compositional 3D scenes' and 'extensive experiments' is not supported by any referenced quantitative metrics, ablation studies, error analysis, or tables in the provided description. This undermines assessment of whether the pipeline actually achieves the claimed physical plausibility, especially given the weakest assumption that foundation-model priors are already mutually consistent.
minor comments (1)
  1. [Abstract] The abstract uses 'ReplicateAnyScene' and 'C3DR benchmark' without initial definition or expansion; clarify on first use.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive feedback. We address the major comments point by point below, agreeing that the abstract can be strengthened for clarity while noting that the full manuscript already contains the requested technical details and experimental support.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim that the five-stage cascade 'extract[s] and structurally align[s] generic priors ... grounding them into structured 3D representations and ensuring semantic coherence and physical plausibility' is load-bearing, yet the abstract provides no description of the stages, alignment procedure, or mechanism for detecting/reconciling conflicts (e.g., segmentation boundaries disagreeing with depth edges or captions omitting spatial relations). Without such details or ablations, it is impossible to verify that a fixed non-learned cascade suffices for complex casual videos.

    Authors: We agree that the abstract's brevity limits immediate visibility into the pipeline. The full manuscript (Section 3) details the five-stage cascade: (1) textual prior extraction via vision-language models, (2) visual segmentation and feature alignment, (3) spatial prior estimation from monocular depth and pose, (4) cross-dimensional structural alignment using geometric and semantic constraints, and (5) 3D grounding with conflict resolution (e.g., depth edges override inconsistent segmentation boundaries, and spatial relations from captions guide object layout). Section 4.3 provides ablations confirming each stage's role in reconciling inconsistencies without learned components. We will revise the abstract to include a concise outline of the stages and reconciliation approach. revision: yes

  2. Referee: [Abstract] Abstract (and implied §4 Experiments): The assertion of 'superiority ... in generating high-quality compositional 3D scenes' and 'extensive experiments' is not supported by any referenced quantitative metrics, ablation studies, error analysis, or tables in the provided description. This undermines assessment of whether the pipeline actually achieves the claimed physical plausibility, especially given the weakest assumption that foundation-model priors are already mutually consistent.

    Authors: The abstract summarizes results at a high level, but we concur that explicit metrics would better substantiate the claims. The full manuscript (Section 4) reports quantitative results on the C3DR benchmark, including metrics for semantic coherence, geometric accuracy, and physical plausibility, with ablations, error analysis, and comparisons to baselines demonstrating superiority and the effectiveness of the alignment process in handling inconsistent priors. We will revise the abstract to reference key quantitative outcomes and note that the cascade explicitly mitigates prior inconsistencies rather than assuming consistency. revision: yes

Circularity Check

0 steps flagged

No circularity: engineering pipeline without self-referential derivations

full rationale

The paper describes a five-stage cascade that extracts and aligns priors from existing vision foundation models (textual, visual, spatial) to produce compositional 3D scenes from video. No equations, fitted parameters, or uniqueness theorems are presented that reduce outputs to inputs by construction. Claims of semantic coherence and physical plausibility rest on the empirical behavior of off-the-shelf models plus a fixed cascade, evaluated on a newly introduced benchmark; these are not tautological or self-citation-dependent. The derivation chain is therefore self-contained and non-circular.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the untested premise that generic priors from existing vision foundation models are already rich enough and mutually consistent for direct grounding into 3D scenes. No new entities are postulated and no explicit free parameters are named in the abstract.

axioms (1)
  • domain assumption Vision foundation models supply reliable generic priors across textual, visual, and spatial dimensions that can be structurally aligned without task-specific training.
    The five-stage cascade is defined as extracting and aligning these priors; the entire pipeline depends on this assumption holding for arbitrary casual videos.

pith-pipeline@v0.9.0 · 5521 in / 1350 out tokens · 67761 ms · 2026-05-10T15:33:47.057993+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

74 extracted references · 29 canonical work pages · 12 internal anchors

  1. [1]

    Gpt-4v(ision) system card (2023),https://api.semanticscholar.org/CorpusID: 263218031

  2. [2]

    In: 2025 International Conference on 3D Vision (3DV)

    Ardelean, A., Özer, M., Egger, B.: Gen3dsr: Generalizable 3d scene reconstruction via divide and conquer from a single view. In: 2025 International Conference on 3D Vision (3DV). pp. 616–626. IEEE (2025)

  3. [3]

    In: Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition

    Avetisyan, A., Dahnert, M., Dai, A., Savva, M., Chang, A.X., Nießner, M.: Scan2cad: Learning cad model alignment in rgb-d scans. In: Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition. pp. 2614–2623 (2019)

  4. [4]

    Qwen3-VL Technical Report

    Bai, S., Cai, Y., Chen, R., Chen, K., Chen, X., Cheng, Z., Deng, L., Ding, W., Gao, C., Ge, C., et al.: Qwen3-vl technical report. arXiv preprint arXiv:2511.21631 (2025)

  5. [5]

    ARKitScenes: A Diverse Real-World Dataset For 3D Indoor Scene Understanding Using Mobile RGB-D Data

    Baruch, G., Chen, Z., Dehghan, A., Dimry, T., Feigin, Y., Fu, P., Gebauer, T., Joffe, B., Kurz, D., Schwartz, A., et al.: Arkitscenes: A diverse real-world dataset for 3d indoor scene understanding using mobile rgb-d data. arXiv preprint arXiv:2111.08897 (2021)

  6. [6]

    In: Sensor fusion IV: control paradigms and data structures

    Besl, P.J., McKay, N.D.: Method for registration of 3-d shapes. In: Sensor fusion IV: control paradigms and data structures. vol. 1611, pp. 586–606. Spie (1992)

  7. [7]

    SAM 3: Segment Anything with Concepts

    Carion, N., Gustafson, L., Hu, Y.T., Debnath, S., Hu, R., Suris, D., Ryali, C., Alwala,K.V.,Khedr,H.,Huang,A.,etal.:Sam3:Segmentanythingwithconcepts. arXiv preprint arXiv:2511.16719 (2025)

  8. [8]

    Matterport3D: Learning from RGB-D Data in Indoor Environments

    Chang, A., Dai, A., Funkhouser, T., Halber, M., Niessner, M., Savva, M., Song, S., Zeng, A., Zhang, Y.: Matterport3d: Learning from rgb-d data in indoor envi- ronments. arXiv preprint arXiv:1709.06158 (2017)

  9. [9]

    ShapeNet: An Information-Rich 3D Model Repository

    Chang, A.X., Funkhouser, T., Guibas, L., Hanrahan, P., Huang, Q., Li, Z., Savarese, S., Savva, M., Song, S., Su, H., et al.: Shapenet: An information-rich 3d model repository. arXiv preprint arXiv:1512.03012 (2015)

  10. [10]

    SAM 3D: 3Dfy Anything in Images

    Chen, X., Chu, F.J., Gleize, P., Liang, K.J., Sax, A., Tang, H., Wang, W., Guo, M., Hardin, T., Li, X., et al.: Sam 3d: 3dfy anything in images. arXiv preprint arXiv:2511.16624 (2025)

  11. [11]

    Abot-n0: Technical report on the vla foundation model for versatile embodied navigation

    Chu, Z., Xie, S., Wu, X., Shen, Y., Luo, M., Wang, Z., Liu, F., Leng, X., Hu, J., Yin, M., et al.: Abot-n0: Technical report on the vla foundation model for versatile embodied navigation. arXiv preprint arXiv:2602.11598 (2026)

  12. [12]

    In: Proceedings of the IEEE conference on computer vision and pattern recognition

    Dai,A.,Chang,A.X.,Savva,M.,Halber,M.,Funkhouser,T.,Nießner,M.:Scannet: Richly-annotated 3d reconstructions of indoor scenes. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 5828–5839 (2017)

  13. [13]

    Advances in Neural Information Processing Systems36, 35799–35813 (2023)

    Deitke, M., Liu, R., Wallingford, M., Ngo, H., Michel, O., Kusupati, A., Fan, A., Laforte, C., Voleti, V., Gadre, S.Y., et al.: Objaverse-xl: A universe of 10m+ 3d objects. Advances in Neural Information Processing Systems36, 35799–35813 (2023)

  14. [14]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Deitke, M., Schwenk, D., Salvador, J., Weihs, L., Michel, O., VanderBilt, E., Schmidt, L., Ehsani, K., Kembhavi, A., Farhadi, A.: Objaverse: A universe of annotated 3d objects. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 13142–13153 (2023)

  15. [15]

    Advances in Neural Information Processing Systems 35, 5982–5994 (2022) 16 Authors Suppressed Due to Excessive Length

    Deitke, M., VanderBilt, E., Herrasti, A., Weihs, L., Ehsani, K., Salvador, J., Han, W., Kolve, E., Kembhavi, A., Mottaghi, R.: Procthor: Large-scale embodied ai using procedural generation. Advances in Neural Information Processing Systems 35, 5982–5994 (2022) 16 Authors Suppressed Due to Excessive Length

  16. [16]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Fu, H., Cai, B., Gao, L., Zhang, L.X., Wang, J., Li, C., Zeng, Q., Sun, C., Jia, R., Zhao, B., et al.: 3d-front: 3d furnished rooms with layouts and semantics. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 10933–10942 (2021)

  17. [17]

    International Journal of Computer Vision129(12), 3313–3337 (2021)

    Fu, H., Jia, R., Gao, L., Gong, M., Zhao, B., Maybank, S., Tao, D.: 3d-future: 3d furniture shape with texture. International Journal of Computer Vision129(12), 3313–3337 (2021)

  18. [18]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Ge, Y., Tang, Y., Xu, J., Gokmen, C., Li, C., Ai, W., Martinez, B.J., Aydin, A., Anvari, M., Chakravarthy, A.K., et al.: Behavior vision suite: Customizable dataset generation via simulation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 22401–22412 (2024)

  19. [19]

    google/(2026), accessed: 2026-03-05

    Google DeepMind: Nano banana 2 (gemini 3.1 flash image).https://gemini. google/(2026), accessed: 2026-03-05

  20. [20]

    Advances in neural information processing systems33, 6840–6851 (2020)

    Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. Advances in neural information processing systems33, 6840–6851 (2020)

  21. [21]

    In: 2016 fourth international conference on 3D vision (3DV)

    Hua, B.S., Pham, Q.H., Nguyen, D.T., Tran, M.K., Yu, L.F., Yeung, S.K.: Scenenn: A scene meshes dataset with annotations. In: 2016 fourth international conference on 3D vision (3DV). pp. 92–101. Ieee (2016)

  22. [22]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Huang, Z., Guo, Y.C., An, X., Yang, Y., Li, Y., Zou, Z.X., Liang, D., Liu, X., Cao, Y.P., Sheng, L.: Midi: Multi-instance diffusion for single image to 3d scene generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 23646–23657 (2025)

  23. [23]

    Mv-adapter: Multi-view consistent image generation made easy.arXiv preprint arXiv:2412.03632, 2024

    Huang, Z., Guo, Y., Wang, H., Yi, R., Ma, L., Cao, Y.P., Sheng, L.: Mv- adapter: Multi-view consistent image generation made easy. arXiv preprint arXiv:2412.03632 (2024)

  24. [24]

    In: Proceedings of the IEEE/CVF international conference on computer vision

    Ke, J., Wang, Q., Wang, Y., Milanfar, P., Yang, F.: Musiq: Multi-scale image quality transformer. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 5148–5157 (2021)

  25. [25]

    ACM Trans

    Kerbl, B., Kopanas, G., Leimkühler, T., Drettakis, G., et al.: 3d gaussian splatting for real-time radiance field rendering. ACM Trans. Graph.42(4), 139–1 (2023)

  26. [26]

    OpenVLA: An Open-Source Vision-Language-Action Model

    Kim, M.J., Pertsch, K., Karamcheti, S., Xiao, T., Balakrishna, A., Nair, S., Rafailov, R., Foster, E., Lam, G., Sanketi, P., et al.: Openvla: An open-source vision-language-action model. arXiv preprint arXiv:2406.09246 (2024)

  27. [27]

    Auto-Encoding Variational Bayes

    Kingma, D.P., Welling, M.: Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114 (2013)

  28. [28]

    Leroy, V., Cabon, Y., Revaud, J.: Grounding image matching in 3d with mast3r (2024)

  29. [29]

    Li, W., Liu, J., Yan, H., Chen, R., Liang, Y., Chen, X., Tan, P., Long, X.: Crafts- man3d: High-fidelity mesh generation with 3d native generation and interactive geometry refiner (2024)

  30. [30]

    arXiv preprint arXiv:2502.06608 (2025)

    Li, Y., Zou, Z.X., Liu, Z., Wang, D., Liang, Y., Yu, Z., Liu, X., Guo, Y.C., Liang, D., Ouyang, W., et al.: Triposg: High-fidelity 3d shape synthesis using large-scale rectified flow models. arXiv preprint arXiv:2502.06608 (2025)

  31. [31]

    Flow Matching for Generative Modeling

    Lipman, Y., Chen, R.T., Ben-Hamu, H., Nickel, M., Le, M.: Flow matching for generative modeling. arXiv preprint arXiv:2210.02747 (2022)

  32. [32]

    Syncdreamer: Gen- erating multiview-consistent images from a single-view im- age.arXiv preprint arXiv:2309.03453, 2023

    Liu, Y., Lin, C., Zeng, Z., Long, X., Liu, L., Komura, T., Wang, W.: Syncdreamer: Generating multiview-consistent images from a single-view image. arXiv preprint arXiv:2309.03453 (2023)

  33. [33]

    Wonder3d: Sin- gle image to 3d using cross-domain diffusion.arXiv preprint arXiv:2310.15008, 2023

    Long, X., Guo, Y.C., Lin, C., Liu, Y., Dou, Z., Liu, L., Ma, Y., Zhang, S.H., Habermann, M., Theobalt, C., et al.: Wonder3d: Single image to 3d using cross- domain diffusion. arXiv preprint arXiv:2310.15008 (2023) Title Suppressed Due to Excessive Length 17

  34. [34]

    Lu, H., Liu, W., Zhang, B., Wang, B., Dong, K., Liu, B., Sun, J., Ren, T., Li, Z., Yang, H., Sun, Y., Deng, C., Xu, H., Xie, Z., Ruan, C.: Deepseek-vl: Towards real-world vision-language understanding (2024)

  35. [35]

    arXiv preprint arXiv:2508.15769 (2025)

    Meng, Y., Wu, H., Zhang, Y., Xie, W.: Scenegen: Single-image 3d scene generation in one feedforward pass. arXiv preprint arXiv:2508.15769 (2025)

  36. [36]

    Commu- nications of the ACM65(1), 99–106 (2021)

    Mildenhall, B., Srinivasan, P.P., Tancik, M., Barron, J.T., Ramamoorthi, R., Ng, R.: Nerf: Representing scenes as neural radiance fields for view synthesis. Commu- nications of the ACM65(1), 99–106 (2021)

  37. [37]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference

    Ni, J., Liu, Y., Lu, R., Zhou, Z., Zhu, S.C., Chen, Y., Huang, S.: Decompositional neural scene reconstruction with generative diffusion prior. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 6022–6033 (2025)

  38. [38]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Park,J.J.,Florence,P.,Straub,J.,Newcombe,R.,Lovegrove,S.:Deepsdf:Learning continuous signed distance functions for shape representation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 165– 174 (2019)

  39. [39]

    Advances in neural infor- mation processing systems34, 12013–12026 (2021)

    Paschalidou, D., Kar, A., Shugrina, M., Kreis, K., Geiger, A., Fidler, S.: Atiss: Autoregressive transformers for indoor scene synthesis. Advances in neural infor- mation processing systems34, 12013–12026 (2021)

  40. [40]

    Peebles,W.,Xie,S.:Scalablediffusionmodelswithtransformers.In:Proceedingsof the IEEE/CVF international conference on computer vision. pp. 4195–4205 (2023)

  41. [41]

    DreamFusion: Text-to-3D using 2D Diffusion

    Poole, B., Jain, A., Barron, J.T., Mildenhall, B.: Dreamfusion: Text-to-3d using 2d diffusion. arXiv preprint arXiv:2209.14988 (2022)

  42. [42]

    arXiv preprint arXiv:2310.13724 (2023) 14

    Puig, X., Undersander, E., Szot, A., Cote, M.D., Yang, T.Y., Partsey, R., Desai, R., Clegg, A.W., Hlavac, M., Min, S.Y., et al.: Habitat 3.0: A co-habitat for humans, avatars and robots. arXiv preprint arXiv:2310.13724 (2023)

  43. [43]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 10684–10695 (2022)

  44. [44]

    In: Conference on robot learning

    Shah, D., Osiński, B., Levine, S., et al.: Lm-nav: Robotic navigation with large pre- trained models of language, vision, and action. In: Conference on robot learning. pp. 492–504. pmlr (2023)

  45. [45]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Shih, M.L., Ma, W.C., Boyice, L., Holynski, A., Cole, F., Curless, B., Kontkanen, J.: Extranerf: Visibility-aware view extrapolation of neural radiance fields with dif- fusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 20385–20395 (2024)

  46. [46]

    Denoising Diffusion Implicit Models

    Song, J., Meng, C., Ermon, S.: Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502 (2020)

  47. [47]

    In: Conference on robot learning

    Srivastava, S., Li, C., Lingelbach, M., Martín-Martín, R., Xia, F., Vainio, K.E., Lian, Z., Gokmen, C., Buch, S., Liu, K., et al.: Behavior: Benchmark for every- day household activities in virtual, interactive, and ecological environments. In: Conference on robot learning. pp. 477–490. PMLR (2022)

  48. [48]

    The Replica Dataset: A Digital Replica of Indoor Spaces

    Straub, J., Whelan, T., Ma, L., Chen, Y., Wijmans, E., Green, S., Engel, J.J., Mur-Artal, R., Ren, C., Verma, S., et al.: The replica dataset: A digital replica of indoor spaces. arXiv preprint arXiv:1906.05797 (2019)

  49. [49]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Tang, J., Nie, Y., Markhasin, L., Dai, A., Thies, J., Nießner, M.: Diffuscene: De- noising diffusion models for generative indoor scene synthesis. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 20507– 20518 (2024)

  50. [50]

    Lgm: Large multi-view gaussian model for high-resolution 3d content creation.arXiv preprint arXiv:2402.05054, 2024

    Tang, J., Chen, Z., Chen, X., Wang, T., Zeng, G., Liu, Z.: Lgm: Large multi- view gaussian model for high-resolution 3d content creation. arXiv preprint arXiv:2402.05054 (2024) 18 Authors Suppressed Due to Excessive Length

  51. [51]

    Team, T.H.: Hunyuan3d 1.0: A unified framework for text-to-3d and image-to-3d generation (2024)

  52. [52]

    Team, T.H.: Hunyuan3d 2.0: Scaling diffusion models for high resolution textured 3d assets generation (2025)

  53. [53]

    Team, T.H.: Hunyuan3d 2.1: From images to high-fidelity 3d assets with production-ready pbr material (2025)

  54. [54]

    IEEE Transactions on pattern analysis and machine intelligence 13(4), 376–380 (2002)

    Umeyama, S.: Least-squares estimation of transformation parameters between two point patterns. IEEE Transactions on pattern analysis and machine intelligence 13(4), 376–380 (2002)

  55. [55]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Wald, J., Avetisyan, A., Navab, N., Tombari, F., Nießner, M.: Rio: 3d object instance re-localization in changing indoor environments. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 7658–7667 (2019)

  56. [56]

    In: Proceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (2025)

    Wang, J., Chen, M., Karaev, N., Vedaldi, A., Rupprecht, C., Novotny, D.: Vggt: Visual geometry grounded transformer. In: Proceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (2025)

  57. [57]

    InProceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Wang, Z., Bovik, A., Sheikh, H., Simoncelli, E.: Image quality assessment: from error visibility to structural similarity. IEEE Transactions on Image Processing 13(4), 600–612 (2004).https://doi.org/10.1109/TIP.2003.819861

  58. [58]

    Ouroboros3d: Image-to-3d gen- eration via 3d-aware recursive diffusion.arXiv preprint arXiv:2406.03184, 2024

    Wen, H., Huang, Z., Wang, Y., Chen, X., Qiao, Y., Sheng, L.: Ouroboros3d: Image- to-3d generation via 3d-aware recursive diffusion. arXiv preprint arXiv:2406.03184 (2024)

  59. [59]

    arXiv preprint arXiv:2505.23747 (2025)

    Wu, D., Liu, F., Hung, Y.H., Duan, Y.: Spatial-mllm: Boosting mllm capabilities in visual-based spatial intelligence. arXiv preprint arXiv:2505.23747 (2025)

  60. [60]

    arXiv preprint arXiv:2405.14832 (2024)

    Wu, S., Lin, Y., Zhang, F., Zeng, Y., Xu, J., Torr, P., Cao, X., Yao, Y.: Direct3d: Scalable image-to-3d generation via 3d latent diffusion transformer. arXiv preprint arXiv:2405.14832 (2024)

  61. [61]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Xia, C., Zhang, S., Liu, F., Liu, C., Hirunyaratsameewong, K., Duan, Y.: Scene- painter: Semantically consistent perpetual 3d scene generation with concept re- lation alignment. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 28808–28817 (2025)

  62. [62]

    Xia, C., Zhu, K., Wang, Z., Liu, F., Zhang, Z., Duan, Y.: Simrecon: Simready compositional scene reconstruction from real videos (2026),https://arxiv.org/ abs/2603.02133

  63. [63]

    arXiv preprint arXiv:2412.01506 (2024)

    Xiang, J., Lv, Z., Xu, S., Deng, Y., Wang, R., Zhang, B., Chen, D., Tong, X., Yang, J.: Structured 3d latents for scalable and versatile 3d generation. arXiv preprint arXiv:2412.01506 (2024)

  64. [64]

    InstantMesh: Efficient 3D Mesh Generation from a Single Image with Sparse-view Large Reconstruction Models

    Xu, J., Cheng, W., Gao, Y., Wang, X., Gao, S., Shan, Y.: Instantmesh: Efficient 3d mesh generation from a single image with sparse-view large reconstruction models. arXiv preprint arXiv:2404.07191 (2024)

  65. [65]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Yang, Y., Sun, F.Y., Weihs, L., VanderBilt, E., Herrasti, A., Han, W., Wu, J., Haber, N., Krishna, R., Liu, L., et al.: Holodeck: Language guided generation of 3d embodied ai environments. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 16227–16237 (2024)

  66. [66]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Yang,Z.,Yang,B.,Dong,W.,Cao,C.,Cui,L.,Ma,Y.,Cui,Z.,Bao,H.:Instascene: Towards complete 3d instance decomposition and reconstruction from cluttered scenes. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 7771–7781 (2025)

  67. [67]

    ACM Transactions on Graphics (TOG)44(4), 1–19 (2025) Title Suppressed Due to Excessive Length 19

    Yao, K., Zhang, L., Yan, X., Zeng, Y., Zhang, Q., Xu, L., Yang, W., Gu, J., Yu, J.: Cast: Component-aligned 3d scene reconstruction from an rgb image. ACM Transactions on Graphics (TOG)44(4), 1–19 (2025) Title Suppressed Due to Excessive Length 19

  68. [68]

    Hi3dgen: High-fidelity 3d geometry generation from im- ages via normal bridging.arXiv preprint arXiv:2503.22236,

    Ye, C., Wu, Y., Lu, Z., Chang, J., Guo, X., Zhou, J., Zhao, H., Han, X.: Hi3dgen: High-fidelity 3d geometry generation from images via normal bridging. arXiv preprint arXiv:2503.22236 (2025)

  69. [69]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Yeshwanth, C., Liu, Y.C., Nießner, M., Dai, A.: Scannet++: A high-fidelity dataset of 3d indoor scenes. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 12–22 (2023)

  70. [70]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference

    Yu,H.,Jia,B.,Chen,Y.,Yang,Y.,Li,P.,Su,R.,Li,J.,Li,Q.,Liang,W.,Zhu,S.C., et al.: Metascenes: Towards automated replica creation for real-world 3d scans. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 1667–1679 (2025)

  71. [71]

    Advances in neural information processing systems35, 25018–25032 (2022)

    Yu, Z., Peng, S., Niemeyer, M., Sattler, T., Geiger, A.: Monosdf: Exploring monoc- ular geometric cues for neural implicit surface reconstruction. Advances in neural information processing systems35, 25018–25032 (2022)

  72. [72]

    Clay: A controllable large-scale generative model for creating high-quality 3d assets, 2024

    Zhang, L., Wang, Z., Zhang, Q., Qiu, Q., Pang, A., Jiang, H., Yang, W., Xu, L., Yu, J.: Clay: A controllable large-scale generative model for creating high-quality 3d assets. arXiv preprint arXiv:2406.13897 (2024)

  73. [73]

    In: CVPR (2018)

    Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable effectiveness of deep features as a perceptual metric. In: CVPR (2018)

  74. [74]

    Advances in neural information processing sys- tems36, 73969–73982 (2023)

    Zhao, Z., Liu, W., Chen, X., Zeng, X., Wang, R., Cheng, P., Fu, B., Chen, T., Yu, G., Gao, S.: Michelangelo: Conditional 3d shape generation based on shape-image- text aligned latent representation. Advances in neural information processing sys- tems36, 73969–73982 (2023)