pith. sign in

arxiv: 2605.23888 · v1 · pith:QFRL7TSXnew · submitted 2026-05-22 · 💻 cs.CV

GenRecon: Bridging Generative Priors for Multi-View 3D Scene Reconstruction

Pith reviewed 2026-05-25 04:35 UTC · model grok-4.3

classification 💻 cs.CV
keywords 3D scene reconstructionmulti-view imagesgenerative priorsconditional generationPBR meshesindoor environmentsprojection conditioningscene-scale generation
0
0 comments X

The pith

A projection-based conditioning mechanism lifts object-level generative priors to multi-view scene-scale 3D reconstruction.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper frames multi-view scene reconstruction as conditional 3D generation over spatially-localized overlapping chunks that tile the full scene. It introduces a projection-based conditioning mechanism to lift posed multi-view image features into a coherent 3D representation that aligns with the generative model. This produces high-fidelity, multi-view consistent geometry without dependence on view ordering. The result is faithful editable PBR mesh reconstructions of indoor environments that outperform existing methods by 16%.

Core claim

The paper claims that casting reconstruction as conditional generation over tiled chunks, combined with a projection-based conditioning mechanism, lifts posed multi-view image features into a coherent 3D representation aligned with the generative model, independent of view ordering and spatially anchored to the scene. This enables scaling from object-level priors to scene-scale generation, yielding high-fidelity editable PBR mesh reconstructions of indoor environments that outperform cutting-edge reconstruction methods by 16%.

What carries the argument

The projection-based conditioning mechanism that lifts posed multi-view image features into a coherent 3D representation aligned with the generative model.

If this is right

  • Scene reconstruction is performed as conditional 3D generation over overlapping chunks that tile large extents.
  • Generated geometry remains multi-view consistent and high-fidelity.
  • The output consists of faithful, editable PBR meshes suitable for indoor environments.
  • Quantitative performance exceeds cutting-edge reconstruction methods by 16%.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The conditioning approach is presented as general and could therefore apply to other generative shape models.
  • Chunk-based tiling implies the method can extend to larger environments simply by increasing the number of chunks.
  • Editable PBR meshes enable direct use in graphics pipelines for editing or simulation without additional conversion steps.

Load-bearing premise

The projection-based conditioning mechanism lifts posed multi-view image features into a coherent 3D representation aligned with the generative model, independent of view ordering and spatially anchored to the scene.

What would settle it

A quantitative evaluation on standard indoor multi-view benchmarks that shows the generated meshes lack consistency across views or fail to exceed baseline reconstruction accuracy by the reported margin would falsify the central claim.

Figures

Figures reproduced from arXiv: 2605.23888 by Angela Dai, Jozef Hladk\'y, Katharina Schmid, Matthias Nie{\ss}ner, Nicolas von L\"utzow.

Figure 1
Figure 1. Figure 1: GenRecon. Given a sparse set of RGB images of an indoor scene (left), our method reconstructs a complete, high-fidelity PBR mesh (center) by formulating scene reconstruction as conditional 3D generation over overlapping spatial chunks. The recovered mesh with material properties enables realistic relighting and editing in standard rendering pipelines (right). Abstract We introduce a new approach to high-fi… view at source ↗
Figure 2
Figure 2. Figure 2: Pipeline overview. Given posed RGB images and a sparse point cloud from SfM (left), we define overlapping scene chunks and construct a global 3D conditioning grid by lifting DINOv3 image features into per-view volumes and aggregating them (center). By extending a 3D generative prior with a new spatially-grounded multi-view conditioning pathway, we jointly generate all chunks in a single flow-matching traje… view at source ↗
Figure 3
Figure 3. Figure 3: qualitatively compares the performance of our method against the baselines on ScanNet++ using 8 input images. While the baselines produce noisy (2DGS, DA3), oversmooth (FineRecon, MonoSDF) surfaces for challenging areas and are incomplete in occluded and unobserved areas (2DGS, DA3, FineRecon, Murre), our approach yields complete and high-fidelity reconstructions. 2DGS DA3 FineRecon MonoSDF Murre Ours Grou… view at source ↗
Figure 4
Figure 4. Figure 4: PBR results. Qualitative results on four reconstructed scenes from ScanNet++: lit scene (left), albedo (middle), metallic and roughness (right) [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Relighting results. Varying lighting configurations for scenes reconstructed from Scan￾Net++. Further visualizations are provided in Appendix C. 4.4 Limitations While our method achieves strong reconstruction quality across a wide range of indoor scenes, several limitations remain. Reconstructions of non-Lambertian surfaces, such as glass and mirrors, are less reliable, as such materials are underrepresent… view at source ↗
Figure 6
Figure 6. Figure 6: Large Scene Generation. Top-down view (left) and multiple close-ups (right). Ablation [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Ablation results. Ablation study on SAGE-10k chunks not seen during training. Our projection-based 3D conditioning effectively enables pose-correct chunk generation from as few as a single input image. Reconstruction quality improves with additional views. Relighting Additional relighting results are provided in [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Additional relighting results. Varying lighting configurations for scenes reconstructed from ScanNet++. D Baseline Implementation Details We evaluate all baselines in the same posed multi-view setting as our method. Whenever possible, we use the exact released configurations and only make changes required to adapt the methods to our evaluation protocol. All such changes fix issues and make the methods appl… view at source ↗
read the original abstract

We introduce a new approach to high-fidelity 3D scene reconstruction from multi-view RGB images that tightly couples reconstruction with a strong generative 3D prior. We cast scene reconstruction as conditional 3D generation over a set of spatially-localized, overlapping chunks that together tile the scene, scaling generation to large scene extents. Crucially, we inherit the fidelity and completeness of state-of-the-art generative shape models -- we use Trellis.2 as an example -- which we generalize to the scene level. To this end, we propose a projection-based conditioning mechanism that lifts posed multi-view image features into a coherent 3D representation aligned with the generative model, independent of view ordering and spatially anchored to the scene, yielding high-fidelity, multi-view consistent generated geometry. This enables lifting the strong object-level prior of Trellis.2 to multi-view, scene-scale generation, producing faithful, editable PBR mesh reconstructions of indoor environments. As a result, we obtain high-fidelity results that outperform cutting-edge reconstruction methods by 16%.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper introduces GenRecon, a method for high-fidelity 3D scene reconstruction from multi-view RGB images. It formulates scene reconstruction as conditional 3D generation over spatially-localized overlapping chunks that tile the scene, generalizing the object-level generative prior of Trellis.2 to scene scale. A projection-based conditioning mechanism is proposed to lift posed multi-view image features into a coherent 3D representation aligned with the generative model, independent of view ordering and spatially anchored to the scene, yielding multi-view consistent PBR mesh outputs that outperform cutting-edge reconstruction methods by 16%.

Significance. If the central claims hold, the work would meaningfully advance integration of strong generative 3D priors with multi-view reconstruction, enabling scalable, high-fidelity, and editable reconstructions of indoor scenes. The chunk-tiling strategy and conditioning approach could address limitations in applying object-centric generative models to large environments while preserving consistency.

major comments (2)
  1. [Abstract] Abstract: The claim that the method 'outperform[s] cutting-edge reconstruction methods by 16%' is stated without reference to any metrics, baselines, datasets, error analysis, or experimental protocol, rendering the central quantitative result unverifiable from the provided description.
  2. [Abstract] Abstract: The projection-based conditioning is asserted to lift features into a 'coherent 3D representation aligned with the generative model, independent of view ordering and spatially anchored to the scene', but supplies no implementation details on the underlying 3D structure, feature projection, cross-view aggregation, pose encoding, or invariance operation. This mechanism is load-bearing for the claim that Trellis.2's object-level prior can be lifted to overlapping scene chunks while preserving multi-view consistency.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each point below and will revise the abstract to improve clarity and verifiability of the claims.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The claim that the method 'outperform[s] cutting-edge reconstruction methods by 16%' is stated without reference to any metrics, baselines, datasets, error analysis, or experimental protocol, rendering the central quantitative result unverifiable from the provided description.

    Authors: We agree that the abstract's quantitative claim would benefit from additional context. In the revised manuscript we will update the abstract to name the metric, the main baselines, the dataset, and a pointer to the experimental section (Section 4) for the protocol and analysis, while keeping the abstract concise. revision: yes

  2. Referee: [Abstract] Abstract: The projection-based conditioning is asserted to lift features into a 'coherent 3D representation aligned with the generative model, independent of view ordering and spatially anchored to the scene', but supplies no implementation details on the underlying 3D structure, feature projection, cross-view aggregation, pose encoding, or invariance operation. This mechanism is load-bearing for the claim that Trellis.2's object-level prior can be lifted to overlapping scene chunks while preserving multi-view consistency.

    Authors: Abstracts are necessarily high-level. The full technical description of the projection-based conditioning (3D chunk structure, feature projection, cross-view aggregation, pose encoding, and order invariance) appears in Section 3.2. We will add one short clarifying phrase to the abstract to better signal these aspects without exceeding length limits. revision: partial

Circularity Check

0 steps flagged

No circularity: derivation builds on external Trellis.2 prior via proposed conditioning mechanism

full rationale

The paper's central derivation introduces a projection-based conditioning to lift multi-view features into a 3D representation compatible with an external generative model (Trellis.2). This is presented as a novel proposal rather than a self-referential fit, renaming, or self-citation chain. No equations or steps reduce the claimed outputs to the inputs by construction; the 16% improvement is positioned as an empirical outcome of applying the external prior at scene scale. The mechanism is described as independent of view ordering and spatially anchored, but this is asserted as a design property of the proposed method, not derived tautologically from fitted quantities or prior self-work. The derivation chain remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no information on free parameters, axioms, or invented entities.

pith-pipeline@v0.9.0 · 5735 in / 1191 out tokens · 63263 ms · 2026-05-25T04:35:13.175497+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

75 extracted references · 63 canonical work pages · 17 internal anchors

  1. [1]

    Taylor, Mathias Unberath, Ming-Yu Liu, and Chen-Hsuan Lin

    Zhaoshuo Li, Thomas Müller, Alex Evans, Russell H. Taylor, Mathias Unberath, Ming-Yu Liu, and Chen-Hsuan Lin. Neuralangelo: High-fidelity neural surface reconstruction, 2023. URL https://arxiv.org/abs/2306.03092

  2. [2]

    Neus: Learning neural implicit surfaces by volume rendering for multi-view reconstruction,

    Peng Wang, Lingjie Liu, Yuan Liu, Christian Theobalt, Taku Komura, and Wenping Wang. Neus: Learning neural implicit surfaces by volume rendering for multi-view reconstruction,

  3. [3]

    URLhttps://arxiv.org/abs/2106.10689

  4. [4]

    V olume rendering of neural implicit surfaces, 2021

    Lior Yariv, Jiatao Gu, Yoni Kasten, and Yaron Lipman. V olume rendering of neural implicit surfaces, 2021. URLhttps://arxiv.org/abs/2106.12052

  5. [5]

    Monosdf: Exploring monocular geometric cues for neural implicit surface reconstruction, 2022

    Zehao Yu, Songyou Peng, Michael Niemeyer, Torsten Sattler, and Andreas Geiger. Monosdf: Exploring monocular geometric cues for neural implicit surface reconstruction, 2022. URL https://arxiv.org/abs/2206.00665

  6. [6]

    Neusg: Neural implicit surface reconstruction with 3d gaussian splatting guidance, 2025

    Hanlin Chen, Chen Li, Yunsong Wang, and Gim Hee Lee. Neusg: Neural implicit surface reconstruction with 3d gaussian splatting guidance, 2025. URL https://arxiv.org/abs/ 2312.00846

  7. [7]

    3d gaussian splatting for real-time radiance field rendering, 2023

    Bernhard Kerbl, Georgios Kopanas, Thomas Leimkühler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering, 2023. URL https://arxiv.org/abs/2308. 04079

  8. [8]

    2d gaussian splatting for geometrically accurate radiance fields, 2025

    Binbin Huang, Zehao Yu, Anpei Chen, Andreas Geiger, and Shenghua Gao. 2d gaussian splatting for geometrically accurate radiance fields, 2025. URL https://arxiv.org/abs/ 2403.17888

  9. [9]

    Pgsr: Planar-based gaussian splatting for efficient and high-fidelity surface reconstruction, 2025

    Danpeng Chen, Hai Li, Weicai Ye, Yifan Wang, Weijian Xie, Shangjin Zhai, Nan Wang, Haomin Liu, Hujun Bao, and Guofeng Zhang. Pgsr: Planar-based gaussian splatting for efficient and high-fidelity surface reconstruction, 2025. URLhttps://arxiv.org/abs/2406.06521

  10. [10]

    Dust3r: Geometric 3d vision made easy, 2024

    Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. Dust3r: Geometric 3d vision made easy, 2024. URLhttps://arxiv.org/abs/2312.14132

  11. [11]

    Grounding image matching in 3d with mast3r, 2024

    Vincent Leroy, Yohann Cabon, and Jérôme Revaud. Grounding image matching in 3d with mast3r, 2024. URLhttps://arxiv.org/abs/2406.09756

  12. [12]

    Vggt: Visual geometry grounded transformer, 2025

    Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Visual geometry grounded transformer, 2025. URL https://arxiv.org/ abs/2503.11651. 10

  13. [13]

    Depth Anything 3: Recovering the Visual Space from Any Views

    Haotong Lin, Sili Chen, Junhao Liew, Donny Y . Chen, Zhenyu Li, Guang Shi, Jiashi Feng, and Bingyi Kang. Depth anything 3: Recovering the visual space from any views, 2025. URL https://arxiv.org/abs/2511.10647

  14. [14]

    Neuralrecon: Real- time coherent 3d reconstruction from monocular video, 2021

    Jiaming Sun, Yiming Xie, Linghao Chen, Xiaowei Zhou, and Hujun Bao. Neuralrecon: Real- time coherent 3d reconstruction from monocular video, 2021. URL https://arxiv.org/ abs/2104.00681

  15. [15]

    Finerecon: Depth-aware feed-forward network for detailed 3d reconstruction, 2023

    Noah Stier, Anurag Ranjan, Alex Colburn, Yajie Yan, Liang Yang, Fangchang Ma, and Baptiste Angles. Finerecon: Depth-aware feed-forward network for detailed 3d reconstruction, 2023. URLhttps://arxiv.org/abs/2304.01480

  16. [16]

    pixelsplat: 3d gaussian splats from image pairs for scalable generalizable 3d reconstruction, 2024

    David Charatan, Sizhe Li, Andrea Tagliasacchi, and Vincent Sitzmann. pixelsplat: 3d gaussian splats from image pairs for scalable generalizable 3d reconstruction, 2024. URL https: //arxiv.org/abs/2312.12337

  17. [17]

    Mvsplat: Efficient 3d gaussian splatting from sparse multi-view images, 2024

    Yuedong Chen, Haofei Xu, Chuanxia Zheng, Bohan Zhuang, Marc Pollefeys, Andreas Geiger, Tat-Jen Cham, and Jianfei Cai. Mvsplat: Efficient 3d gaussian splatting from sparse multi-view images, 2024. URLhttps://arxiv.org/abs/2403.14627

  18. [18]

    Anysplat: Feed-forward 3d gaussian splatting from unconstrained views, 2025

    Lihan Jiang, Yucheng Mao, Linning Xu, Tao Lu, Kerui Ren, Yichen Jin, Xudong Xu, Mulin Yu, Jiangmiao Pang, Feng Zhao, Dahua Lin, and Bo Dai. Anysplat: Feed-forward 3d gaussian splatting from unconstrained views, 2025. URLhttps://arxiv.org/abs/2505.23716

  19. [19]

    InstantMesh: Efficient 3D Mesh Generation from a Single Image with Sparse-view Large Reconstruction Models

    Jiale Xu, Weihao Cheng, Yiming Gao, Xintao Wang, Shenghua Gao, and Ying Shan. In- stantmesh: Efficient 3d mesh generation from a single image with sparse-view large reconstruc- tion models, 2024. URLhttps://arxiv.org/abs/2404.07191

  20. [20]

    One-2-3-45: Any single image to 3d mesh in 45 seconds without per-shape optimization, 2023

    Minghua Liu, Chao Xu, Haian Jin, Linghao Chen, Mukund Varma T, Zexiang Xu, and Hao Su. One-2-3-45: Any single image to 3d mesh in 45 seconds without per-shape optimization, 2023. URLhttps://arxiv.org/abs/2306.16928

  21. [21]

    SyncDreamer: Generating Multiview-consistent Images from a Single-view Image

    Yuan Liu, Cheng Lin, Zijiao Zeng, Xiaoxiao Long, Lingjie Liu, Taku Komura, and Wenping Wang. Syncdreamer: Generating multiview-consistent images from a single-view image, 2024. URLhttps://arxiv.org/abs/2309.03453

  22. [22]

    Hunyuan3D 2.0: Scaling Diffusion Models for High Resolution Textured 3D Assets Generation

    Zibo Zhao, Zeqiang Lai, Qingxiang Lin, Yunfei Zhao, Haolin Liu, Shuhui Yang, Yifei Feng, Mingxin Yang, Sheng Zhang, Xianghui Yang, Huiwen Shi, Sicong Liu, Junta Wu, Yihang Lian, Fan Yang, Ruining Tang, Zebin He, Xinzhou Wang, Jian Liu, Xuhui Zuo, Zhuo Chen, Biwen Lei, Haohan Weng, Jing Xu, Yiling Zhu, Xinhai Liu, Lixin Xu, Changrong Hu, Shaoxiong Yang, So...

  23. [23]

    Waslander, Sara Vicente, Daniyar Turmukhambetov, and Michael Firman

    Ziwei Liao, Mohamed Sayed, Steven L. Waslander, Sara Vicente, Daniyar Turmukhambetov, and Michael Firman. Complete gaussian splats from a single image with denoising diffusion models, 2025. URLhttps://arxiv.org/abs/2508.21542

  24. [24]

    Reconviagen: Towards accurate multi-view 3d object reconstruction via generation, 2025

    Jiahao Chang, Chongjie Ye, Yushuang Wu, Yuantao Chen, Yidan Zhang, Zhongjin Luo, Chenghong Li, Yihao Zhi, and Xiaoguang Han. Reconviagen: Towards accurate multi-view 3d object reconstruction via generation, 2025. URLhttps://arxiv.org/abs/2510.23306

  25. [25]

    Native and Compact Structured Latents for 3D Generation

    Jianfeng Xiang, Xiaoxue Chen, Sicheng Xu, Ruicheng Wang, Zelong Lv, Yu Deng, Hongyuan Zhu, Yue Dong, Hao Zhao, Nicholas Jing Yuan, and Jiaolong Yang. Native and compact structured latents for 3d generation, 2025. URLhttps://arxiv.org/abs/2512.14692

  26. [26]

    Schonberger and Jan-Michael Frahm

    Johannes L. Schonberger and Jan-Michael Frahm. Structure-from-Motion Revisited. In2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV , USA, June 2016. IEEE. doi: 10.1109/cvpr.2016.445. 11

  27. [27]

    Srinivasan, Matthew Tancik, Jonathan T

    Ben Mildenhall, Pratul P. Srinivasan, Matthew Tancik, Jonathan T. Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view synthesis, 2020. URL https://arxiv.org/abs/2003.08934

  28. [28]

    Meshsplats: Mesh-based rendering with gaussian splatting initialization, 2026

    Rafał Tobiasz, Grzegorz Wilczy´nski, Marcin Mazur, Sławomir Tadeja, Weronika Smolak- Dy˙zewska, and Przemysław Spurek. Meshsplats: Mesh-based rendering with gaussian splatting initialization, 2026. URLhttps://arxiv.org/abs/2502.07754

  29. [29]

    Efros, and Angjoo Kanazawa

    Qianqian Wang, Yifei Zhang, Aleksander Holynski, Alexei A. Efros, and Angjoo Kanazawa. Continuous 3d perception model with persistent state, 2025. URL https://arxiv.org/abs/ 2501.12387

  30. [30]

    Transformerfusion: Monocular rgb scene reconstruction using transformers, 2021

    Aljaž Božiˇc, Pablo Palafox, Justus Thies, Angela Dai, and Matthias Nießner. Transformerfusion: Monocular rgb scene reconstruction using transformers, 2021. URL https://arxiv.org/ abs/2107.02191

  31. [31]

    V ortx: V olumetric 3d reconstruction with transformers for voxelwise view selection and fusion, 2021

    Noah Stier, Alexander Rich, Pradeep Sen, and Tobias Höllerer. V ortx: V olumetric 3d reconstruction with transformers for voxelwise view selection and fusion, 2021. URL https://arxiv.org/abs/2112.00236

  32. [32]

    Visfusion: Visibility-aware online 3d scene recon- struction from videos, 2023

    Huiyu Gao, Wei Mao, and Miaomiao Liu. Visfusion: Visibility-aware online 3d scene recon- struction from videos, 2023. URLhttps://arxiv.org/abs/2304.10687

  33. [33]

    Uforecon: Generalizable sparse-view surface reconstruction from arbitrary and unfavorable sets, 2024

    Youngju Na, Woo Jae Kim, Kyu Beom Han, Suhyeon Ha, and Sung eui Yoon. Uforecon: Generalizable sparse-view surface reconstruction from arbitrary and unfavorable sets, 2024. URLhttps://arxiv.org/abs/2403.05086

  34. [34]

    Simplerecon: 3d reconstruction without 3d convolutions, 2022

    Mohamed Sayed, John Gibson, Jamie Watson, Victor Prisacariu, Michael Firman, and Clément Godard. Simplerecon: 3d reconstruction without 3d convolutions, 2022. URL https://arxiv. org/abs/2208.14743

  35. [35]

    Depthsplat: Connecting gaussian splatting and depth, 2025

    Haofei Xu, Songyou Peng, Fangjinhua Wang, Hermann Blum, Daniel Barath, Andreas Geiger, and Marc Pollefeys. Depthsplat: Connecting gaussian splatting and depth, 2025. URL https: //arxiv.org/abs/2410.13862

  36. [36]

    No pose, no problem: Surprisingly simple 3d gaussian splats from sparse unposed images,

    Botao Ye, Sifei Liu, Haofei Xu, Xueting Li, Marc Pollefeys, Ming-Hsuan Yang, and Songyou Peng. No pose, no problem: Surprisingly simple 3d gaussian splats from sparse unposed images,

  37. [37]

    URLhttps://arxiv.org/abs/2410.24207

  38. [38]

    Pf3plat: Pose-free feed-forward 3d gaussian splatting, 2025

    Sunghwan Hong, Jaewoo Jung, Heeseong Shin, Jisang Han, Jiaolong Yang, Chong Luo, and Seungryong Kim. Pf3plat: Pose-free feed-forward 3d gaussian splatting, 2025. URL https: //arxiv.org/abs/2410.22128

  39. [39]

    Yonosplat: You only need one model for feedforward 3d gaussian splatting, 2025

    Botao Ye, Boqi Chen, Haofei Xu, Daniel Barath, and Marc Pollefeys. Yonosplat: You only need one model for feedforward 3d gaussian splatting, 2025. URL https://arxiv.org/abs/ 2511.07321

  40. [40]

    Freesplat++: Generalizable 3d gaussian splatting for efficient indoor scene reconstruction, 2025

    Yunsong Wang, Tianxin Huang, Hanlin Chen, and Gim Hee Lee. Freesplat++: Generalizable 3d gaussian splatting for efficient indoor scene reconstruction, 2025. URL https://arxiv. org/abs/2503.22986

  41. [41]

    CAT3D: Create Anything in 3D with Multi-View Diffusion Models

    Ruiqi Gao, Aleksander Holynski, Philipp Henzler, Arthur Brussee, Ricardo Martin-Brualla, Pratul Srinivasan, Jonathan T. Barron, and Ben Poole. Cat3d: Create anything in 3d with multi-view diffusion models, 2024. URLhttps://arxiv.org/abs/2405.10314

  42. [42]

    Srinivasan, Dor Verbin, Jonathan T

    Rundi Wu, Ben Mildenhall, Philipp Henzler, Keunhong Park, Ruiqi Gao, Daniel Watson, Pratul P. Srinivasan, Dor Verbin, Jonathan T. Barron, Ben Poole, and Aleksander Holynski. Reconfusion: 3d reconstruction with diffusion priors, 2023. URL https://arxiv.org/abs/2312.02981

  43. [43]

    Multi-view reconstruction via sfm-guided monocular depth estimation,

    Haoyu Guo, He Zhu, Sida Peng, Haotong Lin, Yunzhi Yan, Tao Xie, Wenguan Wang, Xiaowei Zhou, and Hujun Bao. Multi-view reconstruction via sfm-guided monocular depth estimation,

  44. [44]

    URLhttps://arxiv.org/abs/2503.14483. 12

  45. [45]

    Depthcrafter: Generating consistent long depth sequences for open-world videos,

    Wenbo Hu, Xiangjun Gao, Xiaoyu Li, Sijie Zhao, Xiaodong Cun, Yong Zhang, Long Quan, and Ying Shan. Depthcrafter: Generating consistent long depth sequences for open-world videos,

  46. [46]

    URLhttps://arxiv.org/abs/2409.02095

  47. [47]

    Geome- trycrafter: Consistent geometry estimation for open-world videos with diffusion priors, 2025

    Tian-Xing Xu, Xiangjun Gao, Wenbo Hu, Xiaoyu Li, Song-Hai Zhang, and Ying Shan. Geome- trycrafter: Consistent geometry estimation for open-world videos with diffusion priors, 2025. URLhttps://arxiv.org/abs/2504.01016

  48. [48]

    Reconx: Reconstruct any scene from sparse views with video diffusion model,

    Fangfu Liu, Wenqiang Sun, Hanyang Wang, Yikai Wang, Haowen Sun, Junliang Ye, Jun Zhang, and Yueqi Duan. Reconx: Reconstruct any scene from sparse views with video diffusion model,

  49. [49]

    URLhttps://arxiv.org/abs/2408.16767

  50. [50]

    ViewCrafter: Taming Video Diffusion Models for High-fidelity Novel View Synthesis

    Wangbo Yu, Jinbo Xing, Li Yuan, Wenbo Hu, Xiaoyu Li, Zhipeng Huang, Xiangjun Gao, Tien-Tsin Wong, Ying Shan, and Yonghong Tian. Viewcrafter: Taming video diffusion models for high-fidelity novel view synthesis, 2024. URLhttps://arxiv.org/abs/2409.02048

  51. [51]

    MV-SAM3D: Adaptive Multi-View Fusion for Layout-Aware 3D Generation

    Baicheng Li, Dong Wu, Jun Li, Shunkai Zhou, Zecui Zeng, Lusong Li, and Hongbin Zha. Mv-sam3d: Adaptive multi-view fusion for layout-aware 3d generation, 2026. URL https: //arxiv.org/abs/2603.11633

  52. [52]

    Pixal3D: Pixel-Aligned 3D Generation from Images

    Dong-Yang Li, Wang Zhao, Yuxin Chen, Wenbo Hu, Meng-Hao Guo, Fang-Lue Zhang, Ying Shan, and Shi-Min Hu. Pixal3d: Pixel-aligned 3d generation from images, 2026. URL https://arxiv.org/abs/2605.10922

  53. [53]

    Scenewiz3d: Towards text-guided 3d scene composition, 2023

    Qihang Zhang, Chaoyang Wang, Aliaksandr Siarohin, Peiye Zhuang, Yinghao Xu, Ceyuan Yang, Dahua Lin, Bolei Zhou, Sergey Tulyakov, and Hsin-Ying Lee. Scenewiz3d: Towards text-guided 3d scene composition, 2023. URLhttps://arxiv.org/abs/2312.08885

  54. [54]

    Gala3d: Towards text-to-3d complex scene generation via layout-guided generative gaussian splatting, 2024

    Xiaoyu Zhou, Xingjian Ran, Yajiao Xiong, Jinlin He, Zhiwei Lin, Yongtao Wang, Deqing Sun, and Ming-Hsuan Yang. Gala3d: Towards text-to-3d complex scene generation via layout-guided generative gaussian splatting, 2024. URLhttps://arxiv.org/abs/2402.07207

  55. [55]

    Discene: Object decoupling and interaction modeling for complex scene generation

    Xiao-Lei Li, Haodong Li, Hao-Xiang Chen, Tai-Jiang Mu, and Shi-Min Hu. Discene: Object decoupling and interaction modeling for complex scene generation. InSIGGRAPH Asia 2024 Conference Papers, SA ’24, New York, NY , USA, 2024. Association for Computing Machinery. ISBN 9798400711312. doi: 10.1145/3680528.3687589. URL https://doi.org/10.1145/ 3680528.3687589

  56. [56]

    Comboverse: Compositional 3d assets creation using spatially-aware diffusion guidance, 2024

    Yongwei Chen, Tengfei Wang, Tong Wu, Xingang Pan, Kui Jia, and Ziwei Liu. Comboverse: Compositional 3d assets creation using spatially-aware diffusion guidance, 2024. URL https: //arxiv.org/abs/2403.12409

  57. [57]

    Reparo: Compositional 3d assets generation with differentiable 3d layout alignment, 2025

    Haonan Han, Rui Yang, Huan Liao, Jiankai Xing, Zunnan Xu, Xiaoming Yu, Junwei Zha, Xiu Li, and Wanhua Li. Reparo: Compositional 3d assets generation with differentiable 3d layout alignment, 2025. URLhttps://arxiv.org/abs/2405.18525

  58. [58]

    Cast: Component-aligned 3d scene reconstruction from an rgb image, 2025

    Kaixin Yao, Longwen Zhang, Xinhao Yan, Yan Zeng, Qixuan Zhang, Wei Yang, Lan Xu, Jiayuan Gu, and Jingyi Yu. Cast: Component-aligned 3d scene reconstruction from an rgb image, 2025. URLhttps://arxiv.org/abs/2502.12894

  59. [59]

    Dreamanywhere: Object-centric panoramic 3d scene generation, 2025

    Edoardo Alberto Dominici, Jozef Hladky, Floor Verhoeven, Lukas Radl, Thomas Deixelberger, Stefan Ainetter, Philipp Drescher, Stefan Hauswiesner, Arno Coomans, Giacomo Nazzaro, Konstantinos Vardis, and Markus Steinberger. Dreamanywhere: Object-centric panoramic 3d scene generation, 2025. URLhttps://arxiv.org/abs/2506.20367

  60. [60]

    Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling, 2023. URLhttps://arxiv.org/abs/2210.02747

  61. [61]

    Oriane Siméoni, Huy V . V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Michaël Ramamonjisoa, Francisco Massa, Daniel Haziza, Luca Wehrstedt, Jianyuan Wang, Timothée Darcet, Théo Moutakanni, Leonel Sentana, Claire Roberts, Andrea Vedaldi, Jamie Tolan, John Brandt, Camille Couprie, Julie...

  62. [62]

    Barron, Ricardo Martin-Brualla, Noah Snavely, and Thomas Funkhouser

    Qianqian Wang, Zhicheng Wang, Kyle Genova, Pratul Srinivasan, Howard Zhou, Jonathan T. Barron, Ricardo Martin-Brualla, Noah Snavely, and Thomas Funkhouser. Ibrnet: Learning multi-view image-based rendering. InCVPR, 2021

  63. [63]

    LoRA: Low-Rank Adaptation of Large Language Models

    Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models, 2021. URL https://arxiv.org/abs/2106.09685

  64. [64]

    Multidiffusion: Fusing diffusion paths for controlled image generation, 2023

    Omer Bar-Tal, Lior Yariv, Yaron Lipman, and Tali Dekel. Multidiffusion: Fusing diffusion paths for controlled image generation, 2023. URLhttps://arxiv.org/abs/2302.08113

  65. [65]

    Sage: Scalable agentic 3d scene generation for embodied ai, 2026

    Hongchi Xia, Xuan Li, Zhaoshuo Li, Qianli Ma, Jiashu Xu, Ming-Yu Liu, Yin Cui, Tsung-Yi Lin, Wei-Chiu Ma, Shenlong Wang, Shuran Song, and Fangyin Wei. Sage: Scalable agentic 3d scene generation for embodied ai, 2026. URLhttps://arxiv.org/abs/2602.10116

  66. [66]

    Structured 3D Latents for Scalable and Versatile 3D Generation

    Jianfeng Xiang, Zelong Lv, Sicheng Xu, Yu Deng, Ruicheng Wang, Bowen Zhang, Dong Chen, Xin Tong, and Jiaolong Yang. Structured 3d latents for scalable and versatile 3d generation. arXiv preprint arXiv:2412.01506, 2024

  67. [67]

    3d-front: 3d furnished rooms with layouts and semantics.arXiv preprint arXiv:2011.09127, 2020

    Huan Fu, Bowen Cai, Lin Gao, Lingxiao Zhang, Cao Li, Qixun Zeng, Chengyue Sun, Yiyun Fei, Yu Zheng, Ying Li, Yi Liu, Peng Liu, Lin Ma, Le Weng, Xiaohang Hu, Xin Ma, Qian Qian, Rongfei Jia, Binqiang Zhao, and Hao Zhang. 3d-front: 3d furnished rooms with layouts and semantics.arXiv preprint arXiv:2011.09127, 2020

  68. [68]

    Scannet++: A high-fidelity dataset of 3d indoor scenes

    Chandan Yeshwanth, Yueh-Cheng Liu, Matthias Nießner, and Angela Dai. Scannet++: A high-fidelity dataset of 3d indoor scenes. InProceedings of the International Conference on Computer Vision (ICCV), 2023

  69. [69]

    Decoupled Weight Decay Regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization, 2019. URL https: //arxiv.org/abs/1711.05101

  70. [70]

    Classifier-Free Diffusion Guidance

    Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance, 2022. URL https:// arxiv.org/abs/2207.12598

  71. [71]

    The unreason- able effectiveness of deep features as a perceptual metric

    Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreason- able effectiveness of deep features as a perceptual metric. InCVPR, 2018

  72. [72]

    Imagenet classification with deep convolutional neural networks

    Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In F. Pereira, C.J. Burges, L. Bottou, and K. Weinberger, editors,Advances in Neural Information Processing Systems, volume 25. Curran Associates, Inc., 2012. URL https://proceedings.neurips.cc/paper_files/paper/2012/file/ c399862d3b9d6...

  73. [73]

    Learning Transferable Visual Models From Natural Language Supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agar- wal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision, 2021. URL https://arxiv.org/abs/2103.00020

  74. [74]

    Omnidata: A scalable pipeline for making multi-task mid-level vision datasets from 3d scans

    Ainaz Eftekhar, Alexander Sax, Jitendra Malik, and Amir Zamir. Omnidata: A scalable pipeline for making multi-task mid-level vision datasets from 3d scans. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 10786–10796, 2021

  75. [75]

    Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Nießner

    Angela Dai, Angel X. Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. InProc. Computer Vision and Pattern Recognition (CVPR), IEEE, 2017. 14 A Experimental Setup Implementation Details.We build on Trellis.2 [ 24] at resolution 512. Consequently, for the occupancy g...