GenRecon: Bridging Generative Priors for Multi-View 3D Scene Reconstruction

Angela Dai; Jozef Hladk\'y; Katharina Schmid; Matthias Nie{\ss}ner; Nicolas von L\"utzow

arxiv: 2605.23888 · v1 · pith:QFRL7TSXnew · submitted 2026-05-22 · 💻 cs.CV

GenRecon: Bridging Generative Priors for Multi-View 3D Scene Reconstruction

Katharina Schmid , Nicolas von L\"utzow , Jozef Hladk\'y , Angela Dai , Matthias Nie{\ss}ner This is my paper

Pith reviewed 2026-05-25 04:35 UTC · model grok-4.3

classification 💻 cs.CV

keywords 3D scene reconstructionmulti-view imagesgenerative priorsconditional generationPBR meshesindoor environmentsprojection conditioningscene-scale generation

0 comments

The pith

A projection-based conditioning mechanism lifts object-level generative priors to multi-view scene-scale 3D reconstruction.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper frames multi-view scene reconstruction as conditional 3D generation over spatially-localized overlapping chunks that tile the full scene. It introduces a projection-based conditioning mechanism to lift posed multi-view image features into a coherent 3D representation that aligns with the generative model. This produces high-fidelity, multi-view consistent geometry without dependence on view ordering. The result is faithful editable PBR mesh reconstructions of indoor environments that outperform existing methods by 16%.

Core claim

The paper claims that casting reconstruction as conditional generation over tiled chunks, combined with a projection-based conditioning mechanism, lifts posed multi-view image features into a coherent 3D representation aligned with the generative model, independent of view ordering and spatially anchored to the scene. This enables scaling from object-level priors to scene-scale generation, yielding high-fidelity editable PBR mesh reconstructions of indoor environments that outperform cutting-edge reconstruction methods by 16%.

What carries the argument

The projection-based conditioning mechanism that lifts posed multi-view image features into a coherent 3D representation aligned with the generative model.

If this is right

Scene reconstruction is performed as conditional 3D generation over overlapping chunks that tile large extents.
Generated geometry remains multi-view consistent and high-fidelity.
The output consists of faithful, editable PBR meshes suitable for indoor environments.
Quantitative performance exceeds cutting-edge reconstruction methods by 16%.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The conditioning approach is presented as general and could therefore apply to other generative shape models.
Chunk-based tiling implies the method can extend to larger environments simply by increasing the number of chunks.
Editable PBR meshes enable direct use in graphics pipelines for editing or simulation without additional conversion steps.

Load-bearing premise

The projection-based conditioning mechanism lifts posed multi-view image features into a coherent 3D representation aligned with the generative model, independent of view ordering and spatially anchored to the scene.

What would settle it

A quantitative evaluation on standard indoor multi-view benchmarks that shows the generated meshes lack consistency across views or fail to exceed baseline reconstruction accuracy by the reported margin would falsify the central claim.

Figures

Figures reproduced from arXiv: 2605.23888 by Angela Dai, Jozef Hladk\'y, Katharina Schmid, Matthias Nie{\ss}ner, Nicolas von L\"utzow.

**Figure 1.** Figure 1: GenRecon. Given a sparse set of RGB images of an indoor scene (left), our method reconstructs a complete, high-fidelity PBR mesh (center) by formulating scene reconstruction as conditional 3D generation over overlapping spatial chunks. The recovered mesh with material properties enables realistic relighting and editing in standard rendering pipelines (right). Abstract We introduce a new approach to high-fi… view at source ↗

**Figure 2.** Figure 2: Pipeline overview. Given posed RGB images and a sparse point cloud from SfM (left), we define overlapping scene chunks and construct a global 3D conditioning grid by lifting DINOv3 image features into per-view volumes and aggregating them (center). By extending a 3D generative prior with a new spatially-grounded multi-view conditioning pathway, we jointly generate all chunks in a single flow-matching traje… view at source ↗

**Figure 3.** Figure 3: qualitatively compares the performance of our method against the baselines on ScanNet++ using 8 input images. While the baselines produce noisy (2DGS, DA3), oversmooth (FineRecon, MonoSDF) surfaces for challenging areas and are incomplete in occluded and unobserved areas (2DGS, DA3, FineRecon, Murre), our approach yields complete and high-fidelity reconstructions. 2DGS DA3 FineRecon MonoSDF Murre Ours Grou… view at source ↗

**Figure 4.** Figure 4: PBR results. Qualitative results on four reconstructed scenes from ScanNet++: lit scene (left), albedo (middle), metallic and roughness (right) [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Relighting results. Varying lighting configurations for scenes reconstructed from ScanNet++. Further visualizations are provided in Appendix C. 4.4 Limitations While our method achieves strong reconstruction quality across a wide range of indoor scenes, several limitations remain. Reconstructions of non-Lambertian surfaces, such as glass and mirrors, are less reliable, as such materials are underrepresent… view at source ↗

**Figure 6.** Figure 6: Large Scene Generation. Top-down view (left) and multiple close-ups (right). Ablation [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗

**Figure 7.** Figure 7: Ablation results. Ablation study on SAGE-10k chunks not seen during training. Our projection-based 3D conditioning effectively enables pose-correct chunk generation from as few as a single input image. Reconstruction quality improves with additional views. Relighting Additional relighting results are provided in [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗

**Figure 8.** Figure 8: Additional relighting results. Varying lighting configurations for scenes reconstructed from ScanNet++. D Baseline Implementation Details We evaluate all baselines in the same posed multi-view setting as our method. Whenever possible, we use the exact released configurations and only make changes required to adapt the methods to our evaluation protocol. All such changes fix issues and make the methods appl… view at source ↗

read the original abstract

We introduce a new approach to high-fidelity 3D scene reconstruction from multi-view RGB images that tightly couples reconstruction with a strong generative 3D prior. We cast scene reconstruction as conditional 3D generation over a set of spatially-localized, overlapping chunks that together tile the scene, scaling generation to large scene extents. Crucially, we inherit the fidelity and completeness of state-of-the-art generative shape models -- we use Trellis.2 as an example -- which we generalize to the scene level. To this end, we propose a projection-based conditioning mechanism that lifts posed multi-view image features into a coherent 3D representation aligned with the generative model, independent of view ordering and spatially anchored to the scene, yielding high-fidelity, multi-view consistent generated geometry. This enables lifting the strong object-level prior of Trellis.2 to multi-view, scene-scale generation, producing faithful, editable PBR mesh reconstructions of indoor environments. As a result, we obtain high-fidelity results that outperform cutting-edge reconstruction methods by 16%.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The projection conditioning is the real addition here, but the 16% gain claim sits on thin evidence until the numbers and baselines are shown.

read the letter

The paper's core move is taking an object-level generative model like Trellis.2 and extending it to full indoor scenes by breaking the scene into overlapping chunks and using a projection-based conditioning step to feed multi-view features into the generator. That conditioning is presented as view-order independent and spatially anchored, which would be the piece that actually lets the object prior transfer without breaking consistency across chunks. If the implementation delivers on that, it is a practical way to scale strong generative priors without starting from scratch on scene-level training data. The write-up also emphasizes producing editable PBR meshes, which aligns with downstream uses in content creation. That part reads as a direct engineering response to the limits of pure reconstruction pipelines on incomplete or noisy views. The 16% outperformance figure is stated plainly, but the abstract gives no dataset names, no baseline list, and no error breakdown, so it is impossible to judge whether the gain is real or an artifact of the test setup. The stress-test concern about whether the projection step truly produces order-independent, anchored features is the key one to check in the method section; if the paper only shows standard feature lifting without an explicit symmetric operation or global coordinate tie-in, the independence claim would need more support. The work sits in the middle ground between generative modeling and multi-view reconstruction, so readers already following Trellis-style papers or indoor scene reconstruction benchmarks would get the most out of it. It is worth sending to referees because the idea is concrete enough to evaluate and the problem it targets is active, even if the current evidence level is preliminary.

Referee Report

2 major / 0 minor

Summary. The paper introduces GenRecon, a method for high-fidelity 3D scene reconstruction from multi-view RGB images. It formulates scene reconstruction as conditional 3D generation over spatially-localized overlapping chunks that tile the scene, generalizing the object-level generative prior of Trellis.2 to scene scale. A projection-based conditioning mechanism is proposed to lift posed multi-view image features into a coherent 3D representation aligned with the generative model, independent of view ordering and spatially anchored to the scene, yielding multi-view consistent PBR mesh outputs that outperform cutting-edge reconstruction methods by 16%.

Significance. If the central claims hold, the work would meaningfully advance integration of strong generative 3D priors with multi-view reconstruction, enabling scalable, high-fidelity, and editable reconstructions of indoor scenes. The chunk-tiling strategy and conditioning approach could address limitations in applying object-centric generative models to large environments while preserving consistency.

major comments (2)

[Abstract] Abstract: The claim that the method 'outperform[s] cutting-edge reconstruction methods by 16%' is stated without reference to any metrics, baselines, datasets, error analysis, or experimental protocol, rendering the central quantitative result unverifiable from the provided description.
[Abstract] Abstract: The projection-based conditioning is asserted to lift features into a 'coherent 3D representation aligned with the generative model, independent of view ordering and spatially anchored to the scene', but supplies no implementation details on the underlying 3D structure, feature projection, cross-view aggregation, pose encoding, or invariance operation. This mechanism is load-bearing for the claim that Trellis.2's object-level prior can be lifted to overlapping scene chunks while preserving multi-view consistency.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each point below and will revise the abstract to improve clarity and verifiability of the claims.

read point-by-point responses

Referee: [Abstract] Abstract: The claim that the method 'outperform[s] cutting-edge reconstruction methods by 16%' is stated without reference to any metrics, baselines, datasets, error analysis, or experimental protocol, rendering the central quantitative result unverifiable from the provided description.

Authors: We agree that the abstract's quantitative claim would benefit from additional context. In the revised manuscript we will update the abstract to name the metric, the main baselines, the dataset, and a pointer to the experimental section (Section 4) for the protocol and analysis, while keeping the abstract concise. revision: yes
Referee: [Abstract] Abstract: The projection-based conditioning is asserted to lift features into a 'coherent 3D representation aligned with the generative model, independent of view ordering and spatially anchored to the scene', but supplies no implementation details on the underlying 3D structure, feature projection, cross-view aggregation, pose encoding, or invariance operation. This mechanism is load-bearing for the claim that Trellis.2's object-level prior can be lifted to overlapping scene chunks while preserving multi-view consistency.

Authors: Abstracts are necessarily high-level. The full technical description of the projection-based conditioning (3D chunk structure, feature projection, cross-view aggregation, pose encoding, and order invariance) appears in Section 3.2. We will add one short clarifying phrase to the abstract to better signal these aspects without exceeding length limits. revision: partial

Circularity Check

0 steps flagged

No circularity: derivation builds on external Trellis.2 prior via proposed conditioning mechanism

full rationale

The paper's central derivation introduces a projection-based conditioning to lift multi-view features into a 3D representation compatible with an external generative model (Trellis.2). This is presented as a novel proposal rather than a self-referential fit, renaming, or self-citation chain. No equations or steps reduce the claimed outputs to the inputs by construction; the 16% improvement is positioned as an empirical outcome of applying the external prior at scene scale. The mechanism is described as independent of view ordering and spatially anchored, but this is asserted as a design property of the proposed method, not derived tautologically from fitted quantities or prior self-work. The derivation chain remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no information on free parameters, axioms, or invented entities.

pith-pipeline@v0.9.0 · 5735 in / 1191 out tokens · 63263 ms · 2026-05-25T04:35:13.175497+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

joint chunk generation... MultiDiffusion-style scheme... z_{t-1}(x)=1/Σ Mc(x) Σ Mc(x) ẑ^(c)_{t-1}(x)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

75 extracted references · 63 canonical work pages · 17 internal anchors

[1]

Taylor, Mathias Unberath, Ming-Yu Liu, and Chen-Hsuan Lin

Zhaoshuo Li, Thomas Müller, Alex Evans, Russell H. Taylor, Mathias Unberath, Ming-Yu Liu, and Chen-Hsuan Lin. Neuralangelo: High-fidelity neural surface reconstruction, 2023. URL https://arxiv.org/abs/2306.03092

work page arXiv 2023
[2]

Neus: Learning neural implicit surfaces by volume rendering for multi-view reconstruction,

Peng Wang, Lingjie Liu, Yuan Liu, Christian Theobalt, Taku Komura, and Wenping Wang. Neus: Learning neural implicit surfaces by volume rendering for multi-view reconstruction,
[3]

URLhttps://arxiv.org/abs/2106.10689

work page internal anchor Pith review Pith/arXiv arXiv
[4]

V olume rendering of neural implicit surfaces, 2021

Lior Yariv, Jiatao Gu, Yoni Kasten, and Yaron Lipman. V olume rendering of neural implicit surfaces, 2021. URLhttps://arxiv.org/abs/2106.12052

work page arXiv 2021
[5]

Monosdf: Exploring monocular geometric cues for neural implicit surface reconstruction, 2022

Zehao Yu, Songyou Peng, Michael Niemeyer, Torsten Sattler, and Andreas Geiger. Monosdf: Exploring monocular geometric cues for neural implicit surface reconstruction, 2022. URL https://arxiv.org/abs/2206.00665

work page arXiv 2022
[6]

Neusg: Neural implicit surface reconstruction with 3d gaussian splatting guidance, 2025

Hanlin Chen, Chen Li, Yunsong Wang, and Gim Hee Lee. Neusg: Neural implicit surface reconstruction with 3d gaussian splatting guidance, 2025. URL https://arxiv.org/abs/ 2312.00846

work page arXiv 2025
[7]

3d gaussian splatting for real-time radiance field rendering, 2023

Bernhard Kerbl, Georgios Kopanas, Thomas Leimkühler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering, 2023. URL https://arxiv.org/abs/2308. 04079

2023
[8]

2d gaussian splatting for geometrically accurate radiance fields, 2025

Binbin Huang, Zehao Yu, Anpei Chen, Andreas Geiger, and Shenghua Gao. 2d gaussian splatting for geometrically accurate radiance fields, 2025. URL https://arxiv.org/abs/ 2403.17888

work page arXiv 2025
[9]

Pgsr: Planar-based gaussian splatting for efficient and high-fidelity surface reconstruction, 2025

Danpeng Chen, Hai Li, Weicai Ye, Yifan Wang, Weijian Xie, Shangjin Zhai, Nan Wang, Haomin Liu, Hujun Bao, and Guofeng Zhang. Pgsr: Planar-based gaussian splatting for efficient and high-fidelity surface reconstruction, 2025. URLhttps://arxiv.org/abs/2406.06521

work page arXiv 2025
[10]

Dust3r: Geometric 3d vision made easy, 2024

Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. Dust3r: Geometric 3d vision made easy, 2024. URLhttps://arxiv.org/abs/2312.14132

work page arXiv 2024
[11]

Grounding image matching in 3d with mast3r, 2024

Vincent Leroy, Yohann Cabon, and Jérôme Revaud. Grounding image matching in 3d with mast3r, 2024. URLhttps://arxiv.org/abs/2406.09756

work page arXiv 2024
[12]

Vggt: Visual geometry grounded transformer, 2025

Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Visual geometry grounded transformer, 2025. URL https://arxiv.org/ abs/2503.11651. 10

work page arXiv 2025
[13]

Depth Anything 3: Recovering the Visual Space from Any Views

Haotong Lin, Sili Chen, Junhao Liew, Donny Y . Chen, Zhenyu Li, Guang Shi, Jiashi Feng, and Bingyi Kang. Depth anything 3: Recovering the visual space from any views, 2025. URL https://arxiv.org/abs/2511.10647

work page internal anchor Pith review Pith/arXiv arXiv 2025
[14]

Neuralrecon: Real- time coherent 3d reconstruction from monocular video, 2021

Jiaming Sun, Yiming Xie, Linghao Chen, Xiaowei Zhou, and Hujun Bao. Neuralrecon: Real- time coherent 3d reconstruction from monocular video, 2021. URL https://arxiv.org/ abs/2104.00681

work page arXiv 2021
[15]

Finerecon: Depth-aware feed-forward network for detailed 3d reconstruction, 2023

Noah Stier, Anurag Ranjan, Alex Colburn, Yajie Yan, Liang Yang, Fangchang Ma, and Baptiste Angles. Finerecon: Depth-aware feed-forward network for detailed 3d reconstruction, 2023. URLhttps://arxiv.org/abs/2304.01480

work page arXiv 2023
[16]

pixelsplat: 3d gaussian splats from image pairs for scalable generalizable 3d reconstruction, 2024

David Charatan, Sizhe Li, Andrea Tagliasacchi, and Vincent Sitzmann. pixelsplat: 3d gaussian splats from image pairs for scalable generalizable 3d reconstruction, 2024. URL https: //arxiv.org/abs/2312.12337

work page arXiv 2024
[17]

Mvsplat: Efficient 3d gaussian splatting from sparse multi-view images, 2024

Yuedong Chen, Haofei Xu, Chuanxia Zheng, Bohan Zhuang, Marc Pollefeys, Andreas Geiger, Tat-Jen Cham, and Jianfei Cai. Mvsplat: Efficient 3d gaussian splatting from sparse multi-view images, 2024. URLhttps://arxiv.org/abs/2403.14627

work page arXiv 2024
[18]

Anysplat: Feed-forward 3d gaussian splatting from unconstrained views, 2025

Lihan Jiang, Yucheng Mao, Linning Xu, Tao Lu, Kerui Ren, Yichen Jin, Xudong Xu, Mulin Yu, Jiangmiao Pang, Feng Zhao, Dahua Lin, and Bo Dai. Anysplat: Feed-forward 3d gaussian splatting from unconstrained views, 2025. URLhttps://arxiv.org/abs/2505.23716

work page arXiv 2025
[19]

InstantMesh: Efficient 3D Mesh Generation from a Single Image with Sparse-view Large Reconstruction Models

Jiale Xu, Weihao Cheng, Yiming Gao, Xintao Wang, Shenghua Gao, and Ying Shan. In- stantmesh: Efficient 3d mesh generation from a single image with sparse-view large reconstruc- tion models, 2024. URLhttps://arxiv.org/abs/2404.07191

work page internal anchor Pith review Pith/arXiv arXiv 2024
[20]

One-2-3-45: Any single image to 3d mesh in 45 seconds without per-shape optimization, 2023

Minghua Liu, Chao Xu, Haian Jin, Linghao Chen, Mukund Varma T, Zexiang Xu, and Hao Su. One-2-3-45: Any single image to 3d mesh in 45 seconds without per-shape optimization, 2023. URLhttps://arxiv.org/abs/2306.16928

work page arXiv 2023
[21]

SyncDreamer: Generating Multiview-consistent Images from a Single-view Image

Yuan Liu, Cheng Lin, Zijiao Zeng, Xiaoxiao Long, Lingjie Liu, Taku Komura, and Wenping Wang. Syncdreamer: Generating multiview-consistent images from a single-view image, 2024. URLhttps://arxiv.org/abs/2309.03453

work page internal anchor Pith review Pith/arXiv arXiv 2024
[22]

Hunyuan3D 2.0: Scaling Diffusion Models for High Resolution Textured 3D Assets Generation

Zibo Zhao, Zeqiang Lai, Qingxiang Lin, Yunfei Zhao, Haolin Liu, Shuhui Yang, Yifei Feng, Mingxin Yang, Sheng Zhang, Xianghui Yang, Huiwen Shi, Sicong Liu, Junta Wu, Yihang Lian, Fan Yang, Ruining Tang, Zebin He, Xinzhou Wang, Jian Liu, Xuhui Zuo, Zhuo Chen, Biwen Lei, Haohan Weng, Jing Xu, Yiling Zhu, Xinhai Liu, Lixin Xu, Changrong Hu, Shaoxiong Yang, So...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[23]

Waslander, Sara Vicente, Daniyar Turmukhambetov, and Michael Firman

Ziwei Liao, Mohamed Sayed, Steven L. Waslander, Sara Vicente, Daniyar Turmukhambetov, and Michael Firman. Complete gaussian splats from a single image with denoising diffusion models, 2025. URLhttps://arxiv.org/abs/2508.21542

work page arXiv 2025
[24]

Reconviagen: Towards accurate multi-view 3d object reconstruction via generation, 2025

Jiahao Chang, Chongjie Ye, Yushuang Wu, Yuantao Chen, Yidan Zhang, Zhongjin Luo, Chenghong Li, Yihao Zhi, and Xiaoguang Han. Reconviagen: Towards accurate multi-view 3d object reconstruction via generation, 2025. URLhttps://arxiv.org/abs/2510.23306

work page arXiv 2025
[25]

Native and Compact Structured Latents for 3D Generation

Jianfeng Xiang, Xiaoxue Chen, Sicheng Xu, Ruicheng Wang, Zelong Lv, Yu Deng, Hongyuan Zhu, Yue Dong, Hao Zhao, Nicholas Jing Yuan, and Jiaolong Yang. Native and compact structured latents for 3d generation, 2025. URLhttps://arxiv.org/abs/2512.14692

work page internal anchor Pith review Pith/arXiv arXiv 2025
[26]

Schonberger and Jan-Michael Frahm

Johannes L. Schonberger and Jan-Michael Frahm. Structure-from-Motion Revisited. In2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV , USA, June 2016. IEEE. doi: 10.1109/cvpr.2016.445. 11

work page doi:10.1109/cvpr.2016.445 2016
[27]

Srinivasan, Matthew Tancik, Jonathan T

Ben Mildenhall, Pratul P. Srinivasan, Matthew Tancik, Jonathan T. Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view synthesis, 2020. URL https://arxiv.org/abs/2003.08934

work page arXiv 2020
[28]

Meshsplats: Mesh-based rendering with gaussian splatting initialization, 2026

Rafał Tobiasz, Grzegorz Wilczy´nski, Marcin Mazur, Sławomir Tadeja, Weronika Smolak- Dy˙zewska, and Przemysław Spurek. Meshsplats: Mesh-based rendering with gaussian splatting initialization, 2026. URLhttps://arxiv.org/abs/2502.07754

work page arXiv 2026
[29]

Efros, and Angjoo Kanazawa

Qianqian Wang, Yifei Zhang, Aleksander Holynski, Alexei A. Efros, and Angjoo Kanazawa. Continuous 3d perception model with persistent state, 2025. URL https://arxiv.org/abs/ 2501.12387

work page arXiv 2025
[30]

Transformerfusion: Monocular rgb scene reconstruction using transformers, 2021

Aljaž Božiˇc, Pablo Palafox, Justus Thies, Angela Dai, and Matthias Nießner. Transformerfusion: Monocular rgb scene reconstruction using transformers, 2021. URL https://arxiv.org/ abs/2107.02191

work page arXiv 2021
[31]

V ortx: V olumetric 3d reconstruction with transformers for voxelwise view selection and fusion, 2021

Noah Stier, Alexander Rich, Pradeep Sen, and Tobias Höllerer. V ortx: V olumetric 3d reconstruction with transformers for voxelwise view selection and fusion, 2021. URL https://arxiv.org/abs/2112.00236

work page arXiv 2021
[32]

Visfusion: Visibility-aware online 3d scene recon- struction from videos, 2023

Huiyu Gao, Wei Mao, and Miaomiao Liu. Visfusion: Visibility-aware online 3d scene recon- struction from videos, 2023. URLhttps://arxiv.org/abs/2304.10687

work page arXiv 2023
[33]

Uforecon: Generalizable sparse-view surface reconstruction from arbitrary and unfavorable sets, 2024

Youngju Na, Woo Jae Kim, Kyu Beom Han, Suhyeon Ha, and Sung eui Yoon. Uforecon: Generalizable sparse-view surface reconstruction from arbitrary and unfavorable sets, 2024. URLhttps://arxiv.org/abs/2403.05086

work page arXiv 2024
[34]

Simplerecon: 3d reconstruction without 3d convolutions, 2022

Mohamed Sayed, John Gibson, Jamie Watson, Victor Prisacariu, Michael Firman, and Clément Godard. Simplerecon: 3d reconstruction without 3d convolutions, 2022. URL https://arxiv. org/abs/2208.14743

work page arXiv 2022
[35]

Depthsplat: Connecting gaussian splatting and depth, 2025

Haofei Xu, Songyou Peng, Fangjinhua Wang, Hermann Blum, Daniel Barath, Andreas Geiger, and Marc Pollefeys. Depthsplat: Connecting gaussian splatting and depth, 2025. URL https: //arxiv.org/abs/2410.13862

work page arXiv 2025
[36]

No pose, no problem: Surprisingly simple 3d gaussian splats from sparse unposed images,

Botao Ye, Sifei Liu, Haofei Xu, Xueting Li, Marc Pollefeys, Ming-Hsuan Yang, and Songyou Peng. No pose, no problem: Surprisingly simple 3d gaussian splats from sparse unposed images,
[37]

URLhttps://arxiv.org/abs/2410.24207

work page arXiv
[38]

Pf3plat: Pose-free feed-forward 3d gaussian splatting, 2025

Sunghwan Hong, Jaewoo Jung, Heeseong Shin, Jisang Han, Jiaolong Yang, Chong Luo, and Seungryong Kim. Pf3plat: Pose-free feed-forward 3d gaussian splatting, 2025. URL https: //arxiv.org/abs/2410.22128

work page arXiv 2025
[39]

Yonosplat: You only need one model for feedforward 3d gaussian splatting, 2025

Botao Ye, Boqi Chen, Haofei Xu, Daniel Barath, and Marc Pollefeys. Yonosplat: You only need one model for feedforward 3d gaussian splatting, 2025. URL https://arxiv.org/abs/ 2511.07321

work page arXiv 2025
[40]

Freesplat++: Generalizable 3d gaussian splatting for efficient indoor scene reconstruction, 2025

Yunsong Wang, Tianxin Huang, Hanlin Chen, and Gim Hee Lee. Freesplat++: Generalizable 3d gaussian splatting for efficient indoor scene reconstruction, 2025. URL https://arxiv. org/abs/2503.22986

work page arXiv 2025
[41]

CAT3D: Create Anything in 3D with Multi-View Diffusion Models

Ruiqi Gao, Aleksander Holynski, Philipp Henzler, Arthur Brussee, Ricardo Martin-Brualla, Pratul Srinivasan, Jonathan T. Barron, and Ben Poole. Cat3d: Create anything in 3d with multi-view diffusion models, 2024. URLhttps://arxiv.org/abs/2405.10314

work page internal anchor Pith review Pith/arXiv arXiv 2024
[42]

Srinivasan, Dor Verbin, Jonathan T

Rundi Wu, Ben Mildenhall, Philipp Henzler, Keunhong Park, Ruiqi Gao, Daniel Watson, Pratul P. Srinivasan, Dor Verbin, Jonathan T. Barron, Ben Poole, and Aleksander Holynski. Reconfusion: 3d reconstruction with diffusion priors, 2023. URL https://arxiv.org/abs/2312.02981

work page arXiv 2023
[43]

Multi-view reconstruction via sfm-guided monocular depth estimation,

Haoyu Guo, He Zhu, Sida Peng, Haotong Lin, Yunzhi Yan, Tao Xie, Wenguan Wang, Xiaowei Zhou, and Hujun Bao. Multi-view reconstruction via sfm-guided monocular depth estimation,
[44]

URLhttps://arxiv.org/abs/2503.14483. 12

work page arXiv
[45]

Depthcrafter: Generating consistent long depth sequences for open-world videos,

Wenbo Hu, Xiangjun Gao, Xiaoyu Li, Sijie Zhao, Xiaodong Cun, Yong Zhang, Long Quan, and Ying Shan. Depthcrafter: Generating consistent long depth sequences for open-world videos,
[46]

URLhttps://arxiv.org/abs/2409.02095

work page arXiv
[47]

Geome- trycrafter: Consistent geometry estimation for open-world videos with diffusion priors, 2025

Tian-Xing Xu, Xiangjun Gao, Wenbo Hu, Xiaoyu Li, Song-Hai Zhang, and Ying Shan. Geome- trycrafter: Consistent geometry estimation for open-world videos with diffusion priors, 2025. URLhttps://arxiv.org/abs/2504.01016

work page arXiv 2025
[48]

Reconx: Reconstruct any scene from sparse views with video diffusion model,

Fangfu Liu, Wenqiang Sun, Hanyang Wang, Yikai Wang, Haowen Sun, Junliang Ye, Jun Zhang, and Yueqi Duan. Reconx: Reconstruct any scene from sparse views with video diffusion model,
[49]

URLhttps://arxiv.org/abs/2408.16767

work page arXiv
[50]

ViewCrafter: Taming Video Diffusion Models for High-fidelity Novel View Synthesis

Wangbo Yu, Jinbo Xing, Li Yuan, Wenbo Hu, Xiaoyu Li, Zhipeng Huang, Xiangjun Gao, Tien-Tsin Wong, Ying Shan, and Yonghong Tian. Viewcrafter: Taming video diffusion models for high-fidelity novel view synthesis, 2024. URLhttps://arxiv.org/abs/2409.02048

work page internal anchor Pith review Pith/arXiv arXiv 2024
[51]

MV-SAM3D: Adaptive Multi-View Fusion for Layout-Aware 3D Generation

Baicheng Li, Dong Wu, Jun Li, Shunkai Zhou, Zecui Zeng, Lusong Li, and Hongbin Zha. Mv-sam3d: Adaptive multi-view fusion for layout-aware 3d generation, 2026. URL https: //arxiv.org/abs/2603.11633

work page internal anchor Pith review Pith/arXiv arXiv 2026
[52]

Pixal3D: Pixel-Aligned 3D Generation from Images

Dong-Yang Li, Wang Zhao, Yuxin Chen, Wenbo Hu, Meng-Hao Guo, Fang-Lue Zhang, Ying Shan, and Shi-Min Hu. Pixal3d: Pixel-aligned 3d generation from images, 2026. URL https://arxiv.org/abs/2605.10922

work page internal anchor Pith review Pith/arXiv arXiv 2026
[53]

Scenewiz3d: Towards text-guided 3d scene composition, 2023

Qihang Zhang, Chaoyang Wang, Aliaksandr Siarohin, Peiye Zhuang, Yinghao Xu, Ceyuan Yang, Dahua Lin, Bolei Zhou, Sergey Tulyakov, and Hsin-Ying Lee. Scenewiz3d: Towards text-guided 3d scene composition, 2023. URLhttps://arxiv.org/abs/2312.08885

work page arXiv 2023
[54]

Gala3d: Towards text-to-3d complex scene generation via layout-guided generative gaussian splatting, 2024

Xiaoyu Zhou, Xingjian Ran, Yajiao Xiong, Jinlin He, Zhiwei Lin, Yongtao Wang, Deqing Sun, and Ming-Hsuan Yang. Gala3d: Towards text-to-3d complex scene generation via layout-guided generative gaussian splatting, 2024. URLhttps://arxiv.org/abs/2402.07207

work page arXiv 2024
[55]

Discene: Object decoupling and interaction modeling for complex scene generation

Xiao-Lei Li, Haodong Li, Hao-Xiang Chen, Tai-Jiang Mu, and Shi-Min Hu. Discene: Object decoupling and interaction modeling for complex scene generation. InSIGGRAPH Asia 2024 Conference Papers, SA ’24, New York, NY , USA, 2024. Association for Computing Machinery. ISBN 9798400711312. doi: 10.1145/3680528.3687589. URL https://doi.org/10.1145/ 3680528.3687589

work page doi:10.1145/3680528.3687589 2024
[56]

Comboverse: Compositional 3d assets creation using spatially-aware diffusion guidance, 2024

Yongwei Chen, Tengfei Wang, Tong Wu, Xingang Pan, Kui Jia, and Ziwei Liu. Comboverse: Compositional 3d assets creation using spatially-aware diffusion guidance, 2024. URL https: //arxiv.org/abs/2403.12409

work page arXiv 2024
[57]

Reparo: Compositional 3d assets generation with differentiable 3d layout alignment, 2025

Haonan Han, Rui Yang, Huan Liao, Jiankai Xing, Zunnan Xu, Xiaoming Yu, Junwei Zha, Xiu Li, and Wanhua Li. Reparo: Compositional 3d assets generation with differentiable 3d layout alignment, 2025. URLhttps://arxiv.org/abs/2405.18525

work page arXiv 2025
[58]

Cast: Component-aligned 3d scene reconstruction from an rgb image, 2025

Kaixin Yao, Longwen Zhang, Xinhao Yan, Yan Zeng, Qixuan Zhang, Wei Yang, Lan Xu, Jiayuan Gu, and Jingyi Yu. Cast: Component-aligned 3d scene reconstruction from an rgb image, 2025. URLhttps://arxiv.org/abs/2502.12894

work page arXiv 2025
[59]

Dreamanywhere: Object-centric panoramic 3d scene generation, 2025

Edoardo Alberto Dominici, Jozef Hladky, Floor Verhoeven, Lukas Radl, Thomas Deixelberger, Stefan Ainetter, Philipp Drescher, Stefan Hauswiesner, Arno Coomans, Giacomo Nazzaro, Konstantinos Vardis, and Markus Steinberger. Dreamanywhere: Object-centric panoramic 3d scene generation, 2025. URLhttps://arxiv.org/abs/2506.20367

work page arXiv 2025
[60]

Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling, 2023. URLhttps://arxiv.org/abs/2210.02747

work page internal anchor Pith review Pith/arXiv arXiv 2023
[61]

Oriane Siméoni, Huy V . V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Michaël Ramamonjisoa, Francisco Massa, Daniel Haziza, Luca Wehrstedt, Jianyuan Wang, Timothée Darcet, Théo Moutakanni, Leonel Sentana, Claire Roberts, Andrea Vedaldi, Jamie Tolan, John Brandt, Camille Couprie, Julie...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[62]

Barron, Ricardo Martin-Brualla, Noah Snavely, and Thomas Funkhouser

Qianqian Wang, Zhicheng Wang, Kyle Genova, Pratul Srinivasan, Howard Zhou, Jonathan T. Barron, Ricardo Martin-Brualla, Noah Snavely, and Thomas Funkhouser. Ibrnet: Learning multi-view image-based rendering. InCVPR, 2021

2021
[63]

LoRA: Low-Rank Adaptation of Large Language Models

Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models, 2021. URL https://arxiv.org/abs/2106.09685

work page internal anchor Pith review Pith/arXiv arXiv 2021
[64]

Multidiffusion: Fusing diffusion paths for controlled image generation, 2023

Omer Bar-Tal, Lior Yariv, Yaron Lipman, and Tali Dekel. Multidiffusion: Fusing diffusion paths for controlled image generation, 2023. URLhttps://arxiv.org/abs/2302.08113

work page arXiv 2023
[65]

Sage: Scalable agentic 3d scene generation for embodied ai, 2026

Hongchi Xia, Xuan Li, Zhaoshuo Li, Qianli Ma, Jiashu Xu, Ming-Yu Liu, Yin Cui, Tsung-Yi Lin, Wei-Chiu Ma, Shenlong Wang, Shuran Song, and Fangyin Wei. Sage: Scalable agentic 3d scene generation for embodied ai, 2026. URLhttps://arxiv.org/abs/2602.10116

work page arXiv 2026
[66]

Structured 3D Latents for Scalable and Versatile 3D Generation

Jianfeng Xiang, Zelong Lv, Sicheng Xu, Yu Deng, Ruicheng Wang, Bowen Zhang, Dong Chen, Xin Tong, and Jiaolong Yang. Structured 3d latents for scalable and versatile 3d generation. arXiv preprint arXiv:2412.01506, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[67]

3d-front: 3d furnished rooms with layouts and semantics.arXiv preprint arXiv:2011.09127, 2020

Huan Fu, Bowen Cai, Lin Gao, Lingxiao Zhang, Cao Li, Qixun Zeng, Chengyue Sun, Yiyun Fei, Yu Zheng, Ying Li, Yi Liu, Peng Liu, Lin Ma, Le Weng, Xiaohang Hu, Xin Ma, Qian Qian, Rongfei Jia, Binqiang Zhao, and Hao Zhang. 3d-front: 3d furnished rooms with layouts and semantics.arXiv preprint arXiv:2011.09127, 2020

work page arXiv 2011
[68]

Scannet++: A high-fidelity dataset of 3d indoor scenes

Chandan Yeshwanth, Yueh-Cheng Liu, Matthias Nießner, and Angela Dai. Scannet++: A high-fidelity dataset of 3d indoor scenes. InProceedings of the International Conference on Computer Vision (ICCV), 2023

2023
[69]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization, 2019. URL https: //arxiv.org/abs/1711.05101

work page internal anchor Pith review Pith/arXiv arXiv 2019
[70]

Classifier-Free Diffusion Guidance

Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance, 2022. URL https:// arxiv.org/abs/2207.12598

work page internal anchor Pith review Pith/arXiv arXiv 2022
[71]

The unreason- able effectiveness of deep features as a perceptual metric

Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreason- able effectiveness of deep features as a perceptual metric. InCVPR, 2018

2018
[72]

Imagenet classification with deep convolutional neural networks

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In F. Pereira, C.J. Burges, L. Bottou, and K. Weinberger, editors,Advances in Neural Information Processing Systems, volume 25. Curran Associates, Inc., 2012. URL https://proceedings.neurips.cc/paper_files/paper/2012/file/ c399862d3b9d6...

2012
[73]

Learning Transferable Visual Models From Natural Language Supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agar- wal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision, 2021. URL https://arxiv.org/abs/2103.00020

work page internal anchor Pith review Pith/arXiv arXiv 2021
[74]

Omnidata: A scalable pipeline for making multi-task mid-level vision datasets from 3d scans

Ainaz Eftekhar, Alexander Sax, Jitendra Malik, and Amir Zamir. Omnidata: A scalable pipeline for making multi-task mid-level vision datasets from 3d scans. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 10786–10796, 2021

2021
[75]

Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Nießner

Angela Dai, Angel X. Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. InProc. Computer Vision and Pattern Recognition (CVPR), IEEE, 2017. 14 A Experimental Setup Implementation Details.We build on Trellis.2 [ 24] at resolution 512. Consequently, for the occupancy g...

2017

[1] [1]

Taylor, Mathias Unberath, Ming-Yu Liu, and Chen-Hsuan Lin

Zhaoshuo Li, Thomas Müller, Alex Evans, Russell H. Taylor, Mathias Unberath, Ming-Yu Liu, and Chen-Hsuan Lin. Neuralangelo: High-fidelity neural surface reconstruction, 2023. URL https://arxiv.org/abs/2306.03092

work page arXiv 2023

[2] [2]

Neus: Learning neural implicit surfaces by volume rendering for multi-view reconstruction,

Peng Wang, Lingjie Liu, Yuan Liu, Christian Theobalt, Taku Komura, and Wenping Wang. Neus: Learning neural implicit surfaces by volume rendering for multi-view reconstruction,

[3] [3]

URLhttps://arxiv.org/abs/2106.10689

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

V olume rendering of neural implicit surfaces, 2021

Lior Yariv, Jiatao Gu, Yoni Kasten, and Yaron Lipman. V olume rendering of neural implicit surfaces, 2021. URLhttps://arxiv.org/abs/2106.12052

work page arXiv 2021

[5] [5]

Monosdf: Exploring monocular geometric cues for neural implicit surface reconstruction, 2022

Zehao Yu, Songyou Peng, Michael Niemeyer, Torsten Sattler, and Andreas Geiger. Monosdf: Exploring monocular geometric cues for neural implicit surface reconstruction, 2022. URL https://arxiv.org/abs/2206.00665

work page arXiv 2022

[6] [6]

Neusg: Neural implicit surface reconstruction with 3d gaussian splatting guidance, 2025

Hanlin Chen, Chen Li, Yunsong Wang, and Gim Hee Lee. Neusg: Neural implicit surface reconstruction with 3d gaussian splatting guidance, 2025. URL https://arxiv.org/abs/ 2312.00846

work page arXiv 2025

[7] [7]

3d gaussian splatting for real-time radiance field rendering, 2023

Bernhard Kerbl, Georgios Kopanas, Thomas Leimkühler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering, 2023. URL https://arxiv.org/abs/2308. 04079

2023

[8] [8]

2d gaussian splatting for geometrically accurate radiance fields, 2025

Binbin Huang, Zehao Yu, Anpei Chen, Andreas Geiger, and Shenghua Gao. 2d gaussian splatting for geometrically accurate radiance fields, 2025. URL https://arxiv.org/abs/ 2403.17888

work page arXiv 2025

[9] [9]

Pgsr: Planar-based gaussian splatting for efficient and high-fidelity surface reconstruction, 2025

Danpeng Chen, Hai Li, Weicai Ye, Yifan Wang, Weijian Xie, Shangjin Zhai, Nan Wang, Haomin Liu, Hujun Bao, and Guofeng Zhang. Pgsr: Planar-based gaussian splatting for efficient and high-fidelity surface reconstruction, 2025. URLhttps://arxiv.org/abs/2406.06521

work page arXiv 2025

[10] [10]

Dust3r: Geometric 3d vision made easy, 2024

Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. Dust3r: Geometric 3d vision made easy, 2024. URLhttps://arxiv.org/abs/2312.14132

work page arXiv 2024

[11] [11]

Grounding image matching in 3d with mast3r, 2024

Vincent Leroy, Yohann Cabon, and Jérôme Revaud. Grounding image matching in 3d with mast3r, 2024. URLhttps://arxiv.org/abs/2406.09756

work page arXiv 2024

[12] [12]

Vggt: Visual geometry grounded transformer, 2025

Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Visual geometry grounded transformer, 2025. URL https://arxiv.org/ abs/2503.11651. 10

work page arXiv 2025

[13] [13]

Depth Anything 3: Recovering the Visual Space from Any Views

Haotong Lin, Sili Chen, Junhao Liew, Donny Y . Chen, Zhenyu Li, Guang Shi, Jiashi Feng, and Bingyi Kang. Depth anything 3: Recovering the visual space from any views, 2025. URL https://arxiv.org/abs/2511.10647

work page internal anchor Pith review Pith/arXiv arXiv 2025

[14] [14]

Neuralrecon: Real- time coherent 3d reconstruction from monocular video, 2021

Jiaming Sun, Yiming Xie, Linghao Chen, Xiaowei Zhou, and Hujun Bao. Neuralrecon: Real- time coherent 3d reconstruction from monocular video, 2021. URL https://arxiv.org/ abs/2104.00681

work page arXiv 2021

[15] [15]

Finerecon: Depth-aware feed-forward network for detailed 3d reconstruction, 2023

Noah Stier, Anurag Ranjan, Alex Colburn, Yajie Yan, Liang Yang, Fangchang Ma, and Baptiste Angles. Finerecon: Depth-aware feed-forward network for detailed 3d reconstruction, 2023. URLhttps://arxiv.org/abs/2304.01480

work page arXiv 2023

[16] [16]

pixelsplat: 3d gaussian splats from image pairs for scalable generalizable 3d reconstruction, 2024

David Charatan, Sizhe Li, Andrea Tagliasacchi, and Vincent Sitzmann. pixelsplat: 3d gaussian splats from image pairs for scalable generalizable 3d reconstruction, 2024. URL https: //arxiv.org/abs/2312.12337

work page arXiv 2024

[17] [17]

Mvsplat: Efficient 3d gaussian splatting from sparse multi-view images, 2024

Yuedong Chen, Haofei Xu, Chuanxia Zheng, Bohan Zhuang, Marc Pollefeys, Andreas Geiger, Tat-Jen Cham, and Jianfei Cai. Mvsplat: Efficient 3d gaussian splatting from sparse multi-view images, 2024. URLhttps://arxiv.org/abs/2403.14627

work page arXiv 2024

[18] [18]

Anysplat: Feed-forward 3d gaussian splatting from unconstrained views, 2025

Lihan Jiang, Yucheng Mao, Linning Xu, Tao Lu, Kerui Ren, Yichen Jin, Xudong Xu, Mulin Yu, Jiangmiao Pang, Feng Zhao, Dahua Lin, and Bo Dai. Anysplat: Feed-forward 3d gaussian splatting from unconstrained views, 2025. URLhttps://arxiv.org/abs/2505.23716

work page arXiv 2025

[19] [19]

InstantMesh: Efficient 3D Mesh Generation from a Single Image with Sparse-view Large Reconstruction Models

Jiale Xu, Weihao Cheng, Yiming Gao, Xintao Wang, Shenghua Gao, and Ying Shan. In- stantmesh: Efficient 3d mesh generation from a single image with sparse-view large reconstruc- tion models, 2024. URLhttps://arxiv.org/abs/2404.07191

work page internal anchor Pith review Pith/arXiv arXiv 2024

[20] [20]

One-2-3-45: Any single image to 3d mesh in 45 seconds without per-shape optimization, 2023

Minghua Liu, Chao Xu, Haian Jin, Linghao Chen, Mukund Varma T, Zexiang Xu, and Hao Su. One-2-3-45: Any single image to 3d mesh in 45 seconds without per-shape optimization, 2023. URLhttps://arxiv.org/abs/2306.16928

work page arXiv 2023

[21] [21]

SyncDreamer: Generating Multiview-consistent Images from a Single-view Image

Yuan Liu, Cheng Lin, Zijiao Zeng, Xiaoxiao Long, Lingjie Liu, Taku Komura, and Wenping Wang. Syncdreamer: Generating multiview-consistent images from a single-view image, 2024. URLhttps://arxiv.org/abs/2309.03453

work page internal anchor Pith review Pith/arXiv arXiv 2024

[22] [22]

Hunyuan3D 2.0: Scaling Diffusion Models for High Resolution Textured 3D Assets Generation

Zibo Zhao, Zeqiang Lai, Qingxiang Lin, Yunfei Zhao, Haolin Liu, Shuhui Yang, Yifei Feng, Mingxin Yang, Sheng Zhang, Xianghui Yang, Huiwen Shi, Sicong Liu, Junta Wu, Yihang Lian, Fan Yang, Ruining Tang, Zebin He, Xinzhou Wang, Jian Liu, Xuhui Zuo, Zhuo Chen, Biwen Lei, Haohan Weng, Jing Xu, Yiling Zhu, Xinhai Liu, Lixin Xu, Changrong Hu, Shaoxiong Yang, So...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[23] [23]

Waslander, Sara Vicente, Daniyar Turmukhambetov, and Michael Firman

Ziwei Liao, Mohamed Sayed, Steven L. Waslander, Sara Vicente, Daniyar Turmukhambetov, and Michael Firman. Complete gaussian splats from a single image with denoising diffusion models, 2025. URLhttps://arxiv.org/abs/2508.21542

work page arXiv 2025

[24] [24]

Reconviagen: Towards accurate multi-view 3d object reconstruction via generation, 2025

Jiahao Chang, Chongjie Ye, Yushuang Wu, Yuantao Chen, Yidan Zhang, Zhongjin Luo, Chenghong Li, Yihao Zhi, and Xiaoguang Han. Reconviagen: Towards accurate multi-view 3d object reconstruction via generation, 2025. URLhttps://arxiv.org/abs/2510.23306

work page arXiv 2025

[25] [25]

Native and Compact Structured Latents for 3D Generation

Jianfeng Xiang, Xiaoxue Chen, Sicheng Xu, Ruicheng Wang, Zelong Lv, Yu Deng, Hongyuan Zhu, Yue Dong, Hao Zhao, Nicholas Jing Yuan, and Jiaolong Yang. Native and compact structured latents for 3d generation, 2025. URLhttps://arxiv.org/abs/2512.14692

work page internal anchor Pith review Pith/arXiv arXiv 2025

[26] [26]

Schonberger and Jan-Michael Frahm

Johannes L. Schonberger and Jan-Michael Frahm. Structure-from-Motion Revisited. In2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV , USA, June 2016. IEEE. doi: 10.1109/cvpr.2016.445. 11

work page doi:10.1109/cvpr.2016.445 2016

[27] [27]

Srinivasan, Matthew Tancik, Jonathan T

Ben Mildenhall, Pratul P. Srinivasan, Matthew Tancik, Jonathan T. Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view synthesis, 2020. URL https://arxiv.org/abs/2003.08934

work page arXiv 2020

[28] [28]

Meshsplats: Mesh-based rendering with gaussian splatting initialization, 2026

Rafał Tobiasz, Grzegorz Wilczy´nski, Marcin Mazur, Sławomir Tadeja, Weronika Smolak- Dy˙zewska, and Przemysław Spurek. Meshsplats: Mesh-based rendering with gaussian splatting initialization, 2026. URLhttps://arxiv.org/abs/2502.07754

work page arXiv 2026

[29] [29]

Efros, and Angjoo Kanazawa

Qianqian Wang, Yifei Zhang, Aleksander Holynski, Alexei A. Efros, and Angjoo Kanazawa. Continuous 3d perception model with persistent state, 2025. URL https://arxiv.org/abs/ 2501.12387

work page arXiv 2025

[30] [30]

Transformerfusion: Monocular rgb scene reconstruction using transformers, 2021

Aljaž Božiˇc, Pablo Palafox, Justus Thies, Angela Dai, and Matthias Nießner. Transformerfusion: Monocular rgb scene reconstruction using transformers, 2021. URL https://arxiv.org/ abs/2107.02191

work page arXiv 2021

[31] [31]

V ortx: V olumetric 3d reconstruction with transformers for voxelwise view selection and fusion, 2021

Noah Stier, Alexander Rich, Pradeep Sen, and Tobias Höllerer. V ortx: V olumetric 3d reconstruction with transformers for voxelwise view selection and fusion, 2021. URL https://arxiv.org/abs/2112.00236

work page arXiv 2021

[32] [32]

Visfusion: Visibility-aware online 3d scene recon- struction from videos, 2023

Huiyu Gao, Wei Mao, and Miaomiao Liu. Visfusion: Visibility-aware online 3d scene recon- struction from videos, 2023. URLhttps://arxiv.org/abs/2304.10687

work page arXiv 2023

[33] [33]

Uforecon: Generalizable sparse-view surface reconstruction from arbitrary and unfavorable sets, 2024

Youngju Na, Woo Jae Kim, Kyu Beom Han, Suhyeon Ha, and Sung eui Yoon. Uforecon: Generalizable sparse-view surface reconstruction from arbitrary and unfavorable sets, 2024. URLhttps://arxiv.org/abs/2403.05086

work page arXiv 2024

[34] [34]

Simplerecon: 3d reconstruction without 3d convolutions, 2022

Mohamed Sayed, John Gibson, Jamie Watson, Victor Prisacariu, Michael Firman, and Clément Godard. Simplerecon: 3d reconstruction without 3d convolutions, 2022. URL https://arxiv. org/abs/2208.14743

work page arXiv 2022

[35] [35]

Depthsplat: Connecting gaussian splatting and depth, 2025

Haofei Xu, Songyou Peng, Fangjinhua Wang, Hermann Blum, Daniel Barath, Andreas Geiger, and Marc Pollefeys. Depthsplat: Connecting gaussian splatting and depth, 2025. URL https: //arxiv.org/abs/2410.13862

work page arXiv 2025

[36] [36]

No pose, no problem: Surprisingly simple 3d gaussian splats from sparse unposed images,

Botao Ye, Sifei Liu, Haofei Xu, Xueting Li, Marc Pollefeys, Ming-Hsuan Yang, and Songyou Peng. No pose, no problem: Surprisingly simple 3d gaussian splats from sparse unposed images,

[37] [37]

URLhttps://arxiv.org/abs/2410.24207

work page arXiv

[38] [38]

Pf3plat: Pose-free feed-forward 3d gaussian splatting, 2025

Sunghwan Hong, Jaewoo Jung, Heeseong Shin, Jisang Han, Jiaolong Yang, Chong Luo, and Seungryong Kim. Pf3plat: Pose-free feed-forward 3d gaussian splatting, 2025. URL https: //arxiv.org/abs/2410.22128

work page arXiv 2025

[39] [39]

Yonosplat: You only need one model for feedforward 3d gaussian splatting, 2025

Botao Ye, Boqi Chen, Haofei Xu, Daniel Barath, and Marc Pollefeys. Yonosplat: You only need one model for feedforward 3d gaussian splatting, 2025. URL https://arxiv.org/abs/ 2511.07321

work page arXiv 2025

[40] [40]

Freesplat++: Generalizable 3d gaussian splatting for efficient indoor scene reconstruction, 2025

Yunsong Wang, Tianxin Huang, Hanlin Chen, and Gim Hee Lee. Freesplat++: Generalizable 3d gaussian splatting for efficient indoor scene reconstruction, 2025. URL https://arxiv. org/abs/2503.22986

work page arXiv 2025

[41] [41]

CAT3D: Create Anything in 3D with Multi-View Diffusion Models

Ruiqi Gao, Aleksander Holynski, Philipp Henzler, Arthur Brussee, Ricardo Martin-Brualla, Pratul Srinivasan, Jonathan T. Barron, and Ben Poole. Cat3d: Create anything in 3d with multi-view diffusion models, 2024. URLhttps://arxiv.org/abs/2405.10314

work page internal anchor Pith review Pith/arXiv arXiv 2024

[42] [42]

Srinivasan, Dor Verbin, Jonathan T

Rundi Wu, Ben Mildenhall, Philipp Henzler, Keunhong Park, Ruiqi Gao, Daniel Watson, Pratul P. Srinivasan, Dor Verbin, Jonathan T. Barron, Ben Poole, and Aleksander Holynski. Reconfusion: 3d reconstruction with diffusion priors, 2023. URL https://arxiv.org/abs/2312.02981

work page arXiv 2023

[43] [43]

Multi-view reconstruction via sfm-guided monocular depth estimation,

Haoyu Guo, He Zhu, Sida Peng, Haotong Lin, Yunzhi Yan, Tao Xie, Wenguan Wang, Xiaowei Zhou, and Hujun Bao. Multi-view reconstruction via sfm-guided monocular depth estimation,

[44] [44]

URLhttps://arxiv.org/abs/2503.14483. 12

work page arXiv

[45] [45]

Depthcrafter: Generating consistent long depth sequences for open-world videos,

Wenbo Hu, Xiangjun Gao, Xiaoyu Li, Sijie Zhao, Xiaodong Cun, Yong Zhang, Long Quan, and Ying Shan. Depthcrafter: Generating consistent long depth sequences for open-world videos,

[46] [46]

URLhttps://arxiv.org/abs/2409.02095

work page arXiv

[47] [47]

Geome- trycrafter: Consistent geometry estimation for open-world videos with diffusion priors, 2025

Tian-Xing Xu, Xiangjun Gao, Wenbo Hu, Xiaoyu Li, Song-Hai Zhang, and Ying Shan. Geome- trycrafter: Consistent geometry estimation for open-world videos with diffusion priors, 2025. URLhttps://arxiv.org/abs/2504.01016

work page arXiv 2025

[48] [48]

Reconx: Reconstruct any scene from sparse views with video diffusion model,

Fangfu Liu, Wenqiang Sun, Hanyang Wang, Yikai Wang, Haowen Sun, Junliang Ye, Jun Zhang, and Yueqi Duan. Reconx: Reconstruct any scene from sparse views with video diffusion model,

[49] [49]

URLhttps://arxiv.org/abs/2408.16767

work page arXiv

[50] [50]

ViewCrafter: Taming Video Diffusion Models for High-fidelity Novel View Synthesis

Wangbo Yu, Jinbo Xing, Li Yuan, Wenbo Hu, Xiaoyu Li, Zhipeng Huang, Xiangjun Gao, Tien-Tsin Wong, Ying Shan, and Yonghong Tian. Viewcrafter: Taming video diffusion models for high-fidelity novel view synthesis, 2024. URLhttps://arxiv.org/abs/2409.02048

work page internal anchor Pith review Pith/arXiv arXiv 2024

[51] [51]

MV-SAM3D: Adaptive Multi-View Fusion for Layout-Aware 3D Generation

Baicheng Li, Dong Wu, Jun Li, Shunkai Zhou, Zecui Zeng, Lusong Li, and Hongbin Zha. Mv-sam3d: Adaptive multi-view fusion for layout-aware 3d generation, 2026. URL https: //arxiv.org/abs/2603.11633

work page internal anchor Pith review Pith/arXiv arXiv 2026

[52] [52]

Pixal3D: Pixel-Aligned 3D Generation from Images

Dong-Yang Li, Wang Zhao, Yuxin Chen, Wenbo Hu, Meng-Hao Guo, Fang-Lue Zhang, Ying Shan, and Shi-Min Hu. Pixal3d: Pixel-aligned 3d generation from images, 2026. URL https://arxiv.org/abs/2605.10922

work page internal anchor Pith review Pith/arXiv arXiv 2026

[53] [53]

Scenewiz3d: Towards text-guided 3d scene composition, 2023

Qihang Zhang, Chaoyang Wang, Aliaksandr Siarohin, Peiye Zhuang, Yinghao Xu, Ceyuan Yang, Dahua Lin, Bolei Zhou, Sergey Tulyakov, and Hsin-Ying Lee. Scenewiz3d: Towards text-guided 3d scene composition, 2023. URLhttps://arxiv.org/abs/2312.08885

work page arXiv 2023

[54] [54]

Gala3d: Towards text-to-3d complex scene generation via layout-guided generative gaussian splatting, 2024

Xiaoyu Zhou, Xingjian Ran, Yajiao Xiong, Jinlin He, Zhiwei Lin, Yongtao Wang, Deqing Sun, and Ming-Hsuan Yang. Gala3d: Towards text-to-3d complex scene generation via layout-guided generative gaussian splatting, 2024. URLhttps://arxiv.org/abs/2402.07207

work page arXiv 2024

[55] [55]

Discene: Object decoupling and interaction modeling for complex scene generation

Xiao-Lei Li, Haodong Li, Hao-Xiang Chen, Tai-Jiang Mu, and Shi-Min Hu. Discene: Object decoupling and interaction modeling for complex scene generation. InSIGGRAPH Asia 2024 Conference Papers, SA ’24, New York, NY , USA, 2024. Association for Computing Machinery. ISBN 9798400711312. doi: 10.1145/3680528.3687589. URL https://doi.org/10.1145/ 3680528.3687589

work page doi:10.1145/3680528.3687589 2024

[56] [56]

Comboverse: Compositional 3d assets creation using spatially-aware diffusion guidance, 2024

Yongwei Chen, Tengfei Wang, Tong Wu, Xingang Pan, Kui Jia, and Ziwei Liu. Comboverse: Compositional 3d assets creation using spatially-aware diffusion guidance, 2024. URL https: //arxiv.org/abs/2403.12409

work page arXiv 2024

[57] [57]

Reparo: Compositional 3d assets generation with differentiable 3d layout alignment, 2025

Haonan Han, Rui Yang, Huan Liao, Jiankai Xing, Zunnan Xu, Xiaoming Yu, Junwei Zha, Xiu Li, and Wanhua Li. Reparo: Compositional 3d assets generation with differentiable 3d layout alignment, 2025. URLhttps://arxiv.org/abs/2405.18525

work page arXiv 2025

[58] [58]

Cast: Component-aligned 3d scene reconstruction from an rgb image, 2025

Kaixin Yao, Longwen Zhang, Xinhao Yan, Yan Zeng, Qixuan Zhang, Wei Yang, Lan Xu, Jiayuan Gu, and Jingyi Yu. Cast: Component-aligned 3d scene reconstruction from an rgb image, 2025. URLhttps://arxiv.org/abs/2502.12894

work page arXiv 2025

[59] [59]

Dreamanywhere: Object-centric panoramic 3d scene generation, 2025

Edoardo Alberto Dominici, Jozef Hladky, Floor Verhoeven, Lukas Radl, Thomas Deixelberger, Stefan Ainetter, Philipp Drescher, Stefan Hauswiesner, Arno Coomans, Giacomo Nazzaro, Konstantinos Vardis, and Markus Steinberger. Dreamanywhere: Object-centric panoramic 3d scene generation, 2025. URLhttps://arxiv.org/abs/2506.20367

work page arXiv 2025

[60] [60]

Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling, 2023. URLhttps://arxiv.org/abs/2210.02747

work page internal anchor Pith review Pith/arXiv arXiv 2023

[61] [61]

Oriane Siméoni, Huy V . V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Michaël Ramamonjisoa, Francisco Massa, Daniel Haziza, Luca Wehrstedt, Jianyuan Wang, Timothée Darcet, Théo Moutakanni, Leonel Sentana, Claire Roberts, Andrea Vedaldi, Jamie Tolan, John Brandt, Camille Couprie, Julie...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[62] [62]

Barron, Ricardo Martin-Brualla, Noah Snavely, and Thomas Funkhouser

Qianqian Wang, Zhicheng Wang, Kyle Genova, Pratul Srinivasan, Howard Zhou, Jonathan T. Barron, Ricardo Martin-Brualla, Noah Snavely, and Thomas Funkhouser. Ibrnet: Learning multi-view image-based rendering. InCVPR, 2021

2021

[63] [63]

LoRA: Low-Rank Adaptation of Large Language Models

Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models, 2021. URL https://arxiv.org/abs/2106.09685

work page internal anchor Pith review Pith/arXiv arXiv 2021

[64] [64]

Multidiffusion: Fusing diffusion paths for controlled image generation, 2023

Omer Bar-Tal, Lior Yariv, Yaron Lipman, and Tali Dekel. Multidiffusion: Fusing diffusion paths for controlled image generation, 2023. URLhttps://arxiv.org/abs/2302.08113

work page arXiv 2023

[65] [65]

Sage: Scalable agentic 3d scene generation for embodied ai, 2026

Hongchi Xia, Xuan Li, Zhaoshuo Li, Qianli Ma, Jiashu Xu, Ming-Yu Liu, Yin Cui, Tsung-Yi Lin, Wei-Chiu Ma, Shenlong Wang, Shuran Song, and Fangyin Wei. Sage: Scalable agentic 3d scene generation for embodied ai, 2026. URLhttps://arxiv.org/abs/2602.10116

work page arXiv 2026

[66] [66]

Structured 3D Latents for Scalable and Versatile 3D Generation

Jianfeng Xiang, Zelong Lv, Sicheng Xu, Yu Deng, Ruicheng Wang, Bowen Zhang, Dong Chen, Xin Tong, and Jiaolong Yang. Structured 3d latents for scalable and versatile 3d generation. arXiv preprint arXiv:2412.01506, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[67] [67]

3d-front: 3d furnished rooms with layouts and semantics.arXiv preprint arXiv:2011.09127, 2020

Huan Fu, Bowen Cai, Lin Gao, Lingxiao Zhang, Cao Li, Qixun Zeng, Chengyue Sun, Yiyun Fei, Yu Zheng, Ying Li, Yi Liu, Peng Liu, Lin Ma, Le Weng, Xiaohang Hu, Xin Ma, Qian Qian, Rongfei Jia, Binqiang Zhao, and Hao Zhang. 3d-front: 3d furnished rooms with layouts and semantics.arXiv preprint arXiv:2011.09127, 2020

work page arXiv 2011

[68] [68]

Scannet++: A high-fidelity dataset of 3d indoor scenes

Chandan Yeshwanth, Yueh-Cheng Liu, Matthias Nießner, and Angela Dai. Scannet++: A high-fidelity dataset of 3d indoor scenes. InProceedings of the International Conference on Computer Vision (ICCV), 2023

2023

[69] [69]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization, 2019. URL https: //arxiv.org/abs/1711.05101

work page internal anchor Pith review Pith/arXiv arXiv 2019

[70] [70]

Classifier-Free Diffusion Guidance

Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance, 2022. URL https:// arxiv.org/abs/2207.12598

work page internal anchor Pith review Pith/arXiv arXiv 2022

[71] [71]

The unreason- able effectiveness of deep features as a perceptual metric

Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreason- able effectiveness of deep features as a perceptual metric. InCVPR, 2018

2018

[72] [72]

Imagenet classification with deep convolutional neural networks

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In F. Pereira, C.J. Burges, L. Bottou, and K. Weinberger, editors,Advances in Neural Information Processing Systems, volume 25. Curran Associates, Inc., 2012. URL https://proceedings.neurips.cc/paper_files/paper/2012/file/ c399862d3b9d6...

2012

[73] [73]

Learning Transferable Visual Models From Natural Language Supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agar- wal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision, 2021. URL https://arxiv.org/abs/2103.00020

work page internal anchor Pith review Pith/arXiv arXiv 2021

[74] [74]

Omnidata: A scalable pipeline for making multi-task mid-level vision datasets from 3d scans

Ainaz Eftekhar, Alexander Sax, Jitendra Malik, and Amir Zamir. Omnidata: A scalable pipeline for making multi-task mid-level vision datasets from 3d scans. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 10786–10796, 2021

2021

[75] [75]

Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Nießner

Angela Dai, Angel X. Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. InProc. Computer Vision and Pattern Recognition (CVPR), IEEE, 2017. 14 A Experimental Setup Implementation Details.We build on Trellis.2 [ 24] at resolution 512. Consequently, for the occupancy g...

2017