arxiv: 2604.20093 · v1 · submitted 2026-04-22 · 💻 cs.CV

Recognition: unknown

FurnSet: Exploiting Repeats for 3D Scene Reconstruction

Paul Dobre , Xin Wang , Hongzhou Yang

Authors on Pith no claims yet

Pith reviewed 2026-05-10 01:26 UTC · model grok-4.3

classification 💻 cs.CV

keywords 3D scene reconstructionsingle-view reconstructionrepeated instancesself-attentionobject groupinglayout optimizationfurniture scenes

0 comments

The pith

Exploiting repeated object instances improves single-view 3D scene reconstruction quality.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper is trying to establish that single-view 3D scene reconstruction benefits from explicitly finding and using repeated objects like multiple instances of the same furniture piece. A reader would care because this repetition is common in real scenes and can provide extra information to reconstruct hidden parts of objects. The approach uses per-object classification tokens and a set-aware self-attention to group matching instances and combine their observations for better joint reconstruction. It also applies conditioning from both the scene and individual objects, then optimizes the layout with losses on point clouds and projections. If the claim holds, this method produces more accurate 3D scenes from one image on datasets with repeated items.

Core claim

By introducing per-object CLS tokens and a set-aware self-attention mechanism that groups identical instances and aggregates complementary observations across them, the framework enables joint reconstruction of repeated objects, combined with scene-level and object-level conditioning and layout optimization using object point clouds with 3D and 2D projection losses, leading to improved scene reconstruction quality.

What carries the argument

per-object CLS tokens and set-aware self-attention mechanism for grouping identical instances and aggregating observations

If this is right

Joint reconstruction from grouped instances fills in missing geometric details from complementary views.
Scene alignment is enhanced by optimizing layouts with point cloud and projection consistency losses.
Reconstruction performs better in scenes containing repeated furniture objects as shown on 3D-Future and 3D-Front.
Object geometries are more complete and layouts more consistent than independent per-object methods.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This grouping strategy could apply to other repeated elements in 3D scenes beyond furniture.
Future work might test the method in outdoor scenes with repeated structures like trees or windows.
Hybrid systems could use this when repeats are detected and switch to standard methods otherwise.

Load-bearing premise

Real-world scenes contain sufficient reliably identifiable repeated instances that the per-object CLS tokens and set-aware self-attention can group correctly without introducing grouping errors.

What would settle it

Observing that on a test scene with clear repeated objects the method produces lower quality reconstructions or incorrect groupings compared to baseline methods that treat objects independently would falsify the central claim.

Figures

Figures reproduced from arXiv: 2604.20093 by Hongzhou Yang, Paul Dobre, Xin Wang.

**Figure 1.** Figure 1: Overview of the FurnSet Framework. Our framework takes a single scene image and object segmentations as conditioning for object generation. During structured voxel generation, CLS tokens are concatenated with object tokens to identify repeated instances, which are jointly reconstructed through set-aware self-attention. The object cross-attention and scene cross-attention modules aggregate object-level and … view at source ↗

**Figure 2.** Figure 2: Repeated object set reconstruction under occlusion. [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗

**Figure 3.** Figure 3: Qualitative comparison with scene generation methods. The top three scenes are [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

read the original abstract

Single-view 3D scene reconstruction involves inferring both object geometry and spatial layout. Existing methods typically reconstruct objects independently or rely on implicit scene context, failing to exploit the repeated instances commonly present in realworld scenes. We propose FurnSet, a framework that explicitly identifies and leverages repeated object instances to improve reconstruction. Our method introduces per-object CLS tokens and a set-aware self-attention mechanism that groups identical instances and aggregates complementary observations across them, enabling joint reconstruction. We further combine scene-level and object-level conditioning to guide object reconstruction, followed by layout optimization using object point clouds with 3D and 2D projection losses for scene alignment. Experiments on 3D-Future and 3D-Front demonstrate improved scene reconstruction quality, highlighting the effectiveness of exploiting repetition for robust 3D scene reconstruction.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

FurnSet groups repeated objects via per-object CLS tokens and set-aware self-attention for joint reconstruction, but the benefit only appears if that grouping stays accurate.

read the letter

The paper's main move is to treat repeats as an explicit signal rather than noise or background context. It adds per-object CLS tokens so each instance gets its own embedding, then uses set-aware self-attention to cluster matching instances and pool their observations before feeding into scene-level and object-level conditioning. A final layout step aligns the point clouds with 3D and 2D projection losses. That pipeline is presented cleanly and targets a real limitation in single-view indoor reconstruction on datasets like 3D-Future and 3D-Front.

Referee Report

1 major / 2 minor

Summary. The manuscript introduces FurnSet, a framework for single-view 3D scene reconstruction that explicitly identifies repeated object instances in indoor scenes. It proposes per-object CLS tokens combined with a set-aware self-attention mechanism to group identical instances and aggregate complementary observations across them for joint reconstruction. The approach further incorporates scene-level and object-level conditioning to guide reconstruction, followed by layout optimization that uses object point clouds together with 3D and 2D projection losses. Experiments on the 3D-Future and 3D-Front datasets are reported to show improved scene reconstruction quality over prior methods.

Significance. If the grouping and aggregation mechanism proves reliable, the work could meaningfully advance single-view scene reconstruction by exploiting a structural property (repetitions) that is common in real-world indoor environments but ignored by most existing pipelines. The combination of set-aware attention with conditioning and explicit layout optimization is a coherent design choice, and evaluation on standard datasets (3D-Future, 3D-Front) would allow direct comparison with prior art. The absence of quantitative metrics, ablations, or error analysis in the current text, however, prevents a full assessment of whether the claimed gains are attributable to repetition exploitation.

major comments (1)

[§3.2] §3.2 (set-aware self-attention): The central claim that per-object CLS tokens plus set-aware self-attention correctly group identical instances so that complementary observations can be aggregated is not supported by any validation. In single-view inputs, intra-class feature similarity can produce embeddings that are close without corresponding to identical objects; erroneous groupings would then average incompatible geometry or texture signals. The subsequent scene/object conditioning and layout optimization steps do not retroactively correct such mistakes, so any reported improvement on 3D-Future/3D-Front could be driven by conditioning alone rather than repetition exploitation. The manuscript must supply either qualitative grouping visualizations or a quantitative grouping-accuracy metric to establish that this component functions as required.

minor comments (2)

The abstract states that experiments demonstrate improved quality but provides no numerical results, baseline comparisons, or ablation tables; these must be added with standard metrics (e.g., object IoU, scene Chamfer distance) and controls that isolate the repetition component.
Notation for the per-object CLS tokens and the set-aware attention operation should be formalized with equations to clarify how grouping and aggregation are implemented.

Simulated Author's Rebuttal

1 responses · 0 unresolved

Thank you for the opportunity to respond to the referee's comments. We value the feedback on the need for validation of the core grouping component in FurnSet. We address the concern below and commit to revisions that strengthen the manuscript.

read point-by-point responses

Referee: [§3.2] §3.2 (set-aware self-attention): The central claim that per-object CLS tokens plus set-aware self-attention correctly group identical instances so that complementary observations can be aggregated is not supported by any validation. In single-view inputs, intra-class feature similarity can produce embeddings that are close without corresponding to identical objects; erroneous groupings would then average incompatible geometry or texture signals. The subsequent scene/object conditioning and layout optimization steps do not retroactively correct such mistakes, so any reported improvement on 3D-Future/3D-Front could be driven by conditioning alone rather than repetition exploitation. The manuscript must supply either qualitative grouping visualizations or a quantitative grouping-accuracy metric to establish that this component functions as required.

Authors: We acknowledge the importance of validating the grouping mechanism to ensure that the observed improvements stem from repetition exploitation. While the manuscript describes the set-aware self-attention and its intended role in grouping identical instances via per-object CLS tokens, we agree that explicit evidence is necessary. In the revised version, we will include qualitative visualizations of the attention maps and grouped instances from the 3D-Future and 3D-Front datasets. These will illustrate cases where repeated objects are correctly identified and their features aggregated. Furthermore, we will introduce a quantitative grouping-accuracy metric, computed by measuring the precision and recall of instance grouping against available ground-truth labels in the datasets. This will be accompanied by an ablation study isolating the contribution of the set-aware attention versus the conditioning components. We believe these additions will substantiate the central claim. revision: yes

Circularity Check

0 steps flagged

No circularity: novel architecture with independent mechanisms evaluated on external data

full rationale

The paper proposes FurnSet as a new framework that introduces per-object CLS tokens and a set-aware self-attention mechanism to group repeated instances and aggregate observations for joint reconstruction. This is combined with scene/object conditioning and layout optimization using point clouds and projection losses. The description presents these as original contributions, with experiments on the external 3D-Future and 3D-Front datasets. No equations or steps reduce a claimed prediction or result to a fitted input or self-citation by construction. No self-citations are invoked as load-bearing for uniqueness or ansatz. The central claim of improved reconstruction via repetition exploitation rests on the described novel components rather than tautological redefinition of inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review prevents identification of specific free parameters or axioms; the approach appears to extend standard transformer elements without introducing new postulated entities.

pith-pipeline@v0.9.0 · 5433 in / 994 out tokens · 30204 ms · 2026-05-10T01:26:43.403466+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

55 extracted references · 16 canonical work pages · 5 internal anchors

[1]

Gen3dsr: Generalizable 3d scene reconstruction via divide and conquer from a single view

Andreea Ardelean, Mert Özer, and Bernhard Egger. Gen3dsr: Generalizable 3d scene reconstruction via divide and conquer from a single view. In3DV, pages 616–626. IEEE, 2025

2025
[2]

Chang, and Matthias Nießner

Armen Avetisyan, Manuel Dahnert, Angela Dai, Manolis Savva, Angel X. Chang, and Matthias Nießner. Scan2cad: Learning cad model alignment in rgb-d scans. InCVPR, pages 2614–2623, 2019

2019
[3]

ShapeNet: An Information-Rich 3D Model Repository

Angel X. Chang, Thomas Funkhouser, Leonidas Guibas, Pat Hanrahan, Qixing Huang, Zimo Li, Silvio Savarese, et al. Shapenet: An information-rich 3d model repository. arXiv preprint arXiv:1512.03012, 2015

work page internal anchor Pith review arXiv 2015
[4]

SAM 3D: 3Dfy Anything in Images

Xingyu Chen, Fu-Jen Chu, Pierre Gleize, Kevin J. Liang, Alexander Sax, Hao Tang, Weiyao Wang, et al. Sam 3d: 3dfy anything in images.arXiv preprint arXiv:2511.16624, 2025

work page internal anchor Pith review arXiv 2025
[5]

Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Nießner

Angela Dai, Angel X. Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In CVPR, pages 5828–5839, 2017

2017
[6]

Objaverse-xl: A universe of 10m+ 3d objects

Matt Deitke, Ruoshi Liu, Matthew Wallingford, Huong Ngo, Oscar Michel, Aditya Kusupati, Alan Fan, et al. Objaverse-xl: A universe of 10m+ 3d objects. InNeurIPS, volume 36, pages 35799–35813, 2023

2023
[7]

Obja- verse: A universe of annotated 3d objects

Matt Deitke, Dustin Schwenk, Jordi Salvador, Luca Weihs, Oscar Michel, Eli Vander- Bilt, Ludwig Schmidt, Kiana Ehsani, Aniruddha Kembhavi, and Ali Farhadi. Obja- verse: A universe of annotated 3d objects. InCVPR, pages 13142–13153, 2023

2023
[8]

Bert: Pre- training of deep bidirectional transformers for language understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre- training of deep bidirectional transformers for language understanding. InNAACL, pages 4171–4186, 2019

2019
[9]

Full- part: Generating each 3d part at full resolution.arXiv preprint arXiv:2510.26140,

Lihe Ding, Shaocong Dong, Yaokun Li, Chenjian Gao, Xiao Chen, Rui Han, Yihao Kuang, et al. Fullpart: Generating each 3d part at full resolution.arXiv preprint arXiv:2510.26140, 2025

work page arXiv 2025
[10]

From one to more: Contextual part latents for 3d generation

Shaocong Dong, Lihe Ding, Xiao Chen, Yaokun Li, Yuxin Wang, Yucheng Wang, Qi Wang, Jaehyeok Kim, Chenjian Gao, Zhanpeng Huang, et al. From one to more: Contextual part latents for 3d generation. InICCV, 2025

2025
[11]

3d-front: 3d furnished rooms with layouts and semantics

Huan Fu, Bowen Cai, Lin Gao, Ling-Xiao Zhang, Jiaming Wang, Cao Li, Qixun Zeng, et al. 3d-front: 3d furnished rooms with layouts and semantics. InICCV, pages 10933– 10942, 2021

2021
[12]

3d-future: 3d furniture shape with texture.International Journal of Computer Vision, 129(12):3313–3337, 2021

Huan Fu, Rongfei Jia, Lin Gao, Mingming Gong, Binqiang Zhao, Steve Maybank, and Dacheng Tao. 3d-future: 3d furniture shape with texture.International Journal of Computer Vision, 129(12):3313–3337, 2021

2021
[13]

Diffcad: Weakly-supervised probabilistic cad model retrieval and alignment from an rgb image

Daoyi Gao, Dávid Rozenberszki, Stefan Leutenegger, and Angela Dai. Diffcad: Weakly-supervised probabilistic cad model retrieval and alignment from an rgb image. ACM Transactions on Graphics, 43(4):1–15, 2024. 12DOBRE ET AL. : FURNSET

2024
[14]

Cat3D: Create anything in 3d with multi-view diffusion models

Ruiqi Gao, Aleksander Holynski, Philipp Henzler, Arthur Brussee, Ricardo Martin- Brualla, Pratul Srinivasan, Jonathan T. Barron, and Ben Poole. Cat3d: Create anything in 3d with multi-view diffusion models.arXiv preprint arXiv:2405.10314, 2024

work page arXiv 2024
[15]

Filterreg: Robust and efficient probabilistic point-set regis- tration using gaussian filter and twist parameterization

Wei Gao and Russ Tedrake. Filterreg: Robust and efficient probabilistic point-set regis- tration using gaussian filter and twist parameterization. InCVPR, pages 11095–11104, 2019

2019
[16]

Roca: Robust cad model retrieval and alignment from a single image

Can Gümeli, Angela Dai, and Matthias Nießner. Roca: Robust cad model retrieval and alignment from a single image. InCVPR, pages 4022–4031, 2022

2022
[17]

Classifier-Free Diffusion Guidance

Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance.arXiv preprint arXiv:2207.12598, 2022

work page internal anchor Pith review arXiv 2022
[18]

Midi: Multi-instance diffu- sion for single image to 3d scene generation

Zehuan Huang, Yuan-Chen Guo, Xingqiao An, Yunhan Yang, Yangguang Li, Zi-Xin Zou, Ding Liang, Xihui Liu, Yan-Pei Cao, and Lu Sheng. Midi: Multi-instance diffu- sion for single image to 3d scene generation. InCVPR, pages 23646–23657, 2025

2025
[19]

arXiv2506.15442(2025) 10

Hunyuan3D Team, Shuhui Yang, Mingxin Yang, Yifei Feng, Xin Huang, Sheng Zhang, Zebin He, et al. Hunyuan3d 2.1: From images to high-fidelity 3d assets with production-ready pbr material.arXiv preprint arXiv:2506.15442, 2025

work page arXiv 2025
[20]

Repurposing diffusion-based image generators for monocular depth estimation

Bingxin Ke, Anton Obukhov, Shengyu Huang, Nando Metzger, Rodrigo Caye Daudt, and Konrad Schindler. Repurposing diffusion-based image generators for monocular depth estimation. InCVPR, pages 9492–9502, 2024

2024
[21]

3d gaussian splatting for real-time radiance field rendering.ACM Transactions on Graph- ics, 42(4), 2023

Bernhard Kerbl, Georgios Kopanas, Thomas Leimkühler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering.ACM Transactions on Graph- ics, 42(4), 2023

2023
[22]

Segment anything

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, et al. Segment anything. InICCV, pages 4015–4026, 2023

2023
[23]

Instant3d: Fast text-to-3d with sparse-view gen- eration and large reconstruction model

Jiahao Li, Hao Tan, Kai Zhang, Zexiang Xu, Fujun Luan, Yinghao Xu, Yicong Hong, Kalyan Sunkavalli, Greg Shakhnarovich, and Sai Bi. Instant3d: Fast text- to-3d with sparse-view generation and large reconstruction model.arXiv preprint arXiv:2311.06214, 2023

work page arXiv 2023
[24]

Triposg: High-fidelity 3d shape synthesis using large-scale rec- tified flow models.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

Yangguang Li, Zi-Xin Zou, Zexiang Liu, Dehu Wang, Yuan Liang, Zhipeng Yu, Xingchao Liu, et al. Triposg: High-fidelity 3d shape synthesis using large-scale rec- tified flow models.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

2025
[25]

Depth Anything 3: Recovering the Visual Space from Any Views

Haotong Lin, Sili Chen, Junhao Liew, Donny Y . Chen, Zhenyu Li, Guang Shi, Jiashi Feng, and Bingyi Kang. Depth anything 3: Recovering the visual space from any views.arXiv preprint arXiv:2511.10647, 2025

work page internal anchor Pith review arXiv 2025
[26]

Partcrafter: Structured 3d mesh generation via compositional latent diffusion transformers.ArXiv, abs/2506.05573, 2025

Yuchen Lin, Chenguo Lin, Panwang Pan, Honglei Yan, Yiqiang Feng, Yadong Mu, and Katerina Fragkiadaki. Partcrafter: Structured 3d mesh generation via compositional latent diffusion transformers.arXiv preprint arXiv:2506.05573, 2025

work page arXiv 2025
[27]

Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling. InICLR, 2023. DOBRE ET AL. : FURNSET13

2023
[28]

Syncdreamer: Gen- erating multiview-consistent images from a single-view im- age.arXiv preprint arXiv:2309.03453, 2023

Yuan Liu, Cheng Lin, Zijiao Zeng, Xiaoxiao Long, Lingjie Liu, Taku Komura, and Wenping Wang. Syncdreamer: Generating multiview-consistent images from a single- view image.arXiv preprint arXiv:2309.03453, 2023

work page arXiv 2023
[29]

Wonder3d: Single image to 3d using cross-domain diffusion

Xiaoxiao Long, Yuan-Chen Guo, Cheng Lin, Yuan Liu, Zhiyang Dou, Lingjie Liu, Yuexin Ma, et al. Wonder3d: Single image to 3d using cross-domain diffusion. In CVPR, pages 9970–9980, 2024

2024
[30]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[31]

arXiv preprint arXiv:2508.15769 (2025)

Yanxu Meng, Haoning Wu, Ya Zhang, and Weidi Xie. Scenegen: Single-image 3d scene generation in one feedforward pass.arXiv preprint arXiv:2508.15769, 2025

work page arXiv 2025
[32]

Srinivasan, Matthew Tancik, Jonathan T

Ben Mildenhall, Pratul P. Srinivasan, Matthew Tancik, Jonathan T. Barron, Ravi Ra- mamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view synthesis.Communications of the ACM, 65(1):99–106, 2021

2021
[33]

Total3dunderstanding: Joint layout, object pose and mesh reconstruction for indoor scenes from a single image

Yinyu Nie, Xiaoguang Han, Shihui Guo, Yujian Zheng, Jian Chang, and Jian Jun Zhang. Total3dunderstanding: Joint layout, object pose and mesh reconstruction for indoor scenes from a single image. InCVPR, pages 55–64, 2020

2020
[34]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. In ICCV, pages 4195–4205, 2023

2023
[35]

Convolutional occupancy networks

Songyou Peng, Michael Niemeyer, Lars Mescheder, Marc Pollefeys, and Andreas Geiger. Convolutional occupancy networks. InECCV, pages 523–540. Springer, 2020

2020
[36]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Om- mer. High-resolution image synthesis with latent diffusion models. InCVPR, pages 10684–10695, 2022

2022
[37]

Retrievalfuse: Neural 3d scene reconstruction with a database

Yawar Siddiqui, Justus Thies, Fangchang Ma, Qi Shan, Matthias Nießner, and Angela Dai. Retrievalfuse: Neural 3d scene reconstruction with a database. InICCV, pages 12568–12577, 2021

2021
[38]

Layoutvlm: Differentiable optimization of 3d layout via vision-language models

Fan-Yun Sun, Weiyu Liu, Siyi Gu, Dylan Lim, Goutam Bhat, Federico Tombari, Man- ling Li, Nick Haber, and Jiajun Wu. Layoutvlm: Differentiable optimization of 3d layout via vision-language models. InCVPR, pages 29469–29478, 2025

2025
[39]

Recent advances in 3d object and scene generation: A survey,

Xiang Tang, Ruotong Li, and Xiaopeng Fan. Recent advances in 3d object and scene generation: A survey.arXiv preprint arXiv:2504.11734, 2025

work page arXiv 2025
[40]

Vggt: Visual geometry grounded transformer

Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Visual geometry grounded transformer. InCVPR, pages 5294–5306, 2025

2025
[41]

Qirui Wu, Denys Iliash, Daniel Ritchie, Manolis Savva, and Angel X. Chang. Diorama: Unleashing zero-shot single-view 3d indoor scene modeling. InICCV, pages 8896– 8907, 2025

2025
[42]

Sin3dm: Learning a diffu- sion model from a single 3d textured shape.arXiv preprint arXiv:2305.15399, 2023

Rundi Wu, Ruoshi Liu, Carl V ondrick, and Changxi Zheng. Sin3dm: Learning a diffu- sion model from a single 3d textured shape.arXiv preprint arXiv:2305.15399, 2023. 14DOBRE ET AL. : FURNSET

work page arXiv 2023
[43]

Direct3d: Scalable image-to-3d generation via 3d latent diffusion trans- former

Shuang Wu, Youtian Lin, Feihu Zhang, Yifei Zeng, Jingxi Xu, Philip Torr, Xun Cao, and Yao Yao. Direct3d: Scalable image-to-3d generation via 3d latent diffusion trans- former. InNeurIPS, volume 37, pages 121859–121881, 2024

2024
[44]

Amodal3r: Amodal 3d reconstruction from occluded 2d images

Tianhao Wu, Chuanxia Zheng, Frank Guan, Andrea Vedaldi, and Tat-Jen Cham. Amodal3r: Amodal 3d reconstruction from occluded 2d images. InICCV, pages 9181– 9193, 2025

2025
[45]

Structured 3d latents for scalable and ver- satile 3d generation

Jianfeng Xiang, Zelong Lv, Sicheng Xu, Yu Deng, Ruicheng Wang, Bowen Zhang, Dong Chen, Xin Tong, and Jiaolong Yang. Structured 3d latents for scalable and ver- satile 3d generation. InCVPR, pages 21469–21480, 2025

2025
[46]

Depth anything v2

Lihe Yang, Bingyi Kang, Zilong Huang, Zhen Zhao, Xiaogang Xu, Jiashi Feng, and Hengshuang Zhao. Depth anything v2. InNeurIPS, 2024

2024
[47]

Cast: Component-aligned 3d scene reconstruction from an rgb image.ACM Transactions on Graphics, 44(4), 2025

Kaixin Yao, Longwen Zhang, Xinhao Yan, Yan Zeng, Qixuan Zhang, Lan Xu, Wei Yang, Jiayuan Gu, and Jingyi Yu. Cast: Component-aligned 3d scene reconstruction from an rgb image.ACM Transactions on Graphics, 44(4), 2025

2025
[48]

Scannet++: A high-fidelity dataset of 3d indoor scenes

Chandan Yeshwanth, Yueh-Cheng Liu, Matthias Nießner, and Angela Dai. Scannet++: A high-fidelity dataset of 3d indoor scenes. InICCV, pages 12–22, 2023

2023
[49]

Metascenes: Towards automated replica creation for real-world 3d scans

Huangyue Yu, Baoxiong Jia, Yixin Chen, Yandan Yang, Puhao Li, Rongpeng Su, Jiaxin Li, et al. Metascenes: Towards automated replica creation for real-world 3d scans. In CVPR, pages 1667–1679, 2025

2025
[50]

Commonscenes: Generating commonsense 3d indoor scenes with scene graph diffusion

Guangyao Zhai, Evin Pınar Örnek, Shun-Cheng Wu, Yan Di, Federico Tombari, Nassir Navab, and Benjamin Busam. Commonscenes: Generating commonsense 3d indoor scenes with scene graph diffusion. InNeurIPS, volume 36, pages 30026–30038, 2023

2023
[51]

3dshape2vecset: A 3d shape representation for neural fields and generative diffusion models.ACM Trans- actions on Graphics, 42(4), 2023

Biao Zhang, Jiapeng Tang, Matthias Niessner, and Peter Wonka. 3dshape2vecset: A 3d shape representation for neural fields and generative diffusion models.ACM Trans- actions on Graphics, 42(4), 2023

2023
[52]

arXiv preprint arXiv:2507.14501 , year=

Jiahui Zhang, Yuelei Li, Anpei Chen, Muyu Xu, Kunhao Liu, Jianyuan Wang, Xiao- Xiao Long, et al. Advances in feed-forward 3d reconstruction and view synthesis: A survey.arXiv preprint arXiv:2507.14501, 2025

work page arXiv 2025
[53]

Clay: A controllable large-scale generative model for creating high-quality 3d assets.ACM Transactions on Graphics, 43(4):1–20, 2024

Longwen Zhang, Ziyu Wang, Qixuan Zhang, Qiwei Qiu, Anqi Pang, Haoran Jiang, Wei Yang, Lan Xu, and Jingyi Yu. Clay: A controllable large-scale generative model for creating high-quality 3d assets.ACM Transactions on Graphics, 43(4):1–20, 2024

2024
[54]

Depr: Depth-guided single-view scene reconstruction with instance- level diffusion

Qingcheng Zhao, Xiang Zhang, Haiyang Xu, Zeyuan Chen, Jianwen Xie, Yuan Gao, and Zhuowen Tu. Depr: Depth-guided single-view scene reconstruction with instance- level diffusion. InICCV, pages 5722–5733, 2025

2025
[55]

Amodalgen3d: Generative amodal 3d object recon- struction from sparse unposed views.arXiv preprint arXiv:2511.21945, 2025

Junwei Zhou and Yu-Wing Tai. Amodalgen3d: Generative amodal 3d object recon- struction from sparse unposed views.arXiv preprint arXiv:2511.21945, 2025

work page arXiv 2025