DecoRec: Decomposed 3D Scene Reconstruction from Single-View Images via Object-Level Diffusion

arxiv: 2605.16807 · v1 · pith:X43KUQV7new · submitted 2026-05-16 · 💻 cs.CV

DecoRec: Decomposed 3D Scene Reconstruction from Single-View Images via Object-Level Diffusion

Yuhan Ping , Yuan Liu , Xiaoxiao Long , Peng Wang , Junhui Hou , Jianyi Zheng , Jia Pan , Xin Li

show 1 more author

Cheng Lin

This is my paper

Pith reviewed 2026-05-19 21:06 UTC · model grok-4.3

classification 💻 cs.CV

keywords 3D reconstructionsingle viewdiffusion modelsobject decompositionscene meshnovel view synthesisdifferentiable rendering

0 comments p. Extension

pith:X43KUQV7 Add to your LaTeX paper

What is a Pith Number?

\usepackage{pith}
\pithnumber{X43KUQV7}

Prints a linked pith:X43KUQV7 badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more

The pith

DecoRec reconstructs 3D scenes from single-view images by diffusing objects individually then refining their merge.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

DecoRec aims to create accurate 3D scene meshes from a single 2D photo by breaking the scene into separate objects. Each object is reconstructed using diffusion models trained for single-view object reconstruction. These pieces are then assembled and refined with differentiable rendering and diffusion guidance to fix geometry and appearance issues. This matters because direct scene reconstruction often fails due to poor datasets and coarse methods, while this object-wise approach promises better fidelity for tasks like designing virtual rooms.

Core claim

The central claim is that by reconstructing each object in the scene separately with diffusion-based single-view methods and then merging them through a refinement pipeline that uses differentiable rendering and diffusion guidance, high-quality 3D scene reconstruction and novel view synthesis become possible from just one image.

What carries the argument

The decomposition of the scene into individual objects reconstructed via diffusion models, followed by a refinement pipeline for merging using differentiable rendering and diffusion-guided adjustments.

Load-bearing premise

That the refinement pipeline can reliably resolve any geometric or appearance inconsistencies introduced when merging the separately reconstructed objects.

What would settle it

Observing a case where individually accurate object reconstructions lead to a merged scene with uncorrectable errors in surface alignment or texture continuity despite the refinement steps.

Figures

Figures reproduced from arXiv: 2605.16807 by Cheng Lin, Jianyi Zheng, Jia Pan, Junhui Hou, Peng Wang, Xiaoxiao Long, Xin Li, Yuan Liu, Yuhan Ping.

**Figure 2.** Figure 2: Refinement comparison. Single-view reconstruction method cannot [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: Overview of our pipeline. Given an input image and object masks, our method first performs a coarse decomposition and reconstruction for both [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗

**Figure 4.** Figure 4: Background reconstruction. We reconstruct a complete background [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗

**Figure 6.** Figure 6: Qualitative comparison with other methods on the real-world dataset. From top to bottom, we show one novel-view rendering and the corresponding [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗

**Figure 7.** Figure 7: Qualitative comparison with other methods on the synthetic dataset. From top to bottom, we show two novel views and the corresponding 3D [PITH_FULL_IMAGE:figures/full_fig_p006_7.png] view at source ↗

**Figure 8.** Figure 8: Qualitative comparison with Gen3DSR on stylized inputs. [PITH_FULL_IMAGE:figures/full_fig_p007_8.png] view at source ↗

**Figure 9.** Figure 9: Ablation studies on different loss settings. Without any designed losses, it will cause issues like texture blurring, structure distortion, and black borders. [PITH_FULL_IMAGE:figures/full_fig_p008_9.png] view at source ↗

**Figure 10.** Figure 10: Flexibility of scene editing. Our method enables decomposed 3D lifting of 2D images and thus, users can perform different editing operations [PITH_FULL_IMAGE:figures/full_fig_p008_10.png] view at source ↗

**Figure 11.** Figure 11: Additional results with implementation of Hunyuan3D-2 and [PITH_FULL_IMAGE:figures/full_fig_p008_11.png] view at source ↗

**Figure 12.** Figure 12: More comparisons between coarse and refinement stages. [PITH_FULL_IMAGE:figures/full_fig_p008_12.png] view at source ↗

**Figure 13.** Figure 13: Result gallery of qualitative comparison with other methods. [PITH_FULL_IMAGE:figures/full_fig_p009_13.png] view at source ↗

**Figure 14.** Figure 14: Result gallery of qualitative comparison with other methods. [PITH_FULL_IMAGE:figures/full_fig_p009_14.png] view at source ↗

**Figure 15.** Figure 15: Result gallery of qualitative comparison with other methods. [PITH_FULL_IMAGE:figures/full_fig_p010_15.png] view at source ↗

**Figure 18.** Figure 18: Comparisons of background reconstruction results. The middle and [PITH_FULL_IMAGE:figures/full_fig_p010_18.png] view at source ↗

**Figure 16.** Figure 16: Result of our method on more types of scenes, including kitchen, [PITH_FULL_IMAGE:figures/full_fig_p010_16.png] view at source ↗

**Figure 17.** Figure 17: Comparison with Gen3DSR by giving the same mask as our method. [PITH_FULL_IMAGE:figures/full_fig_p010_17.png] view at source ↗

read the original abstract

In this paper, we introduce \textit{DecoRec}, a novel system designed to elevate single-view 2D images to a decomposed 3D scene mesh. Current methods for single-view scene reconstruction typically rely on object retrieval or the regression of coarse 3D voxels or surfaces, leading to inaccuracies in capturing the appearance and geometry of the input image. The lack of high-quality large-scale scene-level datasets further complicates direct 3D scene generation from single-view images. To achieve high-quality 3D scene generation from a single-view image, DecoRec takes advantage of recent diffusion-based single-view object reconstruction methods to reconstruct individual objects separately. Subsequently, a refinement pipeline is proposed to effectively merge these reconstructed objects, enhancing appearance and geometry through a differentiable rendering technique and diffusion-guided refinement. Our results demonstrate that DecoRec facilitates high-quality single-view scene reconstruction in both geometry and novel synthesis, offering significant benefits for downstream applications like room interior design.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DecoRec splits single-view scene reconstruction into per-object diffusion steps followed by a differentiable merge, which is a sensible engineering move but leaves the consistency of the final geometry open to question.

read the letter

The main point with DecoRec is that it reconstructs each object in a scene separately using existing diffusion models for single-view 3D, then runs a refinement stage that combines differentiable rendering with further diffusion guidance to produce a coherent mesh. This sidesteps the shortage of large scene-level training sets by working modularly instead of generating everything at once. The approach is straightforward and directly targets practical gaps in current retrieval or coarse voxel methods. It shows promise for tasks like room design where both shape fidelity and novel-view rendering matter. The authors make a reasonable case that the post-merge refinement improves appearance and geometry over baselines, at least in the examples they present. That modular split is the clearest addition here, even if the individual pieces draw from prior diffusion and rendering work. The softer spot is the merging step itself. Single-view inputs leave depth and contact relations under-determined, so the refinement has to resolve inter-object geometry without explicit 3D constraints like penetration penalties. If the guidance stays mostly appearance-driven, it could mask rather than correct drift at boundaries or occluded regions. The abstract gives no numbers or targeted failure-case tests, so the full paper needs to demonstrate that the pipeline actually delivers measurable structural gains rather than just smoother visuals. This is the kind of system paper that would interest people building 3D tools for interior modeling or AR. Readers who want concrete pipelines that combine off-the-shelf object models with a fusion stage will find usable ideas, provided the experiments back the claims. It is worth sending for peer review so referees can examine the implementation details and the strength of the quantitative comparisons.

Referee Report

2 major / 2 minor

Summary. The paper introduces DecoRec, a system for single-view 3D scene reconstruction that decomposes the input into individual objects, reconstructs each using existing diffusion-based single-view object methods, estimates poses to place them, and then applies a refinement pipeline with differentiable rendering and diffusion guidance to produce a coherent scene mesh. The central claim is that this yields high-quality geometry and novel-view synthesis superior to direct scene-level approaches, without requiring large scene datasets.

Significance. If validated, the decomposed approach would be a useful contribution to single-view scene reconstruction by leveraging strong object-level priors and a post-merging refinement stage. Credit is given for the modular design that reuses external diffusion models and for proposing an independent merging/refinement stage rather than end-to-end scene generation. The emphasis on both geometry and appearance consistency via differentiable rendering is a reasonable direction for addressing under-constrained single-view problems.

major comments (2)

[Results / Experiments] The abstract states that results demonstrate high-quality geometry and novel-view synthesis, yet the supplied text contains no quantitative metrics (e.g., Chamfer distance, IoU, or PSNR), ablation studies, or error analysis. This is load-bearing for the central claim; the results section must include controlled comparisons to baselines and targeted failure-case analysis (occlusions, contacts) to substantiate superiority.
[Method / Refinement pipeline] The refinement pipeline (described after object reconstruction) relies on differentiable rendering plus diffusion guidance to resolve inter-object inconsistencies. Without explicit 3D consistency terms (e.g., contact or penetration losses) or ablation on scenes with touching/occluding objects, it is unclear whether residual geometric drift or appearance seams are reliably eliminated; this directly affects the coherence guarantee.

minor comments (2)

[Method] Notation for the merging stage (pose estimation, rigid placement) should be formalized with equations or a clear algorithmic outline to improve reproducibility.
[Abstract] The abstract would benefit from naming the specific object-reconstruction diffusion models used and the scene categories evaluated.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments. We address each major point below and describe the changes planned for the revised manuscript.

read point-by-point responses

Referee: [Results / Experiments] The abstract states that results demonstrate high-quality geometry and novel-view synthesis, yet the supplied text contains no quantitative metrics (e.g., Chamfer distance, IoU, or PSNR), ablation studies, or error analysis. This is load-bearing for the central claim; the results section must include controlled comparisons to baselines and targeted failure-case analysis (occlusions, contacts) to substantiate superiority.

Authors: We agree that the absence of quantitative metrics weakens the central claim. The current manuscript emphasizes qualitative results partly because of the scarcity of standardized large-scale scene benchmarks for single-view reconstruction. In the revision we will add controlled quantitative comparisons using Chamfer distance and IoU for geometry as well as PSNR and SSIM for novel-view synthesis, together with ablations and a dedicated failure-case analysis focused on occlusions and object contacts. revision: yes
Referee: [Method / Refinement pipeline] The refinement pipeline (described after object reconstruction) relies on differentiable rendering plus diffusion guidance to resolve inter-object inconsistencies. Without explicit 3D consistency terms (e.g., contact or penetration losses) or ablation on scenes with touching/occluding objects, it is unclear whether residual geometric drift or appearance seams are reliably eliminated; this directly affects the coherence guarantee.

Authors: The refinement stage optimizes the merged scene via differentiable rendering under diffusion guidance from the input and generated novel views; the strong object-level priors embedded in the diffusion model are intended to reduce inter-object drift and seams without hand-crafted contact losses. We acknowledge that this mechanism would be clearer with targeted evidence. In the revision we will therefore include ablations on scenes containing touching and occluding objects, reporting both qualitative coherence and quantitative consistency metrics. revision: partial

Circularity Check

0 steps flagged

No significant circularity; method builds on external priors with independent refinement stage

full rationale

The paper describes a pipeline that applies existing diffusion-based single-view object reconstruction methods to individual objects, followed by a separate refinement stage using differentiable rendering and diffusion guidance to merge them. No equations, fitted parameters renamed as predictions, or self-citation chains that reduce the central claim to its own inputs are present in the abstract or described approach. The derivation is self-contained against external benchmarks (prior diffusion models) and does not rely on self-definitional steps or uniqueness theorems imported from the authors' prior work.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract introduces no explicit free parameters, mathematical axioms, or newly postulated entities; the approach relies on existing diffusion models for single objects and standard differentiable rendering without stating additional unproven assumptions beyond the effectiveness of the proposed merging stage.

pith-pipeline@v0.9.0 · 5718 in / 1162 out tokens · 62634 ms · 2026-05-19T21:06:34.650006+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

DecoRec ... reconstruct individual objects separately ... refinement pipeline ... differentiable rendering technique and diffusion-guided refinement

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

95 extracted references · 95 canonical work pages · 9 internal anchors

[1]

High- resolution image synthesis with latent diffusion models,

R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High- resolution image synthesis with latent diffusion models,” inCVPR, 2022

work page 2022
[2]

Monoscene: Monocular 3d semantic scene completion,

A.-Q. Cao and R. De Charette, “Monoscene: Monocular 3d semantic scene completion,” inCVPR, 2022

work page 2022
[3]

Corenet: Coherent 3d scene reconstruction from a single rgb image,

S. Popov, P. Bauszat, and V . Ferrari, “Corenet: Coherent 3d scene reconstruction from a single rgb image,” inECCV, 2020. MANUSCRIPT SUBMITTED TO IEEE TVCG 11

work page 2020
[4]

To- tal3dunderstanding: Joint layout, object pose and mesh reconstruction for indoor scenes from a single image,

Y . Nie, X. Han, S. Guo, Y . Zheng, J. Chang, and J. J. Zhang, “To- tal3dunderstanding: Joint layout, object pose and mesh reconstruction for indoor scenes from a single image,” inCVPR, 2020

work page 2020
[5]

Psdr-room: Single photo to scene using differentiable rendering,

K. Yan, F. Luan, M. Ha ˇsan, T. Groueix, V . Deschaintre, and S. Zhao, “Psdr-room: Single photo to scene using differentiable rendering,” in SIGGRAPH Asia 2023, 2023

work page 2023
[6]

Roca: Robust cad model retrieval and alignment from a single image,

C. G ¨umeli, A. Dai, and M. Nießner, “Roca: Robust cad model retrieval and alignment from a single image,” inCVPR, 2022

work page 2022
[7]

Patch2cad: Patchwise embedding learning for in-the-wild shape retrieval from a single image,

W. Kuo, A. Angelova, T.-Y . Lin, and A. Dai, “Patch2cad: Patchwise embedding learning for in-the-wild shape retrieval from a single image,” inICCV, 2021

work page 2021
[8]

Generalizing single- view 3d shape retrieval to occlusions and unseen objects,

Q. Wu, D. Ritchie, M. Savva, and A. X. Chang, “Generalizing single- view 3d shape retrieval to occlusions and unseen objects,”arXiv preprint arXiv:2401.00405, 2023

work page arXiv 2023
[9]

DreamFusion: Text-to-3D using 2D Diffusion

B. Poole, A. Jain, J. T. Barron, and B. Mildenhall, “Dreamfusion: Text- to-3d using 2d diffusion,”arXiv preprint arXiv:2209.14988, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[10]

SyncDreamer: Generating Multiview-consistent Images from a Single-view Image

Y . Liu, C. Lin, Z. Zeng, X. Long, L. Liu, T. Komura, and W. Wang, “Syncdreamer: Generating multiview-consistent images from a single- view image,”arXiv preprint arXiv:2309.03453, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[11]

Zero-1-to-3: Zero-shot one image to 3d object,

R. Liu, R. Wu, B. Van Hoorick, P. Tokmakov, S. Zakharov, and C. V ondrick, “Zero-1-to-3: Zero-shot one image to 3d object,” inCVPR, 2023

work page 2023
[12]

Zero123++: a Single Image to Consistent Multi-view Diffusion Base Model

R. Shi, H. Chen, Z. Zhang, M. Liu, C. Xu, X. Wei, L. Chen, C. Zeng, and H. Su, “Zero123++: a single image to consistent multi-view diffusion base model,”arXiv preprint arXiv:2310.15110, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[13]

arXiv preprint arXiv:2310.15008 , year=

X. Long, Y .-C. Guo, C. Lin, Y . Liu, Z. Dou, L. Liu, Y . Ma, S.-H. Zhang, M. Habermann, C. Theobaltet al., “Wonder3d: Single image to 3d using cross-domain diffusion,”arXiv preprint arXiv:2310.15008, 2023

work page arXiv 2023
[14]

Objaverse: A universe of annotated 3d objects,

M. Deitke, D. Schwenk, J. Salvador, L. Weihs, O. Michel, E. VanderBilt, L. Schmidt, K. Ehsani, A. Kembhavi, and A. Farhadi, “Objaverse: A universe of annotated 3d objects,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 13 142–13 153

work page 2023
[15]

Segment anything,

A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y . Loet al., “Segment anything,” inCVPR, 2023

work page 2023
[16]

Crm: Single image to 3d textured mesh with convolutional reconstruction model,

Z. Wang, Y . Wang, Y . Chen, C. Xiang, S. Chen, D. Yu, C. Li, H. Su, and J. Zhu, “Crm: Single image to 3d textured mesh with convolutional reconstruction model,”arXiv preprint arXiv:2403.05034, 2024

work page arXiv 2024
[17]

Modular primitives for high-performance differentiable rendering,

S. Laine, J. Hellsten, T. Karras, Y . Seol, J. Lehtinen, and T. Aila, “Modular primitives for high-performance differentiable rendering,” ACM Transactions on Graphics, vol. 39, no. 6, 2020

work page 2020
[18]

Instructpix2pix: Learning to follow image editing instructions,

T. Brooks, A. Holynski, and A. A. Efros, “Instructpix2pix: Learning to follow image editing instructions,” inCVPR, 2023

work page 2023
[19]

Pixel- wise view selection for unstructured multi-view stereo,

J. L. Sch ¨onberger, E. Zheng, M. Pollefeys, and J.-M. Frahm, “Pixel- wise view selection for unstructured multi-view stereo,” inEuropean Conference on Computer Vision (ECCV), 2016

work page 2016
[20]

Kinectfusion: real-time 3d reconstruction and interaction using a moving depth cam- era,

S. Izadi, D. Kim, O. Hilliges, D. Molyneaux, R. Newcombe, P. Kohli, J. Shotton, S. Hodges, D. Freeman, A. Davisonet al., “Kinectfusion: real-time 3d reconstruction and interaction using a moving depth cam- era,” inProceedings of the 24th annual ACM symposium on User interface software and technology, 2011, pp. 559–568

work page 2011
[21]

Denoising diffusion probabilistic models,

J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” Advances in neural information processing systems, vol. 33, pp. 6840– 6851, 2020

work page 2020
[22]

Superprimitive: Scene recon- struction at a primitive level,

K. Mazur, G. Bae, and A. J. Davison, “Superprimitive: Scene recon- struction at a primitive level,”arXiv preprint arXiv:2312.05889, 2023

work page arXiv 2023
[23]

Panoptic 3d scene reconstruction from a single rgb image,

M. Dahnert, J. Hou, M. Nießner, and A. Dai, “Panoptic 3d scene reconstruction from a single rgb image,”Advances in Neural Information Processing Systems, vol. 34, pp. 8282–8293, 2021

work page 2021
[24]

Know your neighbors: Improving single-view reconstruction via spatial vision-language reasoning,

R. Li, T. Fischer, M. Segu, M. Pollefeys, L. Van Gool, and F. Tombari, “Know your neighbors: Improving single-view reconstruction via spatial vision-language reasoning,”arXiv preprint arXiv:2404.03658, 2024

work page arXiv 2024
[25]

Depthssc: Depth-spatial alignment and dynamic voxel resolution for monocular 3d semantic scene completion,

J. Yao and J. Zhang, “Depthssc: Depth-spatial alignment and dynamic voxel resolution for monocular 3d semantic scene completion,”arXiv preprint arXiv:2311.17084, 2023

work page arXiv 2023
[26]

Ndc- scene: Boost monocular 3d semantic scene completion in normalized device coordinates space,

J. Yao, C. Li, K. Sun, Y . Cai, H. Li, W. Ouyang, and H. Li, “Ndc- scene: Boost monocular 3d semantic scene completion in normalized device coordinates space,” in2023 IEEE/CVF International Conference on Computer Vision (ICCV). IEEE Computer Society, 2023, pp. 9421– 9431

work page 2023
[27]

Scenerf: Self-supervised monocular 3d scene reconstruction with radiance fields,

A.-Q. Cao and R. de Charette, “Scenerf: Self-supervised monocular 3d scene reconstruction with radiance fields,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 9387–9398

work page 2023
[28]

Rico: Regularizing the unobservable for indoor compositional reconstruction,

Z. Li, X. Lyu, Y . Ding, M. Wang, Y . Liao, and Y . Liu, “Rico: Regularizing the unobservable for indoor compositional reconstruction,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 17 761–17 771

work page 2023
[29]

Holistic 3d scene understanding from a single image with implicit representa- tion,

C. Zhang, Z. Cui, Y . Zhang, B. Zeng, M. Pollefeys, and S. Liu, “Holistic 3d scene understanding from a single image with implicit representa- tion,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 8833–8842

work page 2021
[30]

Towards high-fidelity single-view holistic reconstruction of indoor scenes,

H. Liu, Y . Zheng, G. Chen, S. Cui, and X. Han, “Towards high-fidelity single-view holistic reconstruction of indoor scenes,” inEuropean Con- ference on Computer Vision. Springer, 2022, pp. 429–446

work page 2022
[31]

Single- view 3d scene reconstruction with high-fidelity shape and texture,

Y . Chen, J. Ni, N. Jiang, Y . Zhang, Y . Zhu, and S. Huang, “Single- view 3d scene reconstruction with high-fidelity shape and texture,” in 2024 International Conference on 3D Vision (3DV). IEEE, 2024, pp. 1456–1467

work page 2024
[32]

Buol: A bottom-up framework with occupancy-aware lifting for panoptic 3d scene reconstruction from a single image,

T. Chu, P. Zhang, Q. Liu, and J. Wang, “Buol: A bottom-up framework with occupancy-aware lifting for panoptic 3d scene reconstruction from a single image,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 4937–4946

work page 2023
[33]

Uni-3d: A universal model for panoptic 3d scene reconstruction,

X. Zhang, Z. Chen, F. Wei, and Z. Tu, “Uni-3d: A universal model for panoptic 3d scene reconstruction,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 9256–9266

work page 2023
[34]

Open: Occlusion-invariant perception network for single image-based 3d shape retrieval,

F. Chu, Y . Cong, and R. Chen, “Open: Occlusion-invariant perception network for single image-based 3d shape retrieval,”IEEE Transactions on Circuits and Systems for Video Technology, 2024

work page 2024
[35]

Diffcad: Weakly- supervised probabilistic cad model retrieval and alignment from an rgb image,

D. Gao, D. Rozenberszki, S. Leutenegger, and A. Dai, “Diffcad: Weakly- supervised probabilistic cad model retrieval and alignment from an rgb image,”arXiv preprint arXiv:2311.18610, 2023

work page arXiv 2023
[36]

Generalizable 3d scene recon- struction via divide and conquer from a single view,

A. Dogaru, M. ¨Ozer, and B. Egger, “Generalizable 3d scene recon- struction via divide and conquer from a single view,”arXiv preprint arXiv:2404.03421, 2024

work page arXiv 2024
[37]

Comboverse: Compositional 3d assets creation using spatially-aware diffusion guid- ance,

Y . Chen, T. Wang, T. Wu, X. Pan, K. Jia, and Z. Liu, “Comboverse: Compositional 3d assets creation using spatially-aware diffusion guid- ance,”arXiv preprint arXiv:2403.12409, 2024

work page arXiv 2024
[38]

3d cinemagraphy from a single image,

X. Li, Z. Cao, H. Sun, J. Zhang, K. Xian, and G. Lin, “3d cinemagraphy from a single image,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 4595–4605

work page 2023
[39]

Sine: Semantic-driven image-based nerf editing with prior- guided editing field,

C. Bao, Y . Zhang, B. Yang, T. Fan, Z. Yang, H. Bao, G. Zhang, and Z. Cui, “Sine: Semantic-driven image-based nerf editing with prior- guided editing field,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 20 919–20 929

work page 2023
[40]

Lolep: Single-view view syn- thesis with locally-learned planes and self-attention occlusion inference,

C. Wang, Y .-P. Wang, and D. Manocha, “Lolep: Single-view view syn- thesis with locally-learned planes and self-attention occlusion inference,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 10 841–10 851

work page 2023
[41]

Perf: Panoramic neural radiance field from a single panorama,

G. Wang, P. Wang, Z. Chen, W. Wang, C. C. Loy, and Z. Liu, “Perf: Panoramic neural radiance field from a single panorama,”arXiv preprint arXiv:2310.16831, 2023

work page arXiv 2023
[42]

As-deformable-as- possible single-image-based view synthesis without depth prior,

C. Zhang, C. Lin, K. Liao, L. Nie, and Y . Zhao, “As-deformable-as- possible single-image-based view synthesis without depth prior,”IEEE Transactions on Circuits and Systems for Video Technology, 2023

work page 2023
[43]

Sinmpi: Novel view synthesis from a single image with expanded multiplane images,

G. Pu, P.-S. Wang, and Z. Lian, “Sinmpi: Novel view synthesis from a single image with expanded multiplane images,” inSIGGRAPH Asia 2023 Conference Papers, 2023, pp. 1–10

work page 2023
[44]

Single-view view synthesis in the wild with learned adaptive multiplane images,

Y . Han, R. Wang, and J. Yang, “Single-view view synthesis in the wild with learned adaptive multiplane images,” inACM SIGGRAPH 2022 Conference Proceedings, 2022, pp. 1–8

work page 2022
[45]

Luciddreamer: Domain-free generation of 3d gaussian splatting scenes,

J. Chung, S. Lee, H. Nam, J. Lee, and K. M. Lee, “Luciddreamer: Domain-free generation of 3d gaussian splatting scenes,”arXiv preprint arXiv:2311.13384, 2023

work page arXiv 2023
[46]

CAT3D: Create Anything in 3D with Multi-View Diffusion Models

R. Gao, A. Holynski, P. Henzler, A. Brussee, R. Martin-Brualla, P. Srini- vasan, J. T. Barron, and B. Poole, “Cat3d: Create anything in 3d with multi-view diffusion models,”arXiv preprint arXiv:2405.10314, 2024

work page internal anchor Pith review arXiv 2024
[47]

Realmdreamer: Text-driven 3d scene generation with inpainting and depth diffusion,

J. Shriram, A. Trevithick, L. Liu, and R. Ramamoorthi, “Realmdreamer: Text-driven 3d scene generation with inpainting and depth diffusion,” arXiv preprint arXiv:2404.07199, 2024

work page arXiv 2024
[48]

Spin-nerf: Multiview segmentation and perceptual inpainting with neural radiance fields,

A. Mirzaei, T. Aumentado-Armstrong, K. G. Derpanis, J. Kelly, M. A. Brubaker, I. Gilitschenski, and A. Levinshtein, “Spin-nerf: Multiview segmentation and perceptual inpainting with neural radiance fields,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 20 669–20 679

work page 2023
[49]

Bolt3d: Generating 3d scenes in seconds,

S. Szymanowicz, J. Y . Zhang, P. Srinivasan, R. Gao, A. Brussee, A. Holynski, R. Martin-Brualla, J. T. Barron, and P. Henzler, “Bolt3d: Generating 3d scenes in seconds,”arXiv preprint arXiv:2503.14445, 2025

work page arXiv 2025
[50]

Nerf: Representing scenes as neural radiance fields for view MANUSCRIPT SUBMITTED TO IEEE TVCG 12 synthesis,

B. Mildenhall, P. P. Srinivasan, M. Tancik, J. T. Barron, R. Ramamoorthi, and R. Ng, “Nerf: Representing scenes as neural radiance fields for view MANUSCRIPT SUBMITTED TO IEEE TVCG 12 synthesis,”Communications of the ACM, vol. 65, no. 1, pp. 99–106, 2021

work page 2021
[51]

3d gaussian splatting for real-time radiance field rendering

B. Kerbl, G. Kopanas, T. Leimk ¨uhler, and G. Drettakis, “3d gaussian splatting for real-time radiance field rendering.”ACM Trans. Graph., vol. 42, no. 4, pp. 139–1, 2023

work page 2023
[52]

Diffuscene: Denoising diffusion models for gerative indoor scene synthesis,

J. Tang, Y . Nie, L. Markhasin, A. Dai, J. Thies, and M. Nießner, “Diffuscene: Denoising diffusion models for gerative indoor scene synthesis,” inProceedings of the ieee/cvf conference on computer vision and pattern recognition, 2024

work page 2024
[53]

Commonscenes: Generating commonsense 3d indoor scenes with scene graphs,

G. Zhai, E. P. ¨Ornek, S.-C. Wu, Y . Di, F. Tombari, N. Navab, and B. Busam, “Commonscenes: Generating commonsense 3d indoor scenes with scene graphs,”Advances in Neural Information Processing Systems, vol. 36, 2024

work page 2024
[54]

Instructscene: Instruction-driven 3d indoor scene synthesis with semantic graph prior.arXiv preprint arXiv:2402.04717, 2024

C. Lin and Y . Mu, “Instructscene: Instruction-driven 3d indoor scene synthesis with semantic graph prior,”arXiv preprint arXiv:2402.04717, 2024

work page arXiv 2024
[55]

Furniscene: A large-scale 3d room dataset with intricate furnishing scenes,

G. Zhang, Y . Wang, C. Luo, S. Xu, J. Peng, Z. Zhang, and M. Zhang, “Furniscene: A large-scale 3d room dataset with intricate furnishing scenes,”arXiv preprint arXiv:2401.03470, 2024

work page arXiv 2024
[56]

Echoscene: Indoor scene generation via information echo over scene graph diffusion,

G. Zhai, E. P. ¨Ornek, D. Z. Chen, R. Liao, Y . Di, N. Navab, F. Tombari, and B. Busam, “Echoscene: Indoor scene generation via information echo over scene graph diffusion,”arXiv preprint arXiv:2405.00915, 2024

work page arXiv 2024
[57]

Blockfusion: Expandable 3d scene generation using latent tri-plane extrapolation,

Z. Wu, Y . Li, H. Yan, T. Shang, W. Sun, S. Wang, R. Cui, W. Liu, H. Sato, H. Liet al., “Blockfusion: Expandable 3d scene generation using latent tri-plane extrapolation,”arXiv preprint arXiv:2401.17053, 2024

work page arXiv 2024
[58]

Scenewiz3d: Towards text-guided 3d scene composition,

Q. Zhang, C. Wang, A. Siarohin, P. Zhuang, Y . Xu, C. Yang, D. Lin, B. Zhou, S. Tulyakov, and H.-Y . Lee, “Scenewiz3d: Towards text-guided 3d scene composition,”arXiv preprint arXiv:2312.08885, 2023

work page arXiv 2023
[59]

Graphdreamer: Compositional 3d scene synthesis from scene graphs,

G. Gao, W. Liu, A. Chen, A. Geiger, and B. Sch ¨olkopf, “Graphdreamer: Compositional 3d scene synthesis from scene graphs,”arXiv preprint arXiv:2312.00093, 2023

work page arXiv 2023
[60]

Text2scene: Text-driven in- door scene stylization with part-aware details,

I. Hwang, H. Kim, and Y . M. Kim, “Text2scene: Text-driven in- door scene stylization with part-aware details,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 1890–1899

work page 2023
[61]

Text2nerf: Text-driven 3d scene generation with neural radiance fields,

J. Zhang, X. Li, Z. Wan, C. Wang, and J. Liao, “Text2nerf: Text-driven 3d scene generation with neural radiance fields,”IEEE Transactions on Visualization and Computer Graphics, 2024

work page 2024
[62]

Dreamscene360: Unconstrained text-to- 3d scene generation with panoramic gaussian splatting,

S. Zhou, Z. Fan, D. Xu, H. Chang, P. Chari, T. Bharadwaj, S. You, Z. Wang, and A. Kadambi, “Dreamscene360: Unconstrained text-to- 3d scene generation with panoramic gaussian splatting,”arXiv preprint arXiv:2404.06903, 2024

work page arXiv 2024
[63]

Dreamscene: 3d gaussian-based text-to-3d scene generation via formation pattern sampling,

H. Li, H. Shi, W. Zhang, W. Wu, Y . Liao, L. Wang, L.-h. Lee, and P. Zhou, “Dreamscene: 3d gaussian-based text-to-3d scene generation via formation pattern sampling,”arXiv preprint arXiv:2404.03575, 2024

work page arXiv 2024
[64]

Text2immersion: Generative immersive scene with 3d gaussians,

H. Ouyang, K. Heal, S. Lombardi, and T. Sun, “Text2immersion: Generative immersive scene with 3d gaussians,”arXiv preprint arXiv:2312.09242, 2023

work page arXiv 2023
[65]

Controlroom3d: Room generation using semantic proxy rooms,

J. Schult, S. Tsai, L. H ¨ollein, B. Wu, J. Wang, C.-Y . Ma, K. Li, X. Wang, F. Wimbauer, Z. Heet al., “Controlroom3d: Room generation using semantic proxy rooms,”arXiv preprint arXiv:2312.05208, 2023

work page arXiv 2023
[66]

Ctrl-room: Controllable text- to-3d room meshes generation with layout constraints,

C. Fang, X. Hu, K. Luo, and P. Tan, “Ctrl-room: Controllable text- to-3d room meshes generation with layout constraints,”arXiv preprint arXiv:2310.03602, 2023

work page arXiv 2023
[67]

3d-scenedreamer: Text-driven 3d-consistent scene generation,

F. Zhang, Y . Zhang, Q. Zheng, R. Ma, W. Hua, H. Bao, W. Xu, and C. Zou, “3d-scenedreamer: Text-driven 3d-consistent scene generation,” arXiv preprint arXiv:2403.09439, 2024

work page arXiv 2024
[68]

Showroom3d: Text to high-quality 3d room generation using 3d priors,

W. Mao, Y .-P. Cao, J.-W. Liu, Z. Xu, and M. Z. Shou, “Showroom3d: Text to high-quality 3d room generation using 3d priors,”arXiv preprint arXiv:2312.13324, 2023

work page arXiv 2023
[69]

Fastscene: Text-driven fast 3d indoor scene generation via panoramic gaussian splatting,

Y . Ma, D. Zhan, and Z. Jin, “Fastscene: Text-driven fast 3d indoor scene generation via panoramic gaussian splatting,”arXiv preprint arXiv:2405.05768, 2024

work page arXiv 2024
[70]

360dvd: Controllable panorama video generation with 360-degree video diffusion model,

Q. Wang, W. Li, C. Mou, X. Cheng, and J. Zhang, “360dvd: Controllable panorama video generation with 360-degree video diffusion model,” arXiv preprint arXiv:2401.06578, 2024

work page arXiv 2024
[71]

MVDream: Multi-view Diffusion for 3D Generation

Y . Shi, P. Wang, J. Ye, M. Long, K. Li, and X. Yang, “Mvdream: Multi- view diffusion for 3d generation,”arXiv preprint arXiv:2308.16512, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[72]

Imagedream: Image-prompt multi-view diffusion for 3d generation

P. Wang and Y . Shi, “Imagedream: Image-prompt multi-view diffusion for 3d generation,”arXiv preprint arXiv:2312.02201, 2023

work page arXiv 2023
[73]

Direct2. 5: Diverse text-to-3d generation via multi-view 2.5 d diffusion,

Y . Lu, J. Zhang, S. Li, T. Fang, D. McKinnon, Y . Tsin, L. Quan, X. Cao, and Y . Yao, “Direct2. 5: Diverse text-to-3d generation via multi-view 2.5 d diffusion,”arXiv preprint arXiv:2311.15980, 2023

work page arXiv 2023
[74]

Shap-E: Generating Conditional 3D Implicit Functions

H. Jun and A. Nichol, “Shap-e: Generating conditional 3d implicit functions,”arXiv preprint arXiv:2305.02463, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[75]

Triplane meets gaussian splatting: Fast and generalizable single-view 3d reconstruction with transformers,

Z.-X. Zou, Z. Yu, Y .-C. Guo, Y . Li, D. Liang, Y .-P. Cao, and S.-H. Zhang, “Triplane meets gaussian splatting: Fast and generalizable single-view 3d reconstruction with transformers,”arXiv preprint arXiv:2312.09147, 2023

work page arXiv 2023
[76]

Instant-3d: Instant neural radiance field training towards on-device ar/vr 3d reconstruction,

S. Li, C. Li, W. Zhu, B. Yu, Y . Zhao, C. Wan, H. You, H. Shi, and Y . Lin, “Instant-3d: Instant neural radiance field training towards on-device ar/vr 3d reconstruction,” inProceedings of the 50th Annual International Symposium on Computer Architecture, 2023, pp. 1–13

work page 2023
[77]

DreamGaussian: Generative Gaussian Splatting for Efficient 3D Content Creation

J. Tang, J. Ren, H. Zhou, Z. Liu, and G. Zeng, “Dreamgaussian: Generative gaussian splatting for efficient 3d content creation,”arXiv preprint arXiv:2309.16653, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[78]

Objaverse-xl: A uni- verse of 10m+ 3d objects,

M. Deitke, R. Liu, M. Wallingford, H. Ngo, O. Michel, A. Kusupati, A. Fan, C. Laforte, V . V oleti, S. Y . Gadreet al., “Objaverse-xl: A uni- verse of 10m+ 3d objects,”Advances in Neural Information Processing Systems, vol. 36, 2024

work page 2024
[79]

Mvdiffusion: Enabling holistic multi-view image generation with correspondence- aware diffusion,

S. Tang, F. Zhang, J. Chen, P. Wang, and F. Yasutaka, “Mvdiffusion: Enabling holistic multi-view image generation with correspondence- aware diffusion,”arXiv preprint 2307.01097, 2023

work page arXiv 2023
[80]

Geowizard: Unleashing the diffusion priors for 3d geometry estimation from a single image,

X. Fu, W. Yin, M. Hu, K. Wang, Y . Ma, P. Tan, S. Shen, D. Lin, and X. Long, “Geowizard: Unleashing the diffusion priors for 3d geometry estimation from a single image,”arXiv preprint arXiv:2403.12013, 2024

work page arXiv 2024

Showing first 80 references.

[1] [1]

High- resolution image synthesis with latent diffusion models,

R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High- resolution image synthesis with latent diffusion models,” inCVPR, 2022

work page 2022

[2] [2]

Monoscene: Monocular 3d semantic scene completion,

A.-Q. Cao and R. De Charette, “Monoscene: Monocular 3d semantic scene completion,” inCVPR, 2022

work page 2022

[3] [3]

Corenet: Coherent 3d scene reconstruction from a single rgb image,

S. Popov, P. Bauszat, and V . Ferrari, “Corenet: Coherent 3d scene reconstruction from a single rgb image,” inECCV, 2020. MANUSCRIPT SUBMITTED TO IEEE TVCG 11

work page 2020

[4] [4]

To- tal3dunderstanding: Joint layout, object pose and mesh reconstruction for indoor scenes from a single image,

Y . Nie, X. Han, S. Guo, Y . Zheng, J. Chang, and J. J. Zhang, “To- tal3dunderstanding: Joint layout, object pose and mesh reconstruction for indoor scenes from a single image,” inCVPR, 2020

work page 2020

[5] [5]

Psdr-room: Single photo to scene using differentiable rendering,

K. Yan, F. Luan, M. Ha ˇsan, T. Groueix, V . Deschaintre, and S. Zhao, “Psdr-room: Single photo to scene using differentiable rendering,” in SIGGRAPH Asia 2023, 2023

work page 2023

[6] [6]

Roca: Robust cad model retrieval and alignment from a single image,

C. G ¨umeli, A. Dai, and M. Nießner, “Roca: Robust cad model retrieval and alignment from a single image,” inCVPR, 2022

work page 2022

[7] [7]

Patch2cad: Patchwise embedding learning for in-the-wild shape retrieval from a single image,

W. Kuo, A. Angelova, T.-Y . Lin, and A. Dai, “Patch2cad: Patchwise embedding learning for in-the-wild shape retrieval from a single image,” inICCV, 2021

work page 2021

[8] [8]

Generalizing single- view 3d shape retrieval to occlusions and unseen objects,

Q. Wu, D. Ritchie, M. Savva, and A. X. Chang, “Generalizing single- view 3d shape retrieval to occlusions and unseen objects,”arXiv preprint arXiv:2401.00405, 2023

work page arXiv 2023

[9] [9]

DreamFusion: Text-to-3D using 2D Diffusion

B. Poole, A. Jain, J. T. Barron, and B. Mildenhall, “Dreamfusion: Text- to-3d using 2d diffusion,”arXiv preprint arXiv:2209.14988, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[10] [10]

SyncDreamer: Generating Multiview-consistent Images from a Single-view Image

Y . Liu, C. Lin, Z. Zeng, X. Long, L. Liu, T. Komura, and W. Wang, “Syncdreamer: Generating multiview-consistent images from a single- view image,”arXiv preprint arXiv:2309.03453, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[11] [11]

Zero-1-to-3: Zero-shot one image to 3d object,

R. Liu, R. Wu, B. Van Hoorick, P. Tokmakov, S. Zakharov, and C. V ondrick, “Zero-1-to-3: Zero-shot one image to 3d object,” inCVPR, 2023

work page 2023

[12] [12]

Zero123++: a Single Image to Consistent Multi-view Diffusion Base Model

R. Shi, H. Chen, Z. Zhang, M. Liu, C. Xu, X. Wei, L. Chen, C. Zeng, and H. Su, “Zero123++: a single image to consistent multi-view diffusion base model,”arXiv preprint arXiv:2310.15110, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[13] [13]

arXiv preprint arXiv:2310.15008 , year=

X. Long, Y .-C. Guo, C. Lin, Y . Liu, Z. Dou, L. Liu, Y . Ma, S.-H. Zhang, M. Habermann, C. Theobaltet al., “Wonder3d: Single image to 3d using cross-domain diffusion,”arXiv preprint arXiv:2310.15008, 2023

work page arXiv 2023

[14] [14]

Objaverse: A universe of annotated 3d objects,

M. Deitke, D. Schwenk, J. Salvador, L. Weihs, O. Michel, E. VanderBilt, L. Schmidt, K. Ehsani, A. Kembhavi, and A. Farhadi, “Objaverse: A universe of annotated 3d objects,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 13 142–13 153

work page 2023

[15] [15]

Segment anything,

A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y . Loet al., “Segment anything,” inCVPR, 2023

work page 2023

[16] [16]

Crm: Single image to 3d textured mesh with convolutional reconstruction model,

Z. Wang, Y . Wang, Y . Chen, C. Xiang, S. Chen, D. Yu, C. Li, H. Su, and J. Zhu, “Crm: Single image to 3d textured mesh with convolutional reconstruction model,”arXiv preprint arXiv:2403.05034, 2024

work page arXiv 2024

[17] [17]

Modular primitives for high-performance differentiable rendering,

S. Laine, J. Hellsten, T. Karras, Y . Seol, J. Lehtinen, and T. Aila, “Modular primitives for high-performance differentiable rendering,” ACM Transactions on Graphics, vol. 39, no. 6, 2020

work page 2020

[18] [18]

Instructpix2pix: Learning to follow image editing instructions,

T. Brooks, A. Holynski, and A. A. Efros, “Instructpix2pix: Learning to follow image editing instructions,” inCVPR, 2023

work page 2023

[19] [19]

Pixel- wise view selection for unstructured multi-view stereo,

J. L. Sch ¨onberger, E. Zheng, M. Pollefeys, and J.-M. Frahm, “Pixel- wise view selection for unstructured multi-view stereo,” inEuropean Conference on Computer Vision (ECCV), 2016

work page 2016

[20] [20]

Kinectfusion: real-time 3d reconstruction and interaction using a moving depth cam- era,

S. Izadi, D. Kim, O. Hilliges, D. Molyneaux, R. Newcombe, P. Kohli, J. Shotton, S. Hodges, D. Freeman, A. Davisonet al., “Kinectfusion: real-time 3d reconstruction and interaction using a moving depth cam- era,” inProceedings of the 24th annual ACM symposium on User interface software and technology, 2011, pp. 559–568

work page 2011

[21] [21]

Denoising diffusion probabilistic models,

J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” Advances in neural information processing systems, vol. 33, pp. 6840– 6851, 2020

work page 2020

[22] [22]

Superprimitive: Scene recon- struction at a primitive level,

K. Mazur, G. Bae, and A. J. Davison, “Superprimitive: Scene recon- struction at a primitive level,”arXiv preprint arXiv:2312.05889, 2023

work page arXiv 2023

[23] [23]

Panoptic 3d scene reconstruction from a single rgb image,

M. Dahnert, J. Hou, M. Nießner, and A. Dai, “Panoptic 3d scene reconstruction from a single rgb image,”Advances in Neural Information Processing Systems, vol. 34, pp. 8282–8293, 2021

work page 2021

[24] [24]

Know your neighbors: Improving single-view reconstruction via spatial vision-language reasoning,

R. Li, T. Fischer, M. Segu, M. Pollefeys, L. Van Gool, and F. Tombari, “Know your neighbors: Improving single-view reconstruction via spatial vision-language reasoning,”arXiv preprint arXiv:2404.03658, 2024

work page arXiv 2024

[25] [25]

Depthssc: Depth-spatial alignment and dynamic voxel resolution for monocular 3d semantic scene completion,

J. Yao and J. Zhang, “Depthssc: Depth-spatial alignment and dynamic voxel resolution for monocular 3d semantic scene completion,”arXiv preprint arXiv:2311.17084, 2023

work page arXiv 2023

[26] [26]

Ndc- scene: Boost monocular 3d semantic scene completion in normalized device coordinates space,

J. Yao, C. Li, K. Sun, Y . Cai, H. Li, W. Ouyang, and H. Li, “Ndc- scene: Boost monocular 3d semantic scene completion in normalized device coordinates space,” in2023 IEEE/CVF International Conference on Computer Vision (ICCV). IEEE Computer Society, 2023, pp. 9421– 9431

work page 2023

[27] [27]

Scenerf: Self-supervised monocular 3d scene reconstruction with radiance fields,

A.-Q. Cao and R. de Charette, “Scenerf: Self-supervised monocular 3d scene reconstruction with radiance fields,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 9387–9398

work page 2023

[28] [28]

Rico: Regularizing the unobservable for indoor compositional reconstruction,

Z. Li, X. Lyu, Y . Ding, M. Wang, Y . Liao, and Y . Liu, “Rico: Regularizing the unobservable for indoor compositional reconstruction,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 17 761–17 771

work page 2023

[29] [29]

Holistic 3d scene understanding from a single image with implicit representa- tion,

C. Zhang, Z. Cui, Y . Zhang, B. Zeng, M. Pollefeys, and S. Liu, “Holistic 3d scene understanding from a single image with implicit representa- tion,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 8833–8842

work page 2021

[30] [30]

Towards high-fidelity single-view holistic reconstruction of indoor scenes,

H. Liu, Y . Zheng, G. Chen, S. Cui, and X. Han, “Towards high-fidelity single-view holistic reconstruction of indoor scenes,” inEuropean Con- ference on Computer Vision. Springer, 2022, pp. 429–446

work page 2022

[31] [31]

Single- view 3d scene reconstruction with high-fidelity shape and texture,

Y . Chen, J. Ni, N. Jiang, Y . Zhang, Y . Zhu, and S. Huang, “Single- view 3d scene reconstruction with high-fidelity shape and texture,” in 2024 International Conference on 3D Vision (3DV). IEEE, 2024, pp. 1456–1467

work page 2024

[32] [32]

Buol: A bottom-up framework with occupancy-aware lifting for panoptic 3d scene reconstruction from a single image,

T. Chu, P. Zhang, Q. Liu, and J. Wang, “Buol: A bottom-up framework with occupancy-aware lifting for panoptic 3d scene reconstruction from a single image,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 4937–4946

work page 2023

[33] [33]

Uni-3d: A universal model for panoptic 3d scene reconstruction,

X. Zhang, Z. Chen, F. Wei, and Z. Tu, “Uni-3d: A universal model for panoptic 3d scene reconstruction,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 9256–9266

work page 2023

[34] [34]

Open: Occlusion-invariant perception network for single image-based 3d shape retrieval,

F. Chu, Y . Cong, and R. Chen, “Open: Occlusion-invariant perception network for single image-based 3d shape retrieval,”IEEE Transactions on Circuits and Systems for Video Technology, 2024

work page 2024

[35] [35]

Diffcad: Weakly- supervised probabilistic cad model retrieval and alignment from an rgb image,

D. Gao, D. Rozenberszki, S. Leutenegger, and A. Dai, “Diffcad: Weakly- supervised probabilistic cad model retrieval and alignment from an rgb image,”arXiv preprint arXiv:2311.18610, 2023

work page arXiv 2023

[36] [36]

Generalizable 3d scene recon- struction via divide and conquer from a single view,

A. Dogaru, M. ¨Ozer, and B. Egger, “Generalizable 3d scene recon- struction via divide and conquer from a single view,”arXiv preprint arXiv:2404.03421, 2024

work page arXiv 2024

[37] [37]

Comboverse: Compositional 3d assets creation using spatially-aware diffusion guid- ance,

Y . Chen, T. Wang, T. Wu, X. Pan, K. Jia, and Z. Liu, “Comboverse: Compositional 3d assets creation using spatially-aware diffusion guid- ance,”arXiv preprint arXiv:2403.12409, 2024

work page arXiv 2024

[38] [38]

3d cinemagraphy from a single image,

X. Li, Z. Cao, H. Sun, J. Zhang, K. Xian, and G. Lin, “3d cinemagraphy from a single image,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 4595–4605

work page 2023

[39] [39]

Sine: Semantic-driven image-based nerf editing with prior- guided editing field,

C. Bao, Y . Zhang, B. Yang, T. Fan, Z. Yang, H. Bao, G. Zhang, and Z. Cui, “Sine: Semantic-driven image-based nerf editing with prior- guided editing field,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 20 919–20 929

work page 2023

[40] [40]

Lolep: Single-view view syn- thesis with locally-learned planes and self-attention occlusion inference,

C. Wang, Y .-P. Wang, and D. Manocha, “Lolep: Single-view view syn- thesis with locally-learned planes and self-attention occlusion inference,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 10 841–10 851

work page 2023

[41] [41]

Perf: Panoramic neural radiance field from a single panorama,

G. Wang, P. Wang, Z. Chen, W. Wang, C. C. Loy, and Z. Liu, “Perf: Panoramic neural radiance field from a single panorama,”arXiv preprint arXiv:2310.16831, 2023

work page arXiv 2023

[42] [42]

As-deformable-as- possible single-image-based view synthesis without depth prior,

C. Zhang, C. Lin, K. Liao, L. Nie, and Y . Zhao, “As-deformable-as- possible single-image-based view synthesis without depth prior,”IEEE Transactions on Circuits and Systems for Video Technology, 2023

work page 2023

[43] [43]

Sinmpi: Novel view synthesis from a single image with expanded multiplane images,

G. Pu, P.-S. Wang, and Z. Lian, “Sinmpi: Novel view synthesis from a single image with expanded multiplane images,” inSIGGRAPH Asia 2023 Conference Papers, 2023, pp. 1–10

work page 2023

[44] [44]

Single-view view synthesis in the wild with learned adaptive multiplane images,

Y . Han, R. Wang, and J. Yang, “Single-view view synthesis in the wild with learned adaptive multiplane images,” inACM SIGGRAPH 2022 Conference Proceedings, 2022, pp. 1–8

work page 2022

[45] [45]

Luciddreamer: Domain-free generation of 3d gaussian splatting scenes,

J. Chung, S. Lee, H. Nam, J. Lee, and K. M. Lee, “Luciddreamer: Domain-free generation of 3d gaussian splatting scenes,”arXiv preprint arXiv:2311.13384, 2023

work page arXiv 2023

[46] [46]

CAT3D: Create Anything in 3D with Multi-View Diffusion Models

R. Gao, A. Holynski, P. Henzler, A. Brussee, R. Martin-Brualla, P. Srini- vasan, J. T. Barron, and B. Poole, “Cat3d: Create anything in 3d with multi-view diffusion models,”arXiv preprint arXiv:2405.10314, 2024

work page internal anchor Pith review arXiv 2024

[47] [47]

Realmdreamer: Text-driven 3d scene generation with inpainting and depth diffusion,

J. Shriram, A. Trevithick, L. Liu, and R. Ramamoorthi, “Realmdreamer: Text-driven 3d scene generation with inpainting and depth diffusion,” arXiv preprint arXiv:2404.07199, 2024

work page arXiv 2024

[48] [48]

Spin-nerf: Multiview segmentation and perceptual inpainting with neural radiance fields,

A. Mirzaei, T. Aumentado-Armstrong, K. G. Derpanis, J. Kelly, M. A. Brubaker, I. Gilitschenski, and A. Levinshtein, “Spin-nerf: Multiview segmentation and perceptual inpainting with neural radiance fields,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 20 669–20 679

work page 2023

[49] [49]

Bolt3d: Generating 3d scenes in seconds,

S. Szymanowicz, J. Y . Zhang, P. Srinivasan, R. Gao, A. Brussee, A. Holynski, R. Martin-Brualla, J. T. Barron, and P. Henzler, “Bolt3d: Generating 3d scenes in seconds,”arXiv preprint arXiv:2503.14445, 2025

work page arXiv 2025

[50] [50]

Nerf: Representing scenes as neural radiance fields for view MANUSCRIPT SUBMITTED TO IEEE TVCG 12 synthesis,

B. Mildenhall, P. P. Srinivasan, M. Tancik, J. T. Barron, R. Ramamoorthi, and R. Ng, “Nerf: Representing scenes as neural radiance fields for view MANUSCRIPT SUBMITTED TO IEEE TVCG 12 synthesis,”Communications of the ACM, vol. 65, no. 1, pp. 99–106, 2021

work page 2021

[51] [51]

3d gaussian splatting for real-time radiance field rendering

B. Kerbl, G. Kopanas, T. Leimk ¨uhler, and G. Drettakis, “3d gaussian splatting for real-time radiance field rendering.”ACM Trans. Graph., vol. 42, no. 4, pp. 139–1, 2023

work page 2023

[52] [52]

Diffuscene: Denoising diffusion models for gerative indoor scene synthesis,

J. Tang, Y . Nie, L. Markhasin, A. Dai, J. Thies, and M. Nießner, “Diffuscene: Denoising diffusion models for gerative indoor scene synthesis,” inProceedings of the ieee/cvf conference on computer vision and pattern recognition, 2024

work page 2024

[53] [53]

Commonscenes: Generating commonsense 3d indoor scenes with scene graphs,

G. Zhai, E. P. ¨Ornek, S.-C. Wu, Y . Di, F. Tombari, N. Navab, and B. Busam, “Commonscenes: Generating commonsense 3d indoor scenes with scene graphs,”Advances in Neural Information Processing Systems, vol. 36, 2024

work page 2024

[54] [54]

Instructscene: Instruction-driven 3d indoor scene synthesis with semantic graph prior.arXiv preprint arXiv:2402.04717, 2024

C. Lin and Y . Mu, “Instructscene: Instruction-driven 3d indoor scene synthesis with semantic graph prior,”arXiv preprint arXiv:2402.04717, 2024

work page arXiv 2024

[55] [55]

Furniscene: A large-scale 3d room dataset with intricate furnishing scenes,

G. Zhang, Y . Wang, C. Luo, S. Xu, J. Peng, Z. Zhang, and M. Zhang, “Furniscene: A large-scale 3d room dataset with intricate furnishing scenes,”arXiv preprint arXiv:2401.03470, 2024

work page arXiv 2024

[56] [56]

Echoscene: Indoor scene generation via information echo over scene graph diffusion,

G. Zhai, E. P. ¨Ornek, D. Z. Chen, R. Liao, Y . Di, N. Navab, F. Tombari, and B. Busam, “Echoscene: Indoor scene generation via information echo over scene graph diffusion,”arXiv preprint arXiv:2405.00915, 2024

work page arXiv 2024

[57] [57]

Blockfusion: Expandable 3d scene generation using latent tri-plane extrapolation,

Z. Wu, Y . Li, H. Yan, T. Shang, W. Sun, S. Wang, R. Cui, W. Liu, H. Sato, H. Liet al., “Blockfusion: Expandable 3d scene generation using latent tri-plane extrapolation,”arXiv preprint arXiv:2401.17053, 2024

work page arXiv 2024

[58] [58]

Scenewiz3d: Towards text-guided 3d scene composition,

Q. Zhang, C. Wang, A. Siarohin, P. Zhuang, Y . Xu, C. Yang, D. Lin, B. Zhou, S. Tulyakov, and H.-Y . Lee, “Scenewiz3d: Towards text-guided 3d scene composition,”arXiv preprint arXiv:2312.08885, 2023

work page arXiv 2023

[59] [59]

Graphdreamer: Compositional 3d scene synthesis from scene graphs,

G. Gao, W. Liu, A. Chen, A. Geiger, and B. Sch ¨olkopf, “Graphdreamer: Compositional 3d scene synthesis from scene graphs,”arXiv preprint arXiv:2312.00093, 2023

work page arXiv 2023

[60] [60]

Text2scene: Text-driven in- door scene stylization with part-aware details,

I. Hwang, H. Kim, and Y . M. Kim, “Text2scene: Text-driven in- door scene stylization with part-aware details,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 1890–1899

work page 2023

[61] [61]

Text2nerf: Text-driven 3d scene generation with neural radiance fields,

J. Zhang, X. Li, Z. Wan, C. Wang, and J. Liao, “Text2nerf: Text-driven 3d scene generation with neural radiance fields,”IEEE Transactions on Visualization and Computer Graphics, 2024

work page 2024

[62] [62]

Dreamscene360: Unconstrained text-to- 3d scene generation with panoramic gaussian splatting,

S. Zhou, Z. Fan, D. Xu, H. Chang, P. Chari, T. Bharadwaj, S. You, Z. Wang, and A. Kadambi, “Dreamscene360: Unconstrained text-to- 3d scene generation with panoramic gaussian splatting,”arXiv preprint arXiv:2404.06903, 2024

work page arXiv 2024

[63] [63]

Dreamscene: 3d gaussian-based text-to-3d scene generation via formation pattern sampling,

H. Li, H. Shi, W. Zhang, W. Wu, Y . Liao, L. Wang, L.-h. Lee, and P. Zhou, “Dreamscene: 3d gaussian-based text-to-3d scene generation via formation pattern sampling,”arXiv preprint arXiv:2404.03575, 2024

work page arXiv 2024

[64] [64]

Text2immersion: Generative immersive scene with 3d gaussians,

H. Ouyang, K. Heal, S. Lombardi, and T. Sun, “Text2immersion: Generative immersive scene with 3d gaussians,”arXiv preprint arXiv:2312.09242, 2023

work page arXiv 2023

[65] [65]

Controlroom3d: Room generation using semantic proxy rooms,

J. Schult, S. Tsai, L. H ¨ollein, B. Wu, J. Wang, C.-Y . Ma, K. Li, X. Wang, F. Wimbauer, Z. Heet al., “Controlroom3d: Room generation using semantic proxy rooms,”arXiv preprint arXiv:2312.05208, 2023

work page arXiv 2023

[66] [66]

Ctrl-room: Controllable text- to-3d room meshes generation with layout constraints,

C. Fang, X. Hu, K. Luo, and P. Tan, “Ctrl-room: Controllable text- to-3d room meshes generation with layout constraints,”arXiv preprint arXiv:2310.03602, 2023

work page arXiv 2023

[67] [67]

3d-scenedreamer: Text-driven 3d-consistent scene generation,

F. Zhang, Y . Zhang, Q. Zheng, R. Ma, W. Hua, H. Bao, W. Xu, and C. Zou, “3d-scenedreamer: Text-driven 3d-consistent scene generation,” arXiv preprint arXiv:2403.09439, 2024

work page arXiv 2024

[68] [68]

Showroom3d: Text to high-quality 3d room generation using 3d priors,

W. Mao, Y .-P. Cao, J.-W. Liu, Z. Xu, and M. Z. Shou, “Showroom3d: Text to high-quality 3d room generation using 3d priors,”arXiv preprint arXiv:2312.13324, 2023

work page arXiv 2023

[69] [69]

Fastscene: Text-driven fast 3d indoor scene generation via panoramic gaussian splatting,

Y . Ma, D. Zhan, and Z. Jin, “Fastscene: Text-driven fast 3d indoor scene generation via panoramic gaussian splatting,”arXiv preprint arXiv:2405.05768, 2024

work page arXiv 2024

[70] [70]

360dvd: Controllable panorama video generation with 360-degree video diffusion model,

Q. Wang, W. Li, C. Mou, X. Cheng, and J. Zhang, “360dvd: Controllable panorama video generation with 360-degree video diffusion model,” arXiv preprint arXiv:2401.06578, 2024

work page arXiv 2024

[71] [71]

MVDream: Multi-view Diffusion for 3D Generation

Y . Shi, P. Wang, J. Ye, M. Long, K. Li, and X. Yang, “Mvdream: Multi- view diffusion for 3d generation,”arXiv preprint arXiv:2308.16512, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[72] [72]

Imagedream: Image-prompt multi-view diffusion for 3d generation

P. Wang and Y . Shi, “Imagedream: Image-prompt multi-view diffusion for 3d generation,”arXiv preprint arXiv:2312.02201, 2023

work page arXiv 2023

[73] [73]

Direct2. 5: Diverse text-to-3d generation via multi-view 2.5 d diffusion,

Y . Lu, J. Zhang, S. Li, T. Fang, D. McKinnon, Y . Tsin, L. Quan, X. Cao, and Y . Yao, “Direct2. 5: Diverse text-to-3d generation via multi-view 2.5 d diffusion,”arXiv preprint arXiv:2311.15980, 2023

work page arXiv 2023

[74] [74]

Shap-E: Generating Conditional 3D Implicit Functions

H. Jun and A. Nichol, “Shap-e: Generating conditional 3d implicit functions,”arXiv preprint arXiv:2305.02463, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[75] [75]

Triplane meets gaussian splatting: Fast and generalizable single-view 3d reconstruction with transformers,

Z.-X. Zou, Z. Yu, Y .-C. Guo, Y . Li, D. Liang, Y .-P. Cao, and S.-H. Zhang, “Triplane meets gaussian splatting: Fast and generalizable single-view 3d reconstruction with transformers,”arXiv preprint arXiv:2312.09147, 2023

work page arXiv 2023

[76] [76]

Instant-3d: Instant neural radiance field training towards on-device ar/vr 3d reconstruction,

S. Li, C. Li, W. Zhu, B. Yu, Y . Zhao, C. Wan, H. You, H. Shi, and Y . Lin, “Instant-3d: Instant neural radiance field training towards on-device ar/vr 3d reconstruction,” inProceedings of the 50th Annual International Symposium on Computer Architecture, 2023, pp. 1–13

work page 2023

[77] [77]

DreamGaussian: Generative Gaussian Splatting for Efficient 3D Content Creation

J. Tang, J. Ren, H. Zhou, Z. Liu, and G. Zeng, “Dreamgaussian: Generative gaussian splatting for efficient 3d content creation,”arXiv preprint arXiv:2309.16653, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[78] [78]

Objaverse-xl: A uni- verse of 10m+ 3d objects,

M. Deitke, R. Liu, M. Wallingford, H. Ngo, O. Michel, A. Kusupati, A. Fan, C. Laforte, V . V oleti, S. Y . Gadreet al., “Objaverse-xl: A uni- verse of 10m+ 3d objects,”Advances in Neural Information Processing Systems, vol. 36, 2024

work page 2024

[79] [79]

Mvdiffusion: Enabling holistic multi-view image generation with correspondence- aware diffusion,

S. Tang, F. Zhang, J. Chen, P. Wang, and F. Yasutaka, “Mvdiffusion: Enabling holistic multi-view image generation with correspondence- aware diffusion,”arXiv preprint 2307.01097, 2023

work page arXiv 2023

[80] [80]

Geowizard: Unleashing the diffusion priors for 3d geometry estimation from a single image,

X. Fu, W. Yin, M. Hu, K. Wang, Y . Ma, P. Tan, S. Shen, D. Lin, and X. Long, “Geowizard: Unleashing the diffusion priors for 3d geometry estimation from a single image,”arXiv preprint arXiv:2403.12013, 2024

work page arXiv 2024