pith. sign in

arxiv: 2606.27923 · v2 · pith:DF4UUR3Xnew · submitted 2026-06-26 · 💻 cs.CV · cs.AI

Home3D 1.0: A High-Fidelity Image-to-3D Asset Generation System for Interior Design

Pith reviewed 2026-06-30 09:47 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords image-to-3D generationsingle-image 3D reconstructionPBR material assignmentmesh decompositioninterior design assetsSDF-based geometrymodular 3D pipelinetexture completion
0
0 comments X

The pith

Home3D turns one photo of furniture into a watertight 3D mesh with PBR materials that decomposes into editable parts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Home3D 1.0, a modular image-to-3D system that generates high-quality assets from a single reference photograph for interior design and e-commerce. It organizes the pipeline into four tightly coupled modules that reconstruct geometry, predict and complete textures, assign materials from a library, and produce semantic part decompositions. A sympathetic reader would care because the output is claimed to be a production-ready mesh with physically-based rendering properties that supports material-specific editing without requiring multiple input views. Each module is evaluated separately to show current capabilities and remaining gaps.

Core claim

Given a photograph of a furniture or decor item, the system outputs a mesh with physically-based rendering (PBR) materials, and the mesh can be decomposed into material-specific components through four tightly coupled modules: Geometry reconstructs a watertight mesh through latent SDF modelling with a geometry VAE and a coarse-to-fine flow-matching DiT; Texture predicts multiview albedo observations, reprojects them onto the mesh, and completes unseen surface regions with a 3D texture field; Material uses MatWeaver to obtain component masks through video-based segmentation and UV-space voting, then retrieves and bakes PBR maps from a curated material library through hierarchical multi-modal

What carries the argument

Four tightly coupled modules (Geometry via latent SDF VAE and flow-matching DiT, Texture via multiview albedo reprojection plus 3D field completion, Material via MatWeaver segmentation and library matching, Parts via PartVAE and PartDiT) that together produce a decomposable PBR mesh from one image.

If this is right

  • The output is a watertight mesh suitable for rendering and editing in standard 3D software.
  • Materials are retrieved from a curated library and baked as PBR maps that support component-level changes.
  • The mesh decomposes into material-specific and semantic part meshes generated in a single pass.
  • Each of the four modules can be assessed independently using dedicated metrics.
  • The system targets direct use in interior design and e-commerce workflows from ordinary photographs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the coupling works as described, the approach could reduce reliance on multi-view capture rigs for asset creation.
  • The modular design suggests it might be possible to swap individual modules for domain-specific improvements, such as better handling of reflective surfaces.
  • Automated part decomposition could support downstream tasks like physics simulation or modular furniture reconfiguration in virtual environments.
  • Extending the input to include partial depth or lighting estimates might close some of the remaining gaps noted in the module evaluations.

Load-bearing premise

The four modules can be tightly coupled in practice to produce high-fidelity, usable assets without major quality gaps or manual intervention.

What would settle it

Testing the full pipeline on a held-out set of real furniture photographs and finding that the generated meshes contain non-watertight surfaces, mismatched PBR parameters, or part decompositions that do not align with the input image's visible materials.

Figures

Figures reproduced from arXiv: 2606.27923 by Chuqiao Wu, Delong Wu, Feng Zhang, Guoqiu Li, Haohui Chen, Hong Wu, Jianan Jiang, Jing Li, Jin Song, Qi Wu, Ruigao Yang, YinDong Kong, Yiyun Fei, Ziru Zeng.

Figure 1
Figure 1. Figure 1: A living room scene populated with furniture models generated by Home3D 1.0. [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Home3D 1.0 System The pipeline proceeds as follows: 1. Geometry (§2.2) reconstructs a watertight mesh M from the input image using a coarse-to-fine latent SDF generation process. 2. Texture (§2.3) predicts multiview albedo observations, reprojects them onto M, and completes unseen regions in 3D surface space. 3. Material (§2.4) segments material regions on the mesh and retrieves high-quality PBR maps from … view at source ↗
Figure 3
Figure 3. Figure 3: Overview of the geometry generation framework, including image-conditioned coarse-to-fine DiT [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Overview of the proposed texture generation framework. The system predicts multiview albedo [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Overview of our proposed framework. The pipeline consists of two main stages: (a) [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Architecture of the material-aware part generation stage. A PartVAE defines a compact latent [PITH_FULL_IMAGE:figures/full_fig_p011_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Qualitative comparisons between Home3D 1.0 and baselines in terms of 3D shape generation. [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Qualitative comparisons between Home3D 1.0 and baselines in terms of PBR appearance genera [PITH_FULL_IMAGE:figures/full_fig_p014_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Qualitative comparisons of material-aware part decomposition against closed-source image-to-3D [PITH_FULL_IMAGE:figures/full_fig_p015_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Material replacement using Home3D 1.0 part decomposition. After material-aware part genera [PITH_FULL_IMAGE:figures/full_fig_p015_10.png] view at source ↗
read the original abstract

We present Home3D 1.0, a modular image-to-3D generation system that produces high-quality 3D assets from a single reference image, targeting interior design and e-commerce applications. Given a photograph of a furniture or decor item, the system outputs a mesh with physically-based rendering (PBR) materials, and the mesh can be decomposed into material-specific components. The pipeline is organized into four tightly coupled modules: Geometry reconstructs a watertight mesh through latent SDF modelling with a geometry VAE and a coarse-to-fine flow-matching DiT; Texture predicts multiview albedo observations, reprojects them onto the mesh, and completes unseen surface regions with a 3D texture field; Material uses MatWeaver to obtain component masks through video-based segmentation and UV-space voting, then retrieves and bakes PBR maps from a curated material library through hierarchical multi-modal matching; and Parts generates material-editable semantic part meshes with a PartVAE and PartDiT, decoding multi-head part-specific SDF fields in one pass. Each module is evaluated independently with dedicated metrics, highlighting both the current system capability and the remaining gaps toward broader deployment.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The manuscript presents Home3D 1.0, a modular image-to-3D generation system for producing high-quality 3D assets with PBR materials and decomposable parts from a single reference image. The pipeline comprises four modules: Geometry (latent SDF modelling with VAE and coarse-to-fine flow-matching DiT for watertight meshes), Texture (multiview albedo prediction, reprojection, and 3D texture field completion), Material (MatWeaver for component masks via video segmentation and UV voting, followed by PBR map retrieval and baking), and Parts (PartVAE and PartDiT for generating material-editable semantic part meshes via multi-head SDF fields). Each module is evaluated independently using dedicated metrics, with the goal of enabling applications in interior design and e-commerce.

Significance. If the integration of the modules succeeds as claimed, the system could offer a substantial advance in automated creation of editable, high-fidelity 3D assets suitable for professional use, addressing a key need in e-commerce and design workflows where manual modeling is costly. The modular structure facilitates independent development and evaluation, which is a positive design choice. However, the absence of system-level validation in the provided description makes it difficult to gauge the practical impact.

major comments (1)
  1. [Abstract] The central claim that the modules 'can be tightly coupled in practice to produce high-fidelity, usable assets without major quality gaps or manual intervention' is not supported by any described end-to-end evaluation. The abstract explicitly notes that 'each module is evaluated independently with dedicated metrics,' but provides no pipeline-level metrics, error-propagation studies, or checks on interface consistency (e.g., how Geometry outputs feed into Texture reprojection or how Material masks condition Parts generation). This leaves the 'tightly coupled' aspect unverified.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the potential of the modular design. We address the single major comment below.

read point-by-point responses
  1. Referee: [Abstract] The central claim that the modules 'can be tightly coupled in practice to produce high-fidelity, usable assets without major quality gaps or manual intervention' is not supported by any described end-to-end evaluation. The abstract explicitly notes that 'each module is evaluated independently with dedicated metrics,' but provides no pipeline-level metrics, error-propagation studies, or checks on interface consistency (e.g., how Geometry outputs feed into Texture reprojection or how Material masks condition Parts generation). This leaves the 'tightly coupled' aspect unverified.

    Authors: We agree with the observation. The current manuscript evaluates each module independently to isolate technical contributions and remaining gaps, as stated in the abstract. The 'tightly coupled' phrasing in the abstract and introduction refers to the explicit data interfaces defined in the methods (Geometry mesh as input to Texture reprojection, Material component masks conditioning Parts generation, etc.), but no quantitative end-to-end metrics, error-propagation analysis, or interface-consistency experiments are reported. We will revise the abstract to remove or qualify the claim of 'without major quality gaps or manual intervention,' add a dedicated subsection presenting qualitative full-pipeline results on held-out images, and discuss observed interface behavior. These changes will be incorporated in the next version. revision: yes

Circularity Check

0 steps flagged

No derivation chain or equations present; modular system description with independent module evaluations.

full rationale

The paper describes a four-module pipeline (Geometry, Texture, Material, Parts) for image-to-3D asset generation and states that each module is evaluated independently with dedicated metrics. No equations, latent variable derivations, fitted parameters renamed as predictions, or self-citation chains appear in the abstract or described content. The central claims concern architectural organization and per-module performance rather than any load-bearing mathematical step that reduces to its own inputs by construction. The integration assumption noted in the skeptic analysis is an empirical claim about system behavior, not a circular derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no explicit free parameters, axioms, or invented entities are stated. The system relies on standard machine-learning components (VAE, DiT, flow-matching) whose assumptions are not detailed here.

pith-pipeline@v0.9.1-grok · 5789 in / 1166 out tokens · 35929 ms · 2026-06-30T09:47:41.302180+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

17 extracted references · 9 canonical work pages · 4 internal anchors

  1. [1]

    Dora: Sampling and benchmarking for 3d shape variational auto-encoders

    Rui Chen, Jianfeng Zhang, Yixun Liang, Guan Luo, Weiyu Li, Jiarui Liu, Xiu Li, Xiaoxiao Long, Jiashi Feng, and Ping Tan. Dora: Sampling and benchmarking for 3d shape variational auto-encoders. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 16251–16261, 2025

  2. [2]

    Objaverse-XL: A Universe of 10M+ 3D Objects

    Matt Deitke, Ruoshi Liu, Matthew Wallingford, Huong Ngo, Oscar Michel, Aditya Kusupati, Alan Fan, Christian Laforte, Vikram Voleti, Samir Yitzhak Gadre, Eli VanderBilt, Aniruddha Kembhavi, Carl Vondrick, Georgia Gkioxari, Kiana Ehsani, Ludwig Schmidt, and Ali Farhadi. Objaverse-xl: A universe of 10m+ 3d objects.arXiv preprint arXiv:2307.05663, 2023

  3. [3]

    Seed3d 1.0: From images to high-fidelity simulation-ready 3d assets.arXiv preprint arXiv:2510.19944, 2025

    Jiashi Feng, Xiu Li, Jing Lin, Jiahang Liu, Gaohong Liu, Weiqiang Lou, Su Ma, et al. Seed3d 1.0: From images to high-fidelity simulation-ready 3d assets.arXiv preprint arXiv:2510.19944, 2025

  4. [4]

    3D-FRONT: 3d furnished rooms with layouts and semantics

    Huan Fu, Bowen Cai, Lin Gao, Ling-Xiao Zhang, Jiaming Wang, Cao Li, Qixun Zeng, Chengyue Sun, Rongfei Jia, Binqiang Zhao, and Hao Zhang. 3D-FRONT: 3d furnished rooms with layouts and semantics. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 10933–10942, October 2021

  5. [5]

    Seed3D 2.0: Advancing High-Fidelity Simulation-Ready 3D Content Generation

    Diandian Gu, Jing Lin, Gaohong Liu, Jiahang Liu, Su Ma, Guang Shi, Jun Wang, Qinlong Wang, Qianyi Wu, Zhongcong Xu, Xuanyu Yi, Zihao Yu, Jianfeng Zhang, Zhuolin Zheng, Yifan Zhu, Rui Chen, Hengkai Guo, Xiaoyang Guo, Mingcong Han, Xu Han, Xiu Li, Yixun Liang, Weiqiang Lou, Junzhe Lu, Guan Luo, Minghan Qin, Shuguang Wang, and Yuang Wang. Seed3d 2.0: Advanci...

  6. [6]

    Lattice: Democratize high-fidelity 3d generation at scale, 2025

    Zeqiang Lai, Yunfei Zhao, Zibo Zhao, Haolin Liu, Qingxiang Lin, Jingwei Huang, Chunchao Guo, and Xiangyu Yue. Lattice: Democratize high-fidelity 3d generation at scale, 2025

  7. [7]

    Hunyuan3d studio: End-to-end ai pipeline for game-ready 3d asset generation

    Biwen Lei, Yang Li, Xinhai Liu, Shuhui Yang, Lixin Xu, Jingwei Huang, Ruining Tang, Haohan Weng, Jian Liu, Jing Xu, et al. Hunyuan3d studio: End-to-end ai pipeline for game-ready 3d asset generation. arXiv preprint arXiv:2509.12815, 2025

  8. [8]

    Partcrafter: Structured 3d mesh generation via compositional latent diffusion transformers.arXiv preprint arXiv:2506.05573, 2025

    Yuchen Lin, Chenguo Lin, Panwang Pan, Honglei Yan, Yiqiang Feng, Yadong Mu, and Katerina Fragki- adaki. Partcrafter: Structured 3d mesh generation via compositional latent diffusion transformers.arXiv preprint arXiv:2506.05573, 2025

  9. [9]

    Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling. InInternational Conference on Learning Representations (ICLR), 2023

  10. [10]

    Lorensen and Harvey E

    William E. Lorensen and Harvey E. Cline. Marching cubes: A high resolution 3d surface construction algorithm. InProceedings of SIGGRAPH, volume 21, pages 163–169, 1987. 16 Alibaba Group/Taobao Home3D Technical Report

  11. [11]

    Dinov2: Learning robust visual features without supervision.Transactions on Machine Learning Research, 2024

    Maxime Oquab, Timoth´ ee Darcet, Th´ eo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision.Transactions on Machine Learning Research, 2024

  12. [12]

    Efficient part-level 3d object generation via dual volume packing

    Jiaxiang Tang, Ruijie Lu, Zhaoshuo Li, Zekun Hao, Xuan Li, Fangyin Wei, Shuran Song, Gang Zeng, Ming-Yu Liu, and Tsung-Yi Lin. Efficient part-level 3d object generation via dual volume packing. arXiv preprint arXiv:2506.09980, 2025

  13. [13]

    Hunyuan3D 2.1: From Images to High-Fidelity 3D Assets with Production-Ready PBR Material

    Team Hunyuan3D, Shuhui Yang, Mingxin Yang, Yifei Feng, Xin Huang, Sheng Zhang, Zebin He, Di Luo, et al. Hunyuan3d 2.1: From images to high-fidelity 3d assets with production-ready pbr material.arXiv preprint arXiv:2506.15442, 2025

  14. [14]

    TripoSR: Fast 3D Object Reconstruction from a Single Image

    Dmitry Tochilkin, David Pankratz, Zexiang Liu, Zixuan Huang, , Adam Letts, Yangguang Li, Ding Liang, Christian Laforte, Varun Jampani, and Yan-Pei Cao. Triposr: Fast 3d object reconstruction from a single image.arXiv preprint arXiv:2403.02151, 2024

  15. [15]

    Native and compact structured latents for 3d generation.Tech report, 2025

    Jianfeng Xiang, Xiaoxue Chen, Sicheng Xu, Ruicheng Wang, Zelong Lv, Yu Deng, Hongyuan Zhu, Yue Dong, Hao Zhao, Nicholas Jing Yuan, and Jiaolong Yang. Native and compact structured latents for 3d generation.Tech report, 2025

  16. [16]

    X-part: High fidelity and structure coherent shape decomposition.arXiv preprint arXiv:2509.08643, 2025

    Xinhao Yan, Jiachen Xu, Yang Li, Changfeng Ma, Yunhan Yang, Chunshi Wang, Zibo Zhao, Zeqiang Lai, Yunfei Zhao, Zhuo Chen, and Chunchao Guo. X-part: High fidelity and structure coherent shape decomposition.arXiv preprint arXiv:2509.08643, 2025

  17. [17]

    Efros, Eli Shechtman, and Oliver Wang

    Richard Zhang, Phillip Isola, Alexei A. Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 586–595, 2018. 17 Alibaba Group/Taobao Home3D Technical Report A Contributions and Acknowledgments All contr...