HomeWorld: A Unified Floorplan-to-Furnished Framework for Generating Controllable, Densely Interactive Whole-Home Scenes
Pith reviewed 2026-06-28 01:42 UTC · model grok-4.3
The pith
A staged pipeline generates controllable whole-home indoor scenes from floorplans using LLM synthesis, image drafting, and VLM refinement.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By decomposing the generation into a sequence of controllable stages—LLM-based floorplan creation with K-D tree representation, multi-viewpoint furniture layout via image models, VLM iterative refinement for furniture and objects, and final asset and attribute attachment—the method produces whole-home scenes with improved layout diversity and design quality over previous approaches.
What carries the argument
The unified hierarchical framework that breaks scene synthesis into floorplan, furniture, and object stages with VLM-based iterative correction.
If this is right
- Produces scenes with greater layout diversity and 3D design appeal than prior methods, as shown in experiments and user studies.
- Enables fine-grained control over whole-home floorplans through detailed text descriptions.
- Generates densely interactive scenes with manipulable objects for embodied AI simulation.
- Facilitates community progress by releasing the 300K floorplan dataset and 5K furnished scenes.
Where Pith is reading between the lines
- Such generated scenes could provide scalable training data for vision and language models in home robotics.
- The multi-stage approach with refinement might be adapted to generate scenes in other environments like offices or outdoors.
- Integration with 3D generative models for asset replacement could allow customization without full regeneration.
Load-bearing premise
The VLM-based refiner can iteratively correct furniture and object placements from multi-level viewpoints without introducing new inconsistencies or requiring manual intervention.
What would settle it
Compare the number of physical simulation failures or user-reported inconsistencies in scenes generated with and without the VLM refiner stage.
read the original abstract
Indoor scene generation is crucial for robot simulation and modern interior design. However, complex layouts together with scarce 3D scene data make learning-based generation challenging. Existing methods often rely on hand-crafted rules or focus on isolated sub-tasks (e.g., floorplan synthesis or single-room furnishing), producing whole-home scenes that lack global coherence, realism, and simulation readiness. To mitigate these limitations, we propose a unified hierarchical framework that decomposes indoor scene synthesis into controllable stages. First, we curate a large-scale dataset of 300K real residential floorplans to train a large language model for whole-home floorplan generation. With detailed descriptions and a K-D tree-based representation, our method enables fine-grained, controllable whole-home floorplan generation. Building upon the generated whole-home floorplan, we leverage image generation models to draft furniture layouts from multi-level roaming viewpoints, and then generate the layouts of small manipulable objects on different supporting surfaces (e.g., cabinets, desks, and dining tables) for embodied AI simulation. During furniture and object layout generation, a VLM-based refiner iteratively corrects furniture and object placement, and a 3D generative model enables flexible replacement of individual assets. We further attach basic physical attributes and simple surface texture and lighting setups to complete the pipeline for embodied AI use. Experiments and user studies demonstrate that our pipeline produces indoor spaces with greater layout diversity and stronger 3D design appeal, outperforming prior methods on both quantitative and qualitative metrics. Finally, alongside our generation pipeline, we will release the floorplan dataset and 5K fully furnished scenes to the community. Project Page: https://kairos-homeworld.github.io/
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents HomeWorld, a unified hierarchical framework for generating controllable, densely interactive whole-home 3D scenes starting from floorplans. It curates a 300K real residential floorplan dataset to train an LLM for whole-home floorplan generation with K-D tree representation for controllability; leverages image generation models to draft furniture layouts from multi-level roaming viewpoints; uses a VLM-based refiner to iteratively correct furniture and object placements; generates small manipulable objects on supporting surfaces; employs 3D generative models for asset replacement; and attaches physical attributes, textures, and lighting for embodied AI simulation. Experiments and user studies are claimed to show greater layout diversity and 3D design appeal than prior methods on quantitative and qualitative metrics, with plans to release the floorplan dataset and 5K furnished scenes.
Significance. If the pipeline's claims hold, the work would offer a practical advance for indoor scene synthesis in robot simulation and interior design by addressing the gap in global coherence for whole-home scenes, unlike prior methods focused on isolated sub-tasks or hand-crafted rules. The hierarchical decomposition, use of pre-trained models, and planned public release of a large-scale floorplan dataset plus 5K scenes represent concrete strengths that could enable reproducible progress in the field.
major comments (2)
- [furniture and object layout generation stage] Furniture and object layout generation stage (abstract and corresponding section): The VLM-based refiner is described as iteratively correcting furniture and object placements from multi-level viewpoints, yet no ablation studies, success-rate metrics, inconsistency counts, convergence criteria, or validation that it avoids introducing new errors (e.g., floating objects or overlaps) are reported. This is load-bearing for the central claim of outperforming priors on layout diversity and 3D appeal, as unvalidated corrections could undermine the quantitative and user-study results.
- [Experiments and user studies] Experiments and user studies section: The abstract asserts outperformance on quantitative and qualitative metrics without specifying the exact metrics, baselines, statistical tests, or effect sizes in the provided description; this prevents assessment of whether the evidence supports the superiority claims, particularly given reliance on the unvalidated VLM refiner.
minor comments (2)
- Ensure all stages of the pipeline (including K-D tree representation details and physical attribute attachment) are cross-referenced with explicit section numbers for clarity.
- The project page URL is provided; confirm that supplementary materials such as dataset access and scene examples are directly linked in the manuscript.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback highlighting the need for more rigorous validation of the VLM refiner and greater specificity in the experimental claims. We address each major comment below and will incorporate revisions to strengthen the manuscript.
read point-by-point responses
-
Referee: [furniture and object layout generation stage] Furniture and object layout generation stage (abstract and corresponding section): The VLM-based refiner is described as iteratively correcting furniture and object placements from multi-level viewpoints, yet no ablation studies, success-rate metrics, inconsistency counts, convergence criteria, or validation that it avoids introducing new errors (e.g., floating objects or overlaps) are reported. This is load-bearing for the central claim of outperforming priors on layout diversity and 3D appeal, as unvalidated corrections could undermine the quantitative and user-study results.
Authors: We agree that isolating the VLM refiner's contribution is important given its role in the pipeline. The manuscript validates the end-to-end results via quantitative layout diversity metrics and user studies on 3D appeal, but does not include dedicated ablations or error analysis for the refiner itself. In the revised version, we will add a new subsection with ablation results: success rates (pre- vs. post-refinement), counts of corrected inconsistencies (e.g., overlaps, floating objects), convergence behavior after iterations, and checks confirming no net increase in errors. This will directly support the superiority claims. revision: yes
-
Referee: [Experiments and user studies] Experiments and user studies section: The abstract asserts outperformance on quantitative and qualitative metrics without specifying the exact metrics, baselines, statistical tests, or effect sizes in the provided description; this prevents assessment of whether the evidence supports the superiority claims, particularly given reliance on the unvalidated VLM refiner.
Authors: The full experiments section details the metrics (layout diversity via coverage and entropy scores, realism via FID and user preference rates), baselines (prior floorplan-to-scene methods), user study design (pairwise comparisons with 50 participants), and statistical tests (paired t-tests with p-values and Cohen's d effect sizes). The abstract is intentionally concise. We will revise the abstract to explicitly list the primary quantitative metrics, main baselines, and note that statistical significance was evaluated with reported effect sizes. This addresses the clarity concern while preserving the existing results. revision: yes
Circularity Check
No circularity: applied pipeline with external models and no self-referential derivations
full rationale
The paper describes a hierarchical engineering pipeline: curating a 300K floorplan dataset to train an LLM, using image generation models for layouts, a VLM refiner for corrections, and a 3D generative model for assets. No equations, predictions, or fitted parameters are presented that reduce to their own inputs by construction. No self-citation chains or uniqueness theorems are invoked as load-bearing premises. The central claims rest on experimental comparisons and user studies against prior methods, which are independent of any internal redefinition. This matches the default case of a self-contained applied system description.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Pre-trained large language models can generate coherent whole-home floorplans from detailed text descriptions when trained on 300K real residential examples
- domain assumption Image generation models can draft furniture layouts from multi-level roaming viewpoints that are accurate enough for subsequent VLM refinement
Reference graph
Works this paper leans on
-
[1]
Procthor: Large-scale embodied ai using procedural generation
Matt Deitke, Eli VanderBilt, Alvaro Herrasti, Luca Weihs, Kiana Ehsani, Jordi Salvador, Winson Han, Eric Kolve, Aniruddha Kembhavi, and Roozbeh Mottaghi. Procthor: Large-scale embodied ai using procedural generation. Advances in Neural Information Processing Systems, 35:5982–5994, 2022
2022
-
[2]
Structured3d: A large photo-realistic dataset for structured 3d modeling
Jia Zheng, Junfei Zhang, Jing Li, Rui Tang, Shenghua Gao, and Zihan Zhou. Structured3d: A large photo-realistic dataset for structured 3d modeling. InEuropean Conference on Computer Vision, pages 519–535. Springer, 2020
2020
-
[3]
MINOS: Multimodal Indoor Simulator for Navigation in Complex Environments
Manolis Savva, Angel X Chang, Alexey Dosovitskiy, Thomas Funkhouser, and Vladlen Koltun. Minos: Multimodal indoor simulator for navigation in complex environments.arXiv preprint arXiv:1712.03931, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[4]
Scannet: Richly-annotated 3d reconstructions of indoor scenes
Angela Dai, Angel X Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 5828–5839, 2017
2017
-
[5]
Matterport3D: Learning from RGB-D Data in Indoor Environments
Angel Chang, Angela Dai, Thomas Funkhouser, Maciej Halber, Matthias Niessner, Manolis Savva, Shuran Song, Andy Zeng, and Yinda Zhang. Matterport3d: Learning from rgb-d data in indoor environments. arXiv preprint arXiv:1709.06158, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[6]
Blockfusion: Expandable 3d scene generation using latent tri-plane extrapolation
Zhennan Wu, Yang Li, Han Yan, Taizhang Shang, Weixuan Sun, Senbo Wang, Ruikai Cui, Weizhe Liu, Hiroyuki Sato, Hongdong Li, et al. Blockfusion: Expandable 3d scene generation using latent tri-plane extrapolation. ACM Transactions on Graphics (ToG), 43(4):1–17, 2024
2024
-
[7]
Diffindscene: Diffusion-based high-quality 3d indoor scene generation
Xiaoliang Ju, Zhaoyang Huang, Yijin Li, Guofeng Zhang, Yu Qiao, and Hongsheng Li. Diffindscene: Diffusion-based high-quality 3d indoor scene generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4526–4535, 2024
2024
-
[8]
Text2room: Extracting textured 3d meshes from 2d text-to-image models
Lukas Höllein, Ang Cao, Andrew Owens, Justin Johnson, and Matthias Nießner. Text2room: Extracting textured 3d meshes from 2d text-to-image models. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 7909–7920, 2023
2023
-
[9]
and National Institute of Informatics (NII)
LIFULL Co., Ltd. and National Institute of Informatics (NII). LIFULL HOME’S High Resolution Floor Plan Image Data. https://www.nii.ac.jp/dsc/idr/en/lifull/1.html, 2017. Accessed: 2026-02-28
2017
-
[10]
Data-driven interior plan generation for residential buildings.ACM Transactions on Graphics (SIGGRAPH Asia), 38(6), 2019
Wenming Wu, Xiao-Ming Fu, Rui Tang, Yuhan Wang, Yu-Hao Qi, and Ligang Liu. Data-driven interior plan generation for residential buildings.ACM Transactions on Graphics (SIGGRAPH Asia), 38(6), 2019
2019
-
[11]
Msd: A benchmark dataset for floor plan generation of building complexes
Casper Van Engelenburg, Fatemeh Mostafavi, Emanuel Kuhn, Yuntae Jeon, Michael Franzen, Matthias Standfest, Jan van Gemert, and Seyran Khademi. Msd: A benchmark dataset for floor plan generation of building complexes. InEuropean Conference on Computer Vision, pages 60–75. Springer, 2024
2024
-
[12]
Mohamed Abouagour and Eleftherios Garyfallidis. Resplan: A large-scale vector-graph dataset of 17,000 residential floor plans.arXiv preprint arXiv:2508.14006, 2025
-
[13]
3d-front: 3d furnished rooms with layouts and semantics
Huan Fu, Bowen Cai, Lin Gao, Ling-Xiao Zhang, Jiaming Wang, Cao Li, Qixun Zeng, Chengyue Sun, Rongfei Jia, Binqiang Zhao, et al. 3d-front: 3d furnished rooms with layouts and semantics. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10933–10942, 2021
2021
-
[14]
InternScenes: A Large-scale Simulatable Indoor Scene Dataset with Realistic Layouts
Weipeng Zhong, Peizhou Cao, Yichen Jin, Li Luo, Wenzhe Cai, Jingli Lin, Hanqing Wang, Zhaoyang Lyu, Tai Wang, Bo Dai, et al. Internscenes: A large-scale simulatable indoor scene dataset with realistic layouts. arXiv preprint arXiv:2509.10813, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[15]
Sceneverse: Scaling 3d vision-language learning for grounded scene understanding
Baoxiong Jia, Yixin Chen, Huangyue Yu, Yan Wang, Xuesong Niu, Tengyu Liu, Qing Li, and Siyuan Huang. Sceneverse: Scaling 3d vision-language learning for grounded scene understanding. InEuropean Conference on Computer Vision, pages 289–310. Springer, 2024. 19
2024
-
[16]
Floorplan-llama: Aligning architects’ feedback and domain knowledge in architectural floor plan generation
Jun Yin, Pengyu Zeng, Haoyuan Sun, Yuqin Dai, Han Zheng, Miao Zhang, Yachao Zhang, and Shuai Lu. Floorplan-llama: Aligning architects’ feedback and domain knowledge in architectural floor plan generation. InProceedings of the 63rd AnnualMeeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6640–6662, 2025
2025
-
[17]
Floorplan-diffusion: Automatic floor plan generation via pre-trained large latent diffusion model
Minyang Xu, Yunzhong Lou, Xiang Gao, and Xiangdong Zhou. Floorplan-diffusion: Automatic floor plan generation via pre-trained large latent diffusion model. InProceedings of the 2025 International Conference on Multimedia Retrieval, pages 1617–1625, 2025
2025
-
[18]
Layoutgpt: Compositional visual planning and generation with large language models
Weixi Feng, Wanrong Zhu, Tsu-jui Fu, Varun Jampani, Arjun Akula, Xuehai He, Sugato Basu, Xin Eric Wang, and William Yang Wang. Layoutgpt: Compositional visual planning and generation with large language models. Advances in Neural Information Processing Systems, 36:18225–18250, 2023
2023
-
[19]
Holodeck: Language guided generation of 3d embodied ai envi- ronments
Yue Yang, Fan-Yun Sun, Luca Weihs, Eli VanderBilt, Alvaro Herrasti, Winson Han, Jiajun Wu, Nick Haber, Ranjay Krishna, Lingjie Liu, et al. Holodeck: Language guided generation of 3d embodied ai envi- ronments. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16227–16237, 2024
2024
-
[20]
Layoutvlm: Differentiable optimization of 3d layout via vision-language models
Fan-Yun Sun, Weiyu Liu, Siyi Gu, Dylan Lim, Goutam Bhat, Federico Tombari, Manling Li, Nick Haber, and Jiajun Wu. Layoutvlm: Differentiable optimization of 3d layout via vision-language models. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 29469–29478, 2025
2025
-
[21]
Infinigen indoors: Photorealistic indoor scenes using procedural generation
Alexander Raistrick, Lingjie Mei, Karhan Kayan, David Yan, Yiming Zuo, Beining Han, Hongyu Wen, Meenal Parakh, Stamatis Alexandropoulos, Lahav Lipson, et al. Infinigen indoors: Photorealistic indoor scenes using procedural generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21783–21794, 2024
2024
-
[22]
Physcene: Physically interactable 3d scene synthesis for embodied ai
Yandan Yang, Baoxiong Jia, Peiyuan Zhi, and Siyuan Huang. Physcene: Physically interactable 3d scene synthesis for embodied ai. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16262–16272, 2024
2024
-
[23]
Chong Su, Yingbin Fu, Zheyuan Hu, Jing Yang, Param Hanji, Shaojun Wang, Xuan Zhao, Cengiz Öztireli, and Fangcheng Zhong. Chord: Generation of collision-free, house-scale, and organized digital twins for 3d indoor scenes with controllable floor plans and optimal layouts.arXiv preprint arXiv:2503.11958, 2025
-
[24]
EmbodiedGen: Towards a generative 3D world engine for embodied intelligence
Xinjie Wang, Liu Liu, Yu Cao, Ruiqi Wu, Wenkang Qin, Dehui Wang, Wei Sui, and Zhizhong Su. Embod- iedgen: Towards a generative 3d world engine for embodied intelligence.arXiv preprintarXiv:2506.10600, 2025
-
[25]
Scenefactor: Factored latent 3d diffusion for controllable 3d scene generation
Aleksey Bokhovkin, Quan Meng, Shubham Tulsiani, and Angela Dai. Scenefactor: Factored latent 3d diffusion for controllable 3d scene generation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 628–639, 2025
2025
-
[26]
Midi: Multi-instance diffusion for single image to 3d scene generation
Zehuan Huang, Yuan-Chen Guo, Xingqiao An, Yunhan Yang, Yangguang Li, Zi-Xin Zou, Ding Liang, Xihui Liu, Yan-Pei Cao, and Lu Sheng. Midi: Multi-instance diffusion for single image to 3d scene generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23646–23657, 2025
2025
-
[27]
I-design: Personalized llm interior designer
Ata Çelen, Guo Han, Konrad Schindler, Luc Van Gool, Iro Armeni, Anton Obukhov, and Xi Wang. I-design: Personalized llm interior designer. InEuropean Conference on Computer Vision, pages 217–234. Springer, 2024
2024
-
[28]
Xinhang Liu, Chi-Keung Tang, and Yu-Wing Tai. Worldcraft: Photo-realistic 3d world creation and customization via llm agents.arXiv preprint arXiv:2502.15601, 2025
-
[29]
Diffuscene: Denoising diffusion models for generative indoor scene synthesis
Jiapeng Tang, Yinyu Nie, Lev Markhasin, Angela Dai, Justus Thies, and Matthias Nießner. Diffuscene: Denoising diffusion models for generative indoor scene synthesis. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 20507–20518, 2024. 20
2024
-
[30]
ReSpace: Text-Driven Autoregressive 3D Indoor Scene Synthesis and Editing
Martin JJ Bucher and Iro Armeni. Respace: Text-driven 3d indoor scene synthesis and editing with preference alignment.arXiv preprint arXiv:2506.02459, 2025
work page internal anchor Pith review arXiv 2025
-
[31]
arXiv preprint arXiv:2406.03866 , year=
Yixuan Yang, Junru Lu, Zixiang Zhao, Zhen Luo, James JQ Yu, Victor Sanchez, and Feng Zheng. Llplace: The 3d indoor scene layout generation and editing via large language model.arXiv preprint arXiv:2406.03866, 2024
-
[32]
arXiv preprint arXiv:2311.13384 , year=
Jaeyoung Chung, Suyoung Lee, Hyeongjin Nam, Jaerin Lee, and Kyoung Mu Lee. Luciddreamer: Domain-free generation of 3d gaussian splatting scenes.arXiv preprint arXiv:2311.13384, 2023
-
[33]
Housecrafter: Lifting floorplans to 3d scenes with 2d diffusion models
Yiwen Chen, Hieu T Nguyen, Vikram Voleti, Varun Jampani, and Huaizu Jiang. Housecrafter: Lifting floorplans to 3d scenes with 2d diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 28440–28450, 2025
2025
-
[34]
SAM 3: Segment Anything with Concepts
Nicolas Carion, Laura Gustafson, Yuan-Ting Hu, Shoubhik Debnath, Ronghang Hu, Didac Suris, Chaitanya Ryali, Kalyan Vasudev Alwala, Haitham Khedr, Andrew Huang, et al. Sam 3: Segment anything with concepts.arXiv preprint arXiv:2511.16719, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[35]
SAM 3D: 3Dfy Anything in Images
Xingyu Chen, Fu-Jen Chu, Pierre Gleize, Kevin J Liang, Alexander Sax, Hao Tang, Weiyao Wang, Michelle Guo, Thibaut Hardin, Xiang Li, et al. Sam 3d: 3dfy anything in images.arXiv preprint arXiv:2511.16624, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[36]
Ziang Cao, Fangzhou Hong, Zhaoxi Chen, Liang Pan, and Ziwei Liu. Physx-anything: Simulation-ready physical 3d assets from single image.arXiv preprint arXiv:2511.13648, 2025
-
[37]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[38]
Lora: Low-rank adaptation of large language models.Iclr, 1(2):3, 2022
Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Liang Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.Iclr, 1(2):3, 2022. 21
2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.