HomeWorld: A Unified Floorplan-to-Furnished Framework for Generating Controllable, Densely Interactive Whole-Home Scenes

Hongsheng Li; Rongyao Fang; Wenbo Li; Xiaoliang Ju; Zipeng Qin

arxiv: 2606.06390 · v1 · pith:GTPDWMDLnew · submitted 2026-06-04 · 💻 cs.CV · cs.AI

HomeWorld: A Unified Floorplan-to-Furnished Framework for Generating Controllable, Densely Interactive Whole-Home Scenes

Wenbo Li , Xiaoliang Ju , Zipeng Qin , Rongyao Fang , Hongsheng Li This is my paper

Pith reviewed 2026-06-28 01:42 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords indoor scene generationfloorplan to sceneembodied AIwhole home scenesfurniture layoutVLM refinement3D scene synthesis

0 comments

The pith

A staged pipeline generates controllable whole-home indoor scenes from floorplans using LLM synthesis, image drafting, and VLM refinement.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The authors develop a complete system to synthesize entire furnished homes suitable for robot simulation and design. They start by training an LLM on 300,000 real floorplans to produce varied whole-home layouts with control via descriptions and a tree structure. Then image models create furniture arrangements viewed from multiple levels, followed by object placement on surfaces with an iterative visual language model corrector. This matters because prior work handled only parts of the problem or used rigid rules, leading to less realistic and less diverse results. The output scenes include physical properties for interactive use.

Core claim

By decomposing the generation into a sequence of controllable stages—LLM-based floorplan creation with K-D tree representation, multi-viewpoint furniture layout via image models, VLM iterative refinement for furniture and objects, and final asset and attribute attachment—the method produces whole-home scenes with improved layout diversity and design quality over previous approaches.

What carries the argument

The unified hierarchical framework that breaks scene synthesis into floorplan, furniture, and object stages with VLM-based iterative correction.

If this is right

Produces scenes with greater layout diversity and 3D design appeal than prior methods, as shown in experiments and user studies.
Enables fine-grained control over whole-home floorplans through detailed text descriptions.
Generates densely interactive scenes with manipulable objects for embodied AI simulation.
Facilitates community progress by releasing the 300K floorplan dataset and 5K furnished scenes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Such generated scenes could provide scalable training data for vision and language models in home robotics.
The multi-stage approach with refinement might be adapted to generate scenes in other environments like offices or outdoors.
Integration with 3D generative models for asset replacement could allow customization without full regeneration.

Load-bearing premise

The VLM-based refiner can iteratively correct furniture and object placements from multi-level viewpoints without introducing new inconsistencies or requiring manual intervention.

What would settle it

Compare the number of physical simulation failures or user-reported inconsistencies in scenes generated with and without the VLM refiner stage.

read the original abstract

Indoor scene generation is crucial for robot simulation and modern interior design. However, complex layouts together with scarce 3D scene data make learning-based generation challenging. Existing methods often rely on hand-crafted rules or focus on isolated sub-tasks (e.g., floorplan synthesis or single-room furnishing), producing whole-home scenes that lack global coherence, realism, and simulation readiness. To mitigate these limitations, we propose a unified hierarchical framework that decomposes indoor scene synthesis into controllable stages. First, we curate a large-scale dataset of 300K real residential floorplans to train a large language model for whole-home floorplan generation. With detailed descriptions and a K-D tree-based representation, our method enables fine-grained, controllable whole-home floorplan generation. Building upon the generated whole-home floorplan, we leverage image generation models to draft furniture layouts from multi-level roaming viewpoints, and then generate the layouts of small manipulable objects on different supporting surfaces (e.g., cabinets, desks, and dining tables) for embodied AI simulation. During furniture and object layout generation, a VLM-based refiner iteratively corrects furniture and object placement, and a 3D generative model enables flexible replacement of individual assets. We further attach basic physical attributes and simple surface texture and lighting setups to complete the pipeline for embodied AI use. Experiments and user studies demonstrate that our pipeline produces indoor spaces with greater layout diversity and stronger 3D design appeal, outperforming prior methods on both quantitative and qualitative metrics. Finally, alongside our generation pipeline, we will release the floorplan dataset and 5K fully furnished scenes to the community. Project Page: https://kairos-homeworld.github.io/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper builds a staged pipeline for whole-home scene generation using LLMs, image models, and a VLM refiner, with a planned release of 300K floorplans and 5K scenes that could help embodied AI work.

read the letter

The core contribution here is a hierarchical pipeline that generates controllable whole-home floorplans via an LLM trained on a new 300K residential dataset, then drafts furniture layouts with image models, places small objects, and applies a VLM refiner plus 3D asset replacement before adding physics and textures. The plan to release the floorplan data and 5K furnished scenes stands out as the most concrete value, since prior work often stayed at single rooms or lacked scale.

This setup does address the coherence problem across multiple rooms better than isolated sub-task methods. The K-D tree representation for controllability and the multi-viewpoint drafting step are reasonable engineering choices for making the output simulation-ready.

The soft spot is the VLM refiner. The abstract describes it iteratively correcting placements from multiple viewpoints, but supplies no ablation, success rate, or count of introduced errors like overlaps or floating objects. Without that, the claimed gains in layout diversity and 3D appeal rest on an unexamined component, and the quantitative and user-study comparisons become harder to interpret.

The work is aimed at embodied AI and interior design researchers who need large numbers of furnished environments. The dataset release alone makes it worth a referee's time even if the method sections need tightening on validation. I would send it for peer review.

Referee Report

2 major / 2 minor

Summary. The paper presents HomeWorld, a unified hierarchical framework for generating controllable, densely interactive whole-home 3D scenes starting from floorplans. It curates a 300K real residential floorplan dataset to train an LLM for whole-home floorplan generation with K-D tree representation for controllability; leverages image generation models to draft furniture layouts from multi-level roaming viewpoints; uses a VLM-based refiner to iteratively correct furniture and object placements; generates small manipulable objects on supporting surfaces; employs 3D generative models for asset replacement; and attaches physical attributes, textures, and lighting for embodied AI simulation. Experiments and user studies are claimed to show greater layout diversity and 3D design appeal than prior methods on quantitative and qualitative metrics, with plans to release the floorplan dataset and 5K furnished scenes.

Significance. If the pipeline's claims hold, the work would offer a practical advance for indoor scene synthesis in robot simulation and interior design by addressing the gap in global coherence for whole-home scenes, unlike prior methods focused on isolated sub-tasks or hand-crafted rules. The hierarchical decomposition, use of pre-trained models, and planned public release of a large-scale floorplan dataset plus 5K scenes represent concrete strengths that could enable reproducible progress in the field.

major comments (2)

[furniture and object layout generation stage] Furniture and object layout generation stage (abstract and corresponding section): The VLM-based refiner is described as iteratively correcting furniture and object placements from multi-level viewpoints, yet no ablation studies, success-rate metrics, inconsistency counts, convergence criteria, or validation that it avoids introducing new errors (e.g., floating objects or overlaps) are reported. This is load-bearing for the central claim of outperforming priors on layout diversity and 3D appeal, as unvalidated corrections could undermine the quantitative and user-study results.
[Experiments and user studies] Experiments and user studies section: The abstract asserts outperformance on quantitative and qualitative metrics without specifying the exact metrics, baselines, statistical tests, or effect sizes in the provided description; this prevents assessment of whether the evidence supports the superiority claims, particularly given reliance on the unvalidated VLM refiner.

minor comments (2)

Ensure all stages of the pipeline (including K-D tree representation details and physical attribute attachment) are cross-referenced with explicit section numbers for clarity.
The project page URL is provided; confirm that supplementary materials such as dataset access and scene examples are directly linked in the manuscript.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting the need for more rigorous validation of the VLM refiner and greater specificity in the experimental claims. We address each major comment below and will incorporate revisions to strengthen the manuscript.

read point-by-point responses

Referee: [furniture and object layout generation stage] Furniture and object layout generation stage (abstract and corresponding section): The VLM-based refiner is described as iteratively correcting furniture and object placements from multi-level viewpoints, yet no ablation studies, success-rate metrics, inconsistency counts, convergence criteria, or validation that it avoids introducing new errors (e.g., floating objects or overlaps) are reported. This is load-bearing for the central claim of outperforming priors on layout diversity and 3D appeal, as unvalidated corrections could undermine the quantitative and user-study results.

Authors: We agree that isolating the VLM refiner's contribution is important given its role in the pipeline. The manuscript validates the end-to-end results via quantitative layout diversity metrics and user studies on 3D appeal, but does not include dedicated ablations or error analysis for the refiner itself. In the revised version, we will add a new subsection with ablation results: success rates (pre- vs. post-refinement), counts of corrected inconsistencies (e.g., overlaps, floating objects), convergence behavior after iterations, and checks confirming no net increase in errors. This will directly support the superiority claims. revision: yes
Referee: [Experiments and user studies] Experiments and user studies section: The abstract asserts outperformance on quantitative and qualitative metrics without specifying the exact metrics, baselines, statistical tests, or effect sizes in the provided description; this prevents assessment of whether the evidence supports the superiority claims, particularly given reliance on the unvalidated VLM refiner.

Authors: The full experiments section details the metrics (layout diversity via coverage and entropy scores, realism via FID and user preference rates), baselines (prior floorplan-to-scene methods), user study design (pairwise comparisons with 50 participants), and statistical tests (paired t-tests with p-values and Cohen's d effect sizes). The abstract is intentionally concise. We will revise the abstract to explicitly list the primary quantitative metrics, main baselines, and note that statistical significance was evaluated with reported effect sizes. This addresses the clarity concern while preserving the existing results. revision: yes

Circularity Check

0 steps flagged

No circularity: applied pipeline with external models and no self-referential derivations

full rationale

The paper describes a hierarchical engineering pipeline: curating a 300K floorplan dataset to train an LLM, using image generation models for layouts, a VLM refiner for corrections, and a 3D generative model for assets. No equations, predictions, or fitted parameters are presented that reduce to their own inputs by construction. No self-citation chains or uniqueness theorems are invoked as load-bearing premises. The central claims rest on experimental comparisons and user studies against prior methods, which are independent of any internal redefinition. This matches the default case of a self-contained applied system description.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The framework depends on the unverified capabilities of pre-trained generative models and the assumption that iterative VLM correction produces usable simulation scenes; no new entities are postulated.

axioms (2)

domain assumption Pre-trained large language models can generate coherent whole-home floorplans from detailed text descriptions when trained on 300K real residential examples
Invoked in the first stage of the pipeline for controllable floorplan generation
domain assumption Image generation models can draft furniture layouts from multi-level roaming viewpoints that are accurate enough for subsequent VLM refinement
Central to the furniture layout drafting step

pith-pipeline@v0.9.1-grok · 5852 in / 1450 out tokens · 36546 ms · 2026-06-28T01:42:21.236209+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

38 extracted references · 14 canonical work pages · 7 internal anchors

[1]

Procthor: Large-scale embodied ai using procedural generation

Matt Deitke, Eli VanderBilt, Alvaro Herrasti, Luca Weihs, Kiana Ehsani, Jordi Salvador, Winson Han, Eric Kolve, Aniruddha Kembhavi, and Roozbeh Mottaghi. Procthor: Large-scale embodied ai using procedural generation. Advances in Neural Information Processing Systems, 35:5982–5994, 2022

2022
[2]

Structured3d: A large photo-realistic dataset for structured 3d modeling

Jia Zheng, Junfei Zhang, Jing Li, Rui Tang, Shenghua Gao, and Zihan Zhou. Structured3d: A large photo-realistic dataset for structured 3d modeling. InEuropean Conference on Computer Vision, pages 519–535. Springer, 2020

2020
[3]

MINOS: Multimodal Indoor Simulator for Navigation in Complex Environments

Manolis Savva, Angel X Chang, Alexey Dosovitskiy, Thomas Funkhouser, and Vladlen Koltun. Minos: Multimodal indoor simulator for navigation in complex environments.arXiv preprint arXiv:1712.03931, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[4]

Scannet: Richly-annotated 3d reconstructions of indoor scenes

Angela Dai, Angel X Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 5828–5839, 2017

2017
[5]

Matterport3D: Learning from RGB-D Data in Indoor Environments

Angel Chang, Angela Dai, Thomas Funkhouser, Maciej Halber, Matthias Niessner, Manolis Savva, Shuran Song, Andy Zeng, and Yinda Zhang. Matterport3d: Learning from rgb-d data in indoor environments. arXiv preprint arXiv:1709.06158, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[6]

Blockfusion: Expandable 3d scene generation using latent tri-plane extrapolation

Zhennan Wu, Yang Li, Han Yan, Taizhang Shang, Weixuan Sun, Senbo Wang, Ruikai Cui, Weizhe Liu, Hiroyuki Sato, Hongdong Li, et al. Blockfusion: Expandable 3d scene generation using latent tri-plane extrapolation. ACM Transactions on Graphics (ToG), 43(4):1–17, 2024

2024
[7]

Diffindscene: Diffusion-based high-quality 3d indoor scene generation

Xiaoliang Ju, Zhaoyang Huang, Yijin Li, Guofeng Zhang, Yu Qiao, and Hongsheng Li. Diffindscene: Diffusion-based high-quality 3d indoor scene generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4526–4535, 2024

2024
[8]

Text2room: Extracting textured 3d meshes from 2d text-to-image models

Lukas Höllein, Ang Cao, Andrew Owens, Justin Johnson, and Matthias Nießner. Text2room: Extracting textured 3d meshes from 2d text-to-image models. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 7909–7920, 2023

2023
[9]

and National Institute of Informatics (NII)

LIFULL Co., Ltd. and National Institute of Informatics (NII). LIFULL HOME’S High Resolution Floor Plan Image Data. https://www.nii.ac.jp/dsc/idr/en/lifull/1.html, 2017. Accessed: 2026-02-28

2017
[10]

Data-driven interior plan generation for residential buildings.ACM Transactions on Graphics (SIGGRAPH Asia), 38(6), 2019

Wenming Wu, Xiao-Ming Fu, Rui Tang, Yuhan Wang, Yu-Hao Qi, and Ligang Liu. Data-driven interior plan generation for residential buildings.ACM Transactions on Graphics (SIGGRAPH Asia), 38(6), 2019

2019
[11]

Msd: A benchmark dataset for floor plan generation of building complexes

Casper Van Engelenburg, Fatemeh Mostafavi, Emanuel Kuhn, Yuntae Jeon, Michael Franzen, Matthias Standfest, Jan van Gemert, and Seyran Khademi. Msd: A benchmark dataset for floor plan generation of building complexes. InEuropean Conference on Computer Vision, pages 60–75. Springer, 2024

2024
[12]

Resplan: A large-scale vector-graph dataset of 17,000 residential floor plans.arXiv preprint arXiv:2508.14006, 2025

Mohamed Abouagour and Eleftherios Garyfallidis. Resplan: A large-scale vector-graph dataset of 17,000 residential floor plans.arXiv preprint arXiv:2508.14006, 2025

work page arXiv 2025
[13]

3d-front: 3d furnished rooms with layouts and semantics

Huan Fu, Bowen Cai, Lin Gao, Ling-Xiao Zhang, Jiaming Wang, Cao Li, Qixun Zeng, Chengyue Sun, Rongfei Jia, Binqiang Zhao, et al. 3d-front: 3d furnished rooms with layouts and semantics. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10933–10942, 2021

2021
[14]

InternScenes: A Large-scale Simulatable Indoor Scene Dataset with Realistic Layouts

Weipeng Zhong, Peizhou Cao, Yichen Jin, Li Luo, Wenzhe Cai, Jingli Lin, Hanqing Wang, Zhaoyang Lyu, Tai Wang, Bo Dai, et al. Internscenes: A large-scale simulatable indoor scene dataset with realistic layouts. arXiv preprint arXiv:2509.10813, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[15]

Sceneverse: Scaling 3d vision-language learning for grounded scene understanding

Baoxiong Jia, Yixin Chen, Huangyue Yu, Yan Wang, Xuesong Niu, Tengyu Liu, Qing Li, and Siyuan Huang. Sceneverse: Scaling 3d vision-language learning for grounded scene understanding. InEuropean Conference on Computer Vision, pages 289–310. Springer, 2024. 19

2024
[16]

Floorplan-llama: Aligning architects’ feedback and domain knowledge in architectural floor plan generation

Jun Yin, Pengyu Zeng, Haoyuan Sun, Yuqin Dai, Han Zheng, Miao Zhang, Yachao Zhang, and Shuai Lu. Floorplan-llama: Aligning architects’ feedback and domain knowledge in architectural floor plan generation. InProceedings of the 63rd AnnualMeeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6640–6662, 2025

2025
[17]

Floorplan-diffusion: Automatic floor plan generation via pre-trained large latent diffusion model

Minyang Xu, Yunzhong Lou, Xiang Gao, and Xiangdong Zhou. Floorplan-diffusion: Automatic floor plan generation via pre-trained large latent diffusion model. InProceedings of the 2025 International Conference on Multimedia Retrieval, pages 1617–1625, 2025

2025
[18]

Layoutgpt: Compositional visual planning and generation with large language models

Weixi Feng, Wanrong Zhu, Tsu-jui Fu, Varun Jampani, Arjun Akula, Xuehai He, Sugato Basu, Xin Eric Wang, and William Yang Wang. Layoutgpt: Compositional visual planning and generation with large language models. Advances in Neural Information Processing Systems, 36:18225–18250, 2023

2023
[19]

Holodeck: Language guided generation of 3d embodied ai envi- ronments

Yue Yang, Fan-Yun Sun, Luca Weihs, Eli VanderBilt, Alvaro Herrasti, Winson Han, Jiajun Wu, Nick Haber, Ranjay Krishna, Lingjie Liu, et al. Holodeck: Language guided generation of 3d embodied ai envi- ronments. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16227–16237, 2024

2024
[20]

Layoutvlm: Differentiable optimization of 3d layout via vision-language models

Fan-Yun Sun, Weiyu Liu, Siyi Gu, Dylan Lim, Goutam Bhat, Federico Tombari, Manling Li, Nick Haber, and Jiajun Wu. Layoutvlm: Differentiable optimization of 3d layout via vision-language models. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 29469–29478, 2025

2025
[21]

Infinigen indoors: Photorealistic indoor scenes using procedural generation

Alexander Raistrick, Lingjie Mei, Karhan Kayan, David Yan, Yiming Zuo, Beining Han, Hongyu Wen, Meenal Parakh, Stamatis Alexandropoulos, Lahav Lipson, et al. Infinigen indoors: Photorealistic indoor scenes using procedural generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21783–21794, 2024

2024
[22]

Physcene: Physically interactable 3d scene synthesis for embodied ai

Yandan Yang, Baoxiong Jia, Peiyuan Zhi, and Siyuan Huang. Physcene: Physically interactable 3d scene synthesis for embodied ai. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16262–16272, 2024

2024
[23]

Chong Su, Yingbin Fu, Zheyuan Hu, Jing Yang, Param Hanji, Shaojun Wang, Xuan Zhao, Cengiz Öztireli, and Fangcheng Zhong. Chord: Generation of collision-free, house-scale, and organized digital twins for 3d indoor scenes with controllable floor plans and optimal layouts.arXiv preprint arXiv:2503.11958, 2025

work page arXiv 2025
[24]

EmbodiedGen: Towards a generative 3D world engine for embodied intelligence

Xinjie Wang, Liu Liu, Yu Cao, Ruiqi Wu, Wenkang Qin, Dehui Wang, Wei Sui, and Zhizhong Su. Embod- iedgen: Towards a generative 3d world engine for embodied intelligence.arXiv preprintarXiv:2506.10600, 2025

work page arXiv 2025
[25]

Scenefactor: Factored latent 3d diffusion for controllable 3d scene generation

Aleksey Bokhovkin, Quan Meng, Shubham Tulsiani, and Angela Dai. Scenefactor: Factored latent 3d diffusion for controllable 3d scene generation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 628–639, 2025

2025
[26]

Midi: Multi-instance diffusion for single image to 3d scene generation

Zehuan Huang, Yuan-Chen Guo, Xingqiao An, Yunhan Yang, Yangguang Li, Zi-Xin Zou, Ding Liang, Xihui Liu, Yan-Pei Cao, and Lu Sheng. Midi: Multi-instance diffusion for single image to 3d scene generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23646–23657, 2025

2025
[27]

I-design: Personalized llm interior designer

Ata Çelen, Guo Han, Konrad Schindler, Luc Van Gool, Iro Armeni, Anton Obukhov, and Xi Wang. I-design: Personalized llm interior designer. InEuropean Conference on Computer Vision, pages 217–234. Springer, 2024

2024
[28]

Worldcraft: Photo-realistic 3d world creation and customization via llm agents.arXiv preprint arXiv:2502.15601, 2025

Xinhang Liu, Chi-Keung Tang, and Yu-Wing Tai. Worldcraft: Photo-realistic 3d world creation and customization via llm agents.arXiv preprint arXiv:2502.15601, 2025

work page arXiv 2025
[29]

Diffuscene: Denoising diffusion models for generative indoor scene synthesis

Jiapeng Tang, Yinyu Nie, Lev Markhasin, Angela Dai, Justus Thies, and Matthias Nießner. Diffuscene: Denoising diffusion models for generative indoor scene synthesis. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 20507–20518, 2024. 20

2024
[30]

ReSpace: Text-Driven Autoregressive 3D Indoor Scene Synthesis and Editing

Martin JJ Bucher and Iro Armeni. Respace: Text-driven 3d indoor scene synthesis and editing with preference alignment.arXiv preprint arXiv:2506.02459, 2025

work page internal anchor Pith review arXiv 2025
[31]

arXiv preprint arXiv:2406.03866 , year=

Yixuan Yang, Junru Lu, Zixiang Zhao, Zhen Luo, James JQ Yu, Victor Sanchez, and Feng Zheng. Llplace: The 3d indoor scene layout generation and editing via large language model.arXiv preprint arXiv:2406.03866, 2024

work page arXiv 2024
[32]

arXiv preprint arXiv:2311.13384 , year=

Jaeyoung Chung, Suyoung Lee, Hyeongjin Nam, Jaerin Lee, and Kyoung Mu Lee. Luciddreamer: Domain-free generation of 3d gaussian splatting scenes.arXiv preprint arXiv:2311.13384, 2023

work page arXiv 2023
[33]

Housecrafter: Lifting floorplans to 3d scenes with 2d diffusion models

Yiwen Chen, Hieu T Nguyen, Vikram Voleti, Varun Jampani, and Huaizu Jiang. Housecrafter: Lifting floorplans to 3d scenes with 2d diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 28440–28450, 2025

2025
[34]

SAM 3: Segment Anything with Concepts

Nicolas Carion, Laura Gustafson, Yuan-Ting Hu, Shoubhik Debnath, Ronghang Hu, Didac Suris, Chaitanya Ryali, Kalyan Vasudev Alwala, Haitham Khedr, Andrew Huang, et al. Sam 3: Segment anything with concepts.arXiv preprint arXiv:2511.16719, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[35]

SAM 3D: 3Dfy Anything in Images

Xingyu Chen, Fu-Jen Chu, Pierre Gleize, Kevin J Liang, Alexander Sax, Hao Tang, Weiyao Wang, Michelle Guo, Thibaut Hardin, Xiang Li, et al. Sam 3d: 3dfy anything in images.arXiv preprint arXiv:2511.16624, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[36]

Physx-anything: Simulation-ready physical 3d assets from single image.arXiv preprint arXiv:2511.13648, 2025

Ziang Cao, Fangzhou Hong, Zhaoxi Chen, Liang Pan, and Ziwei Liu. Physx-anything: Simulation-ready physical 3d assets from single image.arXiv preprint arXiv:2511.13648, 2025

work page arXiv 2025
[37]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[38]

Lora: Low-rank adaptation of large language models.Iclr, 1(2):3, 2022

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Liang Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.Iclr, 1(2):3, 2022. 21

2022

[1] [1]

Procthor: Large-scale embodied ai using procedural generation

Matt Deitke, Eli VanderBilt, Alvaro Herrasti, Luca Weihs, Kiana Ehsani, Jordi Salvador, Winson Han, Eric Kolve, Aniruddha Kembhavi, and Roozbeh Mottaghi. Procthor: Large-scale embodied ai using procedural generation. Advances in Neural Information Processing Systems, 35:5982–5994, 2022

2022

[2] [2]

Structured3d: A large photo-realistic dataset for structured 3d modeling

Jia Zheng, Junfei Zhang, Jing Li, Rui Tang, Shenghua Gao, and Zihan Zhou. Structured3d: A large photo-realistic dataset for structured 3d modeling. InEuropean Conference on Computer Vision, pages 519–535. Springer, 2020

2020

[3] [3]

MINOS: Multimodal Indoor Simulator for Navigation in Complex Environments

Manolis Savva, Angel X Chang, Alexey Dosovitskiy, Thomas Funkhouser, and Vladlen Koltun. Minos: Multimodal indoor simulator for navigation in complex environments.arXiv preprint arXiv:1712.03931, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[4] [4]

Scannet: Richly-annotated 3d reconstructions of indoor scenes

Angela Dai, Angel X Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 5828–5839, 2017

2017

[5] [5]

Matterport3D: Learning from RGB-D Data in Indoor Environments

Angel Chang, Angela Dai, Thomas Funkhouser, Maciej Halber, Matthias Niessner, Manolis Savva, Shuran Song, Andy Zeng, and Yinda Zhang. Matterport3d: Learning from rgb-d data in indoor environments. arXiv preprint arXiv:1709.06158, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[6] [6]

Blockfusion: Expandable 3d scene generation using latent tri-plane extrapolation

Zhennan Wu, Yang Li, Han Yan, Taizhang Shang, Weixuan Sun, Senbo Wang, Ruikai Cui, Weizhe Liu, Hiroyuki Sato, Hongdong Li, et al. Blockfusion: Expandable 3d scene generation using latent tri-plane extrapolation. ACM Transactions on Graphics (ToG), 43(4):1–17, 2024

2024

[7] [7]

Diffindscene: Diffusion-based high-quality 3d indoor scene generation

Xiaoliang Ju, Zhaoyang Huang, Yijin Li, Guofeng Zhang, Yu Qiao, and Hongsheng Li. Diffindscene: Diffusion-based high-quality 3d indoor scene generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4526–4535, 2024

2024

[8] [8]

Text2room: Extracting textured 3d meshes from 2d text-to-image models

Lukas Höllein, Ang Cao, Andrew Owens, Justin Johnson, and Matthias Nießner. Text2room: Extracting textured 3d meshes from 2d text-to-image models. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 7909–7920, 2023

2023

[9] [9]

and National Institute of Informatics (NII)

LIFULL Co., Ltd. and National Institute of Informatics (NII). LIFULL HOME’S High Resolution Floor Plan Image Data. https://www.nii.ac.jp/dsc/idr/en/lifull/1.html, 2017. Accessed: 2026-02-28

2017

[10] [10]

Data-driven interior plan generation for residential buildings.ACM Transactions on Graphics (SIGGRAPH Asia), 38(6), 2019

Wenming Wu, Xiao-Ming Fu, Rui Tang, Yuhan Wang, Yu-Hao Qi, and Ligang Liu. Data-driven interior plan generation for residential buildings.ACM Transactions on Graphics (SIGGRAPH Asia), 38(6), 2019

2019

[11] [11]

Msd: A benchmark dataset for floor plan generation of building complexes

Casper Van Engelenburg, Fatemeh Mostafavi, Emanuel Kuhn, Yuntae Jeon, Michael Franzen, Matthias Standfest, Jan van Gemert, and Seyran Khademi. Msd: A benchmark dataset for floor plan generation of building complexes. InEuropean Conference on Computer Vision, pages 60–75. Springer, 2024

2024

[12] [12]

Resplan: A large-scale vector-graph dataset of 17,000 residential floor plans.arXiv preprint arXiv:2508.14006, 2025

Mohamed Abouagour and Eleftherios Garyfallidis. Resplan: A large-scale vector-graph dataset of 17,000 residential floor plans.arXiv preprint arXiv:2508.14006, 2025

work page arXiv 2025

[13] [13]

3d-front: 3d furnished rooms with layouts and semantics

Huan Fu, Bowen Cai, Lin Gao, Ling-Xiao Zhang, Jiaming Wang, Cao Li, Qixun Zeng, Chengyue Sun, Rongfei Jia, Binqiang Zhao, et al. 3d-front: 3d furnished rooms with layouts and semantics. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10933–10942, 2021

2021

[14] [14]

InternScenes: A Large-scale Simulatable Indoor Scene Dataset with Realistic Layouts

Weipeng Zhong, Peizhou Cao, Yichen Jin, Li Luo, Wenzhe Cai, Jingli Lin, Hanqing Wang, Zhaoyang Lyu, Tai Wang, Bo Dai, et al. Internscenes: A large-scale simulatable indoor scene dataset with realistic layouts. arXiv preprint arXiv:2509.10813, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[15] [15]

Sceneverse: Scaling 3d vision-language learning for grounded scene understanding

Baoxiong Jia, Yixin Chen, Huangyue Yu, Yan Wang, Xuesong Niu, Tengyu Liu, Qing Li, and Siyuan Huang. Sceneverse: Scaling 3d vision-language learning for grounded scene understanding. InEuropean Conference on Computer Vision, pages 289–310. Springer, 2024. 19

2024

[16] [16]

Floorplan-llama: Aligning architects’ feedback and domain knowledge in architectural floor plan generation

Jun Yin, Pengyu Zeng, Haoyuan Sun, Yuqin Dai, Han Zheng, Miao Zhang, Yachao Zhang, and Shuai Lu. Floorplan-llama: Aligning architects’ feedback and domain knowledge in architectural floor plan generation. InProceedings of the 63rd AnnualMeeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6640–6662, 2025

2025

[17] [17]

Floorplan-diffusion: Automatic floor plan generation via pre-trained large latent diffusion model

Minyang Xu, Yunzhong Lou, Xiang Gao, and Xiangdong Zhou. Floorplan-diffusion: Automatic floor plan generation via pre-trained large latent diffusion model. InProceedings of the 2025 International Conference on Multimedia Retrieval, pages 1617–1625, 2025

2025

[18] [18]

Layoutgpt: Compositional visual planning and generation with large language models

Weixi Feng, Wanrong Zhu, Tsu-jui Fu, Varun Jampani, Arjun Akula, Xuehai He, Sugato Basu, Xin Eric Wang, and William Yang Wang. Layoutgpt: Compositional visual planning and generation with large language models. Advances in Neural Information Processing Systems, 36:18225–18250, 2023

2023

[19] [19]

Holodeck: Language guided generation of 3d embodied ai envi- ronments

Yue Yang, Fan-Yun Sun, Luca Weihs, Eli VanderBilt, Alvaro Herrasti, Winson Han, Jiajun Wu, Nick Haber, Ranjay Krishna, Lingjie Liu, et al. Holodeck: Language guided generation of 3d embodied ai envi- ronments. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16227–16237, 2024

2024

[20] [20]

Layoutvlm: Differentiable optimization of 3d layout via vision-language models

Fan-Yun Sun, Weiyu Liu, Siyi Gu, Dylan Lim, Goutam Bhat, Federico Tombari, Manling Li, Nick Haber, and Jiajun Wu. Layoutvlm: Differentiable optimization of 3d layout via vision-language models. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 29469–29478, 2025

2025

[21] [21]

Infinigen indoors: Photorealistic indoor scenes using procedural generation

Alexander Raistrick, Lingjie Mei, Karhan Kayan, David Yan, Yiming Zuo, Beining Han, Hongyu Wen, Meenal Parakh, Stamatis Alexandropoulos, Lahav Lipson, et al. Infinigen indoors: Photorealistic indoor scenes using procedural generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21783–21794, 2024

2024

[22] [22]

Physcene: Physically interactable 3d scene synthesis for embodied ai

Yandan Yang, Baoxiong Jia, Peiyuan Zhi, and Siyuan Huang. Physcene: Physically interactable 3d scene synthesis for embodied ai. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16262–16272, 2024

2024

[23] [23]

Chong Su, Yingbin Fu, Zheyuan Hu, Jing Yang, Param Hanji, Shaojun Wang, Xuan Zhao, Cengiz Öztireli, and Fangcheng Zhong. Chord: Generation of collision-free, house-scale, and organized digital twins for 3d indoor scenes with controllable floor plans and optimal layouts.arXiv preprint arXiv:2503.11958, 2025

work page arXiv 2025

[24] [24]

EmbodiedGen: Towards a generative 3D world engine for embodied intelligence

Xinjie Wang, Liu Liu, Yu Cao, Ruiqi Wu, Wenkang Qin, Dehui Wang, Wei Sui, and Zhizhong Su. Embod- iedgen: Towards a generative 3d world engine for embodied intelligence.arXiv preprintarXiv:2506.10600, 2025

work page arXiv 2025

[25] [25]

Scenefactor: Factored latent 3d diffusion for controllable 3d scene generation

Aleksey Bokhovkin, Quan Meng, Shubham Tulsiani, and Angela Dai. Scenefactor: Factored latent 3d diffusion for controllable 3d scene generation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 628–639, 2025

2025

[26] [26]

Midi: Multi-instance diffusion for single image to 3d scene generation

Zehuan Huang, Yuan-Chen Guo, Xingqiao An, Yunhan Yang, Yangguang Li, Zi-Xin Zou, Ding Liang, Xihui Liu, Yan-Pei Cao, and Lu Sheng. Midi: Multi-instance diffusion for single image to 3d scene generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23646–23657, 2025

2025

[27] [27]

I-design: Personalized llm interior designer

Ata Çelen, Guo Han, Konrad Schindler, Luc Van Gool, Iro Armeni, Anton Obukhov, and Xi Wang. I-design: Personalized llm interior designer. InEuropean Conference on Computer Vision, pages 217–234. Springer, 2024

2024

[28] [28]

Worldcraft: Photo-realistic 3d world creation and customization via llm agents.arXiv preprint arXiv:2502.15601, 2025

Xinhang Liu, Chi-Keung Tang, and Yu-Wing Tai. Worldcraft: Photo-realistic 3d world creation and customization via llm agents.arXiv preprint arXiv:2502.15601, 2025

work page arXiv 2025

[29] [29]

Diffuscene: Denoising diffusion models for generative indoor scene synthesis

Jiapeng Tang, Yinyu Nie, Lev Markhasin, Angela Dai, Justus Thies, and Matthias Nießner. Diffuscene: Denoising diffusion models for generative indoor scene synthesis. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 20507–20518, 2024. 20

2024

[30] [30]

ReSpace: Text-Driven Autoregressive 3D Indoor Scene Synthesis and Editing

Martin JJ Bucher and Iro Armeni. Respace: Text-driven 3d indoor scene synthesis and editing with preference alignment.arXiv preprint arXiv:2506.02459, 2025

work page internal anchor Pith review arXiv 2025

[31] [31]

arXiv preprint arXiv:2406.03866 , year=

Yixuan Yang, Junru Lu, Zixiang Zhao, Zhen Luo, James JQ Yu, Victor Sanchez, and Feng Zheng. Llplace: The 3d indoor scene layout generation and editing via large language model.arXiv preprint arXiv:2406.03866, 2024

work page arXiv 2024

[32] [32]

arXiv preprint arXiv:2311.13384 , year=

Jaeyoung Chung, Suyoung Lee, Hyeongjin Nam, Jaerin Lee, and Kyoung Mu Lee. Luciddreamer: Domain-free generation of 3d gaussian splatting scenes.arXiv preprint arXiv:2311.13384, 2023

work page arXiv 2023

[33] [33]

Housecrafter: Lifting floorplans to 3d scenes with 2d diffusion models

Yiwen Chen, Hieu T Nguyen, Vikram Voleti, Varun Jampani, and Huaizu Jiang. Housecrafter: Lifting floorplans to 3d scenes with 2d diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 28440–28450, 2025

2025

[34] [34]

SAM 3: Segment Anything with Concepts

Nicolas Carion, Laura Gustafson, Yuan-Ting Hu, Shoubhik Debnath, Ronghang Hu, Didac Suris, Chaitanya Ryali, Kalyan Vasudev Alwala, Haitham Khedr, Andrew Huang, et al. Sam 3: Segment anything with concepts.arXiv preprint arXiv:2511.16719, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[35] [35]

SAM 3D: 3Dfy Anything in Images

Xingyu Chen, Fu-Jen Chu, Pierre Gleize, Kevin J Liang, Alexander Sax, Hao Tang, Weiyao Wang, Michelle Guo, Thibaut Hardin, Xiang Li, et al. Sam 3d: 3dfy anything in images.arXiv preprint arXiv:2511.16624, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[36] [36]

Physx-anything: Simulation-ready physical 3d assets from single image.arXiv preprint arXiv:2511.13648, 2025

Ziang Cao, Fangzhou Hong, Zhaoxi Chen, Liang Pan, and Ziwei Liu. Physx-anything: Simulation-ready physical 3d assets from single image.arXiv preprint arXiv:2511.13648, 2025

work page arXiv 2025

[37] [37]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[38] [38]

Lora: Low-rank adaptation of large language models.Iclr, 1(2):3, 2022

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Liang Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.Iclr, 1(2):3, 2022. 21

2022