RoomPilot: Controllable Indoor Scene Synthesis via Multimodal Semantic Parsing

Ruihui Li; Shougao Zhang; Tianhao Zhou; Wentang Chen; Yiman Zhang

arxiv: 2512.11234 · v2 · pith:DHDXHFTSnew · submitted 2025-12-12 · 💻 cs.CV

RoomPilot: Controllable Indoor Scene Synthesis via Multimodal Semantic Parsing

Wentang Chen , Shougao Zhang , Yiman Zhang , Tianhao Zhou , Ruihui Li This is my paper

Pith reviewed 2026-05-21 16:48 UTC · model grok-4.3

classification 💻 cs.CV

keywords indoor scene synthesismultimodal parsingdomain-specific languagehierarchical generationcontrollable 3D scenessemantic representationCAD floor plans

0 comments

The pith

RoomPilot maps text descriptions and CAD floor plans into an Indoor Domain-Specific Language that drives hierarchical synthesis of controllable multi-room 3D scenes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces RoomPilot to generate indoor scenes from mixed inputs such as text and floor plans while giving users precise control over the results. It first converts those varied inputs into a single structured representation called IDSL that captures semantic details clearly. A staged process then assembles the scene from the full building layout down through individual rooms to specific objects, aiming to keep everything logically connected and functionally sensible. This matters for applications like game design and architectural previews because earlier approaches either accept only narrow input forms or produce outputs that are hard to adjust precisely. The work also supplies a labeled collection of 3D assets to help the generated scenes look more realistic and consistent.

Core claim

RoomPilot shows that heterogeneous multimodal inputs can be translated into an Indoor Domain-Specific Language (IDSL) that acts as an interpretable semantic representation, supporting a hierarchical synthesis pipeline that builds scenes progressively at the building, room, and object levels to promote structural coherence and functional consistency across multi-room layouts.

What carries the argument

The Indoor Domain-Specific Language (IDSL) that unifies inputs into a structured semantic representation and enables progressive hierarchical organization at building, room, and object scales.

If this is right

Combined text and geometric inputs produce scenes whose layout and contents can be adjusted at fine detail without regenerating the entire model.
Multi-room environments maintain logical connections between rooms and avoid functional mismatches such as misplaced furniture types.
The added asset dataset with semantic labels raises the realism and surface consistency of rendered objects.
Applications in embodied AI and visualization gain repeatable control over scene variations from the same high-level description.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same IDSL intermediate step could support live editing where a user changes one sentence and the 3D layout updates accordingly.
Integration with simulation engines would let training agents experience consistent multi-room physics without manual scene fixes.
The hierarchical structure suggests a natural way to add temporal changes, such as moving objects between rooms while preserving overall coherence.

Load-bearing premise

That mapping inputs to IDSL and organizing the process hierarchically will by itself produce fine-grained controllability, physical consistency, and visual fidelity without added constraints or tuning.

What would settle it

A direct test that feeds contradictory text and CAD plan inputs and checks whether the output scenes violate spatial rules such as object overlaps or fail to match the specified room counts and object placements.

Figures

Figures reproduced from arXiv: 2512.11234 by Ruihui Li, Shougao Zhang, Tianhao Zhou, Wentang Chen, Yiman Zhang.

**Figure 1.** Figure 1: Overview of RoomPilot. Given text instructions or CAD floor plans, RoomPilot produces controllable and structurally coherent multi-room indoor scenes. It parses inputs into an IDSL semantic hierarchy and refines layouts using self-regulating optimization before generating final 3D scenes. Abstract Generating controllable and interactive indoor scenes is fundamental to applications in game development, arch… view at source ↗

**Figure 2.** Figure 2: RoomPilot takes either text descriptions or CAD floor plans as input, parses them into a unified IDSL representation, optimizes multi-room layouts by a self-regulating energy-based process, and retrieves appropriate assets to generate interactive 3D indoor scenes. limitations. Highly abstract representations make it difficult to inject design knowledge and encode functional or semantic intent, while overl… view at source ↗

**Figure 3.** Figure 3: Single-room generation comparison. We show results for five room types under detailed text prompts. RoomPilot produces layouts with higher semantic alignment and structural coherence compared to existing methods. Method Bedroom Living Room Dining Room Physics Visual & Semantics Physics Visual & Semantics Physics Visual & Semantics #Obj ↑ #OB ↓ #CN ↓ Real. ↑ Func. ↑ Lay. ↑ Comp. ↑ #Obj ↑ #OB ↓ #CN ↓ Real. ↑… view at source ↗

**Figure 4.** Figure 4: Visualization of the procedural presets used for structural [PITH_FULL_IMAGE:figures/full_fig_p016_4.png] view at source ↗

**Figure 5.** Figure 5: Example of relationship-aware post-placement opti [PITH_FULL_IMAGE:figures/full_fig_p018_5.png] view at source ↗

**Figure 6.** Figure 6: Visual results of RoomPilot across different settings. The top row displays the detailed descriptions and corresponding 3D scenes. [PITH_FULL_IMAGE:figures/full_fig_p020_6.png] view at source ↗

read the original abstract

Generating controllable indoor scenes is fundamental to applications in game development, architectural visualization, and embodied AI. However, existing approaches either support a limited input modalities or rely on implicit generation processes that hinder precise control over scene structure and semantics. To address these limitations, we introduce RoomPilot, a unified framework for controllable indoor scene synthesis from multi-modal inputs, including textual descriptions and CAD floor plans. RoomPilot maps heterogeneous inputs into an Indoor Domain-Specific Language (IDSL), which serves as a structured and interpretable semantic representation for describing indoor scenes. Built upon IDSL, RoomPilot presents a hierarchical synthesis pipeline that progressively organizes scenes at the building, room, and object levels, promoting structural coherence and functional consistency across multi-room layouts. Moreover, RoomPilot constructs a curated asset dataset with rich semantic annotations to support high-quality scene synthesis, improving visual realism and appearance consistency. Extensive experiments demonstrate effective multi-modal understanding, fine-grained controllability in scene generation, and improved physical consistency and visual fidelity, marking a significant step toward controllable 3D indoor scene synthesis. Code and model will be available.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The manuscript introduces RoomPilot, a unified framework for controllable indoor scene synthesis from multimodal inputs such as textual descriptions and CAD floor plans. It maps these inputs to an Indoor Domain-Specific Language (IDSL) as a structured semantic representation and employs a hierarchical synthesis pipeline that organizes scenes at building, room, and object levels. Additionally, it constructs a curated asset dataset with rich semantic annotations to support high-quality synthesis. The authors claim that this leads to effective multi-modal understanding, fine-grained controllability, structural coherence, functional consistency, physical consistency, and improved visual fidelity, as shown through extensive experiments.

Significance. If the central claims hold, this work has the potential to advance the field of 3D indoor scene generation by introducing an interpretable intermediate representation via IDSL and a multi-scale hierarchical approach, which could address limitations in controllability and consistency found in prior implicit or limited-modality methods. The promise of releasing code, model, and the curated dataset would further enhance its impact and reproducibility.

major comments (1)

Abstract: The abstract asserts that 'Extensive experiments demonstrate effective multi-modal understanding, fine-grained controllability in scene generation, and improved physical consistency and visual fidelity' but supplies no quantitative metrics, baseline comparisons, error analysis, or experimental details. This absence is load-bearing for the central claims of improvements over existing approaches.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the potential of RoomPilot. We address the single major comment below and will incorporate revisions to improve clarity and substantiation of claims.

read point-by-point responses

Referee: Abstract: The abstract asserts that 'Extensive experiments demonstrate effective multi-modal understanding, fine-grained controllability in scene generation, and improved physical consistency and visual fidelity' but supplies no quantitative metrics, baseline comparisons, error analysis, or experimental details. This absence is load-bearing for the central claims of improvements over existing approaches.

Authors: We agree that the abstract is high-level and omits specific numbers to remain concise. The full paper (Section 4) contains quantitative results including baseline comparisons on metrics such as FID, structural coherence, physical consistency scores, and user studies for controllability and fidelity. To directly address this, we will revise the abstract to include key highlights such as representative improvements (e.g., X% better coherence and Y% higher physical consistency versus baselines). This change will better ground the claims while preserving abstract length. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The abstract introduces IDSL as a new structured semantic representation and a hierarchical synthesis pipeline at building/room/object levels as novel components for handling multimodal inputs. No equations, fitted parameters, predictions, or self-citations are shown that reduce any claimed result to its own inputs by construction. The framework is presented as self-contained through the definition of new elements rather than any self-definitional, fitted-input, or ansatz-smuggled steps, consistent with the reader's assessment of low circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The central claim rests on the introduction of IDSL as a bridging representation and the domain assumption that hierarchical organization improves coherence; no explicit free parameters are mentioned.

axioms (2)

domain assumption Heterogeneous inputs such as text and CAD floor plans can be mapped to a common structured semantic representation.
This mapping is presented as the foundation for the unified framework.
domain assumption Organizing scene synthesis hierarchically at building, room, and object levels promotes structural coherence and functional consistency.
Invoked directly in the description of the synthesis pipeline.

invented entities (1)

IDSL (Indoor Domain-Specific Language) no independent evidence
purpose: Structured and interpretable semantic representation to handle multimodal inputs for scene description
Newly introduced as the core intermediate representation.

pith-pipeline@v0.9.0 · 5727 in / 1518 out tokens · 66456 ms · 2026-05-21T16:48:52.762928+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

IDSL defines a three-level semantic hierarchy—Building Level, Room Level, and Object Level—each deliberately designed according to architectural principles.
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We define a time-dependent total energy E(S, t) = α_struct(t) E_struct(S) + α_sem(t) E_sem(S)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

32 extracted references · 32 canonical work pages

[1]

Componerf: Text-guided multi-object compositional nerf with editable 3d scene lay- out.arXiv preprint arXiv:2303.13843, 2023

Haotian Bai, Yuanhuiyi Lyu, Lutao Jiang, Sijia Li, Haonan Lu, Xiaodong Lin, and Lin Wang. Componerf: Text-guided multi-object compositional nerf with editable 3d scene lay- out.arXiv preprint arXiv:2303.13843, 2023. 3

work page arXiv 2023
[2]

I-design: Personal- ized llm interior designer.arXiv preprint arXiv:2404.02838,

Ata C ¸ elen, Guo Han, Konrad Schindler, Luc Van Gool, Iro Armeni, Anton Obukhov, and Xi Wang. I-design: Personal- ized llm interior designer.arXiv preprint arXiv:2404.02838,

work page arXiv
[3]

Automated creation of digital cousins for robust policy learning.arXiv preprint arXiv:2410.07408, 2024

Tianyuan Dai, Josiah Wong, Yunfan Jiang, Chen Wang, Cem Gokmen, Ruohan Zhang, Jiajun Wu, and Li Fei-Fei. Auto- mated creation of digital cousins for robust policy learning. arXiv preprint arXiv:2410.07408, 2024. 3

work page arXiv 2024
[4]

Procthor: Large-scale embodied ai using procedural generation.Ad- vances in Neural Information Processing Systems, 35:5982– 5994, 2022

Matt Deitke, Eli VanderBilt, Alvaro Herrasti, Luca Weihs, Kiana Ehsani, Jordi Salvador, Winson Han, Eric Kolve, Aniruddha Kembhavi, and Roozbeh Mottaghi. Procthor: Large-scale embodied ai using procedural generation.Ad- vances in Neural Information Processing Systems, 35:5982– 5994, 2022. 2

work page 2022
[5]

ProcTHOR: Large-Scale Embodied AI Using Procedural Generation

Matt Deitke, Eli VanderBilt, Alvaro Herrasti, Luca Weihs, Jordi Salvador, Kiana Ehsani, Winson Han, Eric Kolve, Ali Farhadi, Aniruddha Kembhavi, and Roozbeh Mottaghi. ProcTHOR: Large-Scale Embodied AI Using Procedural Generation. InNeurIPS, 2022. Outstanding Paper Award. 2, 3

work page 2022
[6]

Mv-diffusion: Motion-aware video diffusion model

Zijun Deng, Xiangteng He, Yuxin Peng, Xiongwei Zhu, and Lele Cheng. Mv-diffusion: Motion-aware video diffusion model. InProceedings of the 31st ACM International Con- ference on Multimedia, pages 7255–7263, 2023. 2

work page 2023
[7]

Ctrl-room: Controllable text-to-3d room meshes generation with layout constraints

Chuan Fang, Yuan Dong, Kunming Luo, Xiaotao Hu, Rakesh Shrestha, and Ping Tan. Ctrl-room: Controllable text-to-3d room meshes generation with layout constraints. In2025 International Conference on 3D Vision (3DV), pages 692–701. IEEE, 2025. 2, 3

work page 2025
[8]

Layoutgpt: Compositional visual plan- ning and generation with large language models.NeurIPS,

Weixi Feng, Wanrong Zhu, Tsu-jui Fu, Varun Jampani, Ar- jun Akula, Xuehai He, Sugato Basu, Xin Eric Wang, and William Yang Wang. Layoutgpt: Compositional visual plan- ning and generation with large language models.NeurIPS,

work page
[9]

Procedural model- ing of plant ecosystems maximizing vegetation cover.Mul- timedia Tools and Applications, 81(12):16195–16217, 2022

Cristina Gasch, Jos ´e Mart´ınez Sotoca, Miguel Chover, In- maculada Remolar, and Cristina Rebollo. Procedural model- ing of plant ecosystems maximizing vegetation cover.Mul- timedia Tools and Applications, 81(12):16195–16217, 2022. 2

work page 2022
[10]

Text2room: Extracting textured 3d meshes from 2d text-to-image models

Lukas H ¨ollein, Ang Cao, Andrew Owens, Justin Johnson, and Matthias Nießner. Text2room: Extracting textured 3d meshes from 2d text-to-image models. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion, pages 7909–7920, 2023. 2

work page 2023
[11]

Ross, Cordelia Schmid, and Alireza Fathi

Ziniu Hu, Ahmet Iscen, Aashi Jain, Thomas Kipf, Yisong Yue, David A. Ross, Cordelia Schmid, and Alireza Fathi. Scenecraft: An llm agent for synthesizing 3d scene as blender code. InICML, 2024. 2

work page 2024
[12]

Instructscene: Instruction- driven 3d indoor scene synthesis with semantic graph prior

Chenguo Lin and Yadong Mu. Instructscene: Instruction- driven 3d indoor scene synthesis with semantic graph prior. arXiv preprint arXiv:2402.04717, 2024. 2, 3

work page arXiv 2024
[13]

Sceneteller: Language-to-3d scene generation

Bas ¸ak Melis¨Ocal, Maxim Tatarchenko, Sezer Karao˘glu, and Theo Gevers. Sceneteller: Language-to-3d scene generation. InEuropean Conference on Computer Vision, pages 362–

work page
[14]

Springer, 2024. 3, 7, 8

work page 2024
[15]

Atiss: Autoregres- sive transformers for indoor scene synthesis

Despoina Paschalidou, Amlan Kar, Maria Shugrina, Karsten Kreis, Andreas Geiger, and Sanja Fidler. Atiss: Autoregres- sive transformers for indoor scene synthesis. InNeurIPS,

work page
[16]

Infinigen indoors: Photorealistic indoor scenes using procedural generation

Alexander Raistrick, Lingjie Mei, Karhan Kayan, David Yan, Yiming Zuo, Beining Han, Hongyu Wen, Meenal Parakh, Stamatis Alexandropoulos, Lahav Lipson, Zeyu Ma, and Jia Deng. Infinigen indoors: Photorealistic indoor scenes using procedural generation. InCVPR, 2024. 2, 3

work page 2024
[17]

In- finigen indoors: Photorealistic indoor scenes using procedu- ral generation

Alexander Raistrick, Lingjie Mei, Karhan Kayan, David Yan, Yiming Zuo, Beining Han, Hongyu Wen, Meenal Parakh, Stamatis Alexandropoulos, Lahav Lipson, et al. In- finigen indoors: Photorealistic indoor scenes using procedu- ral generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21783– 21794, 2024. 2, 8

work page 2024
[18]

Controlroom3d: Room gen- eration using semantic proxy rooms

Jonas Schult, Sam Tsai, Lukas H ¨ollein, Bichen Wu, Jialiang Wang, Chih-Yao Ma, Kunpeng Li, Xiaofang Wang, Felix Wimbauer, Zijian He, et al. Controlroom3d: Room gen- eration using semantic proxy rooms. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6201–6210, 2024. 2

work page 2024
[19]

3d-gpt: Procedural 3d model- ing with large language models

Chunyi Sun, Junlin Han, Weijian Deng, Xinlong Wang, Zis- han Qin, and Stephen Gould. 3d-gpt: Procedural 3d model- ing with large language models. In2025 International Con- ference on 3D Vision (3DV), pages 1253–1263. IEEE, 2025. 2

work page 2025
[20]

Layoutvlm: Differentiable optimization of 3d layout via vision-language models.CVPR, 2025

Fan-Yun Sun, Weiyu Liu, Siyi Gu, Dylan Lim, Goutam Bhat, Federico Tombari, Manling Li, Nick Haber, and Jiajun Wu. Layoutvlm: Differentiable optimization of 3d layout via vision-language models.CVPR, 2025. 8

work page 2025
[21]

Layoutvlm: Differentiable optimization of 3d layout via vision-language models

Fan-Yun Sun, Weiyu Liu, Siyi Gu, Dylan Lim, Goutam Bhat, Federico Tombari, Manling Li, Nick Haber, and Jia- jun Wu. Layoutvlm: Differentiable optimization of 3d layout via vision-language models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 29469– 29478, 2025. 3, 6, 7, 8

work page 2025
[22]

Diffuscene: Denoising diffu- sion models for gerative indoor scene synthesis

Jiapeng Tang, Yinyu Nie, Lev Markhasin, Angela Dai, Justus Thies, and Matthias Nießner. Diffuscene: Denoising diffu- sion models for gerative indoor scene synthesis. InCVPR,

work page
[23]

Architect: Generating vivid and interactive 3d scenes with hierarchical 2d inpainting

Yian Wang, Xiaowen Qiu, Jiageng Liu, Zhehuan Chen, Jit- ing Cai, Yufei Wang, Tsun-Hsuan Wang, Zhou Xian, and Chuang Gan. Architect: Generating vivid and interactive 3d scenes with hierarchical 2d inpainting. InNeurIPS, 2024. 3

work page 2024
[24]

Scenecraft: Layout-guided 3d scene generation

Xiuyu Yang, Yunze Man, Junkun Chen, and Yu-Xiong Wang. Scenecraft: Layout-guided 3d scene generation. Advances in Neural Information Processing Systems, 37: 82060–82084, 2024. 2, 3

work page 2024
[25]

Physcene: Physically interactable 3d scene synthesis for embodied ai

Yandan Yang, Baoxiong Jia, Peiyuan Zhi, and Siyuan Huang. Physcene: Physically interactable 3d scene synthesis for embodied ai. InCVPR, 2024. 3

work page 2024
[26]

Holodeck: Language guided gen- eration of 3d embodied ai environments

Yue Yang, Fan-Yun Sun, Luca Weihs, Eli VanderBilt, Al- varo Herrasti, Winson Han, Jiajun Wu, Nick Haber, Ranjay Krishna, Lingjie Liu, et al. Holodeck: Language guided gen- eration of 3d embodied ai environments. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16227–16237, 2024. 3, 7, 8

work page 2024
[27]

Sceneweaver: All-in-one 3d scene synthesis with an extensible and self-reflective agent.arXiv preprint arXiv:2509.20414, 2025

Yandan Yang, Baoxiong Jia, Shujie Zhang, and Siyuan Huang. Sceneweaver: All-in-one 3d scene synthesis with an extensible and self-reflective agent.arXiv preprint arXiv:2509.20414, 2025. 3, 6, 7, 8

work page arXiv 2025
[28]

Metascenes: Towards automated replica creation for real-world 3d scans

Huangyue Yu, Baoxiong Jia, Yixin Chen, Yandan Yang, Puhao Li, Rongpeng Su, Jiaxin Li, Qing Li, Wei Liang, Zhu Song-Chun, Tengyu Liu, and Siyuan Huang. Metascenes: Towards automated replica creation for real-world 3d scans. InCVPR, 2025. 3

work page 2025
[29]

Fur- niscene: A large-scale 3d room dataset with intricate fur- nishing scenes.arXiv preprint arXiv:2401.03470, 2024

Genghao Zhang, Yuxi Wang, Chuanchen Luo, Shibiao Xu, Zhaoxiang Zhang, Man Zhang, and Junran Peng. Fur- niscene: A large-scale 3d room dataset with intricate fur- nishing scenes.arXiv preprint arXiv:2401.03470, 2024. 2

work page arXiv 2024
[30]

Scenex: Procedural controllable large-scale scene genera- tion.arXiv preprint arXiv:2403.15698, 2024

Mengqi Zhou, Yuxi Wang, Jun Hou, Shougao Zhang, Yi- wei Li, Chuanchen Luo, Junran Peng, and Zhaoxiang Zhang. Scenex: Procedural controllable large-scale scene genera- tion.arXiv preprint arXiv:2403.15698, 2024. 2

work page arXiv 2024
[31]

Roomcraft: Controllable and complete 3d indoor scene generation.arXiv preprint arXiv:2506.22291, 2025

Mengqi Zhou, Xipeng Wang, Yuxi Wang, and Zhaoxiang Zhang. Roomcraft: Controllable and complete 3d indoor scene generation.arXiv preprint arXiv:2506.22291, 2025. 2 RoomPilot: Controllable Synthesis of Interactive Indoor Environments via Multimodal Semantic Parsing Supplementary Material A. Details of Cross-Modal Semantic Parsing A.1. Text Description Parsi...

work page arXiv 2025
[32]

Semantics(building)

are always accepted, ensuring monotonic geometric con- vergence. Semantic improvements are accepted using a temperature-controlled rule, allowing flexibility early in the search and determinism as the system stabilizes. Convergence.The combination of explicit IDSL at- tributes, staged rule activation, adaptive annealing, and lo- calized evaluation yields ...

work page 2066

[1] [1]

Componerf: Text-guided multi-object compositional nerf with editable 3d scene lay- out.arXiv preprint arXiv:2303.13843, 2023

Haotian Bai, Yuanhuiyi Lyu, Lutao Jiang, Sijia Li, Haonan Lu, Xiaodong Lin, and Lin Wang. Componerf: Text-guided multi-object compositional nerf with editable 3d scene lay- out.arXiv preprint arXiv:2303.13843, 2023. 3

work page arXiv 2023

[2] [2]

I-design: Personal- ized llm interior designer.arXiv preprint arXiv:2404.02838,

Ata C ¸ elen, Guo Han, Konrad Schindler, Luc Van Gool, Iro Armeni, Anton Obukhov, and Xi Wang. I-design: Personal- ized llm interior designer.arXiv preprint arXiv:2404.02838,

work page arXiv

[3] [3]

Automated creation of digital cousins for robust policy learning.arXiv preprint arXiv:2410.07408, 2024

Tianyuan Dai, Josiah Wong, Yunfan Jiang, Chen Wang, Cem Gokmen, Ruohan Zhang, Jiajun Wu, and Li Fei-Fei. Auto- mated creation of digital cousins for robust policy learning. arXiv preprint arXiv:2410.07408, 2024. 3

work page arXiv 2024

[4] [4]

Procthor: Large-scale embodied ai using procedural generation.Ad- vances in Neural Information Processing Systems, 35:5982– 5994, 2022

Matt Deitke, Eli VanderBilt, Alvaro Herrasti, Luca Weihs, Kiana Ehsani, Jordi Salvador, Winson Han, Eric Kolve, Aniruddha Kembhavi, and Roozbeh Mottaghi. Procthor: Large-scale embodied ai using procedural generation.Ad- vances in Neural Information Processing Systems, 35:5982– 5994, 2022. 2

work page 2022

[5] [5]

ProcTHOR: Large-Scale Embodied AI Using Procedural Generation

Matt Deitke, Eli VanderBilt, Alvaro Herrasti, Luca Weihs, Jordi Salvador, Kiana Ehsani, Winson Han, Eric Kolve, Ali Farhadi, Aniruddha Kembhavi, and Roozbeh Mottaghi. ProcTHOR: Large-Scale Embodied AI Using Procedural Generation. InNeurIPS, 2022. Outstanding Paper Award. 2, 3

work page 2022

[6] [6]

Mv-diffusion: Motion-aware video diffusion model

Zijun Deng, Xiangteng He, Yuxin Peng, Xiongwei Zhu, and Lele Cheng. Mv-diffusion: Motion-aware video diffusion model. InProceedings of the 31st ACM International Con- ference on Multimedia, pages 7255–7263, 2023. 2

work page 2023

[7] [7]

Ctrl-room: Controllable text-to-3d room meshes generation with layout constraints

Chuan Fang, Yuan Dong, Kunming Luo, Xiaotao Hu, Rakesh Shrestha, and Ping Tan. Ctrl-room: Controllable text-to-3d room meshes generation with layout constraints. In2025 International Conference on 3D Vision (3DV), pages 692–701. IEEE, 2025. 2, 3

work page 2025

[8] [8]

Layoutgpt: Compositional visual plan- ning and generation with large language models.NeurIPS,

Weixi Feng, Wanrong Zhu, Tsu-jui Fu, Varun Jampani, Ar- jun Akula, Xuehai He, Sugato Basu, Xin Eric Wang, and William Yang Wang. Layoutgpt: Compositional visual plan- ning and generation with large language models.NeurIPS,

work page

[9] [9]

Procedural model- ing of plant ecosystems maximizing vegetation cover.Mul- timedia Tools and Applications, 81(12):16195–16217, 2022

Cristina Gasch, Jos ´e Mart´ınez Sotoca, Miguel Chover, In- maculada Remolar, and Cristina Rebollo. Procedural model- ing of plant ecosystems maximizing vegetation cover.Mul- timedia Tools and Applications, 81(12):16195–16217, 2022. 2

work page 2022

[10] [10]

Text2room: Extracting textured 3d meshes from 2d text-to-image models

Lukas H ¨ollein, Ang Cao, Andrew Owens, Justin Johnson, and Matthias Nießner. Text2room: Extracting textured 3d meshes from 2d text-to-image models. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion, pages 7909–7920, 2023. 2

work page 2023

[11] [11]

Ross, Cordelia Schmid, and Alireza Fathi

Ziniu Hu, Ahmet Iscen, Aashi Jain, Thomas Kipf, Yisong Yue, David A. Ross, Cordelia Schmid, and Alireza Fathi. Scenecraft: An llm agent for synthesizing 3d scene as blender code. InICML, 2024. 2

work page 2024

[12] [12]

Instructscene: Instruction- driven 3d indoor scene synthesis with semantic graph prior

Chenguo Lin and Yadong Mu. Instructscene: Instruction- driven 3d indoor scene synthesis with semantic graph prior. arXiv preprint arXiv:2402.04717, 2024. 2, 3

work page arXiv 2024

[13] [13]

Sceneteller: Language-to-3d scene generation

Bas ¸ak Melis¨Ocal, Maxim Tatarchenko, Sezer Karao˘glu, and Theo Gevers. Sceneteller: Language-to-3d scene generation. InEuropean Conference on Computer Vision, pages 362–

work page

[14] [14]

Springer, 2024. 3, 7, 8

work page 2024

[15] [15]

Atiss: Autoregres- sive transformers for indoor scene synthesis

Despoina Paschalidou, Amlan Kar, Maria Shugrina, Karsten Kreis, Andreas Geiger, and Sanja Fidler. Atiss: Autoregres- sive transformers for indoor scene synthesis. InNeurIPS,

work page

[16] [16]

Infinigen indoors: Photorealistic indoor scenes using procedural generation

Alexander Raistrick, Lingjie Mei, Karhan Kayan, David Yan, Yiming Zuo, Beining Han, Hongyu Wen, Meenal Parakh, Stamatis Alexandropoulos, Lahav Lipson, Zeyu Ma, and Jia Deng. Infinigen indoors: Photorealistic indoor scenes using procedural generation. InCVPR, 2024. 2, 3

work page 2024

[17] [17]

In- finigen indoors: Photorealistic indoor scenes using procedu- ral generation

Alexander Raistrick, Lingjie Mei, Karhan Kayan, David Yan, Yiming Zuo, Beining Han, Hongyu Wen, Meenal Parakh, Stamatis Alexandropoulos, Lahav Lipson, et al. In- finigen indoors: Photorealistic indoor scenes using procedu- ral generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21783– 21794, 2024. 2, 8

work page 2024

[18] [18]

Controlroom3d: Room gen- eration using semantic proxy rooms

Jonas Schult, Sam Tsai, Lukas H ¨ollein, Bichen Wu, Jialiang Wang, Chih-Yao Ma, Kunpeng Li, Xiaofang Wang, Felix Wimbauer, Zijian He, et al. Controlroom3d: Room gen- eration using semantic proxy rooms. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6201–6210, 2024. 2

work page 2024

[19] [19]

3d-gpt: Procedural 3d model- ing with large language models

Chunyi Sun, Junlin Han, Weijian Deng, Xinlong Wang, Zis- han Qin, and Stephen Gould. 3d-gpt: Procedural 3d model- ing with large language models. In2025 International Con- ference on 3D Vision (3DV), pages 1253–1263. IEEE, 2025. 2

work page 2025

[20] [20]

Layoutvlm: Differentiable optimization of 3d layout via vision-language models.CVPR, 2025

Fan-Yun Sun, Weiyu Liu, Siyi Gu, Dylan Lim, Goutam Bhat, Federico Tombari, Manling Li, Nick Haber, and Jiajun Wu. Layoutvlm: Differentiable optimization of 3d layout via vision-language models.CVPR, 2025. 8

work page 2025

[21] [21]

Layoutvlm: Differentiable optimization of 3d layout via vision-language models

Fan-Yun Sun, Weiyu Liu, Siyi Gu, Dylan Lim, Goutam Bhat, Federico Tombari, Manling Li, Nick Haber, and Jia- jun Wu. Layoutvlm: Differentiable optimization of 3d layout via vision-language models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 29469– 29478, 2025. 3, 6, 7, 8

work page 2025

[22] [22]

Diffuscene: Denoising diffu- sion models for gerative indoor scene synthesis

Jiapeng Tang, Yinyu Nie, Lev Markhasin, Angela Dai, Justus Thies, and Matthias Nießner. Diffuscene: Denoising diffu- sion models for gerative indoor scene synthesis. InCVPR,

work page

[23] [23]

Architect: Generating vivid and interactive 3d scenes with hierarchical 2d inpainting

Yian Wang, Xiaowen Qiu, Jiageng Liu, Zhehuan Chen, Jit- ing Cai, Yufei Wang, Tsun-Hsuan Wang, Zhou Xian, and Chuang Gan. Architect: Generating vivid and interactive 3d scenes with hierarchical 2d inpainting. InNeurIPS, 2024. 3

work page 2024

[24] [24]

Scenecraft: Layout-guided 3d scene generation

Xiuyu Yang, Yunze Man, Junkun Chen, and Yu-Xiong Wang. Scenecraft: Layout-guided 3d scene generation. Advances in Neural Information Processing Systems, 37: 82060–82084, 2024. 2, 3

work page 2024

[25] [25]

Physcene: Physically interactable 3d scene synthesis for embodied ai

Yandan Yang, Baoxiong Jia, Peiyuan Zhi, and Siyuan Huang. Physcene: Physically interactable 3d scene synthesis for embodied ai. InCVPR, 2024. 3

work page 2024

[26] [26]

Holodeck: Language guided gen- eration of 3d embodied ai environments

Yue Yang, Fan-Yun Sun, Luca Weihs, Eli VanderBilt, Al- varo Herrasti, Winson Han, Jiajun Wu, Nick Haber, Ranjay Krishna, Lingjie Liu, et al. Holodeck: Language guided gen- eration of 3d embodied ai environments. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16227–16237, 2024. 3, 7, 8

work page 2024

[27] [27]

Sceneweaver: All-in-one 3d scene synthesis with an extensible and self-reflective agent.arXiv preprint arXiv:2509.20414, 2025

Yandan Yang, Baoxiong Jia, Shujie Zhang, and Siyuan Huang. Sceneweaver: All-in-one 3d scene synthesis with an extensible and self-reflective agent.arXiv preprint arXiv:2509.20414, 2025. 3, 6, 7, 8

work page arXiv 2025

[28] [28]

Metascenes: Towards automated replica creation for real-world 3d scans

Huangyue Yu, Baoxiong Jia, Yixin Chen, Yandan Yang, Puhao Li, Rongpeng Su, Jiaxin Li, Qing Li, Wei Liang, Zhu Song-Chun, Tengyu Liu, and Siyuan Huang. Metascenes: Towards automated replica creation for real-world 3d scans. InCVPR, 2025. 3

work page 2025

[29] [29]

Fur- niscene: A large-scale 3d room dataset with intricate fur- nishing scenes.arXiv preprint arXiv:2401.03470, 2024

Genghao Zhang, Yuxi Wang, Chuanchen Luo, Shibiao Xu, Zhaoxiang Zhang, Man Zhang, and Junran Peng. Fur- niscene: A large-scale 3d room dataset with intricate fur- nishing scenes.arXiv preprint arXiv:2401.03470, 2024. 2

work page arXiv 2024

[30] [30]

Scenex: Procedural controllable large-scale scene genera- tion.arXiv preprint arXiv:2403.15698, 2024

Mengqi Zhou, Yuxi Wang, Jun Hou, Shougao Zhang, Yi- wei Li, Chuanchen Luo, Junran Peng, and Zhaoxiang Zhang. Scenex: Procedural controllable large-scale scene genera- tion.arXiv preprint arXiv:2403.15698, 2024. 2

work page arXiv 2024

[31] [31]

Roomcraft: Controllable and complete 3d indoor scene generation.arXiv preprint arXiv:2506.22291, 2025

Mengqi Zhou, Xipeng Wang, Yuxi Wang, and Zhaoxiang Zhang. Roomcraft: Controllable and complete 3d indoor scene generation.arXiv preprint arXiv:2506.22291, 2025. 2 RoomPilot: Controllable Synthesis of Interactive Indoor Environments via Multimodal Semantic Parsing Supplementary Material A. Details of Cross-Modal Semantic Parsing A.1. Text Description Parsi...

work page arXiv 2025

[32] [32]

Semantics(building)

are always accepted, ensuring monotonic geometric con- vergence. Semantic improvements are accepted using a temperature-controlled rule, allowing flexibility early in the search and determinism as the system stabilizes. Convergence.The combination of explicit IDSL at- tributes, staged rule activation, adaptive annealing, and lo- calized evaluation yields ...

work page 2066