RoomPilot: Controllable Indoor Scene Synthesis via Multimodal Semantic Parsing
Pith reviewed 2026-05-21 16:48 UTC · model grok-4.3
The pith
RoomPilot maps text descriptions and CAD floor plans into an Indoor Domain-Specific Language that drives hierarchical synthesis of controllable multi-room 3D scenes.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
RoomPilot shows that heterogeneous multimodal inputs can be translated into an Indoor Domain-Specific Language (IDSL) that acts as an interpretable semantic representation, supporting a hierarchical synthesis pipeline that builds scenes progressively at the building, room, and object levels to promote structural coherence and functional consistency across multi-room layouts.
What carries the argument
The Indoor Domain-Specific Language (IDSL) that unifies inputs into a structured semantic representation and enables progressive hierarchical organization at building, room, and object scales.
If this is right
- Combined text and geometric inputs produce scenes whose layout and contents can be adjusted at fine detail without regenerating the entire model.
- Multi-room environments maintain logical connections between rooms and avoid functional mismatches such as misplaced furniture types.
- The added asset dataset with semantic labels raises the realism and surface consistency of rendered objects.
- Applications in embodied AI and visualization gain repeatable control over scene variations from the same high-level description.
Where Pith is reading between the lines
- The same IDSL intermediate step could support live editing where a user changes one sentence and the 3D layout updates accordingly.
- Integration with simulation engines would let training agents experience consistent multi-room physics without manual scene fixes.
- The hierarchical structure suggests a natural way to add temporal changes, such as moving objects between rooms while preserving overall coherence.
Load-bearing premise
That mapping inputs to IDSL and organizing the process hierarchically will by itself produce fine-grained controllability, physical consistency, and visual fidelity without added constraints or tuning.
What would settle it
A direct test that feeds contradictory text and CAD plan inputs and checks whether the output scenes violate spatial rules such as object overlaps or fail to match the specified room counts and object placements.
Figures
read the original abstract
Generating controllable indoor scenes is fundamental to applications in game development, architectural visualization, and embodied AI. However, existing approaches either support a limited input modalities or rely on implicit generation processes that hinder precise control over scene structure and semantics. To address these limitations, we introduce RoomPilot, a unified framework for controllable indoor scene synthesis from multi-modal inputs, including textual descriptions and CAD floor plans. RoomPilot maps heterogeneous inputs into an Indoor Domain-Specific Language (IDSL), which serves as a structured and interpretable semantic representation for describing indoor scenes. Built upon IDSL, RoomPilot presents a hierarchical synthesis pipeline that progressively organizes scenes at the building, room, and object levels, promoting structural coherence and functional consistency across multi-room layouts. Moreover, RoomPilot constructs a curated asset dataset with rich semantic annotations to support high-quality scene synthesis, improving visual realism and appearance consistency. Extensive experiments demonstrate effective multi-modal understanding, fine-grained controllability in scene generation, and improved physical consistency and visual fidelity, marking a significant step toward controllable 3D indoor scene synthesis. Code and model will be available.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces RoomPilot, a unified framework for controllable indoor scene synthesis from multimodal inputs such as textual descriptions and CAD floor plans. It maps these inputs to an Indoor Domain-Specific Language (IDSL) as a structured semantic representation and employs a hierarchical synthesis pipeline that organizes scenes at building, room, and object levels. Additionally, it constructs a curated asset dataset with rich semantic annotations to support high-quality synthesis. The authors claim that this leads to effective multi-modal understanding, fine-grained controllability, structural coherence, functional consistency, physical consistency, and improved visual fidelity, as shown through extensive experiments.
Significance. If the central claims hold, this work has the potential to advance the field of 3D indoor scene generation by introducing an interpretable intermediate representation via IDSL and a multi-scale hierarchical approach, which could address limitations in controllability and consistency found in prior implicit or limited-modality methods. The promise of releasing code, model, and the curated dataset would further enhance its impact and reproducibility.
major comments (1)
- Abstract: The abstract asserts that 'Extensive experiments demonstrate effective multi-modal understanding, fine-grained controllability in scene generation, and improved physical consistency and visual fidelity' but supplies no quantitative metrics, baseline comparisons, error analysis, or experimental details. This absence is load-bearing for the central claims of improvements over existing approaches.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and for recognizing the potential of RoomPilot. We address the single major comment below and will incorporate revisions to improve clarity and substantiation of claims.
read point-by-point responses
-
Referee: Abstract: The abstract asserts that 'Extensive experiments demonstrate effective multi-modal understanding, fine-grained controllability in scene generation, and improved physical consistency and visual fidelity' but supplies no quantitative metrics, baseline comparisons, error analysis, or experimental details. This absence is load-bearing for the central claims of improvements over existing approaches.
Authors: We agree that the abstract is high-level and omits specific numbers to remain concise. The full paper (Section 4) contains quantitative results including baseline comparisons on metrics such as FID, structural coherence, physical consistency scores, and user studies for controllability and fidelity. To directly address this, we will revise the abstract to include key highlights such as representative improvements (e.g., X% better coherence and Y% higher physical consistency versus baselines). This change will better ground the claims while preserving abstract length. revision: yes
Circularity Check
No significant circularity in derivation chain
full rationale
The abstract introduces IDSL as a new structured semantic representation and a hierarchical synthesis pipeline at building/room/object levels as novel components for handling multimodal inputs. No equations, fitted parameters, predictions, or self-citations are shown that reduce any claimed result to its own inputs by construction. The framework is presented as self-contained through the definition of new elements rather than any self-definitional, fitted-input, or ansatz-smuggled steps, consistent with the reader's assessment of low circularity.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Heterogeneous inputs such as text and CAD floor plans can be mapped to a common structured semantic representation.
- domain assumption Organizing scene synthesis hierarchically at building, room, and object levels promotes structural coherence and functional consistency.
invented entities (1)
-
IDSL (Indoor Domain-Specific Language)
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AlexanderDuality.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
IDSL defines a three-level semantic hierarchy—Building Level, Room Level, and Object Level—each deliberately designed according to architectural principles.
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We define a time-dependent total energy E(S, t) = α_struct(t) E_struct(S) + α_sem(t) E_sem(S)
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Haotian Bai, Yuanhuiyi Lyu, Lutao Jiang, Sijia Li, Haonan Lu, Xiaodong Lin, and Lin Wang. Componerf: Text-guided multi-object compositional nerf with editable 3d scene lay- out.arXiv preprint arXiv:2303.13843, 2023. 3
-
[2]
I-design: Personal- ized llm interior designer.arXiv preprint arXiv:2404.02838,
Ata C ¸ elen, Guo Han, Konrad Schindler, Luc Van Gool, Iro Armeni, Anton Obukhov, and Xi Wang. I-design: Personal- ized llm interior designer.arXiv preprint arXiv:2404.02838,
-
[3]
Tianyuan Dai, Josiah Wong, Yunfan Jiang, Chen Wang, Cem Gokmen, Ruohan Zhang, Jiajun Wu, and Li Fei-Fei. Auto- mated creation of digital cousins for robust policy learning. arXiv preprint arXiv:2410.07408, 2024. 3
-
[4]
Matt Deitke, Eli VanderBilt, Alvaro Herrasti, Luca Weihs, Kiana Ehsani, Jordi Salvador, Winson Han, Eric Kolve, Aniruddha Kembhavi, and Roozbeh Mottaghi. Procthor: Large-scale embodied ai using procedural generation.Ad- vances in Neural Information Processing Systems, 35:5982– 5994, 2022. 2
work page 2022
-
[5]
ProcTHOR: Large-Scale Embodied AI Using Procedural Generation
Matt Deitke, Eli VanderBilt, Alvaro Herrasti, Luca Weihs, Jordi Salvador, Kiana Ehsani, Winson Han, Eric Kolve, Ali Farhadi, Aniruddha Kembhavi, and Roozbeh Mottaghi. ProcTHOR: Large-Scale Embodied AI Using Procedural Generation. InNeurIPS, 2022. Outstanding Paper Award. 2, 3
work page 2022
-
[6]
Mv-diffusion: Motion-aware video diffusion model
Zijun Deng, Xiangteng He, Yuxin Peng, Xiongwei Zhu, and Lele Cheng. Mv-diffusion: Motion-aware video diffusion model. InProceedings of the 31st ACM International Con- ference on Multimedia, pages 7255–7263, 2023. 2
work page 2023
-
[7]
Ctrl-room: Controllable text-to-3d room meshes generation with layout constraints
Chuan Fang, Yuan Dong, Kunming Luo, Xiaotao Hu, Rakesh Shrestha, and Ping Tan. Ctrl-room: Controllable text-to-3d room meshes generation with layout constraints. In2025 International Conference on 3D Vision (3DV), pages 692–701. IEEE, 2025. 2, 3
work page 2025
-
[8]
Layoutgpt: Compositional visual plan- ning and generation with large language models.NeurIPS,
Weixi Feng, Wanrong Zhu, Tsu-jui Fu, Varun Jampani, Ar- jun Akula, Xuehai He, Sugato Basu, Xin Eric Wang, and William Yang Wang. Layoutgpt: Compositional visual plan- ning and generation with large language models.NeurIPS,
-
[9]
Cristina Gasch, Jos ´e Mart´ınez Sotoca, Miguel Chover, In- maculada Remolar, and Cristina Rebollo. Procedural model- ing of plant ecosystems maximizing vegetation cover.Mul- timedia Tools and Applications, 81(12):16195–16217, 2022. 2
work page 2022
-
[10]
Text2room: Extracting textured 3d meshes from 2d text-to-image models
Lukas H ¨ollein, Ang Cao, Andrew Owens, Justin Johnson, and Matthias Nießner. Text2room: Extracting textured 3d meshes from 2d text-to-image models. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion, pages 7909–7920, 2023. 2
work page 2023
-
[11]
Ross, Cordelia Schmid, and Alireza Fathi
Ziniu Hu, Ahmet Iscen, Aashi Jain, Thomas Kipf, Yisong Yue, David A. Ross, Cordelia Schmid, and Alireza Fathi. Scenecraft: An llm agent for synthesizing 3d scene as blender code. InICML, 2024. 2
work page 2024
-
[12]
Instructscene: Instruction- driven 3d indoor scene synthesis with semantic graph prior
Chenguo Lin and Yadong Mu. Instructscene: Instruction- driven 3d indoor scene synthesis with semantic graph prior. arXiv preprint arXiv:2402.04717, 2024. 2, 3
-
[13]
Sceneteller: Language-to-3d scene generation
Bas ¸ak Melis¨Ocal, Maxim Tatarchenko, Sezer Karao˘glu, and Theo Gevers. Sceneteller: Language-to-3d scene generation. InEuropean Conference on Computer Vision, pages 362–
-
[14]
Springer, 2024. 3, 7, 8
work page 2024
-
[15]
Atiss: Autoregres- sive transformers for indoor scene synthesis
Despoina Paschalidou, Amlan Kar, Maria Shugrina, Karsten Kreis, Andreas Geiger, and Sanja Fidler. Atiss: Autoregres- sive transformers for indoor scene synthesis. InNeurIPS,
-
[16]
Infinigen indoors: Photorealistic indoor scenes using procedural generation
Alexander Raistrick, Lingjie Mei, Karhan Kayan, David Yan, Yiming Zuo, Beining Han, Hongyu Wen, Meenal Parakh, Stamatis Alexandropoulos, Lahav Lipson, Zeyu Ma, and Jia Deng. Infinigen indoors: Photorealistic indoor scenes using procedural generation. InCVPR, 2024. 2, 3
work page 2024
-
[17]
In- finigen indoors: Photorealistic indoor scenes using procedu- ral generation
Alexander Raistrick, Lingjie Mei, Karhan Kayan, David Yan, Yiming Zuo, Beining Han, Hongyu Wen, Meenal Parakh, Stamatis Alexandropoulos, Lahav Lipson, et al. In- finigen indoors: Photorealistic indoor scenes using procedu- ral generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21783– 21794, 2024. 2, 8
work page 2024
-
[18]
Controlroom3d: Room gen- eration using semantic proxy rooms
Jonas Schult, Sam Tsai, Lukas H ¨ollein, Bichen Wu, Jialiang Wang, Chih-Yao Ma, Kunpeng Li, Xiaofang Wang, Felix Wimbauer, Zijian He, et al. Controlroom3d: Room gen- eration using semantic proxy rooms. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6201–6210, 2024. 2
work page 2024
-
[19]
3d-gpt: Procedural 3d model- ing with large language models
Chunyi Sun, Junlin Han, Weijian Deng, Xinlong Wang, Zis- han Qin, and Stephen Gould. 3d-gpt: Procedural 3d model- ing with large language models. In2025 International Con- ference on 3D Vision (3DV), pages 1253–1263. IEEE, 2025. 2
work page 2025
-
[20]
Layoutvlm: Differentiable optimization of 3d layout via vision-language models.CVPR, 2025
Fan-Yun Sun, Weiyu Liu, Siyi Gu, Dylan Lim, Goutam Bhat, Federico Tombari, Manling Li, Nick Haber, and Jiajun Wu. Layoutvlm: Differentiable optimization of 3d layout via vision-language models.CVPR, 2025. 8
work page 2025
-
[21]
Layoutvlm: Differentiable optimization of 3d layout via vision-language models
Fan-Yun Sun, Weiyu Liu, Siyi Gu, Dylan Lim, Goutam Bhat, Federico Tombari, Manling Li, Nick Haber, and Jia- jun Wu. Layoutvlm: Differentiable optimization of 3d layout via vision-language models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 29469– 29478, 2025. 3, 6, 7, 8
work page 2025
-
[22]
Diffuscene: Denoising diffu- sion models for gerative indoor scene synthesis
Jiapeng Tang, Yinyu Nie, Lev Markhasin, Angela Dai, Justus Thies, and Matthias Nießner. Diffuscene: Denoising diffu- sion models for gerative indoor scene synthesis. InCVPR,
-
[23]
Architect: Generating vivid and interactive 3d scenes with hierarchical 2d inpainting
Yian Wang, Xiaowen Qiu, Jiageng Liu, Zhehuan Chen, Jit- ing Cai, Yufei Wang, Tsun-Hsuan Wang, Zhou Xian, and Chuang Gan. Architect: Generating vivid and interactive 3d scenes with hierarchical 2d inpainting. InNeurIPS, 2024. 3
work page 2024
-
[24]
Scenecraft: Layout-guided 3d scene generation
Xiuyu Yang, Yunze Man, Junkun Chen, and Yu-Xiong Wang. Scenecraft: Layout-guided 3d scene generation. Advances in Neural Information Processing Systems, 37: 82060–82084, 2024. 2, 3
work page 2024
-
[25]
Physcene: Physically interactable 3d scene synthesis for embodied ai
Yandan Yang, Baoxiong Jia, Peiyuan Zhi, and Siyuan Huang. Physcene: Physically interactable 3d scene synthesis for embodied ai. InCVPR, 2024. 3
work page 2024
-
[26]
Holodeck: Language guided gen- eration of 3d embodied ai environments
Yue Yang, Fan-Yun Sun, Luca Weihs, Eli VanderBilt, Al- varo Herrasti, Winson Han, Jiajun Wu, Nick Haber, Ranjay Krishna, Lingjie Liu, et al. Holodeck: Language guided gen- eration of 3d embodied ai environments. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16227–16237, 2024. 3, 7, 8
work page 2024
-
[27]
Yandan Yang, Baoxiong Jia, Shujie Zhang, and Siyuan Huang. Sceneweaver: All-in-one 3d scene synthesis with an extensible and self-reflective agent.arXiv preprint arXiv:2509.20414, 2025. 3, 6, 7, 8
-
[28]
Metascenes: Towards automated replica creation for real-world 3d scans
Huangyue Yu, Baoxiong Jia, Yixin Chen, Yandan Yang, Puhao Li, Rongpeng Su, Jiaxin Li, Qing Li, Wei Liang, Zhu Song-Chun, Tengyu Liu, and Siyuan Huang. Metascenes: Towards automated replica creation for real-world 3d scans. InCVPR, 2025. 3
work page 2025
-
[29]
Genghao Zhang, Yuxi Wang, Chuanchen Luo, Shibiao Xu, Zhaoxiang Zhang, Man Zhang, and Junran Peng. Fur- niscene: A large-scale 3d room dataset with intricate fur- nishing scenes.arXiv preprint arXiv:2401.03470, 2024. 2
-
[30]
Scenex: Procedural controllable large-scale scene genera- tion.arXiv preprint arXiv:2403.15698, 2024
Mengqi Zhou, Yuxi Wang, Jun Hou, Shougao Zhang, Yi- wei Li, Chuanchen Luo, Junran Peng, and Zhaoxiang Zhang. Scenex: Procedural controllable large-scale scene genera- tion.arXiv preprint arXiv:2403.15698, 2024. 2
-
[31]
Mengqi Zhou, Xipeng Wang, Yuxi Wang, and Zhaoxiang Zhang. Roomcraft: Controllable and complete 3d indoor scene generation.arXiv preprint arXiv:2506.22291, 2025. 2 RoomPilot: Controllable Synthesis of Interactive Indoor Environments via Multimodal Semantic Parsing Supplementary Material A. Details of Cross-Modal Semantic Parsing A.1. Text Description Parsi...
-
[32]
are always accepted, ensuring monotonic geometric con- vergence. Semantic improvements are accepted using a temperature-controlled rule, allowing flexibility early in the search and determinism as the system stabilizes. Convergence.The combination of explicit IDSL at- tributes, staged rule activation, adaptive annealing, and lo- calized evaluation yields ...
work page 2066
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.