pith. sign in

arxiv: 2512.11234 · v2 · pith:DHDXHFTSnew · submitted 2025-12-12 · 💻 cs.CV

RoomPilot: Controllable Indoor Scene Synthesis via Multimodal Semantic Parsing

Pith reviewed 2026-05-21 16:48 UTC · model grok-4.3

classification 💻 cs.CV
keywords indoor scene synthesismultimodal parsingdomain-specific languagehierarchical generationcontrollable 3D scenessemantic representationCAD floor plans
0
0 comments X

The pith

RoomPilot maps text descriptions and CAD floor plans into an Indoor Domain-Specific Language that drives hierarchical synthesis of controllable multi-room 3D scenes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces RoomPilot to generate indoor scenes from mixed inputs such as text and floor plans while giving users precise control over the results. It first converts those varied inputs into a single structured representation called IDSL that captures semantic details clearly. A staged process then assembles the scene from the full building layout down through individual rooms to specific objects, aiming to keep everything logically connected and functionally sensible. This matters for applications like game design and architectural previews because earlier approaches either accept only narrow input forms or produce outputs that are hard to adjust precisely. The work also supplies a labeled collection of 3D assets to help the generated scenes look more realistic and consistent.

Core claim

RoomPilot shows that heterogeneous multimodal inputs can be translated into an Indoor Domain-Specific Language (IDSL) that acts as an interpretable semantic representation, supporting a hierarchical synthesis pipeline that builds scenes progressively at the building, room, and object levels to promote structural coherence and functional consistency across multi-room layouts.

What carries the argument

The Indoor Domain-Specific Language (IDSL) that unifies inputs into a structured semantic representation and enables progressive hierarchical organization at building, room, and object scales.

If this is right

  • Combined text and geometric inputs produce scenes whose layout and contents can be adjusted at fine detail without regenerating the entire model.
  • Multi-room environments maintain logical connections between rooms and avoid functional mismatches such as misplaced furniture types.
  • The added asset dataset with semantic labels raises the realism and surface consistency of rendered objects.
  • Applications in embodied AI and visualization gain repeatable control over scene variations from the same high-level description.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same IDSL intermediate step could support live editing where a user changes one sentence and the 3D layout updates accordingly.
  • Integration with simulation engines would let training agents experience consistent multi-room physics without manual scene fixes.
  • The hierarchical structure suggests a natural way to add temporal changes, such as moving objects between rooms while preserving overall coherence.

Load-bearing premise

That mapping inputs to IDSL and organizing the process hierarchically will by itself produce fine-grained controllability, physical consistency, and visual fidelity without added constraints or tuning.

What would settle it

A direct test that feeds contradictory text and CAD plan inputs and checks whether the output scenes violate spatial rules such as object overlaps or fail to match the specified room counts and object placements.

Figures

Figures reproduced from arXiv: 2512.11234 by Ruihui Li, Shougao Zhang, Tianhao Zhou, Wentang Chen, Yiman Zhang.

Figure 1
Figure 1. Figure 1: Overview of RoomPilot. Given text instructions or CAD floor plans, RoomPilot produces controllable and structurally coherent multi-room indoor scenes. It parses inputs into an IDSL semantic hierarchy and refines layouts using self-regulating optimization before generating final 3D scenes. Abstract Generating controllable and interactive indoor scenes is fundamental to applications in game development, arch… view at source ↗
Figure 2
Figure 2. Figure 2: RoomPilot takes either text descriptions or CAD floor plans as input, parses them into a unified IDSL representation, optimizes multi-room layouts by a self-regulating energy-based process, and retrieves appropriate assets to generate interactive 3D indoor scenes. limitations. Highly abstract representations make it diffi￾cult to inject design knowledge and encode functional or semantic intent, while overl… view at source ↗
Figure 3
Figure 3. Figure 3: Single-room generation comparison. We show results for five room types under detailed text prompts. RoomPilot produces layouts with higher semantic alignment and structural coherence compared to existing methods. Method Bedroom Living Room Dining Room Physics Visual & Semantics Physics Visual & Semantics Physics Visual & Semantics #Obj ↑ #OB ↓ #CN ↓ Real. ↑ Func. ↑ Lay. ↑ Comp. ↑ #Obj ↑ #OB ↓ #CN ↓ Real. ↑… view at source ↗
Figure 4
Figure 4. Figure 4: Visualization of the procedural presets used for structural [PITH_FULL_IMAGE:figures/full_fig_p016_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Example of relationship-aware post-placement opti [PITH_FULL_IMAGE:figures/full_fig_p018_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Visual results of RoomPilot across different settings. The top row displays the detailed descriptions and corresponding 3D scenes. [PITH_FULL_IMAGE:figures/full_fig_p020_6.png] view at source ↗
read the original abstract

Generating controllable indoor scenes is fundamental to applications in game development, architectural visualization, and embodied AI. However, existing approaches either support a limited input modalities or rely on implicit generation processes that hinder precise control over scene structure and semantics. To address these limitations, we introduce RoomPilot, a unified framework for controllable indoor scene synthesis from multi-modal inputs, including textual descriptions and CAD floor plans. RoomPilot maps heterogeneous inputs into an Indoor Domain-Specific Language (IDSL), which serves as a structured and interpretable semantic representation for describing indoor scenes. Built upon IDSL, RoomPilot presents a hierarchical synthesis pipeline that progressively organizes scenes at the building, room, and object levels, promoting structural coherence and functional consistency across multi-room layouts. Moreover, RoomPilot constructs a curated asset dataset with rich semantic annotations to support high-quality scene synthesis, improving visual realism and appearance consistency. Extensive experiments demonstrate effective multi-modal understanding, fine-grained controllability in scene generation, and improved physical consistency and visual fidelity, marking a significant step toward controllable 3D indoor scene synthesis. Code and model will be available.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The manuscript introduces RoomPilot, a unified framework for controllable indoor scene synthesis from multimodal inputs such as textual descriptions and CAD floor plans. It maps these inputs to an Indoor Domain-Specific Language (IDSL) as a structured semantic representation and employs a hierarchical synthesis pipeline that organizes scenes at building, room, and object levels. Additionally, it constructs a curated asset dataset with rich semantic annotations to support high-quality synthesis. The authors claim that this leads to effective multi-modal understanding, fine-grained controllability, structural coherence, functional consistency, physical consistency, and improved visual fidelity, as shown through extensive experiments.

Significance. If the central claims hold, this work has the potential to advance the field of 3D indoor scene generation by introducing an interpretable intermediate representation via IDSL and a multi-scale hierarchical approach, which could address limitations in controllability and consistency found in prior implicit or limited-modality methods. The promise of releasing code, model, and the curated dataset would further enhance its impact and reproducibility.

major comments (1)
  1. Abstract: The abstract asserts that 'Extensive experiments demonstrate effective multi-modal understanding, fine-grained controllability in scene generation, and improved physical consistency and visual fidelity' but supplies no quantitative metrics, baseline comparisons, error analysis, or experimental details. This absence is load-bearing for the central claims of improvements over existing approaches.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the potential of RoomPilot. We address the single major comment below and will incorporate revisions to improve clarity and substantiation of claims.

read point-by-point responses
  1. Referee: Abstract: The abstract asserts that 'Extensive experiments demonstrate effective multi-modal understanding, fine-grained controllability in scene generation, and improved physical consistency and visual fidelity' but supplies no quantitative metrics, baseline comparisons, error analysis, or experimental details. This absence is load-bearing for the central claims of improvements over existing approaches.

    Authors: We agree that the abstract is high-level and omits specific numbers to remain concise. The full paper (Section 4) contains quantitative results including baseline comparisons on metrics such as FID, structural coherence, physical consistency scores, and user studies for controllability and fidelity. To directly address this, we will revise the abstract to include key highlights such as representative improvements (e.g., X% better coherence and Y% higher physical consistency versus baselines). This change will better ground the claims while preserving abstract length. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The abstract introduces IDSL as a new structured semantic representation and a hierarchical synthesis pipeline at building/room/object levels as novel components for handling multimodal inputs. No equations, fitted parameters, predictions, or self-citations are shown that reduce any claimed result to its own inputs by construction. The framework is presented as self-contained through the definition of new elements rather than any self-definitional, fitted-input, or ansatz-smuggled steps, consistent with the reader's assessment of low circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The central claim rests on the introduction of IDSL as a bridging representation and the domain assumption that hierarchical organization improves coherence; no explicit free parameters are mentioned.

axioms (2)
  • domain assumption Heterogeneous inputs such as text and CAD floor plans can be mapped to a common structured semantic representation.
    This mapping is presented as the foundation for the unified framework.
  • domain assumption Organizing scene synthesis hierarchically at building, room, and object levels promotes structural coherence and functional consistency.
    Invoked directly in the description of the synthesis pipeline.
invented entities (1)
  • IDSL (Indoor Domain-Specific Language) no independent evidence
    purpose: Structured and interpretable semantic representation to handle multimodal inputs for scene description
    Newly introduced as the core intermediate representation.

pith-pipeline@v0.9.0 · 5727 in / 1518 out tokens · 66456 ms · 2026-05-21T16:48:52.762928+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

32 extracted references · 32 canonical work pages

  1. [1]

    Componerf: Text-guided multi-object compositional nerf with editable 3d scene lay- out.arXiv preprint arXiv:2303.13843, 2023

    Haotian Bai, Yuanhuiyi Lyu, Lutao Jiang, Sijia Li, Haonan Lu, Xiaodong Lin, and Lin Wang. Componerf: Text-guided multi-object compositional nerf with editable 3d scene lay- out.arXiv preprint arXiv:2303.13843, 2023. 3

  2. [2]

    I-design: Personal- ized llm interior designer.arXiv preprint arXiv:2404.02838,

    Ata C ¸ elen, Guo Han, Konrad Schindler, Luc Van Gool, Iro Armeni, Anton Obukhov, and Xi Wang. I-design: Personal- ized llm interior designer.arXiv preprint arXiv:2404.02838,

  3. [3]

    Automated creation of digital cousins for robust policy learning.arXiv preprint arXiv:2410.07408, 2024

    Tianyuan Dai, Josiah Wong, Yunfan Jiang, Chen Wang, Cem Gokmen, Ruohan Zhang, Jiajun Wu, and Li Fei-Fei. Auto- mated creation of digital cousins for robust policy learning. arXiv preprint arXiv:2410.07408, 2024. 3

  4. [4]

    Procthor: Large-scale embodied ai using procedural generation.Ad- vances in Neural Information Processing Systems, 35:5982– 5994, 2022

    Matt Deitke, Eli VanderBilt, Alvaro Herrasti, Luca Weihs, Kiana Ehsani, Jordi Salvador, Winson Han, Eric Kolve, Aniruddha Kembhavi, and Roozbeh Mottaghi. Procthor: Large-scale embodied ai using procedural generation.Ad- vances in Neural Information Processing Systems, 35:5982– 5994, 2022. 2

  5. [5]

    ProcTHOR: Large-Scale Embodied AI Using Procedural Generation

    Matt Deitke, Eli VanderBilt, Alvaro Herrasti, Luca Weihs, Jordi Salvador, Kiana Ehsani, Winson Han, Eric Kolve, Ali Farhadi, Aniruddha Kembhavi, and Roozbeh Mottaghi. ProcTHOR: Large-Scale Embodied AI Using Procedural Generation. InNeurIPS, 2022. Outstanding Paper Award. 2, 3

  6. [6]

    Mv-diffusion: Motion-aware video diffusion model

    Zijun Deng, Xiangteng He, Yuxin Peng, Xiongwei Zhu, and Lele Cheng. Mv-diffusion: Motion-aware video diffusion model. InProceedings of the 31st ACM International Con- ference on Multimedia, pages 7255–7263, 2023. 2

  7. [7]

    Ctrl-room: Controllable text-to-3d room meshes generation with layout constraints

    Chuan Fang, Yuan Dong, Kunming Luo, Xiaotao Hu, Rakesh Shrestha, and Ping Tan. Ctrl-room: Controllable text-to-3d room meshes generation with layout constraints. In2025 International Conference on 3D Vision (3DV), pages 692–701. IEEE, 2025. 2, 3

  8. [8]

    Layoutgpt: Compositional visual plan- ning and generation with large language models.NeurIPS,

    Weixi Feng, Wanrong Zhu, Tsu-jui Fu, Varun Jampani, Ar- jun Akula, Xuehai He, Sugato Basu, Xin Eric Wang, and William Yang Wang. Layoutgpt: Compositional visual plan- ning and generation with large language models.NeurIPS,

  9. [9]

    Procedural model- ing of plant ecosystems maximizing vegetation cover.Mul- timedia Tools and Applications, 81(12):16195–16217, 2022

    Cristina Gasch, Jos ´e Mart´ınez Sotoca, Miguel Chover, In- maculada Remolar, and Cristina Rebollo. Procedural model- ing of plant ecosystems maximizing vegetation cover.Mul- timedia Tools and Applications, 81(12):16195–16217, 2022. 2

  10. [10]

    Text2room: Extracting textured 3d meshes from 2d text-to-image models

    Lukas H ¨ollein, Ang Cao, Andrew Owens, Justin Johnson, and Matthias Nießner. Text2room: Extracting textured 3d meshes from 2d text-to-image models. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion, pages 7909–7920, 2023. 2

  11. [11]

    Ross, Cordelia Schmid, and Alireza Fathi

    Ziniu Hu, Ahmet Iscen, Aashi Jain, Thomas Kipf, Yisong Yue, David A. Ross, Cordelia Schmid, and Alireza Fathi. Scenecraft: An llm agent for synthesizing 3d scene as blender code. InICML, 2024. 2

  12. [12]

    Instructscene: Instruction- driven 3d indoor scene synthesis with semantic graph prior

    Chenguo Lin and Yadong Mu. Instructscene: Instruction- driven 3d indoor scene synthesis with semantic graph prior. arXiv preprint arXiv:2402.04717, 2024. 2, 3

  13. [13]

    Sceneteller: Language-to-3d scene generation

    Bas ¸ak Melis¨Ocal, Maxim Tatarchenko, Sezer Karao˘glu, and Theo Gevers. Sceneteller: Language-to-3d scene generation. InEuropean Conference on Computer Vision, pages 362–

  14. [14]

    Springer, 2024. 3, 7, 8

  15. [15]

    Atiss: Autoregres- sive transformers for indoor scene synthesis

    Despoina Paschalidou, Amlan Kar, Maria Shugrina, Karsten Kreis, Andreas Geiger, and Sanja Fidler. Atiss: Autoregres- sive transformers for indoor scene synthesis. InNeurIPS,

  16. [16]

    Infinigen indoors: Photorealistic indoor scenes using procedural generation

    Alexander Raistrick, Lingjie Mei, Karhan Kayan, David Yan, Yiming Zuo, Beining Han, Hongyu Wen, Meenal Parakh, Stamatis Alexandropoulos, Lahav Lipson, Zeyu Ma, and Jia Deng. Infinigen indoors: Photorealistic indoor scenes using procedural generation. InCVPR, 2024. 2, 3

  17. [17]

    In- finigen indoors: Photorealistic indoor scenes using procedu- ral generation

    Alexander Raistrick, Lingjie Mei, Karhan Kayan, David Yan, Yiming Zuo, Beining Han, Hongyu Wen, Meenal Parakh, Stamatis Alexandropoulos, Lahav Lipson, et al. In- finigen indoors: Photorealistic indoor scenes using procedu- ral generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21783– 21794, 2024. 2, 8

  18. [18]

    Controlroom3d: Room gen- eration using semantic proxy rooms

    Jonas Schult, Sam Tsai, Lukas H ¨ollein, Bichen Wu, Jialiang Wang, Chih-Yao Ma, Kunpeng Li, Xiaofang Wang, Felix Wimbauer, Zijian He, et al. Controlroom3d: Room gen- eration using semantic proxy rooms. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6201–6210, 2024. 2

  19. [19]

    3d-gpt: Procedural 3d model- ing with large language models

    Chunyi Sun, Junlin Han, Weijian Deng, Xinlong Wang, Zis- han Qin, and Stephen Gould. 3d-gpt: Procedural 3d model- ing with large language models. In2025 International Con- ference on 3D Vision (3DV), pages 1253–1263. IEEE, 2025. 2

  20. [20]

    Layoutvlm: Differentiable optimization of 3d layout via vision-language models.CVPR, 2025

    Fan-Yun Sun, Weiyu Liu, Siyi Gu, Dylan Lim, Goutam Bhat, Federico Tombari, Manling Li, Nick Haber, and Jiajun Wu. Layoutvlm: Differentiable optimization of 3d layout via vision-language models.CVPR, 2025. 8

  21. [21]

    Layoutvlm: Differentiable optimization of 3d layout via vision-language models

    Fan-Yun Sun, Weiyu Liu, Siyi Gu, Dylan Lim, Goutam Bhat, Federico Tombari, Manling Li, Nick Haber, and Jia- jun Wu. Layoutvlm: Differentiable optimization of 3d layout via vision-language models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 29469– 29478, 2025. 3, 6, 7, 8

  22. [22]

    Diffuscene: Denoising diffu- sion models for gerative indoor scene synthesis

    Jiapeng Tang, Yinyu Nie, Lev Markhasin, Angela Dai, Justus Thies, and Matthias Nießner. Diffuscene: Denoising diffu- sion models for gerative indoor scene synthesis. InCVPR,

  23. [23]

    Architect: Generating vivid and interactive 3d scenes with hierarchical 2d inpainting

    Yian Wang, Xiaowen Qiu, Jiageng Liu, Zhehuan Chen, Jit- ing Cai, Yufei Wang, Tsun-Hsuan Wang, Zhou Xian, and Chuang Gan. Architect: Generating vivid and interactive 3d scenes with hierarchical 2d inpainting. InNeurIPS, 2024. 3

  24. [24]

    Scenecraft: Layout-guided 3d scene generation

    Xiuyu Yang, Yunze Man, Junkun Chen, and Yu-Xiong Wang. Scenecraft: Layout-guided 3d scene generation. Advances in Neural Information Processing Systems, 37: 82060–82084, 2024. 2, 3

  25. [25]

    Physcene: Physically interactable 3d scene synthesis for embodied ai

    Yandan Yang, Baoxiong Jia, Peiyuan Zhi, and Siyuan Huang. Physcene: Physically interactable 3d scene synthesis for embodied ai. InCVPR, 2024. 3

  26. [26]

    Holodeck: Language guided gen- eration of 3d embodied ai environments

    Yue Yang, Fan-Yun Sun, Luca Weihs, Eli VanderBilt, Al- varo Herrasti, Winson Han, Jiajun Wu, Nick Haber, Ranjay Krishna, Lingjie Liu, et al. Holodeck: Language guided gen- eration of 3d embodied ai environments. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16227–16237, 2024. 3, 7, 8

  27. [27]

    Sceneweaver: All-in-one 3d scene synthesis with an extensible and self-reflective agent.arXiv preprint arXiv:2509.20414, 2025

    Yandan Yang, Baoxiong Jia, Shujie Zhang, and Siyuan Huang. Sceneweaver: All-in-one 3d scene synthesis with an extensible and self-reflective agent.arXiv preprint arXiv:2509.20414, 2025. 3, 6, 7, 8

  28. [28]

    Metascenes: Towards automated replica creation for real-world 3d scans

    Huangyue Yu, Baoxiong Jia, Yixin Chen, Yandan Yang, Puhao Li, Rongpeng Su, Jiaxin Li, Qing Li, Wei Liang, Zhu Song-Chun, Tengyu Liu, and Siyuan Huang. Metascenes: Towards automated replica creation for real-world 3d scans. InCVPR, 2025. 3

  29. [29]

    Fur- niscene: A large-scale 3d room dataset with intricate fur- nishing scenes.arXiv preprint arXiv:2401.03470, 2024

    Genghao Zhang, Yuxi Wang, Chuanchen Luo, Shibiao Xu, Zhaoxiang Zhang, Man Zhang, and Junran Peng. Fur- niscene: A large-scale 3d room dataset with intricate fur- nishing scenes.arXiv preprint arXiv:2401.03470, 2024. 2

  30. [30]

    Scenex: Procedural controllable large-scale scene genera- tion.arXiv preprint arXiv:2403.15698, 2024

    Mengqi Zhou, Yuxi Wang, Jun Hou, Shougao Zhang, Yi- wei Li, Chuanchen Luo, Junran Peng, and Zhaoxiang Zhang. Scenex: Procedural controllable large-scale scene genera- tion.arXiv preprint arXiv:2403.15698, 2024. 2

  31. [31]

    Roomcraft: Controllable and complete 3d indoor scene generation.arXiv preprint arXiv:2506.22291, 2025

    Mengqi Zhou, Xipeng Wang, Yuxi Wang, and Zhaoxiang Zhang. Roomcraft: Controllable and complete 3d indoor scene generation.arXiv preprint arXiv:2506.22291, 2025. 2 RoomPilot: Controllable Synthesis of Interactive Indoor Environments via Multimodal Semantic Parsing Supplementary Material A. Details of Cross-Modal Semantic Parsing A.1. Text Description Parsi...

  32. [32]

    Semantics(building)

    are always accepted, ensuring monotonic geometric con- vergence. Semantic improvements are accepted using a temperature-controlled rule, allowing flexibility early in the search and determinism as the system stabilizes. Convergence.The combination of explicit IDSL at- tributes, staged rule activation, adaptive annealing, and lo- calized evaluation yields ...