Code-as-Room: Generating 3D Rooms from Top-Down View Images via Agentic Code Synthesis
Pith reviewed 2026-05-20 11:46 UTC · model grok-4.3
The pith
A multi-stage MLLM agent parses top-down room images and outputs executable Blender code to build 3D scenes.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Given a top-down room image, the framework parses the reference image to extract scene elements and their spatial relationships, and synthesizes executable Blender code for geometry, materials, and lighting in a principled, multi-stage pipeline equipped with a cross-stage memory module and structured execution harness.
What carries the argument
The multi-stage agentic pipeline that maintains cross-stage memory while generating Blender code from parsed image elements and spatial relations.
If this is right
- The generated code runs directly in Blender to produce complete, renderable 3D rooms.
- The execution harness and memory module prevent the infinite loops and instability reported for earlier image-conditioned agents.
- A new benchmark supplies standard evaluation protocols for code-based room synthesis methods.
- Precise control over geometry, materials, and lighting becomes possible because the output is editable source code.
Where Pith is reading between the lines
- The editable nature of the code could let users iteratively refine scenes by editing the script rather than restarting from the image.
- The same staged parsing-plus-code approach might transfer to generating 3D models of outdoor scenes or non-room interiors from different reference views.
- Because the code is executable, it could be combined with simulation engines to test embodied AI agents inside the synthesized rooms without additional modeling steps.
Load-bearing premise
The multimodal model can extract accurate object identities and spatial relationships from one top-down image without omissions or errors that would make the generated code fail to match the scene.
What would settle it
Execute the output Blender code on a held-out set of top-down images and measure whether the resulting 3D models, when viewed from matching angles, reproduce the original object counts, positions, and approximate sizes within a small tolerance.
Figures
read the original abstract
Designing realistic and functional 3D indoor rooms is essential for a wide range of applications, including interior design, virtual reality, gaming, and embodied AI. While recent MLLM-based approaches have shown great potential for 3D room synthesis from textual descriptions or reference images, text-based methods struggle to capture precise spatial information, and existing image-conditioned agents suffer from instability and infinite looping when tasked with holistic room generation from top-down views. To address these limitations, we propose Code-as-Room, an MLLM-based agentic framework equipped with a structured execution harness, which represents 3D rooms with Blender codes. Given a top-down room image, the framework parses the reference image to extract scene elements and their spatial relationships, and synthesizes executable Blender code for geometry, materials, and lighting in a principled, multi-stage pipeline. A cross-stage memory module is maintained throughout to mitigate context forgetting inherent to existing agent-based frameworks. We further introduce a dedicated benchmark for code-based 3D room synthesis, encompassing various evaluation protocols. Based on our benchmark, comprehensive comparisons against existing agent-based methods are conducted to validate the effectiveness of our proposed execution harness.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents Code-as-Room, an MLLM-based agentic framework that generates 3D indoor rooms from top-down view images by synthesizing executable Blender code. It parses the reference image to extract scene elements and spatial relationships, then uses a multi-stage pipeline for geometry, materials, and lighting synthesis, augmented by a cross-stage memory module to mitigate context forgetting. The work also introduces a dedicated benchmark for code-based 3D room synthesis and conducts comparisons against existing agent-based methods.
Significance. If the central claims hold, the framework could improve stability and precision in image-conditioned 3D room generation compared to prior MLLM agents, offering advantages in interpretability through code output and applicability to interior design, VR, gaming, and embodied AI. The introduction of a structured execution harness and a new benchmark provides concrete tools for future work in this area.
major comments (2)
- [Method / Pipeline Description] The pipeline's first stage (image parsing for scene elements and spatial relationships) is load-bearing for all downstream code synthesis, yet the manuscript provides no explicit verification, rollback, or quantitative metrics on extraction accuracy; top-down views lack depth and occlusion cues, and MLLM hallucinations here would directly invalidate later stages without correction.
- [Experiments / Benchmark] The new benchmark is described as enabling comprehensive comparisons, but the evaluation section does not isolate parsing fidelity from end-to-end results or report ablations on the execution harness; this leaves unclear whether claimed improvements over baselines stem from better extraction or merely from the harness preventing infinite loops.
minor comments (2)
- [Abstract] The abstract states that comparisons validate the harness but does not preview any key quantitative metrics or error rates, which would strengthen the summary for readers.
- [Method] Clarify the exact interface between the cross-stage memory module and the code execution environment to avoid ambiguity in how context is preserved across stages.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment point by point below, indicating planned revisions to improve clarity and rigor.
read point-by-point responses
-
Referee: [Method / Pipeline Description] The pipeline's first stage (image parsing for scene elements and spatial relationships) is load-bearing for all downstream code synthesis, yet the manuscript provides no explicit verification, rollback, or quantitative metrics on extraction accuracy; top-down views lack depth and occlusion cues, and MLLM hallucinations here would directly invalidate later stages without correction.
Authors: We agree that the image parsing stage is foundational and that top-down views present inherent challenges due to missing depth and occlusion information. The manuscript highlights the multi-stage pipeline and cross-stage memory module as mechanisms to maintain consistency and reduce error propagation across stages. However, we acknowledge that explicit verification, rollback procedures, and quantitative metrics for parsing accuracy are not reported in the current version. In the revised manuscript, we will add a new subsection under the method that includes human-evaluated parsing accuracy metrics (e.g., element detection precision and spatial relationship correctness on a subset of the benchmark) along with examples of how the agentic framework can detect and mitigate hallucinations through iterative code refinement. revision: yes
-
Referee: [Experiments / Benchmark] The new benchmark is described as enabling comprehensive comparisons, but the evaluation section does not isolate parsing fidelity from end-to-end results or report ablations on the execution harness; this leaves unclear whether claimed improvements over baselines stem from better extraction or merely from the harness preventing infinite loops.
Authors: We thank the referee for this observation. The benchmark and evaluation protocols focus on end-to-end metrics such as code executability, visual fidelity, and functional correctness to enable fair comparisons with prior agent-based methods. We recognize that additional component-level analysis would strengthen the claims. In the revised manuscript, we will expand the experiments section with ablations that isolate the execution harness (comparing runs with and without it to quantify stability gains) and include proxy metrics for parsing fidelity where feasible, such as correlation between parsing quality and final room quality scores. This will clarify the sources of the reported improvements. revision: yes
Circularity Check
No circularity: engineering framework with no derivations or fitted predictions
full rationale
The paper presents Code-as-Room as an MLLM-based agentic framework that parses top-down images and synthesizes Blender code via a multi-stage pipeline with a cross-stage memory module. No equations, parameters, or first-principles derivations appear in the abstract or description. The approach is an explicit engineering construction whose claims are evaluated on a new benchmark rather than reducing to inputs by construction. No self-citation load-bearing steps, uniqueness theorems, or ansatzes smuggled via prior work are invoked. The central pipeline is externally falsifiable through execution and benchmark metrics, making the derivation self-contained.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
System card:claude opus 4.6.Anthropic Technical Report, 2026
Anthropic. System card:claude opus 4.6.Anthropic Technical Report, 2026. URLhttps://www-cdn. anthropic.com/0dd865075ad3132672ee0ab40b05a53f14cf5288.pdf. Accessed: 2026-05-12
work page 2026
-
[2]
3d semantic parsing of large-scale indoor spaces
Iro Armeni, Ozan Sener, Amir R Zamir, Helen Jiang, Ioannis Brilakis, Martin Fischer, and Silvio Savarese. 3d semantic parsing of large-scale indoor spaces. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 1534–1543, 2016
work page 2016
-
[3]
A generalized semantic representation for procedural generation of rooms
J Timothy Balint and Rafael Bidarra. A generalized semantic representation for procedural generation of rooms. InProceedings of the 14th International Conference on the F oundations of Digital Games, pages 1–8, 2019
work page 2019
-
[4]
I- design: Personalized llm interior designer
Ata C ¸ elen, Guo Han, Konrad Schindler, Luc Van Gool, Iro Armeni, Anton Obukhov, and Xi Wang. I- design: Personalized llm interior designer. InEuropean Conference on Computer Vision, pages 217–234. Springer, 2024
work page 2024
-
[5]
Matt Deitke, Eli VanderBilt, Alvaro Herrasti, Luca Weihs, Kiana Ehsani, Jordi Salvador, Winson Han, Eric Kolve, Aniruddha Kembhavi, and Roozbeh Mottaghi. Procthor: Large-scale embodied ai using procedural generation.Advances in Neural Information Processing Systems, 35:5982–5994, 2022
work page 2022
-
[6]
Weixi Feng, Wanrong Zhu, Tsu-jui Fu, Varun Jampani, Arjun Akula, Xuehai He, Sugato Basu, Xin Eric Wang, and William Yang Wang. Layoutgpt: Compositional visual planning and generation with large language models.Advances in Neural Information Processing Systems, 36:18225–18250, 2023
work page 2023
-
[7]
Anyhome: Open-vocabulary generation of struc- tured and textured 3d homes
Rao Fu, Zehao Wen, Zichen Liu, and Srinath Sridhar. Anyhome: Open-vocabulary generation of struc- tured and textured 3d homes. InEuropean Conference on Computer Vision, pages 52–70. Springer, 2024
work page 2024
-
[8]
Gemini 3.1 flash-lite model card.Google DeepMind Techni- cal Report, 2026
Google Gemini Team. Gemini 3.1 flash-lite model card.Google DeepMind Techni- cal Report, 2026. URLhttps://storage.googleapis.com/deepmind-media/Model-Cards/ Gemini-3-1-Flash-Lite-Model-Card.pdf. Accessed: 2026-05-12
work page 2026
-
[9]
Gemini 3.1 pro model card.Google DeepMind Technical Re- port, 2026
Google Gemini Team. Gemini 3.1 pro model card.Google DeepMind Technical Re- port, 2026. URLhttps://storage.googleapis.com/deepmind-media/Model-Cards/ Gemini-3-1-Pro-Model-Card.pdf. Accessed: 2026-05-12
work page 2026
-
[10]
R Kenny Jones, Theresa Barton, Xianghao Xu, Kai Wang, Ellen Jiang, Paul Guerrero, Niloy J Mitra, and Daniel Ritchie. Shapeassembly: Learning to generate programs for 3d shape structure synthesis.ACM Transactions on Graphics (TOG), 39(6):1–20, 2020
work page 2020
-
[11]
InteriorNet: Mega-scale Multi-sensor Photo-realistic Indoor Scenes Dataset
Wenbin Li, Sajad Saeedi, John McCormac, Ronald Clark, Dimos Tzoumanikas, Qing Ye, Yuzhong Huang, Rui Tang, and Stefan Leutenegger. Interiornet: Mega-scale multi-sensor photo-realistic indoor scenes dataset.arXiv preprint arXiv:1809.00716, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[12]
Zhengqin Li, Ting-Wei Yu, Shen Sang, Sarah Wang, Meng Song, Yuhan Liu, Yu-Ying Yeh, Rui Zhu, Nitesh Gundavarapu, Jia Shi, et al. Openrooms: An end-to-end open framework for photorealistic indoor scene datasets.arXiv preprint arXiv:2007.12868, 2020
-
[13]
Zero- 1-to-3: Zero-shot one image to 3d object
Ruoshi Liu, Rundi Wu, Basile Van Hoorick, Pavel Tokmakov, Sergey Zakharov, and Carl V ondrick. Zero- 1-to-3: Zero-shot one image to 3d object. InProceedings of the IEEE/CVF international conference on computer vision, pages 9298–9309, 2023
work page 2023
-
[14]
Sync- dreamer: Generating multiview-consistent images from a single-view image
Yuan Liu, Cheng Lin, Zijiao Zeng, Xiaoxiao Long, Lingjie Liu, Taku Komura, and Wenping Wang. Sync- dreamer: Generating multiview-consistent images from a single-view image. InInternational conference on learning representations, volume 2024, pages 27676–27697, 2024
work page 2024
-
[15]
Ll3m: Large language 3d modelers.arXiv preprint arXiv:2508.08228, 2025
Sining Lu, Guan Chen, Nam Anh Dinh, Itai Lang, Ari Holtzman, and Rana Hanocka. Ll3m: Large language 3d modelers.arXiv preprint arXiv:2508.08228, 2025
-
[16]
Stable: Simulation-ready tabletop layout generation via a semantics-physics dual system,
Zhen Luo, Yixuan Yang, Xudong Xu, Jinkun Hao, Zhaoyang Lyu, Feng Zheng, Jiangmiao Pang, and Yanwei Fu. Stable: Simulation-ready tabletop layout generation via a semantics-physics dual system,
-
[17]
URLhttps://arxiv.org/abs/2605.16137
work page internal anchor Pith review Pith/arXiv arXiv
-
[18]
Paul Merrell, Eric Schkufza, Zeyang Li, Maneesh Agrawala, and Vladlen Koltun. Interactive furniture layout using interior design guidelines.ACM transactions on graphics (TOG), 30(4):1–10, 2011
work page 2011
-
[19]
Gpt-5.5 model card and system card.OpenAI Technical Report, 2026
OpenAI. Gpt-5.5 model card and system card.OpenAI Technical Report, 2026. URLhttps: //deploymentsafety.openai.com/gpt-5-5/gpt-5-5.pdf. Accessed: 2026-05-12. 13
work page 2026
-
[20]
arXiv preprint arXiv:2602.09153 , year=
Nicholas Pfaff, Thomas Cohn, Sergey Zakharov, Rick Cory, and Russ Tedrake. Scenesmith: Agentic generation of simulation-ready indoor scenes.arXiv preprint arXiv:2602.09153, 2026
-
[21]
Qwen3.6-27B: Flagship-level coding in a 27B dense model, April 2026
Qwen Team. Qwen3.6-27B: Flagship-level coding in a 27B dense model, April 2026. URLhttps: //qwen.ai/blog?id=qwen3.6-27b
work page 2026
-
[22]
Indoor segmentation and support inference from rgbd images
Nathan Silberman, Derek Hoiem, Pushmeet Kohli, and Rob Fergus. Indoor segmentation and support inference from rgbd images. InEuropean conference on computer vision, pages 746–760. Springer, 2012
work page 2012
-
[23]
3d-gpt: Proce- dural 3d modeling with large language models
Chunyi Sun, Junlin Han, Weijian Deng, Xinlong Wang, Zishan Qin, and Stephen Gould. 3d-gpt: Proce- dural 3d modeling with large language models. In2025 International Conference on 3D Vision (3DV), pages 1253–1263. IEEE, 2025
work page 2025
-
[24]
Layoutvlm: Differentiable optimization of 3d layout via vision-language models
Fan-Yun Sun, Weiyu Liu, Siyi Gu, Dylan Lim, Goutam Bhat, Federico Tombari, Manling Li, Nick Haber, and Jiajun Wu. Layoutvlm: Differentiable optimization of 3d layout via vision-language models. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 29469–29478, 2025
work page 2025
-
[25]
Fan-Yun Sun, Shengguang Wu, Christian Jacobsen, Thomas Yim, Haoming Zou, Alex Zook, Shangru Li, Yu-Hsin Chou, Ethem Can, Xunlei Wu, et al. 3d-generalist: Self-improving vision-language-action models for crafting 3d worlds.arXiv preprint arXiv:2507.06484, 2025
-
[26]
arXiv preprint arXiv:2602.10116 , year=
Hongchi Xia, Xuan Li, Zhaoshuo Li, Qianli Ma, Jiashu Xu, Ming-Yu Liu, Yin Cui, Tsung-Yi Lin, Wei- Chiu Ma, Shenlong Wang, et al. Sage: Scalable agentic 3d scene generation for embodied ai.arXiv preprint arXiv:2602.10116, 2026
-
[27]
Constraint-based automatic placement for scene composition
Ken Xu, James Stewart, and Eugene Fiume. Constraint-based automatic placement for scene composition. InGraphics Interface, volume 2, pages 25–34, 2002
work page 2002
-
[28]
Yandan Yang, Baoxiong Jia, Shujie Zhang, and Siyuan Huang. Sceneweaver: All-in-one 3d scene syn- thesis with an extensible and self-reflective agent.Advances in neural information processing systems, 38:140319–140351, 2026
work page 2026
-
[29]
Sceneweaver: All-in-one 3d scene synthesis with an extensible and self- reflective agent
Yixuan Yang, Junru Lu, Zixiang Zhao, Zhen Luo, James JQ Yu, Victor Sanchez, and Feng Zheng. Llplace: The 3d indoor scene layout generation and editing via large language model.arXiv preprint arXiv:2406.03866, 2024
-
[30]
Yixuan Yang, Zhen Luo, Tongsheng Ding, Junru Lu, Mingqi Gao, Jinyu Yang, Victor Sanchez, and Feng Zheng. Optiscene: Llm-driven indoor scene layout generation via scaled human-aligned data synthesis and multi-stage preference optimization.Advances in Neural Information Processing Systems, 38:42499– 42529, 2026
work page 2026
-
[31]
Holodeck: Language guided genera- tion of 3d embodied AI environments
Yue Yang, Fan-Yun Sun, Luca Weihs, Eli VanderBilt, Alvaro Herrasti, Winson Han, Jiajun Wu, Nick Haber, Ranjay Krishna, Lingjie Liu, Kiana Ehsani, and Eric Kolve. Holodeck: Language guided genera- tion of 3d embodied AI environments. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024
work page 2024
-
[32]
Kaixin Yao, Longwen Zhang, Xinhao Yan, Yan Zeng, Qixuan Zhang, Lan Xu, Wei Yang, Jiayuan Gu, and Jingyi Yu. Cast: Component-aligned 3d scene reconstruction from an rgb image.ACM Transactions on Graphics (TOG), 44(4):1–19, 2025
work page 2025
-
[33]
Vision-as-Inverse-Graphics Agent via Interleaved Multimodal Reasoning
Shaofeng Yin, Jiaxin Ge, Zora Zhiruo Wang, Chenyang Wang, Xiuyu Li, Michael J Black, Trevor Dar- rell, Angjoo Kanazawa, and Haiwen Feng. Vision-as-inverse-graphics agent via interleaved multimodal reasoning.arXiv preprint arXiv:2601.11109, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[34]
Make it home: Automatic optimization of furniture arrangement.ACM Trans
Lap-Fai Yu, Sai Kit Yeung, Chi-Keung Tang, Demetri Terzopoulos, Tony F Chan, and Stanley J Osher. Make it home: Automatic optimization of furniture arrangement.ACM Trans. Graph., 30(4):86, 2011
work page 2011
-
[35]
Lap-Fai Yu, Sai-Kit Yeung, and Demetri Terzopoulos. The clutterpalette: An interactive tool for detailing indoor scenes.IEEE transactions on visualization and computer graphics, 22(2):1138–1148, 2015. 14
work page 2015
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.