SceneCode: Executable World Programs for Editable Indoor Scenes with Articulated Objects
Pith reviewed 2026-05-20 05:59 UTC · model grok-4.3
The pith
SceneCode turns natural language prompts into executable Blender programs that produce editable indoor scenes with articulated objects.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
A room-level agentic backbone converts a prompt into a house layout and per-object AssetRequests; each request is routed to one of five code-generation strategies that output part-wise Blender Python programs; an execution-guided repair-and-refine loop validates the programs; the resulting programs compile into simulation-ready assets with articulation metadata and are linked through a persistent scene-state registry that supports traceable editing.
What carries the argument
The SceneCode pipeline, which routes each AssetRequest to a code-generation strategy and applies an execution-guided repair-and-refine loop to produce valid part-wise Blender programs for articulated objects.
If this is right
- Generated scenes match the original prompt more faithfully than static-mesh pipelines.
- Produced assets exhibit cleaner mesh structure than library-sourced alternatives.
- Assets carry simulator-loadable articulation metadata that supports direct use in physics engines.
- Scene assembly becomes a traceable process in which individual objects can be edited locally without regenerating the entire environment.
Where Pith is reading between the lines
- The same routing-plus-repair mechanism could be tested on prompts that introduce previously unseen object categories to measure generalization beyond the training distribution of indoor furniture.
- Because programs remain human-readable, natural-language edits could be mapped back to targeted code changes, offering a path toward interactive scene refinement that the current evaluation does not yet demonstrate.
- The persistent registry that links requests, programs, and assets may simplify integration with reinforcement-learning pipelines that require repeatable scene variations.
Load-bearing premise
The method will reliably produce valid Blender programs for arbitrary articulated indoor objects when each request is routed to one of five code-generation strategies followed by an execution-guided repair loop.
What would settle it
Run the system on a prompt such as 'a living room containing a sofa whose seat cushion lifts independently' and check whether the exported program loads into a physics simulator with correct joint articulation and without mesh errors.
Figures
read the original abstract
Indoor scene synthesis underpins embodied AI, robotic manipulation, and simulation-based policy evaluation, where a useful scene must specify not only what the environment looks like, but also how its objects are structured. Existing pipelines, however, typically represent generated content as static meshes and inherit articulation only from curated asset libraries, which limits object-level controllability and prevents new interactable assets from being produced on demand. We address this gap by formulating physically interactable indoor scene synthesis as programmatic world generation, and present SceneCode, a framework that compiles a natural language prompt into an executable, code-driven indoor world rather than a collection of opaque meshes. A room-level agentic backbone first turns the prompt into a structured house layout and emits per-object AssetRequests through a planner--designer--critic loop. Each request is then routed to one of five code-generation strategies and converted into a synthesized part-wise Blender Python programs that are validated through an execution-guided repair-and-refine loop. The resulting programs are compiled into simulation-ready assets, and exported as SDF for physics simulation. A persistent scene-state registry links object requests, executable programs, rendered geometry, and simulation assets, turning scene assembly into a traceable and locally editable world-building process. We evaluate SceneCode across scene-level synthesis, object-level asset quality, human judgment, and downstream robot interaction. Results show that executable world programs improve prompt-faithful indoor scene generation and produce assets with cleaner mesh structure, and simulator-loadable articulation metadata. Project page: https://scene-code.github.io/.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents SceneCode, a framework that compiles natural language prompts into executable Blender Python programs for generating editable indoor scenes with articulated objects. A room-level agentic backbone produces structured layouts and per-object AssetRequests via a planner-designer-critic loop. Each request routes to one of five code-generation strategies, yielding part-wise programs that undergo an execution-guided repair-and-refine loop. Programs compile to simulation-ready SDF assets with articulation metadata, supported by a persistent scene-state registry for traceability and local editability. Evaluations across scene synthesis, asset quality, human judgment, and robot interaction claim improvements in prompt faithfulness, mesh cleanliness, and simulator-loadable articulations.
Significance. If the empirical results hold, this work offers a meaningful advance for embodied AI and robotics by shifting from static mesh libraries to on-demand, programmatically editable and articulated scenes. The combination of agentic planning, code synthesis with self-correction, and a traceable registry addresses key limitations in controllability and reproducibility for simulation-based policy evaluation.
major comments (2)
- [§3.2] The central claim that the five-strategy routing plus execution-guided repair-and-refine loop reliably produces valid part-wise Blender programs for arbitrary articulated objects is load-bearing; the manuscript should report success/failure rates, iteration counts, and failure modes of the repair loop on a held-out set of complex objects (e.g., multi-joint furniture) to substantiate scalability.
- [§4] Table or figure reporting quantitative results for scene synthesis and asset quality (e.g., prompt alignment scores, mesh quality metrics, articulation validity rates) must include explicit baselines, statistical significance, and error analysis; without these, the stated improvements over prior pipelines cannot be fully assessed.
minor comments (3)
- [§3] Notation for AssetRequest and the persistent registry could be formalized with a small diagram or pseudocode to clarify data flow between planner, code generators, and simulator export.
- [Figures 3-5] Figure captions for rendered scenes and SDF assets should explicitly note which objects were synthesized versus retrieved to highlight the on-demand generation contribution.
- [Abstract] The abstract's summary of results would benefit from one or two concrete metric improvements (e.g., 'X% higher prompt faithfulness') to better orient readers before the detailed evaluation.
Simulated Author's Rebuttal
We thank the referee for their thorough review and positive recommendation for minor revision. We appreciate the constructive feedback on strengthening the empirical validation of our framework. Below, we address each major comment point by point, outlining the revisions we will make to the manuscript.
read point-by-point responses
-
Referee: [§3.2] The central claim that the five-strategy routing plus execution-guided repair-and-refine loop reliably produces valid part-wise Blender programs for arbitrary articulated objects is load-bearing; the manuscript should report success/failure rates, iteration counts, and failure modes of the repair loop on a held-out set of complex objects (e.g., multi-joint furniture) to substantiate scalability.
Authors: We agree with the referee that quantitative evaluation of the repair-and-refine loop is essential to support the scalability of our approach. The manuscript currently focuses on the design and qualitative demonstration of the loop in §3.2. To address this, we will add a new subsection or appendix with results from additional experiments on a held-out test set of 100 complex articulated objects, including multi-joint furniture. This will report success rates (e.g., percentage of programs that execute without errors after repair), average number of iterations required, and categorized failure modes (such as syntax errors, geometry inconsistencies, or articulation mismatches). These metrics will substantiate the reliability of the five-strategy routing and self-correction mechanism. revision: yes
-
Referee: [§4] Table or figure reporting quantitative results for scene synthesis and asset quality (e.g., prompt alignment scores, mesh quality metrics, articulation validity rates) must include explicit baselines, statistical significance, and error analysis; without these, the stated improvements over prior pipelines cannot be fully assessed.
Authors: We acknowledge that the current presentation of results in §4 could be strengthened by more explicit comparisons and statistical rigor. While our evaluations include comparisons to prior methods for scene synthesis and asset quality, we will revise the tables and figures to clearly identify all baselines used, include statistical significance tests (e.g., paired t-tests with p-values), and add error analysis such as standard deviations, confidence intervals, and discussion of outlier cases. This will enable readers to fully assess the improvements claimed over existing pipelines. revision: yes
Circularity Check
No significant circularity detected
full rationale
The paper describes a new pipeline for programmatic indoor scene synthesis that routes AssetRequests through five code-generation strategies and an execution-guided repair loop to produce Blender Python programs, which are then compiled to SDF assets. This construction relies on standard agentic planning, self-correction loops, and external tools (Blender, SDF export) rather than any self-definitional mapping, fitted parameter renamed as prediction, or load-bearing self-citation chain. Evaluations on scene synthesis, asset quality, and robot interaction are presented as independent benchmarks, leaving the derivation self-contained.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Natural language prompts can be reliably turned into structured house layouts and per-object AssetRequests via a planner-designer-critic loop
- domain assumption Blender Python programs generated by the five strategies can be executed and repaired to produce valid articulated assets
invented entities (2)
-
AssetRequest
no independent evidence
-
persistent scene-state registry
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Each request is then routed to one of five code-generation strategies and converted into a synthesized part-wise Blender Python programs that are validated through an execution-guided repair-and-refine loop.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Evaluating Large Language Models Trained on Code
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[2]
Zoey Chen, Aaron Walsman, Marius Memmel, Kaichun Mo, Alex Fang, Karthikeya Vemuri, Alan Wu, Dieter Fox, and Abhishek Gupta. Urdformer: A pipeline for constructing articulated simulation environments from real-world images.arXiv preprint arXiv:2405.11656, 2024
-
[3]
2025.doi:10.48550/arXiv.2508.14879
Bingquan Dai, Li Ray Luo, Qihong Tang, Jie Wang, Xinyu Lian, Hao Xu, Minghan Qin, Xudong Xu, Bo Dai, Haoqian Wang, et al. Meshcoder: Llm-powered structured mesh code generation from point clouds.arXiv preprint arXiv:2508.14879, 2025
-
[4]
Matt Deitke, Eli VanderBilt, Alvaro Herrasti, Luca Weihs, Kiana Ehsani, Jordi Salvador, Winson Han, Eric Kolve, Aniruddha Kembhavi, and Roozbeh Mottaghi. Procthor: Large-scale embodied ai using procedural generation.Advances in Neural Information Processing Systems, 35:5982–5994, 2022
work page 2022
-
[5]
Yuhao Du, Shunian Chen, Wenbo Zan, Peizhao Li, Mingxuan Wang, Dingjie Song, Bo Li, Yan Hu, and Benyou Wang. Blenderllm: Training large language models for computer-aided design with self- improvement.arXiv preprint arXiv:2412.14203, 2024
-
[6]
Weixi Feng, Wanrong Zhu, Tsu-jui Fu, Varun Jampani, Arjun Akula, Xuehai He, Sugato Basu, Xin Eric Wang, and William Yang Wang. Layoutgpt: Compositional visual planning and generation with large language models.Advances in Neural Information Processing Systems, 36:18225–18250, 2023
work page 2023
-
[7]
Anyhome: Open-vocabulary generation of structured and textured 3d homes
Rao Fu, Zehao Wen, Zichen Liu, and Srinath Sridhar. Anyhome: Open-vocabulary generation of structured and textured 3d homes. InEuropean Conference on Computer Vision, pages 52–70. Springer, 2024
work page 2024
-
[8]
Scenecraft: An llm agent for synthesizing 3d scenes as blender code
Ziniu Hu, Ahmet Iscen, Aashi Jain, Thomas Kipf, Yisong Yue, David A Ross, Cordelia Schmid, and Alireza Fathi. Scenecraft: An llm agent for synthesizing 3d scenes as blender code. InForty-first International Conference on Machine Learning, 2024
work page 2024
-
[9]
R Kenny Jones, Theresa Barton, Xianghao Xu, Kai Wang, Ellen Jiang, Paul Guerrero, Niloy J Mitra, and Daniel Ritchie. Shapeassembly: Learning to generate programs for 3d shape structure synthesis.ACM Transactions on Graphics (TOG), 39(6):1–20, 2020
work page 2020
-
[10]
AI2-THOR: An Interactive 3D Environment for Visual AI
Eric Kolve, Roozbeh Mottaghi, Winson Han, Eli VanderBilt, Luca Weihs, Alvaro Herrasti, Matt Deitke, Kiana Ehsani, Daniel Gordon, Yuke Zhu, Aniruddha Kembhavi, Abhinav Gupta, and Ali Farhadi. Ai2-thor: An interactive 3d environment for visual ai, 2022. URLhttps://arxiv.org/abs/1712.05474
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[11]
Jiahui Lei, Congyue Deng, William B Shen, Leonidas J Guibas, and Kostas Daniilidis. Nap: Neural 3d articulated object prior.Advances in Neural Information Processing Systems, 36:31878–31894, 2023
work page 2023
-
[12]
Behavior-1k: A benchmark for embodied ai with 1,000 everyday activities and realistic simulation
Chengshu Li, Ruohan Zhang, Josiah Wong, Cem Gokmen, Sanjana Srivastava, Roberto Martín-Martín, Chen Wang, Gabrael Levine, Michael Lingelbach, Jiankai Sun, et al. Behavior-1k: A benchmark for embodied ai with 1,000 everyday activities and realistic simulation. InConference on Robot Learning, pages 80–93. PMLR, 2023
work page 2023
-
[13]
Code as policies: Language model programs for embodied control
Jacky Liang, Wenlong Huang, Fei Xia, Peng Xu, Karol Hausman, Brian Ichter, Pete Florence, and Andy Zeng. Code as policies: Language model programs for embodied control. In2023 IEEE International conference on robotics and automation (ICRA), pages 9493–9500. IEEE, 2023
work page 2023
-
[14]
Instructscene: Instruction- driven 3d indoor scene synthesis with semantic graph prior
Chenguo Lin and Yadong Mu. Instructscene: Instruction-driven 3d indoor scene synthesis with semantic graph prior.arXiv preprint arXiv:2402.04717, 2024
-
[15]
Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023
work page 2023
-
[16]
Cage: Controllable articulation generation
Jiayi Liu, Hou In Ivan Tam, Ali Mahdavi-Amiri, and Manolis Savva. Cage: Controllable articulation generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17880–17889, 2024
work page 2024
-
[17]
Kaichun Mo, Shilin Zhu, Angel X Chang, Li Yi, Subarna Tripathi, Leonidas J Guibas, and Hao Su. Partnet: A large-scale benchmark for fine-grained and hierarchical part-level 3d object understanding. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 909–918, 2019. 10
work page 2019
-
[18]
Maniskill: Generalizable manipulation skill benchmark with large-scale demonstrations,
Tongzhou Mu, Zhan Ling, Fanbo Xiang, Derek Yang, Xuanlin Li, Stone Tao, Zhiao Huang, Zhiwei Jia, and Hao Su. Maniskill: Generalizable manipulation skill benchmark with large-scale demonstrations.arXiv preprint arXiv:2107.14483, 2021
-
[19]
RoboCasa: Large-Scale Simulation of Everyday Tasks for Generalist Robots
Soroush Nasiriany, Abhiram Maddukuri, Lance Zhang, Adeet Parikh, Aaron Lo, Abhishek Joshi, Ajay Mandlekar, and Yuke Zhu. Robocasa: Large-scale simulation of everyday tasks for generalist robots.arXiv preprint arXiv:2406.02523, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[20]
GPT-4V(ision) system card.OpenAI Technical Report, 2023
OpenAI. GPT-4V(ision) system card.OpenAI Technical Report, 2023
work page 2023
-
[21]
Despoina Paschalidou, Amlan Kar, Maria Shugrina, Karsten Kreis, Andreas Geiger, and Sanja Fidler. Atiss: Autoregressive transformers for indoor scene synthesis.Advances in neural information processing systems, 34:12013–12026, 2021
work page 2021
-
[22]
arXiv preprint arXiv:2602.09153 , year=
Nicholas Pfaff, Thomas Cohn, Sergey Zakharov, Rick Cory, and Russ Tedrake. Scenesmith: Agentic generation of simulation-ready indoor scenes.arXiv preprint arXiv:2602.09153, 2026
-
[23]
arXiv preprint arXiv:2503.16848 , year=
Hou In Derek Pun, Hou In Ivan Tam, Austin T Wang, Xiaoliang Huo, Angel X Chang, and Manolis Savva. Hsm: Hierarchical scene motifs for multi-scale indoor scene generation.arXiv preprint arXiv:2503.16848, 2025
-
[24]
Infinite photorealistic worlds using procedural generation
Alexander Raistrick, Lahav Lipson, Zeyu Ma, Lingjie Mei, Mingzhe Wang, Yiming Zuo, Karhan Kayan, Hongyu Wen, Beining Han, Yihan Wang, Alejandro Newell, Hei Law, Ankit Goyal, Kaiyu Yang, and Jia Deng. Infinite photorealistic worlds using procedural generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages...
work page 2023
-
[25]
Habitat: A platform for embodied ai research
Manolis Savva, Abhishek Kadian, Oleksandr Maksymets, Yili Zhao, Erik Wijmans, Bhavana Jain, Julian Straub, Jia Liu, Vladlen Koltun, Jitendra Malik, et al. Habitat: A platform for embodied ai research. In Proceedings of the IEEE/CVF international conference on computer vision, pages 9339–9347, 2019
work page 2019
-
[26]
ProgPrompt: Generating Situated Robot Task Plans using Large Language Models
Ishika Singh, Valts Blukis, Arsalan Mousavian, Ankit Goyal, Danfei Xu, Jonathan Tremblay, Dieter Fox, Jesse Thomason, and Animesh Garg. Progprompt: Generating situated robot task plans using large language models.arXiv preprint arXiv:2209.11302, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[27]
3d-gpt: Procedural 3d modeling with large language models
Chunyi Sun, Junlin Han, Weijian Deng, Xinlong Wang, Zishan Qin, and Stephen Gould. 3d-gpt: Procedural 3d modeling with large language models. In2025 International Conference on 3D Vision (3DV), pages 1253–1263. IEEE, 2025
work page 2025
-
[28]
Layoutvlm: Differentiable optimization of 3d layout via vision-language models
Fan-Yun Sun, Weiyu Liu, Siyi Gu, Dylan Lim, Goutam Bhat, Federico Tombari, Manling Li, Nick Haber, and Jiajun Wu. Layoutvlm: Differentiable optimization of 3d layout via vision-language models. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 29469–29478, 2025
work page 2025
-
[29]
Andrew Szot, Alexander Clegg, Eric Undersander, Erik Wijmans, Yili Zhao, John Turner, Noah Maestre, Mustafa Mukadam, Devendra Singh Chaplot, Oleksandr Maksymets, et al. Habitat 2.0: Training home assistants to rearrange their habitat.Advances in neural information processing systems, 34:251–266, 2021
work page 2021
-
[30]
Sceneeval: Evaluating semantic coherence in text-conditioned 3d indoor scene synthesis
Hou In Ivan Tam, Hou In Derek Pun, Austin T Wang, Angel X Chang, and Manolis Savva. Sceneeval: Evaluating semantic coherence in text-conditioned 3d indoor scene synthesis. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 7355–7365, 2026
work page 2026
-
[31]
Diffuscene: Denoising diffusion models for generative indoor scene synthesis
Jiapeng Tang, Yinyu Nie, Lev Markhasin, Angela Dai, Justus Thies, and Matthias Nießner. Diffuscene: Denoising diffusion models for generative indoor scene synthesis. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 20507–20518, 2024
work page 2024
-
[32]
Gemini: A Family of Highly Capable Multimodal Models
Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[33]
SAM 3D: 3Dfy Anything in Images
SAM 3D Team, Xingyu Chen, Fu-Jen Chu, Pierre Gleize, Kevin J Liang, Alexander Sax, Hao Tang, Weiyao Wang, Michelle Guo, Thibaut Hardin, Xiang Li, Aohan Lin, Jiawei Liu, Ziqi Ma, Anushka Sagar, Bowen Song, Xiaodong Wang, Jianing Yang, Bowen Zhang, Piotr Dollár, Georgia Gkioxari, Matt Feiszli, and Jitendra Malik. Sam 3d: 3dfy anything in images. 2025. URL h...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[34]
Mujoco: A physics engine for model-based control
Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for model-based control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 5026–5033. IEEE,
work page 2012
-
[35]
doi: 10.1109/IROS.2012.6386109
-
[36]
Sceneformer: Indoor scene generation with transformers
Xinpeng Wang, Chandan Yeshwanth, and Matthias Nießner. Sceneformer: Indoor scene generation with transformers. In2021 International conference on 3D vision (3DV), pages 106–115. IEEE, 2021. 11
work page 2021
-
[37]
Yian Wang, Ruihai Wu, Kaichun Mo, Jiaqi Ke, Qingnan Fan, Leonidas J Guibas, and Hao Dong. Adaafford: Learning to adapt manipulation affordance for 3d articulated objects via few-shot interactions. InEuropean conference on computer vision, pages 90–107. Springer, 2022
work page 2022
-
[38]
Yian Wang, Xiaowen Qiu, Jiageng Liu, Zhehuan Chen, Jiting Cai, Yufei Wang, Tsun-Hsuan Wang, Zhou Xian, and Chuang Gan. Architect: Generating vivid and interactive 3d scenes with hierarchical 2d inpainting.Advances in Neural Information Processing Systems, 37:67575–67603, 2024
work page 2024
-
[39]
Sapien: A simulated part-based interactive environment
Fanbo Xiang, Yuzhe Qin, Kaichun Mo, Yikuan Xia, Hao Zhu, Fangchen Liu, Minghua Liu, Hanxiao Jiang, Yifu Yuan, He Wang, et al. Sapien: A simulated part-based interactive environment. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11097–11107, 2020
work page 2020
-
[40]
Yue Yang, Fan-Yun Sun, Luca Weihs, Eli VanderBilt, Alvaro Herrasti, Winson Han, Jiajun Wu, Nick Haber, Ranjay Krishna, Lingjie Liu, et al. Holodeck: Language guided generation of 3d embodied ai environments. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16227–16237, 2024. 12 Appendix Contents A. More Related Wo...
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.