pith. sign in

arxiv: 2605.19587 · v1 · pith:3I2BWIGJnew · submitted 2026-05-19 · 💻 cs.AI

SceneCode: Executable World Programs for Editable Indoor Scenes with Articulated Objects

Pith reviewed 2026-05-20 05:59 UTC · model grok-4.3

classification 💻 cs.AI
keywords indoor scene synthesisexecutable programsarticulated objectsBlender Pythonembodied AIprogrammatic generationsimulation assetsscene editing
0
0 comments X

The pith

SceneCode turns natural language prompts into executable Blender programs that produce editable indoor scenes with articulated objects.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that indoor scene synthesis can be reframed as the generation of executable world programs rather than collections of static meshes. It shows that routing object requests through code-generation strategies and an execution-guided repair loop yields assets that better match input prompts, exhibit cleaner geometry, and carry simulator-ready articulation data. A sympathetic reader would care because this removes reliance on pre-curated asset libraries and enables on-demand creation of physically interactable environments for robotics and embodied AI. The approach also keeps the entire scene traceable, so individual objects remain locally editable after initial generation.

Core claim

A room-level agentic backbone converts a prompt into a house layout and per-object AssetRequests; each request is routed to one of five code-generation strategies that output part-wise Blender Python programs; an execution-guided repair-and-refine loop validates the programs; the resulting programs compile into simulation-ready assets with articulation metadata and are linked through a persistent scene-state registry that supports traceable editing.

What carries the argument

The SceneCode pipeline, which routes each AssetRequest to a code-generation strategy and applies an execution-guided repair-and-refine loop to produce valid part-wise Blender programs for articulated objects.

If this is right

  • Generated scenes match the original prompt more faithfully than static-mesh pipelines.
  • Produced assets exhibit cleaner mesh structure than library-sourced alternatives.
  • Assets carry simulator-loadable articulation metadata that supports direct use in physics engines.
  • Scene assembly becomes a traceable process in which individual objects can be edited locally without regenerating the entire environment.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same routing-plus-repair mechanism could be tested on prompts that introduce previously unseen object categories to measure generalization beyond the training distribution of indoor furniture.
  • Because programs remain human-readable, natural-language edits could be mapped back to targeted code changes, offering a path toward interactive scene refinement that the current evaluation does not yet demonstrate.
  • The persistent registry that links requests, programs, and assets may simplify integration with reinforcement-learning pipelines that require repeatable scene variations.

Load-bearing premise

The method will reliably produce valid Blender programs for arbitrary articulated indoor objects when each request is routed to one of five code-generation strategies followed by an execution-guided repair loop.

What would settle it

Run the system on a prompt such as 'a living room containing a sofa whose seat cushion lifts independently' and check whether the exported program loads into a physics simulator with correct joint articulation and without mesh errors.

Figures

Figures reproduced from arXiv: 2605.19587 by Kevin Qinghong Lin, Linjie Li, Puyi Wang, Yangguang Li, Yu Cheng, Yuhao Wang, Zhengyuan Yang.

Figure 1
Figure 1. Figure 1: Overview of SCENECODE. Given a natural language scene prompt, our framework compiles it into an executable, code-driven indoor scene with interactable objects. Abstract Indoor scene synthesis underpins embodied AI, robotic manipulation, and simulation-based policy evaluation, where a useful scene must specify not only what the environment looks like, but also how its objects are structured. Existing pipeli… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of SCENECODE. From room-level planning to code-driven object generation, simulation-ready compilation, and scene-state registration. 3.1 Room-Level Agentic Scene Backbone The room-level backbone transforms a scene prompt into a set of per-room object specifications that drive subsequent program synthesis. Concretely, it produces a structured house layout H together with an ordered sequence of obje… view at source ↗
Figure 3
Figure 3. Figure 3: Room-level qualitative comparison. SceneCode shows better prompt fidelity than the baselines. See Appendix I.1 for additional examples [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Object-level qualitative comparison. (a) Mesh and UV layouts of representative assets from SceneCode versus SAM 3D Objects. (b) Code-level editability: locally re-executing one object program with different parameters yields variant objects. (c) On-demand articulated objects with prescribed structure or material that retrieval-based pipelines cannot satisfy. 4.3 Object-Level Geometry and UV Quality We comp… view at source ↗
Figure 5
Figure 5. Figure 5: Robot interaction with generated articulated assets. A SceneCode-generated articulated object is imported into MuJoCo for contact-based robot manipulation. The movable parts produced by the Articulated Object Program remain independent links with compiled joints, allowing the robot to physically open or slide them [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Route-specific prompt ablation. Generic prompting can produce visually plausible assets, but route-specific construction prompts better preserve semantic parts, internal structure, material grounding, and articulation-ready components. Here, generic prompting refers to removing the router and simply asking the VLM, given the input request, to generate Blender code for the desired 3D content. As shown in [… view at source ↗
Figure 7
Figure 7. Figure 7: User-study interface. Screenshot of the UI used in our user study. G Full User Study Results Setup recap. The user study involves nine participants split evenly into three groups of three. Group A compares SceneCode against SceneSmith, Group B against HSM, and Group C against LayoutVLM, on the same 30 SceneEval-100 room-level prompts used for the automatic evaluation. For each prompt and each comparison, p… view at source ↗
Figure 8
Figure 8. Figure 8: Additional room-level demonstrations. SCENECODE generates kitchen, basement, dining room, living room, bathroom, and bedroom scenes with prompt-faithful object coverage, spatial layout, and stylistic attributes [PITH_FULL_IMAGE:figures/full_fig_p028_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Additional object-level demonstrations. Rendered objects and corresponding wireframes for furniture, ceiling objects, wall objects, and manipulands generated by SCENECODE. The examples show that code-generated assets can express materials, numbers, text, and nontrivial geometric structure while preserving explicit mesh organization [PITH_FULL_IMAGE:figures/full_fig_p029_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Articulated object demonstrations. SCENECODE generates articulated household objects with correct geometric structure, storage interiors such as shelves and partitions, and movable components that remain available for downstream interaction. 29 [PITH_FULL_IMAGE:figures/full_fig_p029_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Generated night￾stand demo. This appendix provides a concrete example of the executable object rep￾resentation used by SCENECODE. We use a generated nightstand as the running example. The object-level Blender program constructs the movable drawer from geometric primitives and procedural materials, while the ex￾ported SDF preserves the drawer as an independent simulation link with a prismatic joint. J.1 Bl… view at source ↗
read the original abstract

Indoor scene synthesis underpins embodied AI, robotic manipulation, and simulation-based policy evaluation, where a useful scene must specify not only what the environment looks like, but also how its objects are structured. Existing pipelines, however, typically represent generated content as static meshes and inherit articulation only from curated asset libraries, which limits object-level controllability and prevents new interactable assets from being produced on demand. We address this gap by formulating physically interactable indoor scene synthesis as programmatic world generation, and present SceneCode, a framework that compiles a natural language prompt into an executable, code-driven indoor world rather than a collection of opaque meshes. A room-level agentic backbone first turns the prompt into a structured house layout and emits per-object AssetRequests through a planner--designer--critic loop. Each request is then routed to one of five code-generation strategies and converted into a synthesized part-wise Blender Python programs that are validated through an execution-guided repair-and-refine loop. The resulting programs are compiled into simulation-ready assets, and exported as SDF for physics simulation. A persistent scene-state registry links object requests, executable programs, rendered geometry, and simulation assets, turning scene assembly into a traceable and locally editable world-building process. We evaluate SceneCode across scene-level synthesis, object-level asset quality, human judgment, and downstream robot interaction. Results show that executable world programs improve prompt-faithful indoor scene generation and produce assets with cleaner mesh structure, and simulator-loadable articulation metadata. Project page: https://scene-code.github.io/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The paper presents SceneCode, a framework that compiles natural language prompts into executable Blender Python programs for generating editable indoor scenes with articulated objects. A room-level agentic backbone produces structured layouts and per-object AssetRequests via a planner-designer-critic loop. Each request routes to one of five code-generation strategies, yielding part-wise programs that undergo an execution-guided repair-and-refine loop. Programs compile to simulation-ready SDF assets with articulation metadata, supported by a persistent scene-state registry for traceability and local editability. Evaluations across scene synthesis, asset quality, human judgment, and robot interaction claim improvements in prompt faithfulness, mesh cleanliness, and simulator-loadable articulations.

Significance. If the empirical results hold, this work offers a meaningful advance for embodied AI and robotics by shifting from static mesh libraries to on-demand, programmatically editable and articulated scenes. The combination of agentic planning, code synthesis with self-correction, and a traceable registry addresses key limitations in controllability and reproducibility for simulation-based policy evaluation.

major comments (2)
  1. [§3.2] The central claim that the five-strategy routing plus execution-guided repair-and-refine loop reliably produces valid part-wise Blender programs for arbitrary articulated objects is load-bearing; the manuscript should report success/failure rates, iteration counts, and failure modes of the repair loop on a held-out set of complex objects (e.g., multi-joint furniture) to substantiate scalability.
  2. [§4] Table or figure reporting quantitative results for scene synthesis and asset quality (e.g., prompt alignment scores, mesh quality metrics, articulation validity rates) must include explicit baselines, statistical significance, and error analysis; without these, the stated improvements over prior pipelines cannot be fully assessed.
minor comments (3)
  1. [§3] Notation for AssetRequest and the persistent registry could be formalized with a small diagram or pseudocode to clarify data flow between planner, code generators, and simulator export.
  2. [Figures 3-5] Figure captions for rendered scenes and SDF assets should explicitly note which objects were synthesized versus retrieved to highlight the on-demand generation contribution.
  3. [Abstract] The abstract's summary of results would benefit from one or two concrete metric improvements (e.g., 'X% higher prompt faithfulness') to better orient readers before the detailed evaluation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thorough review and positive recommendation for minor revision. We appreciate the constructive feedback on strengthening the empirical validation of our framework. Below, we address each major comment point by point, outlining the revisions we will make to the manuscript.

read point-by-point responses
  1. Referee: [§3.2] The central claim that the five-strategy routing plus execution-guided repair-and-refine loop reliably produces valid part-wise Blender programs for arbitrary articulated objects is load-bearing; the manuscript should report success/failure rates, iteration counts, and failure modes of the repair loop on a held-out set of complex objects (e.g., multi-joint furniture) to substantiate scalability.

    Authors: We agree with the referee that quantitative evaluation of the repair-and-refine loop is essential to support the scalability of our approach. The manuscript currently focuses on the design and qualitative demonstration of the loop in §3.2. To address this, we will add a new subsection or appendix with results from additional experiments on a held-out test set of 100 complex articulated objects, including multi-joint furniture. This will report success rates (e.g., percentage of programs that execute without errors after repair), average number of iterations required, and categorized failure modes (such as syntax errors, geometry inconsistencies, or articulation mismatches). These metrics will substantiate the reliability of the five-strategy routing and self-correction mechanism. revision: yes

  2. Referee: [§4] Table or figure reporting quantitative results for scene synthesis and asset quality (e.g., prompt alignment scores, mesh quality metrics, articulation validity rates) must include explicit baselines, statistical significance, and error analysis; without these, the stated improvements over prior pipelines cannot be fully assessed.

    Authors: We acknowledge that the current presentation of results in §4 could be strengthened by more explicit comparisons and statistical rigor. While our evaluations include comparisons to prior methods for scene synthesis and asset quality, we will revise the tables and figures to clearly identify all baselines used, include statistical significance tests (e.g., paired t-tests with p-values), and add error analysis such as standard deviations, confidence intervals, and discussion of outlier cases. This will enable readers to fully assess the improvements claimed over existing pipelines. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper describes a new pipeline for programmatic indoor scene synthesis that routes AssetRequests through five code-generation strategies and an execution-guided repair loop to produce Blender Python programs, which are then compiled to SDF assets. This construction relies on standard agentic planning, self-correction loops, and external tools (Blender, SDF export) rather than any self-definitional mapping, fitted parameter renamed as prediction, or load-bearing self-citation chain. Evaluations on scene synthesis, asset quality, and robot interaction are presented as independent benchmarks, leaving the derivation self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 2 invented entities

Review is based solely on the abstract; detailed free parameters, axioms, or invented entities cannot be fully audited without the full manuscript. The framework appears to rely on standard assumptions about agent planning and code executability.

axioms (2)
  • domain assumption Natural language prompts can be reliably turned into structured house layouts and per-object AssetRequests via a planner-designer-critic loop
    Invoked in the room-level agentic backbone description.
  • domain assumption Blender Python programs generated by the five strategies can be executed and repaired to produce valid articulated assets
    Central to the code-generation and validation loop.
invented entities (2)
  • AssetRequest no independent evidence
    purpose: Structured request for per-object code generation
    Introduced to route object needs through the pipeline
  • persistent scene-state registry no independent evidence
    purpose: Links requests, programs, geometry, and simulation assets for traceability and editing
    New component enabling editable world-building

pith-pipeline@v0.9.0 · 5823 in / 1566 out tokens · 53902 ms · 2026-05-20T05:59:11.231841+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

40 extracted references · 40 canonical work pages · 6 internal anchors

  1. [1]

    Evaluating Large Language Models Trained on Code

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374, 2021

  2. [2]

    Urdformer: A pipeline for constructing articulated simula- tion environments from real-world images.arXiv preprint arXiv:2405.11656, 2024

    Zoey Chen, Aaron Walsman, Marius Memmel, Kaichun Mo, Alex Fang, Karthikeya Vemuri, Alan Wu, Dieter Fox, and Abhishek Gupta. Urdformer: A pipeline for constructing articulated simulation environments from real-world images.arXiv preprint arXiv:2405.11656, 2024

  3. [3]

    2025.doi:10.48550/arXiv.2508.14879

    Bingquan Dai, Li Ray Luo, Qihong Tang, Jie Wang, Xinyu Lian, Hao Xu, Minghan Qin, Xudong Xu, Bo Dai, Haoqian Wang, et al. Meshcoder: Llm-powered structured mesh code generation from point clouds.arXiv preprint arXiv:2508.14879, 2025

  4. [4]

    Procthor: Large-scale embodied ai using procedural generation.Advances in Neural Information Processing Systems, 35:5982–5994, 2022

    Matt Deitke, Eli VanderBilt, Alvaro Herrasti, Luca Weihs, Kiana Ehsani, Jordi Salvador, Winson Han, Eric Kolve, Aniruddha Kembhavi, and Roozbeh Mottaghi. Procthor: Large-scale embodied ai using procedural generation.Advances in Neural Information Processing Systems, 35:5982–5994, 2022

  5. [5]

    Blenderllm: Training large language models for computer-aided design with self- improvement.arXiv preprint arXiv:2412.14203, 2024

    Yuhao Du, Shunian Chen, Wenbo Zan, Peizhao Li, Mingxuan Wang, Dingjie Song, Bo Li, Yan Hu, and Benyou Wang. Blenderllm: Training large language models for computer-aided design with self- improvement.arXiv preprint arXiv:2412.14203, 2024

  6. [6]

    Layoutgpt: Compositional visual planning and generation with large language models.Advances in Neural Information Processing Systems, 36:18225–18250, 2023

    Weixi Feng, Wanrong Zhu, Tsu-jui Fu, Varun Jampani, Arjun Akula, Xuehai He, Sugato Basu, Xin Eric Wang, and William Yang Wang. Layoutgpt: Compositional visual planning and generation with large language models.Advances in Neural Information Processing Systems, 36:18225–18250, 2023

  7. [7]

    Anyhome: Open-vocabulary generation of structured and textured 3d homes

    Rao Fu, Zehao Wen, Zichen Liu, and Srinath Sridhar. Anyhome: Open-vocabulary generation of structured and textured 3d homes. InEuropean Conference on Computer Vision, pages 52–70. Springer, 2024

  8. [8]

    Scenecraft: An llm agent for synthesizing 3d scenes as blender code

    Ziniu Hu, Ahmet Iscen, Aashi Jain, Thomas Kipf, Yisong Yue, David A Ross, Cordelia Schmid, and Alireza Fathi. Scenecraft: An llm agent for synthesizing 3d scenes as blender code. InForty-first International Conference on Machine Learning, 2024

  9. [9]

    Shapeassembly: Learning to generate programs for 3d shape structure synthesis.ACM Transactions on Graphics (TOG), 39(6):1–20, 2020

    R Kenny Jones, Theresa Barton, Xianghao Xu, Kai Wang, Ellen Jiang, Paul Guerrero, Niloy J Mitra, and Daniel Ritchie. Shapeassembly: Learning to generate programs for 3d shape structure synthesis.ACM Transactions on Graphics (TOG), 39(6):1–20, 2020

  10. [10]

    AI2-THOR: An Interactive 3D Environment for Visual AI

    Eric Kolve, Roozbeh Mottaghi, Winson Han, Eli VanderBilt, Luca Weihs, Alvaro Herrasti, Matt Deitke, Kiana Ehsani, Daniel Gordon, Yuke Zhu, Aniruddha Kembhavi, Abhinav Gupta, and Ali Farhadi. Ai2-thor: An interactive 3d environment for visual ai, 2022. URLhttps://arxiv.org/abs/1712.05474

  11. [11]

    Nap: Neural 3d articulated object prior.Advances in Neural Information Processing Systems, 36:31878–31894, 2023

    Jiahui Lei, Congyue Deng, William B Shen, Leonidas J Guibas, and Kostas Daniilidis. Nap: Neural 3d articulated object prior.Advances in Neural Information Processing Systems, 36:31878–31894, 2023

  12. [12]

    Behavior-1k: A benchmark for embodied ai with 1,000 everyday activities and realistic simulation

    Chengshu Li, Ruohan Zhang, Josiah Wong, Cem Gokmen, Sanjana Srivastava, Roberto Martín-Martín, Chen Wang, Gabrael Levine, Michael Lingelbach, Jiankai Sun, et al. Behavior-1k: A benchmark for embodied ai with 1,000 everyday activities and realistic simulation. InConference on Robot Learning, pages 80–93. PMLR, 2023

  13. [13]

    Code as policies: Language model programs for embodied control

    Jacky Liang, Wenlong Huang, Fei Xia, Peng Xu, Karol Hausman, Brian Ichter, Pete Florence, and Andy Zeng. Code as policies: Language model programs for embodied control. In2023 IEEE International conference on robotics and automation (ICRA), pages 9493–9500. IEEE, 2023

  14. [14]

    Instructscene: Instruction- driven 3d indoor scene synthesis with semantic graph prior

    Chenguo Lin and Yadong Mu. Instructscene: Instruction-driven 3d indoor scene synthesis with semantic graph prior.arXiv preprint arXiv:2402.04717, 2024

  15. [15]

    Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

  16. [16]

    Cage: Controllable articulation generation

    Jiayi Liu, Hou In Ivan Tam, Ali Mahdavi-Amiri, and Manolis Savva. Cage: Controllable articulation generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17880–17889, 2024

  17. [17]

    Partnet: A large-scale benchmark for fine-grained and hierarchical part-level 3d object understanding

    Kaichun Mo, Shilin Zhu, Angel X Chang, Li Yi, Subarna Tripathi, Leonidas J Guibas, and Hao Su. Partnet: A large-scale benchmark for fine-grained and hierarchical part-level 3d object understanding. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 909–918, 2019. 10

  18. [18]

    Maniskill: Generalizable manipulation skill benchmark with large-scale demonstrations,

    Tongzhou Mu, Zhan Ling, Fanbo Xiang, Derek Yang, Xuanlin Li, Stone Tao, Zhiao Huang, Zhiwei Jia, and Hao Su. Maniskill: Generalizable manipulation skill benchmark with large-scale demonstrations.arXiv preprint arXiv:2107.14483, 2021

  19. [19]

    RoboCasa: Large-Scale Simulation of Everyday Tasks for Generalist Robots

    Soroush Nasiriany, Abhiram Maddukuri, Lance Zhang, Adeet Parikh, Aaron Lo, Abhishek Joshi, Ajay Mandlekar, and Yuke Zhu. Robocasa: Large-scale simulation of everyday tasks for generalist robots.arXiv preprint arXiv:2406.02523, 2024

  20. [20]

    GPT-4V(ision) system card.OpenAI Technical Report, 2023

    OpenAI. GPT-4V(ision) system card.OpenAI Technical Report, 2023

  21. [21]

    Atiss: Autoregressive transformers for indoor scene synthesis.Advances in neural information processing systems, 34:12013–12026, 2021

    Despoina Paschalidou, Amlan Kar, Maria Shugrina, Karsten Kreis, Andreas Geiger, and Sanja Fidler. Atiss: Autoregressive transformers for indoor scene synthesis.Advances in neural information processing systems, 34:12013–12026, 2021

  22. [22]

    arXiv preprint arXiv:2602.09153 , year=

    Nicholas Pfaff, Thomas Cohn, Sergey Zakharov, Rick Cory, and Russ Tedrake. Scenesmith: Agentic generation of simulation-ready indoor scenes.arXiv preprint arXiv:2602.09153, 2026

  23. [23]

    arXiv preprint arXiv:2503.16848 , year=

    Hou In Derek Pun, Hou In Ivan Tam, Austin T Wang, Xiaoliang Huo, Angel X Chang, and Manolis Savva. Hsm: Hierarchical scene motifs for multi-scale indoor scene generation.arXiv preprint arXiv:2503.16848, 2025

  24. [24]

    Infinite photorealistic worlds using procedural generation

    Alexander Raistrick, Lahav Lipson, Zeyu Ma, Lingjie Mei, Mingzhe Wang, Yiming Zuo, Karhan Kayan, Hongyu Wen, Beining Han, Yihan Wang, Alejandro Newell, Hei Law, Ankit Goyal, Kaiyu Yang, and Jia Deng. Infinite photorealistic worlds using procedural generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages...

  25. [25]

    Habitat: A platform for embodied ai research

    Manolis Savva, Abhishek Kadian, Oleksandr Maksymets, Yili Zhao, Erik Wijmans, Bhavana Jain, Julian Straub, Jia Liu, Vladlen Koltun, Jitendra Malik, et al. Habitat: A platform for embodied ai research. In Proceedings of the IEEE/CVF international conference on computer vision, pages 9339–9347, 2019

  26. [26]

    ProgPrompt: Generating Situated Robot Task Plans using Large Language Models

    Ishika Singh, Valts Blukis, Arsalan Mousavian, Ankit Goyal, Danfei Xu, Jonathan Tremblay, Dieter Fox, Jesse Thomason, and Animesh Garg. Progprompt: Generating situated robot task plans using large language models.arXiv preprint arXiv:2209.11302, 2022

  27. [27]

    3d-gpt: Procedural 3d modeling with large language models

    Chunyi Sun, Junlin Han, Weijian Deng, Xinlong Wang, Zishan Qin, and Stephen Gould. 3d-gpt: Procedural 3d modeling with large language models. In2025 International Conference on 3D Vision (3DV), pages 1253–1263. IEEE, 2025

  28. [28]

    Layoutvlm: Differentiable optimization of 3d layout via vision-language models

    Fan-Yun Sun, Weiyu Liu, Siyi Gu, Dylan Lim, Goutam Bhat, Federico Tombari, Manling Li, Nick Haber, and Jiajun Wu. Layoutvlm: Differentiable optimization of 3d layout via vision-language models. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 29469–29478, 2025

  29. [29]

    Habitat 2.0: Training home assistants to rearrange their habitat.Advances in neural information processing systems, 34:251–266, 2021

    Andrew Szot, Alexander Clegg, Eric Undersander, Erik Wijmans, Yili Zhao, John Turner, Noah Maestre, Mustafa Mukadam, Devendra Singh Chaplot, Oleksandr Maksymets, et al. Habitat 2.0: Training home assistants to rearrange their habitat.Advances in neural information processing systems, 34:251–266, 2021

  30. [30]

    Sceneeval: Evaluating semantic coherence in text-conditioned 3d indoor scene synthesis

    Hou In Ivan Tam, Hou In Derek Pun, Austin T Wang, Angel X Chang, and Manolis Savva. Sceneeval: Evaluating semantic coherence in text-conditioned 3d indoor scene synthesis. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 7355–7365, 2026

  31. [31]

    Diffuscene: Denoising diffusion models for generative indoor scene synthesis

    Jiapeng Tang, Yinyu Nie, Lev Markhasin, Angela Dai, Justus Thies, and Matthias Nießner. Diffuscene: Denoising diffusion models for generative indoor scene synthesis. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 20507–20518, 2024

  32. [32]

    Gemini: A Family of Highly Capable Multimodal Models

    Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023

  33. [33]

    SAM 3D: 3Dfy Anything in Images

    SAM 3D Team, Xingyu Chen, Fu-Jen Chu, Pierre Gleize, Kevin J Liang, Alexander Sax, Hao Tang, Weiyao Wang, Michelle Guo, Thibaut Hardin, Xiang Li, Aohan Lin, Jiawei Liu, Ziqi Ma, Anushka Sagar, Bowen Song, Xiaodong Wang, Jianing Yang, Bowen Zhang, Piotr Dollár, Georgia Gkioxari, Matt Feiszli, and Jitendra Malik. Sam 3d: 3dfy anything in images. 2025. URL h...

  34. [34]

    Mujoco: A physics engine for model-based control

    Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for model-based control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 5026–5033. IEEE,

  35. [35]

    doi: 10.1109/IROS.2012.6386109

  36. [36]

    Sceneformer: Indoor scene generation with transformers

    Xinpeng Wang, Chandan Yeshwanth, and Matthias Nießner. Sceneformer: Indoor scene generation with transformers. In2021 International conference on 3D vision (3DV), pages 106–115. IEEE, 2021. 11

  37. [37]

    Adaafford: Learning to adapt manipulation affordance for 3d articulated objects via few-shot interactions

    Yian Wang, Ruihai Wu, Kaichun Mo, Jiaqi Ke, Qingnan Fan, Leonidas J Guibas, and Hao Dong. Adaafford: Learning to adapt manipulation affordance for 3d articulated objects via few-shot interactions. InEuropean conference on computer vision, pages 90–107. Springer, 2022

  38. [38]

    Architect: Generating vivid and interactive 3d scenes with hierarchical 2d inpainting.Advances in Neural Information Processing Systems, 37:67575–67603, 2024

    Yian Wang, Xiaowen Qiu, Jiageng Liu, Zhehuan Chen, Jiting Cai, Yufei Wang, Tsun-Hsuan Wang, Zhou Xian, and Chuang Gan. Architect: Generating vivid and interactive 3d scenes with hierarchical 2d inpainting.Advances in Neural Information Processing Systems, 37:67575–67603, 2024

  39. [39]

    Sapien: A simulated part-based interactive environment

    Fanbo Xiang, Yuzhe Qin, Kaichun Mo, Yikuan Xia, Hao Zhu, Fangchen Liu, Minghua Liu, Hanxiao Jiang, Yifu Yuan, He Wang, et al. Sapien: A simulated part-based interactive environment. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11097–11107, 2020

  40. [40]

    chair in front of desk

    Yue Yang, Fan-Yun Sun, Luca Weihs, Eli VanderBilt, Alvaro Herrasti, Winson Han, Jiajun Wu, Nick Haber, Ranjay Krishna, Lingjie Liu, et al. Holodeck: Language guided generation of 3d embodied ai environments. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16227–16237, 2024. 12 Appendix Contents A. More Related Wo...