SceneCode: Executable World Programs for Editable Indoor Scenes with Articulated Objects

Kevin Qinghong Lin; Linjie Li; Puyi Wang; Yangguang Li; Yu Cheng; Yuhao Wang; Zhengyuan Yang

arxiv: 2605.19587 · v1 · pith:3I2BWIGJnew · submitted 2026-05-19 · 💻 cs.AI

SceneCode: Executable World Programs for Editable Indoor Scenes with Articulated Objects

Puyi Wang , Yuhao Wang , Linjie Li , Zhengyuan Yang , Kevin Qinghong Lin , Yangguang Li , Yu Cheng This is my paper

Pith reviewed 2026-05-20 05:59 UTC · model grok-4.3

classification 💻 cs.AI

keywords indoor scene synthesisexecutable programsarticulated objectsBlender Pythonembodied AIprogrammatic generationsimulation assetsscene editing

0 comments

The pith

SceneCode turns natural language prompts into executable Blender programs that produce editable indoor scenes with articulated objects.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that indoor scene synthesis can be reframed as the generation of executable world programs rather than collections of static meshes. It shows that routing object requests through code-generation strategies and an execution-guided repair loop yields assets that better match input prompts, exhibit cleaner geometry, and carry simulator-ready articulation data. A sympathetic reader would care because this removes reliance on pre-curated asset libraries and enables on-demand creation of physically interactable environments for robotics and embodied AI. The approach also keeps the entire scene traceable, so individual objects remain locally editable after initial generation.

Core claim

A room-level agentic backbone converts a prompt into a house layout and per-object AssetRequests; each request is routed to one of five code-generation strategies that output part-wise Blender Python programs; an execution-guided repair-and-refine loop validates the programs; the resulting programs compile into simulation-ready assets with articulation metadata and are linked through a persistent scene-state registry that supports traceable editing.

What carries the argument

The SceneCode pipeline, which routes each AssetRequest to a code-generation strategy and applies an execution-guided repair-and-refine loop to produce valid part-wise Blender programs for articulated objects.

If this is right

Generated scenes match the original prompt more faithfully than static-mesh pipelines.
Produced assets exhibit cleaner mesh structure than library-sourced alternatives.
Assets carry simulator-loadable articulation metadata that supports direct use in physics engines.
Scene assembly becomes a traceable process in which individual objects can be edited locally without regenerating the entire environment.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same routing-plus-repair mechanism could be tested on prompts that introduce previously unseen object categories to measure generalization beyond the training distribution of indoor furniture.
Because programs remain human-readable, natural-language edits could be mapped back to targeted code changes, offering a path toward interactive scene refinement that the current evaluation does not yet demonstrate.
The persistent registry that links requests, programs, and assets may simplify integration with reinforcement-learning pipelines that require repeatable scene variations.

Load-bearing premise

The method will reliably produce valid Blender programs for arbitrary articulated indoor objects when each request is routed to one of five code-generation strategies followed by an execution-guided repair loop.

What would settle it

Run the system on a prompt such as 'a living room containing a sofa whose seat cushion lifts independently' and check whether the exported program loads into a physics simulator with correct joint articulation and without mesh errors.

Figures

Figures reproduced from arXiv: 2605.19587 by Kevin Qinghong Lin, Linjie Li, Puyi Wang, Yangguang Li, Yu Cheng, Yuhao Wang, Zhengyuan Yang.

**Figure 1.** Figure 1: Overview of SCENECODE. Given a natural language scene prompt, our framework compiles it into an executable, code-driven indoor scene with interactable objects. Abstract Indoor scene synthesis underpins embodied AI, robotic manipulation, and simulation-based policy evaluation, where a useful scene must specify not only what the environment looks like, but also how its objects are structured. Existing pipeli… view at source ↗

**Figure 2.** Figure 2: Overview of SCENECODE. From room-level planning to code-driven object generation, simulation-ready compilation, and scene-state registration. 3.1 Room-Level Agentic Scene Backbone The room-level backbone transforms a scene prompt into a set of per-room object specifications that drive subsequent program synthesis. Concretely, it produces a structured house layout H together with an ordered sequence of obje… view at source ↗

**Figure 3.** Figure 3: Room-level qualitative comparison. SceneCode shows better prompt fidelity than the baselines. See Appendix I.1 for additional examples [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Object-level qualitative comparison. (a) Mesh and UV layouts of representative assets from SceneCode versus SAM 3D Objects. (b) Code-level editability: locally re-executing one object program with different parameters yields variant objects. (c) On-demand articulated objects with prescribed structure or material that retrieval-based pipelines cannot satisfy. 4.3 Object-Level Geometry and UV Quality We comp… view at source ↗

**Figure 5.** Figure 5: Robot interaction with generated articulated assets. A SceneCode-generated articulated object is imported into MuJoCo for contact-based robot manipulation. The movable parts produced by the Articulated Object Program remain independent links with compiled joints, allowing the robot to physically open or slide them [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: Route-specific prompt ablation. Generic prompting can produce visually plausible assets, but route-specific construction prompts better preserve semantic parts, internal structure, material grounding, and articulation-ready components. Here, generic prompting refers to removing the router and simply asking the VLM, given the input request, to generate Blender code for the desired 3D content. As shown in [… view at source ↗

**Figure 7.** Figure 7: User-study interface. Screenshot of the UI used in our user study. G Full User Study Results Setup recap. The user study involves nine participants split evenly into three groups of three. Group A compares SceneCode against SceneSmith, Group B against HSM, and Group C against LayoutVLM, on the same 30 SceneEval-100 room-level prompts used for the automatic evaluation. For each prompt and each comparison, p… view at source ↗

**Figure 8.** Figure 8: Additional room-level demonstrations. SCENECODE generates kitchen, basement, dining room, living room, bathroom, and bedroom scenes with prompt-faithful object coverage, spatial layout, and stylistic attributes [PITH_FULL_IMAGE:figures/full_fig_p028_8.png] view at source ↗

**Figure 9.** Figure 9: Additional object-level demonstrations. Rendered objects and corresponding wireframes for furniture, ceiling objects, wall objects, and manipulands generated by SCENECODE. The examples show that code-generated assets can express materials, numbers, text, and nontrivial geometric structure while preserving explicit mesh organization [PITH_FULL_IMAGE:figures/full_fig_p029_9.png] view at source ↗

**Figure 10.** Figure 10: Articulated object demonstrations. SCENECODE generates articulated household objects with correct geometric structure, storage interiors such as shelves and partitions, and movable components that remain available for downstream interaction. 29 [PITH_FULL_IMAGE:figures/full_fig_p029_10.png] view at source ↗

**Figure 11.** Figure 11: Generated nightstand demo. This appendix provides a concrete example of the executable object representation used by SCENECODE. We use a generated nightstand as the running example. The object-level Blender program constructs the movable drawer from geometric primitives and procedural materials, while the exported SDF preserves the drawer as an independent simulation link with a prismatic joint. J.1 Bl… view at source ↗

read the original abstract

Indoor scene synthesis underpins embodied AI, robotic manipulation, and simulation-based policy evaluation, where a useful scene must specify not only what the environment looks like, but also how its objects are structured. Existing pipelines, however, typically represent generated content as static meshes and inherit articulation only from curated asset libraries, which limits object-level controllability and prevents new interactable assets from being produced on demand. We address this gap by formulating physically interactable indoor scene synthesis as programmatic world generation, and present SceneCode, a framework that compiles a natural language prompt into an executable, code-driven indoor world rather than a collection of opaque meshes. A room-level agentic backbone first turns the prompt into a structured house layout and emits per-object AssetRequests through a planner--designer--critic loop. Each request is then routed to one of five code-generation strategies and converted into a synthesized part-wise Blender Python programs that are validated through an execution-guided repair-and-refine loop. The resulting programs are compiled into simulation-ready assets, and exported as SDF for physics simulation. A persistent scene-state registry links object requests, executable programs, rendered geometry, and simulation assets, turning scene assembly into a traceable and locally editable world-building process. We evaluate SceneCode across scene-level synthesis, object-level asset quality, human judgment, and downstream robot interaction. Results show that executable world programs improve prompt-faithful indoor scene generation and produce assets with cleaner mesh structure, and simulator-loadable articulation metadata. Project page: https://scene-code.github.io/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SceneCode generates indoor scenes as editable Blender code with articulations instead of static meshes, which fits robotics sim needs but the abstract leaves the actual gains hard to judge.

read the letter

The main point is that this paper treats scene synthesis as writing and compiling executable programs rather than dumping out meshes. They start with a prompt, run an agentic planner-designer-critic loop to produce a layout and per-object AssetRequests, then route each request to one of five code strategies that output part-wise Blender Python scripts. An execution-guided repair loop fixes problems by actually running the code, the scripts compile to SDF assets with articulation metadata, and a registry keeps everything linked for later edits. That setup directly targets the controllability gap for doors, drawers, and other movable parts in embodied AI work. It builds sensibly on Blender and current agent patterns, and the registry idea adds practical traceability without overcomplicating things. The claim that this produces prompt-faithful scenes with cleaner geometry and simulator-ready metadata is the central advance, and if the experiments back it, the programmatic angle is useful. The soft spot is the evaluation. The abstract reports positive results on scene quality, asset structure, human ratings, and robot tasks, yet gives no numbers, baselines, or failure cases. That makes it difficult to gauge whether the repair loop actually scales or if the five strategies deliver consistent wins over simpler mesh pipelines. The key assumption—that routing plus self-correction will reliably yield valid articulated programs for new objects—needs concrete evidence from the full results to land. This is aimed at people building simulation environments for robotics and policy learning. Readers working on code-driven 3D generation or editable assets would find the pipeline description worth their time. The thinking is coherent and the problem framing is honest, so the paper deserves a serious referee even if the numbers section needs tightening.

Referee Report

2 major / 3 minor

Summary. The paper presents SceneCode, a framework that compiles natural language prompts into executable Blender Python programs for generating editable indoor scenes with articulated objects. A room-level agentic backbone produces structured layouts and per-object AssetRequests via a planner-designer-critic loop. Each request routes to one of five code-generation strategies, yielding part-wise programs that undergo an execution-guided repair-and-refine loop. Programs compile to simulation-ready SDF assets with articulation metadata, supported by a persistent scene-state registry for traceability and local editability. Evaluations across scene synthesis, asset quality, human judgment, and robot interaction claim improvements in prompt faithfulness, mesh cleanliness, and simulator-loadable articulations.

Significance. If the empirical results hold, this work offers a meaningful advance for embodied AI and robotics by shifting from static mesh libraries to on-demand, programmatically editable and articulated scenes. The combination of agentic planning, code synthesis with self-correction, and a traceable registry addresses key limitations in controllability and reproducibility for simulation-based policy evaluation.

major comments (2)

[§3.2] The central claim that the five-strategy routing plus execution-guided repair-and-refine loop reliably produces valid part-wise Blender programs for arbitrary articulated objects is load-bearing; the manuscript should report success/failure rates, iteration counts, and failure modes of the repair loop on a held-out set of complex objects (e.g., multi-joint furniture) to substantiate scalability.
[§4] Table or figure reporting quantitative results for scene synthesis and asset quality (e.g., prompt alignment scores, mesh quality metrics, articulation validity rates) must include explicit baselines, statistical significance, and error analysis; without these, the stated improvements over prior pipelines cannot be fully assessed.

minor comments (3)

[§3] Notation for AssetRequest and the persistent registry could be formalized with a small diagram or pseudocode to clarify data flow between planner, code generators, and simulator export.
[Figures 3-5] Figure captions for rendered scenes and SDF assets should explicitly note which objects were synthesized versus retrieved to highlight the on-demand generation contribution.
[Abstract] The abstract's summary of results would benefit from one or two concrete metric improvements (e.g., 'X% higher prompt faithfulness') to better orient readers before the detailed evaluation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thorough review and positive recommendation for minor revision. We appreciate the constructive feedback on strengthening the empirical validation of our framework. Below, we address each major comment point by point, outlining the revisions we will make to the manuscript.

read point-by-point responses

Referee: [§3.2] The central claim that the five-strategy routing plus execution-guided repair-and-refine loop reliably produces valid part-wise Blender programs for arbitrary articulated objects is load-bearing; the manuscript should report success/failure rates, iteration counts, and failure modes of the repair loop on a held-out set of complex objects (e.g., multi-joint furniture) to substantiate scalability.

Authors: We agree with the referee that quantitative evaluation of the repair-and-refine loop is essential to support the scalability of our approach. The manuscript currently focuses on the design and qualitative demonstration of the loop in §3.2. To address this, we will add a new subsection or appendix with results from additional experiments on a held-out test set of 100 complex articulated objects, including multi-joint furniture. This will report success rates (e.g., percentage of programs that execute without errors after repair), average number of iterations required, and categorized failure modes (such as syntax errors, geometry inconsistencies, or articulation mismatches). These metrics will substantiate the reliability of the five-strategy routing and self-correction mechanism. revision: yes
Referee: [§4] Table or figure reporting quantitative results for scene synthesis and asset quality (e.g., prompt alignment scores, mesh quality metrics, articulation validity rates) must include explicit baselines, statistical significance, and error analysis; without these, the stated improvements over prior pipelines cannot be fully assessed.

Authors: We acknowledge that the current presentation of results in §4 could be strengthened by more explicit comparisons and statistical rigor. While our evaluations include comparisons to prior methods for scene synthesis and asset quality, we will revise the tables and figures to clearly identify all baselines used, include statistical significance tests (e.g., paired t-tests with p-values), and add error analysis such as standard deviations, confidence intervals, and discussion of outlier cases. This will enable readers to fully assess the improvements claimed over existing pipelines. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper describes a new pipeline for programmatic indoor scene synthesis that routes AssetRequests through five code-generation strategies and an execution-guided repair loop to produce Blender Python programs, which are then compiled to SDF assets. This construction relies on standard agentic planning, self-correction loops, and external tools (Blender, SDF export) rather than any self-definitional mapping, fitted parameter renamed as prediction, or load-bearing self-citation chain. Evaluations on scene synthesis, asset quality, and robot interaction are presented as independent benchmarks, leaving the derivation self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 2 invented entities

Review is based solely on the abstract; detailed free parameters, axioms, or invented entities cannot be fully audited without the full manuscript. The framework appears to rely on standard assumptions about agent planning and code executability.

axioms (2)

domain assumption Natural language prompts can be reliably turned into structured house layouts and per-object AssetRequests via a planner-designer-critic loop
Invoked in the room-level agentic backbone description.
domain assumption Blender Python programs generated by the five strategies can be executed and repaired to produce valid articulated assets
Central to the code-generation and validation loop.

invented entities (2)

AssetRequest no independent evidence
purpose: Structured request for per-object code generation
Introduced to route object needs through the pipeline
persistent scene-state registry no independent evidence
purpose: Links requests, programs, geometry, and simulation assets for traceability and editing
New component enabling editable world-building

pith-pipeline@v0.9.0 · 5823 in / 1566 out tokens · 53902 ms · 2026-05-20T05:59:11.231841+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Each request is then routed to one of five code-generation strategies and converted into a synthesized part-wise Blender Python programs that are validated through an execution-guided repair-and-refine loop.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

40 extracted references · 40 canonical work pages · 6 internal anchors

[1]

Evaluating Large Language Models Trained on Code

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[2]

Urdformer: A pipeline for constructing articulated simula- tion environments from real-world images.arXiv preprint arXiv:2405.11656, 2024

Zoey Chen, Aaron Walsman, Marius Memmel, Kaichun Mo, Alex Fang, Karthikeya Vemuri, Alan Wu, Dieter Fox, and Abhishek Gupta. Urdformer: A pipeline for constructing articulated simulation environments from real-world images.arXiv preprint arXiv:2405.11656, 2024

work page arXiv 2024
[3]

2025.doi:10.48550/arXiv.2508.14879

Bingquan Dai, Li Ray Luo, Qihong Tang, Jie Wang, Xinyu Lian, Hao Xu, Minghan Qin, Xudong Xu, Bo Dai, Haoqian Wang, et al. Meshcoder: Llm-powered structured mesh code generation from point clouds.arXiv preprint arXiv:2508.14879, 2025

work page arXiv 2025
[4]

Procthor: Large-scale embodied ai using procedural generation.Advances in Neural Information Processing Systems, 35:5982–5994, 2022

Matt Deitke, Eli VanderBilt, Alvaro Herrasti, Luca Weihs, Kiana Ehsani, Jordi Salvador, Winson Han, Eric Kolve, Aniruddha Kembhavi, and Roozbeh Mottaghi. Procthor: Large-scale embodied ai using procedural generation.Advances in Neural Information Processing Systems, 35:5982–5994, 2022

work page 2022
[5]

Blenderllm: Training large language models for computer-aided design with self- improvement.arXiv preprint arXiv:2412.14203, 2024

Yuhao Du, Shunian Chen, Wenbo Zan, Peizhao Li, Mingxuan Wang, Dingjie Song, Bo Li, Yan Hu, and Benyou Wang. Blenderllm: Training large language models for computer-aided design with self- improvement.arXiv preprint arXiv:2412.14203, 2024

work page arXiv 2024
[6]

Layoutgpt: Compositional visual planning and generation with large language models.Advances in Neural Information Processing Systems, 36:18225–18250, 2023

Weixi Feng, Wanrong Zhu, Tsu-jui Fu, Varun Jampani, Arjun Akula, Xuehai He, Sugato Basu, Xin Eric Wang, and William Yang Wang. Layoutgpt: Compositional visual planning and generation with large language models.Advances in Neural Information Processing Systems, 36:18225–18250, 2023

work page 2023
[7]

Anyhome: Open-vocabulary generation of structured and textured 3d homes

Rao Fu, Zehao Wen, Zichen Liu, and Srinath Sridhar. Anyhome: Open-vocabulary generation of structured and textured 3d homes. InEuropean Conference on Computer Vision, pages 52–70. Springer, 2024

work page 2024
[8]

Scenecraft: An llm agent for synthesizing 3d scenes as blender code

Ziniu Hu, Ahmet Iscen, Aashi Jain, Thomas Kipf, Yisong Yue, David A Ross, Cordelia Schmid, and Alireza Fathi. Scenecraft: An llm agent for synthesizing 3d scenes as blender code. InForty-first International Conference on Machine Learning, 2024

work page 2024
[9]

Shapeassembly: Learning to generate programs for 3d shape structure synthesis.ACM Transactions on Graphics (TOG), 39(6):1–20, 2020

R Kenny Jones, Theresa Barton, Xianghao Xu, Kai Wang, Ellen Jiang, Paul Guerrero, Niloy J Mitra, and Daniel Ritchie. Shapeassembly: Learning to generate programs for 3d shape structure synthesis.ACM Transactions on Graphics (TOG), 39(6):1–20, 2020

work page 2020
[10]

AI2-THOR: An Interactive 3D Environment for Visual AI

Eric Kolve, Roozbeh Mottaghi, Winson Han, Eli VanderBilt, Luca Weihs, Alvaro Herrasti, Matt Deitke, Kiana Ehsani, Daniel Gordon, Yuke Zhu, Aniruddha Kembhavi, Abhinav Gupta, and Ali Farhadi. Ai2-thor: An interactive 3d environment for visual ai, 2022. URLhttps://arxiv.org/abs/1712.05474

work page internal anchor Pith review Pith/arXiv arXiv 2022
[11]

Nap: Neural 3d articulated object prior.Advances in Neural Information Processing Systems, 36:31878–31894, 2023

Jiahui Lei, Congyue Deng, William B Shen, Leonidas J Guibas, and Kostas Daniilidis. Nap: Neural 3d articulated object prior.Advances in Neural Information Processing Systems, 36:31878–31894, 2023

work page 2023
[12]

Behavior-1k: A benchmark for embodied ai with 1,000 everyday activities and realistic simulation

Chengshu Li, Ruohan Zhang, Josiah Wong, Cem Gokmen, Sanjana Srivastava, Roberto Martín-Martín, Chen Wang, Gabrael Levine, Michael Lingelbach, Jiankai Sun, et al. Behavior-1k: A benchmark for embodied ai with 1,000 everyday activities and realistic simulation. InConference on Robot Learning, pages 80–93. PMLR, 2023

work page 2023
[13]

Code as policies: Language model programs for embodied control

Jacky Liang, Wenlong Huang, Fei Xia, Peng Xu, Karol Hausman, Brian Ichter, Pete Florence, and Andy Zeng. Code as policies: Language model programs for embodied control. In2023 IEEE International conference on robotics and automation (ICRA), pages 9493–9500. IEEE, 2023

work page 2023
[14]

Instructscene: Instruction- driven 3d indoor scene synthesis with semantic graph prior

Chenguo Lin and Yadong Mu. Instructscene: Instruction-driven 3d indoor scene synthesis with semantic graph prior.arXiv preprint arXiv:2402.04717, 2024

work page arXiv 2024
[15]

Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

work page 2023
[16]

Cage: Controllable articulation generation

Jiayi Liu, Hou In Ivan Tam, Ali Mahdavi-Amiri, and Manolis Savva. Cage: Controllable articulation generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17880–17889, 2024

work page 2024
[17]

Partnet: A large-scale benchmark for fine-grained and hierarchical part-level 3d object understanding

Kaichun Mo, Shilin Zhu, Angel X Chang, Li Yi, Subarna Tripathi, Leonidas J Guibas, and Hao Su. Partnet: A large-scale benchmark for fine-grained and hierarchical part-level 3d object understanding. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 909–918, 2019. 10

work page 2019
[18]

Maniskill: Generalizable manipulation skill benchmark with large-scale demonstrations,

Tongzhou Mu, Zhan Ling, Fanbo Xiang, Derek Yang, Xuanlin Li, Stone Tao, Zhiao Huang, Zhiwei Jia, and Hao Su. Maniskill: Generalizable manipulation skill benchmark with large-scale demonstrations.arXiv preprint arXiv:2107.14483, 2021

work page arXiv 2021
[19]

RoboCasa: Large-Scale Simulation of Everyday Tasks for Generalist Robots

Soroush Nasiriany, Abhiram Maddukuri, Lance Zhang, Adeet Parikh, Aaron Lo, Abhishek Joshi, Ajay Mandlekar, and Yuke Zhu. Robocasa: Large-scale simulation of everyday tasks for generalist robots.arXiv preprint arXiv:2406.02523, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[20]

GPT-4V(ision) system card.OpenAI Technical Report, 2023

OpenAI. GPT-4V(ision) system card.OpenAI Technical Report, 2023

work page 2023
[21]

Atiss: Autoregressive transformers for indoor scene synthesis.Advances in neural information processing systems, 34:12013–12026, 2021

Despoina Paschalidou, Amlan Kar, Maria Shugrina, Karsten Kreis, Andreas Geiger, and Sanja Fidler. Atiss: Autoregressive transformers for indoor scene synthesis.Advances in neural information processing systems, 34:12013–12026, 2021

work page 2021
[22]

arXiv preprint arXiv:2602.09153 , year=

Nicholas Pfaff, Thomas Cohn, Sergey Zakharov, Rick Cory, and Russ Tedrake. Scenesmith: Agentic generation of simulation-ready indoor scenes.arXiv preprint arXiv:2602.09153, 2026

work page arXiv 2026
[23]

arXiv preprint arXiv:2503.16848 , year=

Hou In Derek Pun, Hou In Ivan Tam, Austin T Wang, Xiaoliang Huo, Angel X Chang, and Manolis Savva. Hsm: Hierarchical scene motifs for multi-scale indoor scene generation.arXiv preprint arXiv:2503.16848, 2025

work page arXiv 2025
[24]

Infinite photorealistic worlds using procedural generation

Alexander Raistrick, Lahav Lipson, Zeyu Ma, Lingjie Mei, Mingzhe Wang, Yiming Zuo, Karhan Kayan, Hongyu Wen, Beining Han, Yihan Wang, Alejandro Newell, Hei Law, Ankit Goyal, Kaiyu Yang, and Jia Deng. Infinite photorealistic worlds using procedural generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages...

work page 2023
[25]

Habitat: A platform for embodied ai research

Manolis Savva, Abhishek Kadian, Oleksandr Maksymets, Yili Zhao, Erik Wijmans, Bhavana Jain, Julian Straub, Jia Liu, Vladlen Koltun, Jitendra Malik, et al. Habitat: A platform for embodied ai research. In Proceedings of the IEEE/CVF international conference on computer vision, pages 9339–9347, 2019

work page 2019
[26]

ProgPrompt: Generating Situated Robot Task Plans using Large Language Models

Ishika Singh, Valts Blukis, Arsalan Mousavian, Ankit Goyal, Danfei Xu, Jonathan Tremblay, Dieter Fox, Jesse Thomason, and Animesh Garg. Progprompt: Generating situated robot task plans using large language models.arXiv preprint arXiv:2209.11302, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[27]

3d-gpt: Procedural 3d modeling with large language models

Chunyi Sun, Junlin Han, Weijian Deng, Xinlong Wang, Zishan Qin, and Stephen Gould. 3d-gpt: Procedural 3d modeling with large language models. In2025 International Conference on 3D Vision (3DV), pages 1253–1263. IEEE, 2025

work page 2025
[28]

Layoutvlm: Differentiable optimization of 3d layout via vision-language models

Fan-Yun Sun, Weiyu Liu, Siyi Gu, Dylan Lim, Goutam Bhat, Federico Tombari, Manling Li, Nick Haber, and Jiajun Wu. Layoutvlm: Differentiable optimization of 3d layout via vision-language models. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 29469–29478, 2025

work page 2025
[29]

Habitat 2.0: Training home assistants to rearrange their habitat.Advances in neural information processing systems, 34:251–266, 2021

Andrew Szot, Alexander Clegg, Eric Undersander, Erik Wijmans, Yili Zhao, John Turner, Noah Maestre, Mustafa Mukadam, Devendra Singh Chaplot, Oleksandr Maksymets, et al. Habitat 2.0: Training home assistants to rearrange their habitat.Advances in neural information processing systems, 34:251–266, 2021

work page 2021
[30]

Sceneeval: Evaluating semantic coherence in text-conditioned 3d indoor scene synthesis

Hou In Ivan Tam, Hou In Derek Pun, Austin T Wang, Angel X Chang, and Manolis Savva. Sceneeval: Evaluating semantic coherence in text-conditioned 3d indoor scene synthesis. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 7355–7365, 2026

work page 2026
[31]

Diffuscene: Denoising diffusion models for generative indoor scene synthesis

Jiapeng Tang, Yinyu Nie, Lev Markhasin, Angela Dai, Justus Thies, and Matthias Nießner. Diffuscene: Denoising diffusion models for generative indoor scene synthesis. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 20507–20518, 2024

work page 2024
[32]

Gemini: A Family of Highly Capable Multimodal Models

Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[33]

SAM 3D: 3Dfy Anything in Images

SAM 3D Team, Xingyu Chen, Fu-Jen Chu, Pierre Gleize, Kevin J Liang, Alexander Sax, Hao Tang, Weiyao Wang, Michelle Guo, Thibaut Hardin, Xiang Li, Aohan Lin, Jiawei Liu, Ziqi Ma, Anushka Sagar, Bowen Song, Xiaodong Wang, Jianing Yang, Bowen Zhang, Piotr Dollár, Georgia Gkioxari, Matt Feiszli, and Jitendra Malik. Sam 3d: 3dfy anything in images. 2025. URL h...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[34]

Mujoco: A physics engine for model-based control

Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for model-based control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 5026–5033. IEEE,

work page 2012
[35]

doi: 10.1109/IROS.2012.6386109

work page doi:10.1109/iros.2012.6386109 2012
[36]

Sceneformer: Indoor scene generation with transformers

Xinpeng Wang, Chandan Yeshwanth, and Matthias Nießner. Sceneformer: Indoor scene generation with transformers. In2021 International conference on 3D vision (3DV), pages 106–115. IEEE, 2021. 11

work page 2021
[37]

Adaafford: Learning to adapt manipulation affordance for 3d articulated objects via few-shot interactions

Yian Wang, Ruihai Wu, Kaichun Mo, Jiaqi Ke, Qingnan Fan, Leonidas J Guibas, and Hao Dong. Adaafford: Learning to adapt manipulation affordance for 3d articulated objects via few-shot interactions. InEuropean conference on computer vision, pages 90–107. Springer, 2022

work page 2022
[38]

Architect: Generating vivid and interactive 3d scenes with hierarchical 2d inpainting.Advances in Neural Information Processing Systems, 37:67575–67603, 2024

Yian Wang, Xiaowen Qiu, Jiageng Liu, Zhehuan Chen, Jiting Cai, Yufei Wang, Tsun-Hsuan Wang, Zhou Xian, and Chuang Gan. Architect: Generating vivid and interactive 3d scenes with hierarchical 2d inpainting.Advances in Neural Information Processing Systems, 37:67575–67603, 2024

work page 2024
[39]

Sapien: A simulated part-based interactive environment

Fanbo Xiang, Yuzhe Qin, Kaichun Mo, Yikuan Xia, Hao Zhu, Fangchen Liu, Minghua Liu, Hanxiao Jiang, Yifu Yuan, He Wang, et al. Sapien: A simulated part-based interactive environment. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11097–11107, 2020

work page 2020
[40]

chair in front of desk

Yue Yang, Fan-Yun Sun, Luca Weihs, Eli VanderBilt, Alvaro Herrasti, Winson Han, Jiajun Wu, Nick Haber, Ranjay Krishna, Lingjie Liu, et al. Holodeck: Language guided generation of 3d embodied ai environments. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16227–16237, 2024. 12 Appendix Contents A. More Related Wo...

work page 2024

[1] [1]

Evaluating Large Language Models Trained on Code

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[2] [2]

Urdformer: A pipeline for constructing articulated simula- tion environments from real-world images.arXiv preprint arXiv:2405.11656, 2024

Zoey Chen, Aaron Walsman, Marius Memmel, Kaichun Mo, Alex Fang, Karthikeya Vemuri, Alan Wu, Dieter Fox, and Abhishek Gupta. Urdformer: A pipeline for constructing articulated simulation environments from real-world images.arXiv preprint arXiv:2405.11656, 2024

work page arXiv 2024

[3] [3]

2025.doi:10.48550/arXiv.2508.14879

Bingquan Dai, Li Ray Luo, Qihong Tang, Jie Wang, Xinyu Lian, Hao Xu, Minghan Qin, Xudong Xu, Bo Dai, Haoqian Wang, et al. Meshcoder: Llm-powered structured mesh code generation from point clouds.arXiv preprint arXiv:2508.14879, 2025

work page arXiv 2025

[4] [4]

Procthor: Large-scale embodied ai using procedural generation.Advances in Neural Information Processing Systems, 35:5982–5994, 2022

Matt Deitke, Eli VanderBilt, Alvaro Herrasti, Luca Weihs, Kiana Ehsani, Jordi Salvador, Winson Han, Eric Kolve, Aniruddha Kembhavi, and Roozbeh Mottaghi. Procthor: Large-scale embodied ai using procedural generation.Advances in Neural Information Processing Systems, 35:5982–5994, 2022

work page 2022

[5] [5]

Blenderllm: Training large language models for computer-aided design with self- improvement.arXiv preprint arXiv:2412.14203, 2024

Yuhao Du, Shunian Chen, Wenbo Zan, Peizhao Li, Mingxuan Wang, Dingjie Song, Bo Li, Yan Hu, and Benyou Wang. Blenderllm: Training large language models for computer-aided design with self- improvement.arXiv preprint arXiv:2412.14203, 2024

work page arXiv 2024

[6] [6]

Layoutgpt: Compositional visual planning and generation with large language models.Advances in Neural Information Processing Systems, 36:18225–18250, 2023

Weixi Feng, Wanrong Zhu, Tsu-jui Fu, Varun Jampani, Arjun Akula, Xuehai He, Sugato Basu, Xin Eric Wang, and William Yang Wang. Layoutgpt: Compositional visual planning and generation with large language models.Advances in Neural Information Processing Systems, 36:18225–18250, 2023

work page 2023

[7] [7]

Anyhome: Open-vocabulary generation of structured and textured 3d homes

Rao Fu, Zehao Wen, Zichen Liu, and Srinath Sridhar. Anyhome: Open-vocabulary generation of structured and textured 3d homes. InEuropean Conference on Computer Vision, pages 52–70. Springer, 2024

work page 2024

[8] [8]

Scenecraft: An llm agent for synthesizing 3d scenes as blender code

Ziniu Hu, Ahmet Iscen, Aashi Jain, Thomas Kipf, Yisong Yue, David A Ross, Cordelia Schmid, and Alireza Fathi. Scenecraft: An llm agent for synthesizing 3d scenes as blender code. InForty-first International Conference on Machine Learning, 2024

work page 2024

[9] [9]

Shapeassembly: Learning to generate programs for 3d shape structure synthesis.ACM Transactions on Graphics (TOG), 39(6):1–20, 2020

R Kenny Jones, Theresa Barton, Xianghao Xu, Kai Wang, Ellen Jiang, Paul Guerrero, Niloy J Mitra, and Daniel Ritchie. Shapeassembly: Learning to generate programs for 3d shape structure synthesis.ACM Transactions on Graphics (TOG), 39(6):1–20, 2020

work page 2020

[10] [10]

AI2-THOR: An Interactive 3D Environment for Visual AI

Eric Kolve, Roozbeh Mottaghi, Winson Han, Eli VanderBilt, Luca Weihs, Alvaro Herrasti, Matt Deitke, Kiana Ehsani, Daniel Gordon, Yuke Zhu, Aniruddha Kembhavi, Abhinav Gupta, and Ali Farhadi. Ai2-thor: An interactive 3d environment for visual ai, 2022. URLhttps://arxiv.org/abs/1712.05474

work page internal anchor Pith review Pith/arXiv arXiv 2022

[11] [11]

Nap: Neural 3d articulated object prior.Advances in Neural Information Processing Systems, 36:31878–31894, 2023

Jiahui Lei, Congyue Deng, William B Shen, Leonidas J Guibas, and Kostas Daniilidis. Nap: Neural 3d articulated object prior.Advances in Neural Information Processing Systems, 36:31878–31894, 2023

work page 2023

[12] [12]

Behavior-1k: A benchmark for embodied ai with 1,000 everyday activities and realistic simulation

Chengshu Li, Ruohan Zhang, Josiah Wong, Cem Gokmen, Sanjana Srivastava, Roberto Martín-Martín, Chen Wang, Gabrael Levine, Michael Lingelbach, Jiankai Sun, et al. Behavior-1k: A benchmark for embodied ai with 1,000 everyday activities and realistic simulation. InConference on Robot Learning, pages 80–93. PMLR, 2023

work page 2023

[13] [13]

Code as policies: Language model programs for embodied control

Jacky Liang, Wenlong Huang, Fei Xia, Peng Xu, Karol Hausman, Brian Ichter, Pete Florence, and Andy Zeng. Code as policies: Language model programs for embodied control. In2023 IEEE International conference on robotics and automation (ICRA), pages 9493–9500. IEEE, 2023

work page 2023

[14] [14]

Instructscene: Instruction- driven 3d indoor scene synthesis with semantic graph prior

Chenguo Lin and Yadong Mu. Instructscene: Instruction-driven 3d indoor scene synthesis with semantic graph prior.arXiv preprint arXiv:2402.04717, 2024

work page arXiv 2024

[15] [15]

Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

work page 2023

[16] [16]

Cage: Controllable articulation generation

Jiayi Liu, Hou In Ivan Tam, Ali Mahdavi-Amiri, and Manolis Savva. Cage: Controllable articulation generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17880–17889, 2024

work page 2024

[17] [17]

Partnet: A large-scale benchmark for fine-grained and hierarchical part-level 3d object understanding

Kaichun Mo, Shilin Zhu, Angel X Chang, Li Yi, Subarna Tripathi, Leonidas J Guibas, and Hao Su. Partnet: A large-scale benchmark for fine-grained and hierarchical part-level 3d object understanding. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 909–918, 2019. 10

work page 2019

[18] [18]

Maniskill: Generalizable manipulation skill benchmark with large-scale demonstrations,

Tongzhou Mu, Zhan Ling, Fanbo Xiang, Derek Yang, Xuanlin Li, Stone Tao, Zhiao Huang, Zhiwei Jia, and Hao Su. Maniskill: Generalizable manipulation skill benchmark with large-scale demonstrations.arXiv preprint arXiv:2107.14483, 2021

work page arXiv 2021

[19] [19]

RoboCasa: Large-Scale Simulation of Everyday Tasks for Generalist Robots

Soroush Nasiriany, Abhiram Maddukuri, Lance Zhang, Adeet Parikh, Aaron Lo, Abhishek Joshi, Ajay Mandlekar, and Yuke Zhu. Robocasa: Large-scale simulation of everyday tasks for generalist robots.arXiv preprint arXiv:2406.02523, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[20] [20]

GPT-4V(ision) system card.OpenAI Technical Report, 2023

OpenAI. GPT-4V(ision) system card.OpenAI Technical Report, 2023

work page 2023

[21] [21]

Atiss: Autoregressive transformers for indoor scene synthesis.Advances in neural information processing systems, 34:12013–12026, 2021

Despoina Paschalidou, Amlan Kar, Maria Shugrina, Karsten Kreis, Andreas Geiger, and Sanja Fidler. Atiss: Autoregressive transformers for indoor scene synthesis.Advances in neural information processing systems, 34:12013–12026, 2021

work page 2021

[22] [22]

arXiv preprint arXiv:2602.09153 , year=

Nicholas Pfaff, Thomas Cohn, Sergey Zakharov, Rick Cory, and Russ Tedrake. Scenesmith: Agentic generation of simulation-ready indoor scenes.arXiv preprint arXiv:2602.09153, 2026

work page arXiv 2026

[23] [23]

arXiv preprint arXiv:2503.16848 , year=

Hou In Derek Pun, Hou In Ivan Tam, Austin T Wang, Xiaoliang Huo, Angel X Chang, and Manolis Savva. Hsm: Hierarchical scene motifs for multi-scale indoor scene generation.arXiv preprint arXiv:2503.16848, 2025

work page arXiv 2025

[24] [24]

Infinite photorealistic worlds using procedural generation

Alexander Raistrick, Lahav Lipson, Zeyu Ma, Lingjie Mei, Mingzhe Wang, Yiming Zuo, Karhan Kayan, Hongyu Wen, Beining Han, Yihan Wang, Alejandro Newell, Hei Law, Ankit Goyal, Kaiyu Yang, and Jia Deng. Infinite photorealistic worlds using procedural generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages...

work page 2023

[25] [25]

Habitat: A platform for embodied ai research

Manolis Savva, Abhishek Kadian, Oleksandr Maksymets, Yili Zhao, Erik Wijmans, Bhavana Jain, Julian Straub, Jia Liu, Vladlen Koltun, Jitendra Malik, et al. Habitat: A platform for embodied ai research. In Proceedings of the IEEE/CVF international conference on computer vision, pages 9339–9347, 2019

work page 2019

[26] [26]

ProgPrompt: Generating Situated Robot Task Plans using Large Language Models

Ishika Singh, Valts Blukis, Arsalan Mousavian, Ankit Goyal, Danfei Xu, Jonathan Tremblay, Dieter Fox, Jesse Thomason, and Animesh Garg. Progprompt: Generating situated robot task plans using large language models.arXiv preprint arXiv:2209.11302, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[27] [27]

3d-gpt: Procedural 3d modeling with large language models

Chunyi Sun, Junlin Han, Weijian Deng, Xinlong Wang, Zishan Qin, and Stephen Gould. 3d-gpt: Procedural 3d modeling with large language models. In2025 International Conference on 3D Vision (3DV), pages 1253–1263. IEEE, 2025

work page 2025

[28] [28]

Layoutvlm: Differentiable optimization of 3d layout via vision-language models

Fan-Yun Sun, Weiyu Liu, Siyi Gu, Dylan Lim, Goutam Bhat, Federico Tombari, Manling Li, Nick Haber, and Jiajun Wu. Layoutvlm: Differentiable optimization of 3d layout via vision-language models. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 29469–29478, 2025

work page 2025

[29] [29]

Habitat 2.0: Training home assistants to rearrange their habitat.Advances in neural information processing systems, 34:251–266, 2021

Andrew Szot, Alexander Clegg, Eric Undersander, Erik Wijmans, Yili Zhao, John Turner, Noah Maestre, Mustafa Mukadam, Devendra Singh Chaplot, Oleksandr Maksymets, et al. Habitat 2.0: Training home assistants to rearrange their habitat.Advances in neural information processing systems, 34:251–266, 2021

work page 2021

[30] [30]

Sceneeval: Evaluating semantic coherence in text-conditioned 3d indoor scene synthesis

Hou In Ivan Tam, Hou In Derek Pun, Austin T Wang, Angel X Chang, and Manolis Savva. Sceneeval: Evaluating semantic coherence in text-conditioned 3d indoor scene synthesis. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 7355–7365, 2026

work page 2026

[31] [31]

Diffuscene: Denoising diffusion models for generative indoor scene synthesis

Jiapeng Tang, Yinyu Nie, Lev Markhasin, Angela Dai, Justus Thies, and Matthias Nießner. Diffuscene: Denoising diffusion models for generative indoor scene synthesis. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 20507–20518, 2024

work page 2024

[32] [32]

Gemini: A Family of Highly Capable Multimodal Models

Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[33] [33]

SAM 3D: 3Dfy Anything in Images

SAM 3D Team, Xingyu Chen, Fu-Jen Chu, Pierre Gleize, Kevin J Liang, Alexander Sax, Hao Tang, Weiyao Wang, Michelle Guo, Thibaut Hardin, Xiang Li, Aohan Lin, Jiawei Liu, Ziqi Ma, Anushka Sagar, Bowen Song, Xiaodong Wang, Jianing Yang, Bowen Zhang, Piotr Dollár, Georgia Gkioxari, Matt Feiszli, and Jitendra Malik. Sam 3d: 3dfy anything in images. 2025. URL h...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[34] [34]

Mujoco: A physics engine for model-based control

Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for model-based control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 5026–5033. IEEE,

work page 2012

[35] [35]

doi: 10.1109/IROS.2012.6386109

work page doi:10.1109/iros.2012.6386109 2012

[36] [36]

Sceneformer: Indoor scene generation with transformers

Xinpeng Wang, Chandan Yeshwanth, and Matthias Nießner. Sceneformer: Indoor scene generation with transformers. In2021 International conference on 3D vision (3DV), pages 106–115. IEEE, 2021. 11

work page 2021

[37] [37]

Adaafford: Learning to adapt manipulation affordance for 3d articulated objects via few-shot interactions

Yian Wang, Ruihai Wu, Kaichun Mo, Jiaqi Ke, Qingnan Fan, Leonidas J Guibas, and Hao Dong. Adaafford: Learning to adapt manipulation affordance for 3d articulated objects via few-shot interactions. InEuropean conference on computer vision, pages 90–107. Springer, 2022

work page 2022

[38] [38]

Architect: Generating vivid and interactive 3d scenes with hierarchical 2d inpainting.Advances in Neural Information Processing Systems, 37:67575–67603, 2024

Yian Wang, Xiaowen Qiu, Jiageng Liu, Zhehuan Chen, Jiting Cai, Yufei Wang, Tsun-Hsuan Wang, Zhou Xian, and Chuang Gan. Architect: Generating vivid and interactive 3d scenes with hierarchical 2d inpainting.Advances in Neural Information Processing Systems, 37:67575–67603, 2024

work page 2024

[39] [39]

Sapien: A simulated part-based interactive environment

Fanbo Xiang, Yuzhe Qin, Kaichun Mo, Yikuan Xia, Hao Zhu, Fangchen Liu, Minghua Liu, Hanxiao Jiang, Yifu Yuan, He Wang, et al. Sapien: A simulated part-based interactive environment. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11097–11107, 2020

work page 2020

[40] [40]

chair in front of desk

Yue Yang, Fan-Yun Sun, Luca Weihs, Eli VanderBilt, Alvaro Herrasti, Winson Han, Jiajun Wu, Nick Haber, Ranjay Krishna, Lingjie Liu, et al. Holodeck: Language guided generation of 3d embodied ai environments. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16227–16237, 2024. 12 Appendix Contents A. More Related Wo...

work page 2024