SceneSmith: Agentic Generation of Simulation-Ready Indoor Scenes

Nicholas Pfaff; Rick Cory; Russ Tedrake; Sergey Zakharov; Thomas Cohn

arxiv: 2602.09153 · v2 · pith:Z4YAZHCUnew · submitted 2026-02-09 · 💻 cs.RO · cs.AI· cs.CV· cs.GR

SceneSmith: Agentic Generation of Simulation-Ready Indoor Scenes

Nicholas Pfaff , Thomas Cohn , Sergey Zakharov , Rick Cory , Russ Tedrake This is my paper

classification 💻 cs.RO cs.AIcs.CVcs.GR

keywords objectsscenesmithenvironmentsindoorphysicalagenticarticulatedaverage

0 comments

read the original abstract

Simulation has become a key tool for training and evaluating home robots at scale, yet existing environments fail to capture the diversity and physical complexity of real indoor spaces. Current scene synthesis methods produce sparsely furnished rooms that lack the dense clutter, articulated furniture, and physical properties essential for robotic manipulation. We introduce SceneSmith, a hierarchical agentic framework that generates simulation-ready indoor environments from natural language prompts. SceneSmith constructs scenes through successive stages$\unicode{x2013}$from architectural layout to furniture placement to small object population$\unicode{x2013}$each implemented as an interaction among VLM agents: designer, critic, and orchestrator. The framework tightly integrates asset generation through text-to-3D synthesis for static objects, dataset retrieval for articulated objects, and physical property estimation. SceneSmith generates 3-6x more objects than prior methods, with <2% inter-object collisions and 96% of objects remaining stable under physics simulation. In a user study with 205 participants, it achieves 92% average realism and 91% average prompt faithfulness win rates against baselines. We further demonstrate that these environments can be used in an end-to-end pipeline for automatic robot policy evaluation.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 9 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

SimWorld Studio: Automatic Environment Generation with Evolving Coding Agent for Embodied Agent Learning
cs.AI 2026-05 unverdicted novelty 8.0

SimWorld Studio uses a self-evolving coding agent to generate adaptive 3D environments that improve embodied agent performance, with reported gains of 18 points over fixed environments in navigation tasks.
SimWorld Studio: Automatic Environment Generation with Evolving Coding Agent for Embodied Agent Learning
cs.AI 2026-05 accept novelty 8.0

SimWorld Studio deploys an evolving coding agent to create adaptive 3D environments that co-evolve with embodied learners, delivering 18-point success-rate gains over fixed environments in navigation benchmarks.
SceneCode: Executable World Programs for Editable Indoor Scenes with Articulated Objects
cs.AI 2026-05 unverdicted novelty 7.0

SceneCode compiles natural language prompts into executable code programs that generate editable, articulated indoor scenes for physics simulation.
3D Generation for Embodied AI and Robotic Simulation: A Survey
cs.RO 2026-04 accept novelty 7.0

3D generation for embodied AI is shifting from visual realism toward interaction readiness, organized into data generation, simulation environments, and sim-to-real bridging roles.
SceneOrchestra: Efficient Agentic 3D Scene Synthesis via Full Tool-Call Trajectory Generation
cs.CV 2026-04 unverdicted novelty 7.0

SceneOrchestra trains an orchestrator to generate full tool-call trajectories for 3D scene synthesis and uses a discriminator during training to select high-quality plans, yielding state-of-the-art results with lower runtime.
VoxScene: Anchor-Conditioned Voxel Diffusion for Indoor Scene Arrangement
cs.GR 2026-05 unverdicted novelty 6.0

VoxScene is a new anchor-conditioned voxel diffusion model that synthesizes collision-free 3D indoor scene arrangements via discrete volumetric occupancies and uses the grids for asset retrieval.
Code-as-Room: Generating 3D Rooms from Top-Down View Images via Agentic Code Synthesis
cs.CV 2026-05 unverdicted novelty 5.0

Code-as-Room is an MLLM-based agentic pipeline that parses top-down images into multi-stage Blender code synthesis with cross-stage memory to generate functional 3D rooms.
3D Generation for Embodied AI and Robotic Simulation: A Survey
cs.RO 2026-04 unverdicted novelty 3.0

The survey organizes 3D generation for embodied AI into data generators for assets, simulation environments for interaction, and sim-to-real bridges, noting a shift toward interaction readiness and listing bottlenecks...
3D Generation for Embodied AI and Robotic Simulation: A Survey
cs.RO 2026-04 unverdicted novelty 2.0

The paper surveys 3D generation techniques for embodied AI and robotics, categorizing them into data generation, simulation environments, and sim-to-real bridging while identifying bottlenecks in physical validity and...