ReSpace: Text-Driven Autoregressive 3D Indoor Scene Synthesis and Editing

Iro Armeni; Martin JJ. Bucher

arxiv: 2506.02459 · v6 · pith:BOHQYEEUnew · submitted 2025-06-03 · 💻 cs.CV

ReSpace: Text-Driven Autoregressive 3D Indoor Scene Synthesis and Editing

Martin JJ. Bucher , Iro Armeni This is my paper

classification 💻 cs.CV

keywords sceneeditingobjectsynthesisadditionindoorlanguagesemantics

0 comments

read the original abstract

Scene synthesis and editing has emerged as a promising direction in computer graphics. Current trained approaches for 3D indoor scene generation either oversimplify object semantics through one-hot class encodings (e.g., 'chair' or 'table'), require masked diffusion for editing, ignore room boundaries, or rely on floor plan renderings that fail to capture complex layouts. LLM-based methods enable richer semantics via natural language, but lack editing functionality, are limited to rectangular layouts, or rely on weak spatial reasoning from implicit world models. We introduce ReSpace, a generative framework for autoregressive text-driven 3D indoor scene synthesis and editing. Our approach features a compact structured scene representation with explicit room boundaries that enables asset-agnostic deployment and frames scene manipulation as a next-token prediction task, supporting object addition, removal, and swapping via natural language. We employ supervised fine-tuning with a preference alignment stage to train a specialized language model for object addition that accounts for user instructions, spatial geometry, object semantics, and scene-level composition. We further introduce a voxelization-based evaluation metric capturing fine-grained geometric violations beyond 3D bounding boxes. Experiments surpass state-of-the-art on object addition and achieve superior human-perceived quality on the application of full scene synthesis, despite not being trained on it.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

HDSL: A Hierarchical Domain-Specific Language for Structured 3D Indoor Scene Generation and Localized Editing with LLM Agents
cs.CV 2026-06 unverdicted novelty 7.0

HDSL is a tree-structured DSL for 3D indoor scenes that lets LLM agents generate subtrees recursively and perform localized edits via hierarchical retrieval and deterministic merge.
SpatialGrammar: A Domain-Specific Language for LLM-Based 3D Indoor Scene Generation
cs.AI 2026-04 unverdicted novelty 7.0

SpatialGrammar provides a grid-based DSL and compiler that lets LLMs generate collision-free 3D indoor scenes more reliably than raw-coordinate or code-based approaches.
HomeWorld: A Unified Floorplan-to-Furnished Framework for Generating Controllable, Densely Interactive Whole-Home Scenes
cs.CV 2026-06 unverdicted novelty 6.0

A hierarchical pipeline generates controllable whole-home 3D scenes from floorplans via LLMs, image models, and VLMs, releasing 300K floorplans and 5K scenes for embodied AI use.
Text-Driven 3D Indoor Scene Synthesis in Non-Manhattan Environments
cs.AI 2026-07 unverdicted novelty 3.0

SPG-Layout combines statistical object priors with hierarchical large-object-first placement to produce physically plausible text-driven 3D scenes in non-Manhattan rooms and outperforms baselines on a new 500-scene benchmark.