Text-Driven 3D Indoor Scene Synthesis in Non-Manhattan Environments

Hangjun Ye; Jun Liu; Kai Chen; Kun Wang; Li Zhang; Long Chen; Xianhui Meng; Xiaoshuai Hao; Xiuying Chen; Yan Luo

arxiv: 2607.02407 · v1 · pith:F3BZ2O6Jnew · submitted 2026-07-02 · 💻 cs.AI · cs.CV

Text-Driven 3D Indoor Scene Synthesis in Non-Manhattan Environments

Xianhui Meng , Zirui Song , Yuchen Zhang , Li Zhang , Yongxuan Lv , Xiuying Chen , Kun Wang , Yan Luo

show 5 more authors

Kai Chen Hangjun Ye Long Chen Jun Liu Xiaoshuai Hao

This is my paper

Pith reviewed 2026-07-03 13:17 UTC · model grok-4.3

classification 💻 cs.AI cs.CV

keywords 3D scene synthesistext-driven generationnon-Manhattan environmentsindoor sceneslayout optimizationstatistical priorsphysical plausibility

0 comments

The pith

SPG-Layout generates physically plausible 3D indoor scenes from text in non-Manhattan environments by combining statistical priors with hierarchical large-object placement.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to overcome the failure of existing text-to-3D methods when rooms have non-right angles and irregular walls. It introduces SPG-Layout, which trains with statistical priors on typical object placements and then arranges furniture starting with the largest items before adding smaller ones. This two-part approach is meant to cut down on geometric clashes while keeping the scenes semantically believable. The authors built a fresh test set of 500 varied non-Manhattan rooms to measure the gains. Experiments show the method beats prior work on both angled and standard rectangular layouts.

Core claim

SPG-Layout is a text-driven framework that first uses statistical priors of object distributions to guide training for stronger environmental understanding, then applies a hierarchical layout strategy that places large objects first to minimize layout violations, thereby balancing semantic realism and physical plausibility and outperforming existing methods on both Manhattan and non-Manhattan scenes as measured on a new benchmark of 500 diverse non-Manhattan environments.

What carries the argument

SPG-Layout, which pairs statistical priors of object distributions for training with a hierarchical strategy that places large objects before smaller ones.

If this is right

Generated scenes in non-Manhattan rooms show fewer geometric violations and higher physical fidelity than prior text-driven approaches.
The same components also raise performance on standard Manhattan environments.
Text prompts can now drive synthesis across a broader set of real-world indoor geometries.
Starting with large objects reduces conflicts when smaller objects are added later.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The method could support more flexible scene creation for games or virtual tours where walls are not at right angles.
The released 500-scene benchmark offers a concrete testbed for other researchers working on irregular layouts.
Combining these priors with newer 3D diffusion models might extend the approach to multi-room or changing scenes.
Testing the priors on outdoor or hybrid spaces could reveal how far the distribution statistics transfer.

Load-bearing premise

Statistical priors on object distributions plus a large-object-first ordering will be enough to handle non-orthogonal relationships and avoid creating new placement problems.

What would settle it

If side-by-side tests on the 500 non-Manhattan environments show SPG-Layout producing equal or higher rates of layout violations than current methods, the performance claim would not hold.

Figures

Figures reproduced from arXiv: 2607.02407 by Hangjun Ye, Jun Liu, Kai Chen, Kun Wang, Li Zhang, Long Chen, Xianhui Meng, Xiaoshuai Hao, Xiuying Chen, Yan Luo, Yongxuan Lv, Yuchen Zhang, Zirui Song.

**Figure 1.** Figure 1: Manhattan vs. Non-Manhattan Environments. Manhattan environments (left) are simple, characterized by orthogonal structures. In contrast, non-Manhattan environments (right) are difficult to understand due to their arbitrary, irregular, and curved geometries, requiring precise reasoning about mesh details and rotation angles. in fact, prevalent. Contemporary architecture frequently employs complex geometri… view at source ↗

**Figure 2.** Figure 2: Overview of the proposed SPG-Layout framework. (i) Provides an intuitive comprehension of structured scene representation (Sec. 3.1). (ii) Structural alignment by full parameter fine-tuning (Sec. 4.3 stage 1). (iii) Inference Pipeline illustrates the hierarchical layout strategy for progressive generation (Sec. 4.4). (iv) Reinforcement Learning optimizes the model using a multi-objective reward function vi… view at source ↗

**Figure 3.** Figure 3: Mechanism of Spatial Prior Guidance (SPG). SPG explicitly integrates geometric boundary priors and semantic co-occurrence affinities to provide computable reward signals for reinforcement learning-based scene generation. 4.1 Non-Manhattan Data Collection To facilitate our research on non-Manhattan Environments, we present a semi-automated data generation pipeline, illustrated in [PITH_FULL_IMAGE:figures… view at source ↗

**Figure 4.** Figure 4: The Illustration of HLS. 4.4 Hierarchical Layout Strategy While SPG-Layout excels in incremental object placement within existing scenes, text-driven 3D scene synthesis necessitates the ability to generate holistic scenes from scratch. In the absence of established spatial references, the unprioritized placement of smaller objects risks fragmenting the continuous free space. This fragmentation creates se… view at source ↗

**Figure 5.** Figure 5: Qualitative Results on non-Manhattan Dataset. Given real-world room type, SPG-Layout generates complete scene layouts guided by specific user instructions. 5.3 Generalization Capacity and User Study Generalization Capacity. To evaluate generalization, we test in standard Manhattan environments (Tab. 2). Results show that our method outperforms existing approaches in both non-Manhattan and conventional Manh… view at source ↗

**Figure 6.** Figure 6: Qualitative results of layout generation in complex non-Manhattan environments compared to the existing methods. [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 7.** Figure 7: Semi-Automated Data Generation Pipeline. We extract geometric boundaries (a) and generate coarse scenes (b), followed by manual refinement of object attributes (c) to construct a curated dataset of 500 non-Manhattan scenes (d). this metric accounts for the detailed geometry of the assets: LMBL = X 1≤i<j≤N Vol(V(oi) ∩ V(oj )), (11) where ∩ denotes the intersection of the voxelized meshes. • Voxelization-Ba… view at source ↗

**Figure 8.** Figure 8: Expansion Capability of SPG-Layout. 11 Expansion Capability of Our framework Leveraging Structured Scene Representation (SSR), our framework goes beyond scene synthesis and naturally supports object-level scene editing, including removal and replacement of existing entities. Under the explicit, objectcentric SSR formulation, object removal becomes a straightforward and interpretable procedure. Specific… view at source ↗

**Figure 9.** Figure 9: Failure Cases. Although no collisions occur, the generated layout severely deviates from human aesthetics, with cluttered and disorganized furniture arrangements. 13.4 Additional Ablation Study To explore the impact of different aggregation operators in the SPG module, we conduct ablation experiments on the full scene synthesis task. The quantitative results are summarized in [PITH_FULL_IMAGE:figures/ful… view at source ↗

**Figure 10.** Figure 10: More Visualizations [PITH_FULL_IMAGE:figures/full_fig_p013_10.png] view at source ↗

**Figure 11.** Figure 11: Stacked Cases [PITH_FULL_IMAGE:figures/full_fig_p013_11.png] view at source ↗

read the original abstract

Large Language Models (LLMs) have demonstrated remarkable capabilities in 3D indoor synthesis for Manhattan environments. However, existing methods often fail to capture plausible object layout patterns in non-Manhattan settings, primarily because they struggle to model non-orthogonal spatial relationships, leading to high geometric violations and low physical fidelity. To address this challenge, we propose SPG-Layout, a novel text-driven framework designed to generate physically plausible indoor scenes within complex non-Manhattan environments. Specifically, we first utilize statistical priors of object distributions to guide the training process, enhancing environmental understanding and fidelity. Furthermore, mirroring human design workflows, we adopt a hierarchical layout strategy that prioritizes the placement of large objects, thereby substantially minimizing layout violations. By synergizing these components, SPG-Layout achieves a balanced optimization of semantic realism and physical plausibility. To evaluate performance in these complex settings, we constructed a new benchmark comprising 500 diverse non-Manhattan environments. Extensive experiments demonstrate that SPG-Layout consistently and significantly outperforms existing methods across both Manhattan and non-Manhattan environments. The code will be publicly released.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SPG-Layout adds hierarchical large-object placement and a 500-scene non-Manhattan benchmark, but the abstract supplies no metrics and the priors look likely to carry Manhattan bias.

read the letter

The core contribution is a practical extension: statistical priors on object distributions plus a large-first hierarchical layout step to reduce violations in angled rooms. The new benchmark of 500 non-Manhattan scenes is a clear step forward for evaluation.

The hierarchical strategy makes sense on its face and aligns with how designers actually work. Building a dedicated non-Manhattan test set also fills an obvious gap that Manhattan-only methods leave open.

The main weakness is the lack of any numbers. The abstract asserts outperformance and better physical plausibility but shows no error rates, ablations, or comparisons on the new benchmark. Without those, it is impossible to judge whether the gains are real or artifacts of benchmark construction. The stress-test point on priors also lands: standard indoor datasets are overwhelmingly Manhattan, and nothing in the description indicates the priors were recomputed or conditioned on angled examples. If the training signal still encodes orthogonal preferences, the hierarchical rule may only paper over the mismatch rather than solve it.

This work is aimed at researchers doing text-to-3D scene generation for VR, robotics, or interior design who already know the Manhattan limitation. It deserves a serious referee because the problem is concrete and the proposed fix is simple enough to test directly. Ask the authors for the quantitative results, the source of the priors, and an ablation on the hierarchical step before deciding on acceptance.

Referee Report

2 major / 1 minor

Summary. The paper proposes SPG-Layout, a text-driven framework for 3D indoor scene synthesis in non-Manhattan environments. It combines statistical priors on object distributions to guide training with a hierarchical layout strategy that places large objects first (mirroring human workflows) to reduce geometric violations. A new benchmark of 500 diverse non-Manhattan scenes is introduced, and the method is claimed to achieve better semantic realism and physical plausibility than prior work on both Manhattan and non-Manhattan cases, with code to be released.

Significance. If the quantitative claims hold after proper validation, the work would address a clear gap in handling non-orthogonal layouts, where existing LLM-based methods produce high violation rates. The hierarchical strategy and new benchmark are potentially useful contributions; public code release would aid reproducibility.

major comments (2)

[Abstract] Abstract: the central claim of 'consistent and significant outperformance' and 'balanced optimization of semantic realism and physical plausibility' is unsupported by any quantitative results, ablation studies, error metrics, or tables in the provided text, rendering the claim unevaluable.
[Methods (statistical priors)] The statistical priors section (wherever defined): the manuscript does not state that the priors were recomputed or conditioned on non-Manhattan examples. If estimated from standard indoor datasets (typically Manhattan-dominant), the training signal encodes orthogonal biases, so the reported gains on the 500-scene benchmark may be an artifact rather than evidence that the priors plus hierarchical placement suffice for angled geometries.

minor comments (1)

[Abstract] The abstract mentions 'extensive experiments' but supplies no details on metrics, baselines, or failure modes; this should be expanded in the results section for clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful review and address each major comment below. We focus on clarifying the manuscript content and committing to revisions where needed to strengthen the presentation.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim of 'consistent and significant outperformance' and 'balanced optimization of semantic realism and physical plausibility' is unsupported by any quantitative results, ablation studies, error metrics, or tables in the provided text, rendering the claim unevaluable.

Authors: The full manuscript contains Section 4 (Experiments) with quantitative results on the 500-scene non-Manhattan benchmark, including tables reporting metrics for semantic realism (e.g., category and attribute accuracy) and physical plausibility (e.g., collision and support violation rates), plus ablation studies comparing components. These directly support the abstract claims. If the reviewed version omitted or truncated this section, we will ensure completeness. We will revise the abstract to briefly reference key quantitative gains (e.g., relative reductions in violations) for immediate evaluability. revision: yes
Referee: [Methods (statistical priors)] The statistical priors section (wherever defined): the manuscript does not state that the priors were recomputed or conditioned on non-Manhattan examples. If estimated from standard indoor datasets (typically Manhattan-dominant), the training signal encodes orthogonal biases, so the reported gains on the 500-scene benchmark may be an artifact rather than evidence that the priors plus hierarchical placement suffice for angled geometries.

Authors: The statistical priors are derived from the 3D-FRONT dataset (primarily Manhattan layouts), as is standard in the field. The manuscript will be revised to explicitly state this source in the methods section. We will also add an ablation isolating the hierarchical placement strategy's contribution on the non-Manhattan benchmark to show that gains arise from the combination rather than priors alone. This addresses the potential bias concern directly. revision: yes

Circularity Check

0 steps flagged

No circularity: derivation relies on external priors and empirical benchmark

full rationale

The provided abstract and description contain no equations, fitted parameters renamed as predictions, or self-citation chains that reduce the central claim to its own inputs. Statistical priors and the hierarchical placement strategy are presented as design choices guiding training and layout, with performance evaluated on a newly constructed 500-scene benchmark. No self-definitional steps, fitted-input predictions, or load-bearing self-citations are visible; the method is self-contained against external benchmarks and does not reduce by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no explicit free parameters, axioms, or invented entities; all entries left empty.

pith-pipeline@v0.9.1-grok · 5762 in / 1148 out tokens · 43285 ms · 2026-07-03T13:17:20.223542+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

106 extracted references · 30 canonical work pages · 12 internal anchors

[1]

Advances in Neural Information Processing Systems , volume=

From programs to poses: Factored real-world scene generation via learned program libraries , author=. Advances in Neural Information Processing Systems , volume=
[2]

arXiv preprint arXiv:2503.16848 , year=

Hsm: Hierarchical scene motifs for multi-scale indoor scene generation , author=. arXiv preprint arXiv:2503.16848 , year=

work page arXiv
[3]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Infinigen indoors: Photorealistic indoor scenes using procedural generation , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
[4]

2021 International Conference on 3D Vision (3DV) , pages=

Sceneformer: Indoor scene generation with transformers , author=. 2021 International Conference on 3D Vision (3DV) , pages=. 2021 , organization=

2021
[5]

Advances in Neural Information Processing Systems , volume=

Atiss: Autoregressive transformers for indoor scene synthesis , author=. Advances in Neural Information Processing Systems , volume=
[6]

arXiv preprint arXiv:2405.21066 , year=

Mixed Diffusion for 3D Indoor Scene Synthesis , author=. arXiv preprint arXiv:2405.21066 , year=

work page arXiv
[7]

Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

3d-front: 3d furnished rooms with layouts and semantics , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=
[8]

International Journal of Computer Vision , volume=

3d-future: 3d furniture shape with texture , author=. International Journal of Computer Vision , volume=. 2021 , publisher=

2021
[9]

GPT-4o System Card

Gpt-4o system card , author=. arXiv preprint arXiv:2410.21276 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[10]

Advances in Neural Information Processing Systems , volume=

Panoptic 3d scene reconstruction from a single rgb image , author=. Advances in Neural Information Processing Systems , volume=
[11]

arXiv preprint arXiv:2402.04717 , year=

Instructscene: Instruction-driven 3d indoor scene synthesis with semantic graph prior , author=. arXiv preprint arXiv:2402.04717 , year=

work page arXiv
[12]

CubiCasa5K: A Dataset and an Improved Multi-Task Model for Floorplan Image Analysis , booktitle =

Ahti Kalervo and Juha Ylioinas and Markus H. CubiCasa5K: A Dataset and an Improved Multi-Task Model for Floorplan Image Analysis , booktitle =
[13]

HorizonNet: Learning Room Layout with 1D Representation and Pano Stretch Data Augmentation , booktitle =

Cheng Sun and Chi. HorizonNet: Learning Room Layout with 1D Representation and Pano Stretch Data Augmentation , booktitle =. 2019 , pages =

2019
[14]

Wang and Angel X

Hou In Ivan Tam and Hou In Derek Pun and Austin T. Wang and Angel X. Chang and Manolis Savva , title =. Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) , year =
[15]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year =

Chuhang Zou and Alex Colburn and Qi Shan and Derek Hoiem , title =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year =
[16]

Advances in neural information processing systems , volume=

Gans trained by a two time-scale update rule converge to a local nash equilibrium , author=. Advances in neural information processing systems , volume=
[17]

The Role of ImageNet Classes in Fr

Kynk. The Role of ImageNet Classes in Fr
[18]

Demystifying MMD GANs

Demystifying mmd gans , author=. arXiv preprint arXiv:1801.01401 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[19]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Deepseekmath: Pushing the limits of mathematical reasoning in open language models , author=. arXiv preprint arXiv:2402.03300 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[20]

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages=

Human-centric indoor scene synthesis using stochastic grammar , author=. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages=
[21]

arXiv preprint arXiv:2303.03565 , year=

Clip-layout: Style-consistent indoor scene synthesis with semantic furniture embedding , author=. arXiv preprint arXiv:2303.03565 , year=

work page arXiv
[22]

Computer Vision -- ECCV 2024 Workshops , series =

Ata Çelen and Guo Han and Konrad Schindler and Luc Van Gool and Iro Armeni and Anton Obukhov and Xi Wang , title =. Computer Vision -- ECCV 2024 Workshops , series =. 2025 , pages =

2024
[23]

Proceedings of the International Conference on 3D Vision (3DV) , year =

Chuan Fang and Yuan Dong and Kunming Luo and Xiaotao Hu and Rakesh Shrestha and Ping Tan , title =. Proceedings of the International Conference on 3D Vision (3DV) , year =
[24]

IEEE Transactions on Visualization and Computer Graphics , volume=

Fast and scalable position-based layout synthesis , author=. IEEE Transactions on Visualization and Computer Graphics , volume=. 2018 , publisher=

2018
[25]

Human-Centric Indoor Scene Synthesis Using Stochastic Grammar , booktitle =

Siyuan Qi and Yixin Zhu and Siyuan Huang and Chenfanfu Jiang and Song. Human-Centric Indoor Scene Synthesis Using Stochastic Grammar , booktitle =. 2018 , pages =

2018
[26]

Computer Graphics Forum , volume =

Daniel Ritchie and Sarah Jobalia and Anna Thomas and Pat Hanrahan , title =. Computer Graphics Forum , volume =
[27]

Activity-Centric Scene Synthesis for Functional 3D Scene Modeling , journal =

Matthew Fisher and Manolis Savva and Pat Hanrahan and Matthias Nie. Activity-Centric Scene Synthesis for Functional 3D Scene Modeling , journal =
[28]

Computers & Graphics , volume =

Pulak Purkait and Christopher Zach and Tat-Jun Chin and Ian Reid , title =. Computers & Graphics , volume =
[29]

Graphics Interface , volume=

Constraint-based automatic placement for scene composition , author=. Graphics Interface , volume=. 2002 , organization=

2002
[30]

ACM Transactions on Graphics (TOG)-Proceedings of ACM SIGGRAPH 2011, v

Make it home: automatic optimization of furniture arrangement , author=. ACM Transactions on Graphics (TOG)-Proceedings of ACM SIGGRAPH 2011, v. 30,(4), July 2011, article no. 86 , volume=

2011
[31]

European Conference on Computer Vision , pages=

Sg-vae: Scene grammar variational autoencoder to generate new indoor scenes , author=. European Conference on Computer Vision , pages=. 2020 , organization=

2020
[32]

ACM Transactions on Graphics (TOG) , volume=

Activity-centric scene synthesis for functional 3D scene modeling , author=. ACM Transactions on Graphics (TOG) , volume=. 2015 , publisher=

2015
[33]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Fast and flexible indoor scene synthesis via deep convolutional generative models , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
[34]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Diffuscene: Denoising diffusion models for generative indoor scene synthesis , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
[35]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Lego-net: Learning regular rearrangements of objects in rooms , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
[36]

ACM Transactions on Graphics (TOG) , volume=

Deep convolutional priors for indoor scene synthesis , author=. ACM Transactions on Graphics (TOG) , volume=. 2018 , publisher=

2018
[37]

Advances in neural information processing systems , volume=

Training language models to follow instructions with human feedback , author=. Advances in neural information processing systems , volume=
[38]

Finetuned Language Models Are Zero-Shot Learners

Finetuned language models are zero-shot learners , author=. arXiv preprint arXiv:2109.01652 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[39]

Advances in neural information processing systems , volume=

Deep reinforcement learning from human preferences , author=. Advances in neural information processing systems , volume=
[40]

Advances in Neural Information Processing Systems , volume=

Direct preference optimization: Your language model is secretly a reward model , author=. Advances in Neural Information Processing Systems , volume=
[41]

Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

Scaling llm test-time compute optimally can be more effective than scaling model parameters , author=. arXiv preprint arXiv:2408.03314 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[42]

Large Language Monkeys: Scaling Inference Compute with Repeated Sampling

Large language monkeys: Scaling inference compute with repeated sampling , author=. arXiv preprint arXiv:2407.21787 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[43]

Self-Consistency Improves Chain of Thought Reasoning in Language Models

Self-consistency improves chain of thought reasoning in language models , author=. arXiv preprint arXiv:2203.11171 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[44]

SIGGRAPH Asia 2022 Conference Papers , pages=

Scene synthesis from human motion , author=. SIGGRAPH Asia 2022 Conference Papers , pages=

2022
[45]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

MIME: Human-aware 3D scene generation , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
[46]

Advances in Neural Information Processing Systems , volume=

Commonscenes: Generating commonsense 3d indoor scenes with scene graph diffusion , author=. Advances in Neural Information Processing Systems , volume=
[47]

European Conference on Computer Vision , pages=

Echoscene: Indoor scene generation via information echo over scene graph diffusion , author=. European Conference on Computer Vision , pages=. 2024 , organization=

2024
[48]

Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

Graph-to-3d: End-to-end generation and manipulation of 3d scenes using scene graphs , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=
[49]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

End-to-end optimization of scene layout , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
[50]

Advances in Neural Information Processing Systems , volume=

Debara: Denoising-based 3d room arrangement generation , author=. Advances in Neural Information Processing Systems , volume=
[51]

Advances in neural information processing systems , volume=

Language models are few-shot learners , author=. Advances in neural information processing systems , volume=
[52]

Advances in Neural Information Processing Systems , volume=

Layoutgpt: Compositional visual planning and generation with large language models , author=. Advances in Neural Information Processing Systems , volume=
[53]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Holodeck: Language guided generation of 3d embodied ai environments , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
[54]

arXiv preprint arXiv:2412.02193 , year=

LayoutVLM: Differentiable Optimization of 3D Layout via Vision-Language Models , author=. arXiv preprint arXiv:2412.02193 , year=

work page arXiv
[55]

arXiv preprint arXiv:2403.09675 , year=

Open-universe indoor scene generation using llm program synthesis and uncurated object databases , author=. arXiv preprint arXiv:2403.09675 , year=

work page arXiv
[56]

arXiv preprint arXiv:2406.03866 , year=

Llplace: The 3d indoor scene layout generation and editing via large language model , author=. arXiv preprint arXiv:2406.03866 , year=

work page arXiv
[57]

arXiv preprint arXiv:2410.16770 , year=

The scene language: Representing scenes with programs, words, and embeddings , author=. arXiv preprint arXiv:2410.16770 , year=

work page arXiv
[58]

arXiv preprint arXiv:1901.02875 , year=

Learning to infer and execute 3d shape programs , author=. arXiv preprint arXiv:1901.02875 , year=

work page arXiv 1901
[59]

International conference on machine learning , pages=

Learning transferable visual models from natural language supervision , author=. International conference on machine learning , pages=
[60]

Proceedings of the IEEE/CVF international conference on computer vision , pages=

Sigmoid loss for language image pre-training , author=. Proceedings of the IEEE/CVF international conference on computer vision , pages=
[61]

Qwen2.5 Technical Report

Qwen2. 5 technical report , author=. arXiv preprint arXiv:2412.15115 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[62]

The Llama 3 Herd of Models

The llama 3 herd of models , author=. arXiv preprint arXiv:2407.21783 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[63]

European Conference on Computer Vision , pages=

Scenescript: Reconstructing scenes with an autoregressive structured language model , author=. European Conference on Computer Vision , pages=
[64]

Computer Vision--ECCV 2020: 16th European Conference, Glasgow, UK, August 23--28, 2020, Proceedings, Part III 16 , pages=

Convolutional occupancy networks , author=. Computer Vision--ECCV 2020: 16th European Conference, Glasgow, UK, August 23--28, 2020, Proceedings, Part III 16 , pages=

2020
[65]

Communications of the ACM , volume=

Nerf: Representing scenes as neural radiance fields for view synthesis , author=. Communications of the ACM , volume=
[66]

Tulu 3: Pushing Frontiers in Open Language Model Post-Training

T " ulu 3: Pushing frontiers in open language model post-training , author=. arXiv preprint arXiv:2411.15124 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[67]

arXiv preprint arXiv:2503.23829 , year=

Crossing the Reward Bridge: Expanding RL with Verifiable Rewards Across Diverse Domains , author=. arXiv preprint arXiv:2503.23829 , year=

work page arXiv
[68]

, author=

Lora: Low-rank adaptation of large language models. , author=. ICLR , volume=
[69]

arXiv preprint arXiv:2406.08660 , year=

Fine-Tuned'Small'LLMs (Still) Significantly Outperform Zero-Shot Generative AI Models in Text Classification , author=. arXiv preprint arXiv:2406.08660 , year=

work page arXiv
[70]

Resources, Conservation and Recycling , volume=

Towards a ‘resource cadastre’for a circular economy--urban-scale building material detection using street view imagery and computer vision , author=. Resources, Conservation and Recycling , volume=
[71]

Automation in Construction , volume=

Performance-based generative design for parametric modeling of engineering structures using deep conditional generative models , author=. Automation in Construction , volume=
[72]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Physcene: Physically interactable 3d scene synthesis for embodied ai , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
[73]

Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

Text2room: Extracting textured 3d meshes from 2d text-to-image models , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=
[74]

European Conference on Computer Vision , pages=

Dreamscene: 3d gaussian-based text-to-3d scene generation via formation pattern sampling , author=. European Conference on Computer Vision , pages=
[75]

Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

Set-the-scene: Global-local training for generating controllable nerf scenes , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=
[76]

Proceedings of the AAAI Conference on Artificial Intelligence and Interactive Digital Entertainment , volume=

Scenecraft: Automating interactive narrative scene generation in digital games with large language models , author=. Proceedings of the AAAI Conference on Artificial Intelligence and Interactive Digital Entertainment , volume=
[77]

2025 International Conference on 3D Vision (3DV) , pages=

Ctrl-room: Controllable text-to-3d room meshes generation with layout constraints , author=. 2025 International Conference on 3D Vision (3DV) , pages=. 2025 , organization=

2025
[78]

European Conference on Computer Vision , pages=

Forest2seq: Revitalizing order prior for sequential indoor scene synthesis , author=. European Conference on Computer Vision , pages=
[79]

Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

CasaGPT: cuboid arrangement and scene assembly for interior design , author=. Proceedings of the Computer Vision and Pattern Recognition Conference , pages=
[80]

arXiv preprint arXiv:2506.07491 , year=

SpatialLM: Training Large Language Models for Structured Indoor Modeling , author=. arXiv preprint arXiv:2506.07491 , year=

work page arXiv

Showing first 80 references.

[1] [1]

Advances in Neural Information Processing Systems , volume=

From programs to poses: Factored real-world scene generation via learned program libraries , author=. Advances in Neural Information Processing Systems , volume=

[2] [2]

arXiv preprint arXiv:2503.16848 , year=

Hsm: Hierarchical scene motifs for multi-scale indoor scene generation , author=. arXiv preprint arXiv:2503.16848 , year=

work page arXiv

[3] [3]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Infinigen indoors: Photorealistic indoor scenes using procedural generation , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

[4] [4]

2021 International Conference on 3D Vision (3DV) , pages=

Sceneformer: Indoor scene generation with transformers , author=. 2021 International Conference on 3D Vision (3DV) , pages=. 2021 , organization=

2021

[5] [5]

Advances in Neural Information Processing Systems , volume=

Atiss: Autoregressive transformers for indoor scene synthesis , author=. Advances in Neural Information Processing Systems , volume=

[6] [6]

arXiv preprint arXiv:2405.21066 , year=

Mixed Diffusion for 3D Indoor Scene Synthesis , author=. arXiv preprint arXiv:2405.21066 , year=

work page arXiv

[7] [7]

Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

3d-front: 3d furnished rooms with layouts and semantics , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

[8] [8]

International Journal of Computer Vision , volume=

3d-future: 3d furniture shape with texture , author=. International Journal of Computer Vision , volume=. 2021 , publisher=

2021

[9] [9]

GPT-4o System Card

Gpt-4o system card , author=. arXiv preprint arXiv:2410.21276 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[10] [10]

Advances in Neural Information Processing Systems , volume=

Panoptic 3d scene reconstruction from a single rgb image , author=. Advances in Neural Information Processing Systems , volume=

[11] [11]

arXiv preprint arXiv:2402.04717 , year=

Instructscene: Instruction-driven 3d indoor scene synthesis with semantic graph prior , author=. arXiv preprint arXiv:2402.04717 , year=

work page arXiv

[12] [12]

CubiCasa5K: A Dataset and an Improved Multi-Task Model for Floorplan Image Analysis , booktitle =

Ahti Kalervo and Juha Ylioinas and Markus H. CubiCasa5K: A Dataset and an Improved Multi-Task Model for Floorplan Image Analysis , booktitle =

[13] [13]

HorizonNet: Learning Room Layout with 1D Representation and Pano Stretch Data Augmentation , booktitle =

Cheng Sun and Chi. HorizonNet: Learning Room Layout with 1D Representation and Pano Stretch Data Augmentation , booktitle =. 2019 , pages =

2019

[14] [14]

Wang and Angel X

Hou In Ivan Tam and Hou In Derek Pun and Austin T. Wang and Angel X. Chang and Manolis Savva , title =. Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) , year =

[15] [15]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year =

Chuhang Zou and Alex Colburn and Qi Shan and Derek Hoiem , title =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year =

[16] [16]

Advances in neural information processing systems , volume=

Gans trained by a two time-scale update rule converge to a local nash equilibrium , author=. Advances in neural information processing systems , volume=

[17] [17]

The Role of ImageNet Classes in Fr

Kynk. The Role of ImageNet Classes in Fr

[18] [18]

Demystifying MMD GANs

Demystifying mmd gans , author=. arXiv preprint arXiv:1801.01401 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[19] [19]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Deepseekmath: Pushing the limits of mathematical reasoning in open language models , author=. arXiv preprint arXiv:2402.03300 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[20] [20]

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages=

Human-centric indoor scene synthesis using stochastic grammar , author=. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages=

[21] [21]

arXiv preprint arXiv:2303.03565 , year=

Clip-layout: Style-consistent indoor scene synthesis with semantic furniture embedding , author=. arXiv preprint arXiv:2303.03565 , year=

work page arXiv

[22] [22]

Computer Vision -- ECCV 2024 Workshops , series =

Ata Çelen and Guo Han and Konrad Schindler and Luc Van Gool and Iro Armeni and Anton Obukhov and Xi Wang , title =. Computer Vision -- ECCV 2024 Workshops , series =. 2025 , pages =

2024

[23] [23]

Proceedings of the International Conference on 3D Vision (3DV) , year =

Chuan Fang and Yuan Dong and Kunming Luo and Xiaotao Hu and Rakesh Shrestha and Ping Tan , title =. Proceedings of the International Conference on 3D Vision (3DV) , year =

[24] [24]

IEEE Transactions on Visualization and Computer Graphics , volume=

Fast and scalable position-based layout synthesis , author=. IEEE Transactions on Visualization and Computer Graphics , volume=. 2018 , publisher=

2018

[25] [25]

Human-Centric Indoor Scene Synthesis Using Stochastic Grammar , booktitle =

Siyuan Qi and Yixin Zhu and Siyuan Huang and Chenfanfu Jiang and Song. Human-Centric Indoor Scene Synthesis Using Stochastic Grammar , booktitle =. 2018 , pages =

2018

[26] [26]

Computer Graphics Forum , volume =

Daniel Ritchie and Sarah Jobalia and Anna Thomas and Pat Hanrahan , title =. Computer Graphics Forum , volume =

[27] [27]

Activity-Centric Scene Synthesis for Functional 3D Scene Modeling , journal =

Matthew Fisher and Manolis Savva and Pat Hanrahan and Matthias Nie. Activity-Centric Scene Synthesis for Functional 3D Scene Modeling , journal =

[28] [28]

Computers & Graphics , volume =

Pulak Purkait and Christopher Zach and Tat-Jun Chin and Ian Reid , title =. Computers & Graphics , volume =

[29] [29]

Graphics Interface , volume=

Constraint-based automatic placement for scene composition , author=. Graphics Interface , volume=. 2002 , organization=

2002

[30] [30]

ACM Transactions on Graphics (TOG)-Proceedings of ACM SIGGRAPH 2011, v

Make it home: automatic optimization of furniture arrangement , author=. ACM Transactions on Graphics (TOG)-Proceedings of ACM SIGGRAPH 2011, v. 30,(4), July 2011, article no. 86 , volume=

2011

[31] [31]

European Conference on Computer Vision , pages=

Sg-vae: Scene grammar variational autoencoder to generate new indoor scenes , author=. European Conference on Computer Vision , pages=. 2020 , organization=

2020

[32] [32]

ACM Transactions on Graphics (TOG) , volume=

Activity-centric scene synthesis for functional 3D scene modeling , author=. ACM Transactions on Graphics (TOG) , volume=. 2015 , publisher=

2015

[33] [33]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Fast and flexible indoor scene synthesis via deep convolutional generative models , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

[34] [34]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Diffuscene: Denoising diffusion models for generative indoor scene synthesis , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

[35] [35]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Lego-net: Learning regular rearrangements of objects in rooms , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

[36] [36]

ACM Transactions on Graphics (TOG) , volume=

Deep convolutional priors for indoor scene synthesis , author=. ACM Transactions on Graphics (TOG) , volume=. 2018 , publisher=

2018

[37] [37]

Advances in neural information processing systems , volume=

Training language models to follow instructions with human feedback , author=. Advances in neural information processing systems , volume=

[38] [38]

Finetuned Language Models Are Zero-Shot Learners

Finetuned language models are zero-shot learners , author=. arXiv preprint arXiv:2109.01652 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[39] [39]

Advances in neural information processing systems , volume=

Deep reinforcement learning from human preferences , author=. Advances in neural information processing systems , volume=

[40] [40]

Advances in Neural Information Processing Systems , volume=

Direct preference optimization: Your language model is secretly a reward model , author=. Advances in Neural Information Processing Systems , volume=

[41] [41]

Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

Scaling llm test-time compute optimally can be more effective than scaling model parameters , author=. arXiv preprint arXiv:2408.03314 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[42] [42]

Large Language Monkeys: Scaling Inference Compute with Repeated Sampling

Large language monkeys: Scaling inference compute with repeated sampling , author=. arXiv preprint arXiv:2407.21787 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[43] [43]

Self-Consistency Improves Chain of Thought Reasoning in Language Models

Self-consistency improves chain of thought reasoning in language models , author=. arXiv preprint arXiv:2203.11171 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[44] [44]

SIGGRAPH Asia 2022 Conference Papers , pages=

Scene synthesis from human motion , author=. SIGGRAPH Asia 2022 Conference Papers , pages=

2022

[45] [45]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

MIME: Human-aware 3D scene generation , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

[46] [46]

Advances in Neural Information Processing Systems , volume=

Commonscenes: Generating commonsense 3d indoor scenes with scene graph diffusion , author=. Advances in Neural Information Processing Systems , volume=

[47] [47]

European Conference on Computer Vision , pages=

Echoscene: Indoor scene generation via information echo over scene graph diffusion , author=. European Conference on Computer Vision , pages=. 2024 , organization=

2024

[48] [48]

Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

Graph-to-3d: End-to-end generation and manipulation of 3d scenes using scene graphs , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

[49] [49]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

End-to-end optimization of scene layout , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

[50] [50]

Advances in Neural Information Processing Systems , volume=

Debara: Denoising-based 3d room arrangement generation , author=. Advances in Neural Information Processing Systems , volume=

[51] [51]

Advances in neural information processing systems , volume=

Language models are few-shot learners , author=. Advances in neural information processing systems , volume=

[52] [52]

Advances in Neural Information Processing Systems , volume=

Layoutgpt: Compositional visual planning and generation with large language models , author=. Advances in Neural Information Processing Systems , volume=

[53] [53]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Holodeck: Language guided generation of 3d embodied ai environments , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

[54] [54]

arXiv preprint arXiv:2412.02193 , year=

LayoutVLM: Differentiable Optimization of 3D Layout via Vision-Language Models , author=. arXiv preprint arXiv:2412.02193 , year=

work page arXiv

[55] [55]

arXiv preprint arXiv:2403.09675 , year=

Open-universe indoor scene generation using llm program synthesis and uncurated object databases , author=. arXiv preprint arXiv:2403.09675 , year=

work page arXiv

[56] [56]

arXiv preprint arXiv:2406.03866 , year=

Llplace: The 3d indoor scene layout generation and editing via large language model , author=. arXiv preprint arXiv:2406.03866 , year=

work page arXiv

[57] [57]

arXiv preprint arXiv:2410.16770 , year=

The scene language: Representing scenes with programs, words, and embeddings , author=. arXiv preprint arXiv:2410.16770 , year=

work page arXiv

[58] [58]

arXiv preprint arXiv:1901.02875 , year=

Learning to infer and execute 3d shape programs , author=. arXiv preprint arXiv:1901.02875 , year=

work page arXiv 1901

[59] [59]

International conference on machine learning , pages=

Learning transferable visual models from natural language supervision , author=. International conference on machine learning , pages=

[60] [60]

Proceedings of the IEEE/CVF international conference on computer vision , pages=

Sigmoid loss for language image pre-training , author=. Proceedings of the IEEE/CVF international conference on computer vision , pages=

[61] [61]

Qwen2.5 Technical Report

Qwen2. 5 technical report , author=. arXiv preprint arXiv:2412.15115 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[62] [62]

The Llama 3 Herd of Models

The llama 3 herd of models , author=. arXiv preprint arXiv:2407.21783 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[63] [63]

European Conference on Computer Vision , pages=

Scenescript: Reconstructing scenes with an autoregressive structured language model , author=. European Conference on Computer Vision , pages=

[64] [64]

Computer Vision--ECCV 2020: 16th European Conference, Glasgow, UK, August 23--28, 2020, Proceedings, Part III 16 , pages=

Convolutional occupancy networks , author=. Computer Vision--ECCV 2020: 16th European Conference, Glasgow, UK, August 23--28, 2020, Proceedings, Part III 16 , pages=

2020

[65] [65]

Communications of the ACM , volume=

Nerf: Representing scenes as neural radiance fields for view synthesis , author=. Communications of the ACM , volume=

[66] [66]

Tulu 3: Pushing Frontiers in Open Language Model Post-Training

T " ulu 3: Pushing frontiers in open language model post-training , author=. arXiv preprint arXiv:2411.15124 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[67] [67]

arXiv preprint arXiv:2503.23829 , year=

Crossing the Reward Bridge: Expanding RL with Verifiable Rewards Across Diverse Domains , author=. arXiv preprint arXiv:2503.23829 , year=

work page arXiv

[68] [68]

, author=

Lora: Low-rank adaptation of large language models. , author=. ICLR , volume=

[69] [69]

arXiv preprint arXiv:2406.08660 , year=

Fine-Tuned'Small'LLMs (Still) Significantly Outperform Zero-Shot Generative AI Models in Text Classification , author=. arXiv preprint arXiv:2406.08660 , year=

work page arXiv

[70] [70]

Resources, Conservation and Recycling , volume=

Towards a ‘resource cadastre’for a circular economy--urban-scale building material detection using street view imagery and computer vision , author=. Resources, Conservation and Recycling , volume=

[71] [71]

Automation in Construction , volume=

Performance-based generative design for parametric modeling of engineering structures using deep conditional generative models , author=. Automation in Construction , volume=

[72] [72]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Physcene: Physically interactable 3d scene synthesis for embodied ai , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

[73] [73]

Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

Text2room: Extracting textured 3d meshes from 2d text-to-image models , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

[74] [74]

European Conference on Computer Vision , pages=

Dreamscene: 3d gaussian-based text-to-3d scene generation via formation pattern sampling , author=. European Conference on Computer Vision , pages=

[75] [75]

Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

Set-the-scene: Global-local training for generating controllable nerf scenes , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

[76] [76]

Proceedings of the AAAI Conference on Artificial Intelligence and Interactive Digital Entertainment , volume=

Scenecraft: Automating interactive narrative scene generation in digital games with large language models , author=. Proceedings of the AAAI Conference on Artificial Intelligence and Interactive Digital Entertainment , volume=

[77] [77]

2025 International Conference on 3D Vision (3DV) , pages=

Ctrl-room: Controllable text-to-3d room meshes generation with layout constraints , author=. 2025 International Conference on 3D Vision (3DV) , pages=. 2025 , organization=

2025

[78] [78]

European Conference on Computer Vision , pages=

Forest2seq: Revitalizing order prior for sequential indoor scene synthesis , author=. European Conference on Computer Vision , pages=

[79] [79]

Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

CasaGPT: cuboid arrangement and scene assembly for interior design , author=. Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

[80] [80]

arXiv preprint arXiv:2506.07491 , year=

SpatialLM: Training Large Language Models for Structured Indoor Modeling , author=. arXiv preprint arXiv:2506.07491 , year=

work page arXiv