VoxScene: Anchor-Conditioned Voxel Diffusion for Indoor Scene Arrangement

Chenliang Zhou; Fangcheng Zhong; Haotian Mao; Hui Wang; Jiatao Lin; Xubo Yang; Yang Zhao; Yan Zhang; Yiheng Zhang; Yuhan Huang

arxiv: 2605.17102 · v1 · pith:IFFNB36Mnew · submitted 2026-05-16 · 💻 cs.GR · cs.CV

VoxScene: Anchor-Conditioned Voxel Diffusion for Indoor Scene Arrangement

Haotian Mao , Yuhan Huang , Jiatao Lin , Yang Zhao , Hui Wang , Yiheng Zhang , Yuwang Wang , Chenliang Zhou

show 3 more authors

Yan Zhang Fangcheng Zhong Xubo Yang

This is my paper

Pith reviewed 2026-05-20 14:41 UTC · model grok-4.3

classification 💻 cs.GR cs.CV

keywords voxel diffusion3D scene synthesisindoor scene arrangementcollision-free layoutanchor-conditioned generationvolumetric occupancyobject-centric representation

0 comments

The pith

Anchor-conditioned voxel diffusion generates collision-free 3D indoor scenes from sequential discrete volumetric occupancies.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a framework that builds indoor scenes by diffusing voxel occupancies for each object in turn. Each new object is conditioned on already placed anchors and nearby context rather than on global bounding boxes or implicit surfaces. This explicit discrete representation ensures that no two objects claim the same voxel space, removing the overlaps that arise when methods treat objects as loose proxies. A reader would care because the result is physically valid arrangements even when rooms are densely furnished. The synthesized voxels can then be used directly to fetch matching 3D assets.

Core claim

VoxScene is an anchor-conditioned voxel diffusion framework for 3D scene synthesis. The pipeline sequentially synthesizes discrete volumetric occupancies conditioned on prior anchors and local context. Exploiting the mutually exclusive nature of discrete voxels eliminates spatial ambiguities and guarantees collision-free arrangements even in highly complex environments. The high-fidelity voxel grids additionally serve as geometric queries for downstream asset retrieval.

What carries the argument

Sequential synthesis of discrete volumetric occupancies conditioned on anchors and local context, using the mutual exclusivity of voxels to enforce non-overlap.

If this is right

The method produces collision-free layouts by design even in densely populated rooms.
Voxel grids supply direct geometric queries that improve asset retrieval accuracy.
Physical plausibility reaches state-of-the-art levels compared with bounding-proxy or implicit baselines.
Shape diversity increases because the representation is not restricted to coarse bounding volumes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same voxel-exclusivity principle could be applied to outdoor or multi-room layouts if the conditioning window is enlarged.
Replacing the sequential order with a parallel diffusion schedule might reduce generation time while preserving the non-overlap property.
Feeding the final voxel grid into a physics simulator would provide an independent test of stability beyond visual collision checks.

Load-bearing premise

Sequential synthesis of discrete volumetric occupancies conditioned on prior anchors and local context produces globally consistent scenes without post-processing or additional global constraints.

What would settle it

Run the generator on a test set of complex indoor layouts and count the fraction of output scenes that contain any pair of objects whose voxels intersect; a non-zero intersection rate would falsify the collision-free guarantee.

Figures

Figures reproduced from arXiv: 2605.17102 by Chenliang Zhou, Fangcheng Zhong, Haotian Mao, Hui Wang, Jiatao Lin, Xubo Yang, Yang Zhao, Yan Zhang, Yiheng Zhang, Yuhan Huang, Yuwang Wang.

**Figure 1.** Figure 1: VoxScene introduces a voxel-based layout representation. By leveraging the spatial exclusivity of discrete occupancies, our method resolves object intersections (shown in the red box). Furthermore, the explicit 3D structure of voxels excels in handling fine-grained layouts within complex geometric contexts (shown in the blue boxes). We present VoxScene, a novel anchor-conditioned voxel diffusion framework… view at source ↗

**Figure 2.** Figure 2: Method overview. We formulate scene synthesis as an anchor-conditioned generative process. Guided by prior anchors from various upstream sources (Part 1), our object-centric framework sequentially generates each target object, completing the whole scene represented in voxels through N iterations (Part 2). These generated voxels then serve as explicit geometric proxies for downstream asset retrieval, ultima… view at source ↗

**Figure 3.** Figure 3: Training policy. We employ a stochastic masking policy to enable our model to generate in an arbitrary sequence. The anchor shift policy further improves the robustness and effectiveness. a local block 𝐺𝑙 of fixed resolution 𝐾 3 , which is centered at p𝑖 and aligned with the heading orientation r𝑖 . Since global IDs are arbitrarily assigned and completely lack cross-scene consistency, directly feeding the… view at source ↗

**Figure 4.** Figure 4: Effect of anchor shifting. corrupted by injecting Gaussian noise 𝜖 ∼ N (0, I) over 𝑇 discrete timesteps until it reaches a nearly pure isotropic Gaussian distribution: 𝑞(z𝑡 |z𝑡−1) = N (z𝑡 ; √︁ 1 − 𝛽𝑡 z𝑡−1, 𝛽𝑡 I), (2) where 𝛽𝑡 is the variance schedule. Applying the reparameterization trick, the closed-form distribution of the noisy latent at any arbitrary timestep 𝑡 can be directly sampled without iterativ… view at source ↗

**Figure 5.** Figure 5: Effect of style clustering. 3.3 Anisotropy Asset Retrieval. Once the iterative diffusion process synthesizes the entire scene layout, we instantiate each generated voxel object with a realistic model. All models are pre-normalized into a canonical unit coordinate system and voxelized at a fixed resolution of 𝐾 3 𝐴 . During inference, the local block is sampled exactly at the anchor’s center without the sp… view at source ↗

**Figure 6.** Figure 6: Qualitative comparisons in 3D-FRONT [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗

**Figure 7.** Figure 7: Qualitative comparisons in M3D-Shelf. Our model is generated conditioned on the anchors generated by DiffuScene. We chose the most similar scene from SceneWevaer as a comparison [PITH_FULL_IMAGE:figures/full_fig_p011_7.png] view at source ↗

**Figure 8.** Figure 8: More Results of our voxel generation and retrieved assets [PITH_FULL_IMAGE:figures/full_fig_p011_8.png] view at source ↗

**Figure 9.** Figure 9: More results on 3D-FRONT [PITH_FULL_IMAGE:figures/full_fig_p016_9.png] view at source ↗

**Figure 10.** Figure 10: More results on M3D-Shelf [PITH_FULL_IMAGE:figures/full_fig_p017_10.png] view at source ↗

read the original abstract

We present VoxScene, a novel anchor-conditioned voxel diffusion framework tailored for 3D scene synthesis. Current data-driven layout generation techniques typically rely on bounding proxies or implicit representations, which overlook volumetric structures. This geometric blindness inevitably leads to severe physical collisions and structural entanglement, particularly in densely populated environments. To overcome these limitations, we shift the paradigm to an explicit, object-centric voxel representation. Our pipeline sequentially synthesizes discrete volumetric occupancies conditioned on prior anchors and local context. By exploiting the mutually exclusive nature of discrete voxels, our approach eliminates spatial ambiguities and guarantees collision-free arrangements, even in highly complex environments. Furthermore, the synthesized high-fidelity voxel grids serve as discriminative geometric queries for downstream asset retrieval. Extensive experiments demonstrate the universality of our method, achieving state-of-the-art physical plausibility and unlocking shape diversity compared to existing layout planners.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

VoxScene shifts scene generation to anchor-conditioned voxel diffusion to avoid collisions by design, but the local-only conditioning may not fully guarantee global consistency.

read the letter

The main takeaway is that VoxScene introduces anchor-conditioned voxel diffusion to create indoor scene arrangements, using discrete voxels to supposedly guarantee no collisions even in crowded spaces. The approach moves from proxy representations to explicit volumetric occupancies synthesized sequentially. The paper does a good job highlighting the problems with current methods that rely on bounding boxes or implicit shapes, which often cause physical issues in dense scenes. By conditioning the diffusion on prior anchors and local context, they build the scene object by object. The voxel grids then help retrieve suitable assets, linking generation to actual geometry. This seems like a solid practical contribution for applications needing realistic layouts. It earns credit for trying to enforce collision-free results by design through the mutually exclusive voxels, rather than hoping post-processing fixes things. If the experiments demonstrate better plausibility and diversity, that would be valuable for the field. The soft spots center on whether local sequential synthesis truly ensures global consistency. As noted in the stress test, without an explicit global constraint, distant objects could still overlap once their full extents are considered. The abstract's strong wording on guarantees might overstate what local conditioning achieves in complex environments. I'd check the full paper for how they handle the entire scene volume and any ablations on this aspect. The data and claims would need the full experimental section to assess properly, but the framing avoids obvious circularity. This paper is for graphics researchers interested in generative models for 3D scenes, particularly those concerned with physical constraints. Readers working on layout planning or diffusion-based synthesis would find the voxel shift interesting. It has enough new elements and addresses a real issue to merit serious peer review. I would recommend sending it to referees rather than desk rejecting it.

Referee Report

1 major / 2 minor

Summary. The manuscript introduces VoxScene, a novel anchor-conditioned voxel diffusion framework for 3D indoor scene synthesis. It replaces bounding-box or implicit proxies with an explicit object-centric voxel representation, sequentially synthesizing discrete volumetric occupancies conditioned on prior anchors and local context. The authors claim that the mutual exclusivity of discrete voxels eliminates spatial ambiguities and guarantees collision-free arrangements even in dense scenes; the resulting voxel grids then serve as geometric queries for downstream asset retrieval. Experiments are said to demonstrate state-of-the-art physical plausibility and greater shape diversity relative to existing layout planners.

Significance. If the central claims are substantiated, the work would offer a concrete advance in data-driven scene arrangement by directly addressing geometric collisions that arise from proxy-based methods. The explicit voxel representation and sequential local conditioning constitute a clear methodological shift that could improve both physical realism and downstream retrieval tasks. The absence of machine-checked proofs or fully reproducible code is noted, but the pipeline itself is a falsifiable contribution that invites direct comparison on collision metrics.

major comments (1)

[Abstract and §3] Abstract and §3 (method overview): the central claim that sequential synthesis 'guarantees collision-free arrangements' by exploiting voxel mutual exclusivity rests on the unverified assumption that local conditioning on prior anchors and neighborhood context is sufficient to enforce global occupancy consistency. No global occupancy mask, post-processing step, or formal argument is supplied showing that non-overlapping local neighborhoods cannot produce intersecting extents after asset retrieval; this directly undermines the 'guarantees' language and the SOTA physical-plausibility assertion.

minor comments (2)

[Abstract] Abstract: quantitative claims of 'state-of-the-art physical plausibility' and 'unlocking shape diversity' are stated without any reported metrics, baselines, ablation tables, or error analysis, making the strength of the empirical contribution impossible to evaluate from the summary alone.
[§3] Notation: the conditioning mechanism (anchor features, local context tensor) is described at a high level; a precise definition of the diffusion conditioning input (e.g., concatenation or cross-attention formulation) would improve reproducibility.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive comments on our work. We have reviewed the concern about the strength of our collision-free claim and agree that the language requires clarification to better reflect the method's reliance on sequential local conditioning without a formal global proof.

read point-by-point responses

Referee: [Abstract and §3] Abstract and §3 (method overview): the central claim that sequential synthesis 'guarantees collision-free arrangements' by exploiting voxel mutual exclusivity rests on the unverified assumption that local conditioning on prior anchors and neighborhood context is sufficient to enforce global occupancy consistency. No global occupancy mask, post-processing step, or formal argument is supplied showing that non-overlapping local neighborhoods cannot produce intersecting extents after asset retrieval; this directly undermines the 'guarantees' language and the SOTA physical-plausibility assertion.

Authors: We acknowledge that the manuscript employs the term 'guarantees' without supplying a formal argument, global occupancy mask, or post-processing step to prove global consistency from local neighborhoods. The sequential process conditions each new object's discrete voxel occupancy on prior anchors and local context, and the mutual exclusivity of voxels within each generated grid prevents intra-object overlaps; however, this does not constitute a rigorous global enforcement mechanism, and intersecting extents could theoretically arise if local predictions are inconsistent across distant but overlapping regions after retrieval. Our experiments report improved physical plausibility metrics relative to proxy-based baselines, but these are empirical rather than provable. To address the comment, we will revise the abstract and §3 to replace 'guarantees collision-free arrangements' with 'empirically promotes collision-free arrangements through sequential local conditioning' and add a brief limitations paragraph noting the absence of global consistency proofs. revision: yes

Circularity Check

0 steps flagged

No circularity: derivation relies on explicit voxel representation and standard diffusion conditioning

full rationale

The paper presents VoxScene as a new pipeline that sequentially synthesizes discrete volumetric occupancies using anchor-conditioned voxel diffusion. The central claim that mutual exclusivity of discrete voxels eliminates spatial ambiguities and guarantees collision-free results follows directly from the choice of explicit object-centric voxel grids and local-context conditioning, without any equations or steps that reduce the output to fitted parameters, self-citations, or definitional loops. No load-bearing uniqueness theorems, ansatzes smuggled via prior work, or renamings of known results are invoked. The method is framed as building on standard diffusion assumptions with an independent geometric representation, making the derivation self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

Based solely on the abstract, the method rests on standard diffusion model assumptions and the domain assumption that voxels provide mutually exclusive occupancy.

free parameters (1)

voxel grid resolution
Choice of voxel size is a design parameter that trades off detail against computation and is not derived from first principles.

axioms (1)

domain assumption Discrete voxels are mutually exclusive in occupancy
Invoked to guarantee collision-free results; stated in the abstract as the key property exploited by the method.

pith-pipeline@v0.9.0 · 5705 in / 1248 out tokens · 51181 ms · 2026-05-20T14:41:15.425388+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

79 extracted references · 79 canonical work pages · 4 internal anchors

[1]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

FreeScene: Mixed graph diffusion for 3D scene synthesis from free prompts , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page
[2]

European Conference on Computer Vision , pages=

I-design: Personalized llm interior designer , author=. European Conference on Computer Vision , pages=. 2024 , organization=

work page 2024
[3]

arXiv preprint arXiv:2601.05810 , year=

SceneFoundry: Generating Interactive Infinite 3D Worlds , author=. arXiv preprint arXiv:2601.05810 , year=

work page arXiv
[4]

Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

A unified, resilient, and explainable adversarial patch detector , author=. Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

work page
[5]

Advances in Neural Information Processing Systems , volume=

ProcTHOR: Large-Scale Embodied AI Using Procedural Generation , author=. Advances in Neural Information Processing Systems , volume=

work page
[6]

Advances in Neural Information Processing Systems , volume=

Debara: Denoising-based 3d room arrangement generation , author=. Advances in Neural Information Processing Systems , volume=

work page
[7]

Spatialgen: Layout-guided 3d indoor scene generation,

Spatialgen: Layout-guided 3d indoor scene generation , author=. arXiv preprint arXiv:2509.14981 , volume=

work page arXiv
[8]

Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

CasaGPT: cuboid arrangement and scene assembly for interior design , author=. Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

work page
[9]

ACM Transactions on Graphics (TOG) , volume=

Example-based synthesis of 3D object arrangements , author=. ACM Transactions on Graphics (TOG) , volume=. 2012 , publisher=

work page 2012
[10]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Graphdreamer: Compositional 3d scene synthesis from scene graphs , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page
[11]

Advances in neural information processing systems , volume=

Mesatask: Towards task-driven tabletop scene generation via 3d spatial reasoning , author=. Advances in neural information processing systems , volume=

work page
[12]

Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision , pages=

Mixed diffusion for 3d indoor scene synthesis , author=. Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision , pages=

work page
[13]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Fireplace: Geometric refinements of llm common sense reasoning for 3d object placement , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

work page
[14]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Layoutdm: Discrete diffusion model for controllable layout generation , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page
[15]

Proceedings of the 2026 CHI Conference on Human Factors in Computing Systems , pages=

Behavior-Aware Anthropometric Scene Generation for Human-Usable 3D Layouts , author=. Proceedings of the 2026 CHI Conference on Human Factors in Computing Systems , pages=

work page 2026
[16]

and Mu, Y

Instructscene: Instruction-driven 3d indoor scene synthesis with semantic graph prior , author=. arXiv preprint arXiv:2402.04717 , year=

work page arXiv
[17]

arXiv preprint arXiv:2603.27573 , year=

SPREAD: Spatial-Physical REasoning via geometry Aware Diffusion , author=. arXiv preprint arXiv:2603.27573 , year=

work page arXiv
[18]

arXiv preprint arXiv:2303.03565 , year=

Clip-layout: Style-consistent indoor scene synthesis with semantic furniture embedding , author=. arXiv preprint arXiv:2303.03565 , year=

work page arXiv
[19]

Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

HouseCrafter: Lifting Floorplans to 3D Scenes with 2D Diffusion Models , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

work page
[20]

Advances in neural information processing systems , volume=

Atiss: Autoregressive transformers for indoor scene synthesis , author=. Advances in neural information processing systems , volume=

work page
[21]

arXiv preprint arXiv:2602.09153 , year=

Scenesmith: Agentic generation of simulation-ready indoor scenes , author=. arXiv preprint arXiv:2602.09153 , year=

work page arXiv
[22]

arXiv preprint arXiv:2503.16848 , year=

Hsm: Hierarchical scene motifs for multi-scale indoor scene generation , author=. arXiv preprint arXiv:2503.16848 , year=

work page arXiv
[23]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Infinigen indoors: Photorealistic indoor scenes using procedural generation , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page
[24]

Advances in Neural Information Processing Systems , volume=

Direct numerical layout generation for 3d indoor scene synthesis via spatial reasoning , author=. Advances in Neural Information Processing Systems , volume=

work page
[25]

Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

Efficient 3d semantic segmentation with superpoint transformer , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

work page
[26]

IEEE/ACM Transactions on networking , volume=

Chord: a scalable peer-to-peer lookup protocol for internet applications , author=. IEEE/ACM Transactions on networking , volume=. 2003 , publisher=

work page 2003
[27]

Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

Layoutvlm: Differentiable optimization of 3d layout via vision-language models , author=. Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

work page
[28]

arXiv preprint arXiv:2508.18597 , year=

SemLayoutDiff: Semantic Layout Generation with Diffusion Model for Indoor Scene Synthesis , author=. arXiv preprint arXiv:2508.18597 , year=

work page arXiv
[29]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Diffuscene: Denoising diffusion models for generative indoor scene synthesis , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

work page
[30]

ACM Transactions on Graphics (TOG) , volume=

Deep convolutional priors for indoor scene synthesis , author=. ACM Transactions on Graphics (TOG) , volume=. 2018 , publisher=

work page 2018
[31]

2021 International conference on 3D vision (3DV) , pages=

Sceneformer: Indoor scene generation with transformers , author=. 2021 International conference on 3D vision (3DV) , pages=. 2021 , organization=

work page 2021
[32]

IEEE transactions on pattern analysis and machine intelligence , year=

Planner3D: LLM-enhanced graph prior meets 3D indoor scene explicit regularization , author=. IEEE transactions on pattern analysis and machine intelligence , year=

work page
[33]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Lego-net: Learning regular rearrangements of objects in rooms , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page
[34]

arXiv preprint arXiv:2602.10116 , year=

Sage: Scalable agentic 3d scene generation for embodied ai , author=. arXiv preprint arXiv:2602.10116 , year=

work page arXiv
[35]

arXiv e-prints , pages=

Llm-driven indoor scene layout generation via scaled human-aligned data synthesis and multi-stage preference optimization , author=. arXiv e-prints , pages=

work page
[36]

arXiv preprint arXiv:2603.19598 , year=

FlowScene: Style-Consistent Indoor Scene Generation with Multimodal Graph Rectified Flow , author=. arXiv preprint arXiv:2603.19598 , year=

work page arXiv
[37]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Holodeck: Language guided generation of 3d embodied ai environments , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page
[38]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Physcene: Physically interactable 3d scene synthesis for embodied ai , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page
[39]

Advances in neural information processing systems , volume=

Sceneweaver: All-in-one 3d scene synthesis with an extensible and self-reflective agent , author=. Advances in neural information processing systems , volume=

work page
[40]

, author=

Make it home: Automatic optimization of furniture arrangement. , author=. ACM Trans. Graph. , volume=

work page
[41]

Advances in Neural Information Processing Systems , volume=

Commonscenes: Generating commonsense 3d indoor scenes with scene graph diffusion , author=. Advances in Neural Information Processing Systems , volume=

work page
[42]

European Conference on Computer Vision , pages=

Echoscene: Indoor scene generation via information echo over scene graph diffusion , author=. European Conference on Computer Vision , pages=. 2024 , organization=

work page 2024
[43]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

Scenex: Procedural controllable large-scale scene generation , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

work page
[44]

arXiv preprint arXiv:2402.07207 , year=

Gala3d: Towards text-to-3d complex scene generation via layout-guided generative gaussian splatting , author=. arXiv preprint arXiv:2402.07207 , year=

work page arXiv
[45]

Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

Scenefactor: Factored latent 3d diffusion for controllable 3d scene generation , author=. Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

work page
[46]

arXiv preprint arXiv:2510.26140 , year=

FullPart: Generating each 3D Part at Full Resolution , author=. arXiv preprint arXiv:2510.26140 , year=

work page arXiv
[47]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

Worldgrow: Generating infinite 3d world , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

work page
[48]

Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

Lt3sd: Latent trees for 3d scene diffusion , author=. Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

work page
[49]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Xcube: Large-scale 3d generative modeling using sparse voxel hierarchies , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

work page
[50]

ACM Transactions on Graphics (ToG) , volume=

Blockfusion: Expandable 3d scene generation using latent tri-plane extrapolation , author=. ACM Transactions on Graphics (ToG) , volume=. 2024 , publisher=

work page 2024
[51]

Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

3d-front: 3d furnished rooms with layouts and semantics , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

work page
[52]

Matterport3D: Learning from RGB-D Data in Indoor Environments

Matterport3d: Learning from rgb-d data in indoor environments , author=. arXiv preprint arXiv:1709.06158 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[53]

European Conference on Computer Vision , pages=

Structured3d: A large photo-realistic dataset for structured 3d modeling , author=. European Conference on Computer Vision , pages=. 2020 , organization=

work page 2020
[54]

arXiv preprint arXiv:2509.23728 , year=

M3DLayout: A Multi-Source Dataset of 3D Indoor Layouts and Structured Descriptions for 3D Generation , author=. arXiv preprint arXiv:2509.23728 , year=

work page arXiv
[55]

2025 International Conference on 3D Vision (3DV) , pages=

3d-gpt: Procedural 3d modeling with large language models , author=. 2025 International Conference on 3D Vision (3DV) , pages=. 2025 , organization=

work page 2025
[56]

Forty-first International Conference on Machine Learning , year=

Scenecraft: An llm agent for synthesizing 3d scenes as blender code , author=. Forty-first International Conference on Machine Learning , year=

work page
[57]

Advances in neural information processing systems , volume=

Openshape: Scaling up 3d shape representation towards open-world understanding , author=. Advances in neural information processing systems , volume=

work page
[58]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Structured 3d latents for scalable and versatile 3d generation , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

work page
[59]

Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

Foldingnet: Point cloud auto-encoder via deep grid deformation , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

work page
[60]

2023 , eprint=

Vox-E: Text-guided Voxel Editing of 3D Objects , author=. 2023 , eprint=

work page 2023
[61]

Computational Visual Media , year=

Computer-aided layout generation for building design: A review , author=. Computational Visual Media , year=

work page
[62]

Proceedings of the AAAI Conference on Artificial Intelligence and Interactive Digital Entertainment , volume=

Procedural content generation in games: A survey with insights on emerging llm integration , author=. Proceedings of the AAAI Conference on Artificial Intelligence and Interactive Digital Entertainment , volume=

work page
[63]

IEEE Network , year=

Large model empowered metaverse: State-of-the-art, challenges and opportunities , author=. IEEE Network , year=

work page
[64]

IEEE Transactions on Emerging Topics in Computational Intelligence , volume=

A survey of embodied ai: From simulators to research tasks , author=. IEEE Transactions on Emerging Topics in Computational Intelligence , volume=. 2022 , publisher=

work page 2022
[65]

IEEE/ASME Transactions on Mechatronics , year=

Aligning cyber space with physical world: A comprehensive survey on embodied ai , author=. IEEE/ASME Transactions on Mechatronics , year=

work page
[66]

18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24) , pages=

\ InfiniGen \ : Efficient generative inference of large language models with dynamic \ KV \ cache management , author=. 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24) , pages=

work page
[67]

European Conference on Computer Vision , pages=

House-gan: Relational generative adversarial networks for graph-constrained house layout generation , author=. European Conference on Computer Vision , pages=. 2020 , organization=

work page 2020
[68]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

House-gan++: Generative adversarial layout refinement network towards intelligent computational agent for professional architects , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

work page
[69]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Graph transformer GANs for graph-constrained house generation , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page
[70]

GPT-4 Technical Report

Gpt-4 technical report , author=. arXiv preprint arXiv:2303.08774 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[71]

Advances in Neural Information Processing Systems , volume=

Layoutgpt: Compositional visual planning and generation with large language models , author=. Advances in Neural Information Processing Systems , volume=

work page
[72]

IEEE signal processing magazine , volume=

Generative adversarial networks: An overview , author=. IEEE signal processing magazine , volume=. 2018 , publisher=

work page 2018
[73]

Advances in neural information processing systems , pages=

Attention is all you need , author=. Advances in neural information processing systems , pages=

work page
[74]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Taming transformers for high-resolution image synthesis , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

work page
[75]

Advances in neural information processing systems , volume=

Denoising diffusion probabilistic models , author=. Advances in neural information processing systems , volume=

work page
[76]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

High-resolution image synthesis with latent diffusion models , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

work page
[77]

Advances in neural information processing systems , volume=

Gans trained by a two time-scale update rule converge to a local nash equilibrium , author=. Advances in neural information processing systems , volume=

work page
[78]

Adam: A Method for Stochastic Optimization

Adam: A method for stochastic optimization , author=. arXiv preprint arXiv:1412.6980 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[79]

Decoupled Weight Decay Regularization

Decoupled weight decay regularization , author=. arXiv preprint arXiv:1711.05101 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[1] [1]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

FreeScene: Mixed graph diffusion for 3D scene synthesis from free prompts , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page

[2] [2]

European Conference on Computer Vision , pages=

I-design: Personalized llm interior designer , author=. European Conference on Computer Vision , pages=. 2024 , organization=

work page 2024

[3] [3]

arXiv preprint arXiv:2601.05810 , year=

SceneFoundry: Generating Interactive Infinite 3D Worlds , author=. arXiv preprint arXiv:2601.05810 , year=

work page arXiv

[4] [4]

Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

A unified, resilient, and explainable adversarial patch detector , author=. Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

work page

[5] [5]

Advances in Neural Information Processing Systems , volume=

ProcTHOR: Large-Scale Embodied AI Using Procedural Generation , author=. Advances in Neural Information Processing Systems , volume=

work page

[6] [6]

Advances in Neural Information Processing Systems , volume=

Debara: Denoising-based 3d room arrangement generation , author=. Advances in Neural Information Processing Systems , volume=

work page

[7] [7]

Spatialgen: Layout-guided 3d indoor scene generation,

Spatialgen: Layout-guided 3d indoor scene generation , author=. arXiv preprint arXiv:2509.14981 , volume=

work page arXiv

[8] [8]

Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

CasaGPT: cuboid arrangement and scene assembly for interior design , author=. Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

work page

[9] [9]

ACM Transactions on Graphics (TOG) , volume=

Example-based synthesis of 3D object arrangements , author=. ACM Transactions on Graphics (TOG) , volume=. 2012 , publisher=

work page 2012

[10] [10]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Graphdreamer: Compositional 3d scene synthesis from scene graphs , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page

[11] [11]

Advances in neural information processing systems , volume=

Mesatask: Towards task-driven tabletop scene generation via 3d spatial reasoning , author=. Advances in neural information processing systems , volume=

work page

[12] [12]

Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision , pages=

Mixed diffusion for 3d indoor scene synthesis , author=. Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision , pages=

work page

[13] [13]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Fireplace: Geometric refinements of llm common sense reasoning for 3d object placement , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

work page

[14] [14]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Layoutdm: Discrete diffusion model for controllable layout generation , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page

[15] [15]

Proceedings of the 2026 CHI Conference on Human Factors in Computing Systems , pages=

Behavior-Aware Anthropometric Scene Generation for Human-Usable 3D Layouts , author=. Proceedings of the 2026 CHI Conference on Human Factors in Computing Systems , pages=

work page 2026

[16] [16]

and Mu, Y

Instructscene: Instruction-driven 3d indoor scene synthesis with semantic graph prior , author=. arXiv preprint arXiv:2402.04717 , year=

work page arXiv

[17] [17]

arXiv preprint arXiv:2603.27573 , year=

SPREAD: Spatial-Physical REasoning via geometry Aware Diffusion , author=. arXiv preprint arXiv:2603.27573 , year=

work page arXiv

[18] [18]

arXiv preprint arXiv:2303.03565 , year=

Clip-layout: Style-consistent indoor scene synthesis with semantic furniture embedding , author=. arXiv preprint arXiv:2303.03565 , year=

work page arXiv

[19] [19]

Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

HouseCrafter: Lifting Floorplans to 3D Scenes with 2D Diffusion Models , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

work page

[20] [20]

Advances in neural information processing systems , volume=

Atiss: Autoregressive transformers for indoor scene synthesis , author=. Advances in neural information processing systems , volume=

work page

[21] [21]

arXiv preprint arXiv:2602.09153 , year=

Scenesmith: Agentic generation of simulation-ready indoor scenes , author=. arXiv preprint arXiv:2602.09153 , year=

work page arXiv

[22] [22]

arXiv preprint arXiv:2503.16848 , year=

Hsm: Hierarchical scene motifs for multi-scale indoor scene generation , author=. arXiv preprint arXiv:2503.16848 , year=

work page arXiv

[23] [23]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Infinigen indoors: Photorealistic indoor scenes using procedural generation , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page

[24] [24]

Advances in Neural Information Processing Systems , volume=

Direct numerical layout generation for 3d indoor scene synthesis via spatial reasoning , author=. Advances in Neural Information Processing Systems , volume=

work page

[25] [25]

Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

Efficient 3d semantic segmentation with superpoint transformer , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

work page

[26] [26]

IEEE/ACM Transactions on networking , volume=

Chord: a scalable peer-to-peer lookup protocol for internet applications , author=. IEEE/ACM Transactions on networking , volume=. 2003 , publisher=

work page 2003

[27] [27]

Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

Layoutvlm: Differentiable optimization of 3d layout via vision-language models , author=. Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

work page

[28] [28]

arXiv preprint arXiv:2508.18597 , year=

SemLayoutDiff: Semantic Layout Generation with Diffusion Model for Indoor Scene Synthesis , author=. arXiv preprint arXiv:2508.18597 , year=

work page arXiv

[29] [29]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Diffuscene: Denoising diffusion models for generative indoor scene synthesis , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

work page

[30] [30]

ACM Transactions on Graphics (TOG) , volume=

Deep convolutional priors for indoor scene synthesis , author=. ACM Transactions on Graphics (TOG) , volume=. 2018 , publisher=

work page 2018

[31] [31]

2021 International conference on 3D vision (3DV) , pages=

Sceneformer: Indoor scene generation with transformers , author=. 2021 International conference on 3D vision (3DV) , pages=. 2021 , organization=

work page 2021

[32] [32]

IEEE transactions on pattern analysis and machine intelligence , year=

Planner3D: LLM-enhanced graph prior meets 3D indoor scene explicit regularization , author=. IEEE transactions on pattern analysis and machine intelligence , year=

work page

[33] [33]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Lego-net: Learning regular rearrangements of objects in rooms , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page

[34] [34]

arXiv preprint arXiv:2602.10116 , year=

Sage: Scalable agentic 3d scene generation for embodied ai , author=. arXiv preprint arXiv:2602.10116 , year=

work page arXiv

[35] [35]

arXiv e-prints , pages=

Llm-driven indoor scene layout generation via scaled human-aligned data synthesis and multi-stage preference optimization , author=. arXiv e-prints , pages=

work page

[36] [36]

arXiv preprint arXiv:2603.19598 , year=

FlowScene: Style-Consistent Indoor Scene Generation with Multimodal Graph Rectified Flow , author=. arXiv preprint arXiv:2603.19598 , year=

work page arXiv

[37] [37]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Holodeck: Language guided generation of 3d embodied ai environments , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page

[38] [38]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Physcene: Physically interactable 3d scene synthesis for embodied ai , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page

[39] [39]

Advances in neural information processing systems , volume=

Sceneweaver: All-in-one 3d scene synthesis with an extensible and self-reflective agent , author=. Advances in neural information processing systems , volume=

work page

[40] [40]

, author=

Make it home: Automatic optimization of furniture arrangement. , author=. ACM Trans. Graph. , volume=

work page

[41] [41]

Advances in Neural Information Processing Systems , volume=

Commonscenes: Generating commonsense 3d indoor scenes with scene graph diffusion , author=. Advances in Neural Information Processing Systems , volume=

work page

[42] [42]

European Conference on Computer Vision , pages=

Echoscene: Indoor scene generation via information echo over scene graph diffusion , author=. European Conference on Computer Vision , pages=. 2024 , organization=

work page 2024

[43] [43]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

Scenex: Procedural controllable large-scale scene generation , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

work page

[44] [44]

arXiv preprint arXiv:2402.07207 , year=

Gala3d: Towards text-to-3d complex scene generation via layout-guided generative gaussian splatting , author=. arXiv preprint arXiv:2402.07207 , year=

work page arXiv

[45] [45]

Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

Scenefactor: Factored latent 3d diffusion for controllable 3d scene generation , author=. Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

work page

[46] [46]

arXiv preprint arXiv:2510.26140 , year=

FullPart: Generating each 3D Part at Full Resolution , author=. arXiv preprint arXiv:2510.26140 , year=

work page arXiv

[47] [47]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

Worldgrow: Generating infinite 3d world , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

work page

[48] [48]

Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

Lt3sd: Latent trees for 3d scene diffusion , author=. Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

work page

[49] [49]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Xcube: Large-scale 3d generative modeling using sparse voxel hierarchies , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

work page

[50] [50]

ACM Transactions on Graphics (ToG) , volume=

Blockfusion: Expandable 3d scene generation using latent tri-plane extrapolation , author=. ACM Transactions on Graphics (ToG) , volume=. 2024 , publisher=

work page 2024

[51] [51]

Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

3d-front: 3d furnished rooms with layouts and semantics , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

work page

[52] [52]

Matterport3D: Learning from RGB-D Data in Indoor Environments

Matterport3d: Learning from rgb-d data in indoor environments , author=. arXiv preprint arXiv:1709.06158 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[53] [53]

European Conference on Computer Vision , pages=

Structured3d: A large photo-realistic dataset for structured 3d modeling , author=. European Conference on Computer Vision , pages=. 2020 , organization=

work page 2020

[54] [54]

arXiv preprint arXiv:2509.23728 , year=

M3DLayout: A Multi-Source Dataset of 3D Indoor Layouts and Structured Descriptions for 3D Generation , author=. arXiv preprint arXiv:2509.23728 , year=

work page arXiv

[55] [55]

2025 International Conference on 3D Vision (3DV) , pages=

3d-gpt: Procedural 3d modeling with large language models , author=. 2025 International Conference on 3D Vision (3DV) , pages=. 2025 , organization=

work page 2025

[56] [56]

Forty-first International Conference on Machine Learning , year=

Scenecraft: An llm agent for synthesizing 3d scenes as blender code , author=. Forty-first International Conference on Machine Learning , year=

work page

[57] [57]

Advances in neural information processing systems , volume=

Openshape: Scaling up 3d shape representation towards open-world understanding , author=. Advances in neural information processing systems , volume=

work page

[58] [58]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Structured 3d latents for scalable and versatile 3d generation , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

work page

[59] [59]

Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

Foldingnet: Point cloud auto-encoder via deep grid deformation , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

work page

[60] [60]

2023 , eprint=

Vox-E: Text-guided Voxel Editing of 3D Objects , author=. 2023 , eprint=

work page 2023

[61] [61]

Computational Visual Media , year=

Computer-aided layout generation for building design: A review , author=. Computational Visual Media , year=

work page

[62] [62]

Proceedings of the AAAI Conference on Artificial Intelligence and Interactive Digital Entertainment , volume=

Procedural content generation in games: A survey with insights on emerging llm integration , author=. Proceedings of the AAAI Conference on Artificial Intelligence and Interactive Digital Entertainment , volume=

work page

[63] [63]

IEEE Network , year=

Large model empowered metaverse: State-of-the-art, challenges and opportunities , author=. IEEE Network , year=

work page

[64] [64]

IEEE Transactions on Emerging Topics in Computational Intelligence , volume=

A survey of embodied ai: From simulators to research tasks , author=. IEEE Transactions on Emerging Topics in Computational Intelligence , volume=. 2022 , publisher=

work page 2022

[65] [65]

IEEE/ASME Transactions on Mechatronics , year=

Aligning cyber space with physical world: A comprehensive survey on embodied ai , author=. IEEE/ASME Transactions on Mechatronics , year=

work page

[66] [66]

18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24) , pages=

\ InfiniGen \ : Efficient generative inference of large language models with dynamic \ KV \ cache management , author=. 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24) , pages=

work page

[67] [67]

European Conference on Computer Vision , pages=

House-gan: Relational generative adversarial networks for graph-constrained house layout generation , author=. European Conference on Computer Vision , pages=. 2020 , organization=

work page 2020

[68] [68]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

House-gan++: Generative adversarial layout refinement network towards intelligent computational agent for professional architects , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

work page

[69] [69]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Graph transformer GANs for graph-constrained house generation , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page

[70] [70]

GPT-4 Technical Report

Gpt-4 technical report , author=. arXiv preprint arXiv:2303.08774 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[71] [71]

Advances in Neural Information Processing Systems , volume=

Layoutgpt: Compositional visual planning and generation with large language models , author=. Advances in Neural Information Processing Systems , volume=

work page

[72] [72]

IEEE signal processing magazine , volume=

Generative adversarial networks: An overview , author=. IEEE signal processing magazine , volume=. 2018 , publisher=

work page 2018

[73] [73]

Advances in neural information processing systems , pages=

Attention is all you need , author=. Advances in neural information processing systems , pages=

work page

[74] [74]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Taming transformers for high-resolution image synthesis , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

work page

[75] [75]

Advances in neural information processing systems , volume=

Denoising diffusion probabilistic models , author=. Advances in neural information processing systems , volume=

work page

[76] [76]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

High-resolution image synthesis with latent diffusion models , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

work page

[77] [77]

Advances in neural information processing systems , volume=

Gans trained by a two time-scale update rule converge to a local nash equilibrium , author=. Advances in neural information processing systems , volume=

work page

[78] [78]

Adam: A Method for Stochastic Optimization

Adam: A method for stochastic optimization , author=. arXiv preprint arXiv:1412.6980 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[79] [79]

Decoupled Weight Decay Regularization

Decoupled weight decay regularization , author=. arXiv preprint arXiv:1711.05101 , year=

work page internal anchor Pith review Pith/arXiv arXiv