pith. sign in

arxiv: 2606.29395 · v1 · pith:BO5ZAIE7new · submitted 2026-06-28 · 💻 cs.CV

NaLA: A 3D Native LLM Layout Agent for High-quality 3D Scene Generation

Pith reviewed 2026-06-30 08:09 UTC · model grok-4.3

classification 💻 cs.CV
keywords 3D scene generationLLM layout agentnative 3D encodingcoarse-to-fine predictionspatial reasoninglayout coherencepose prediction
0
0 comments X

The pith

Encoding 3D assets directly into LLMs reduces information loss and improves scene layout quality over text conversion.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

NaLA addresses implausible layouts in LLM-based 3D scene generation by avoiding the conversion of 3D assets and boundaries into text descriptions. It feeds 3D geometry straight into the model so the LLM can reason explicitly about collisions, surface support, and containment. A coarse-to-fine mechanism first selects discrete poses autoregressively and then refines them through continuous regression. This produces more coherent placements while cutting inference time compared with prior agents. Ablation studies confirm that each design choice contributes to the gains in geometric perception and layout quality.

Core claim

NaLA encodes 3D scene boundaries and 3D assets directly into the LLM, preserving fine-grained geometry and enabling explicit reasoning over relationships like collisions, surface supporting, and containment. It adopts a coarse-to-fine prediction mechanism that first predicts discrete poses in an autoregressive manner and then refines the discrete poses with a continuous regression. Trained on diverse layout datasets, NaLA attains strong geometric perception and layout coherence and outperforms prior layout agents in both generation quality and inference efficiency.

What carries the argument

Direct native 3D encoding of assets and boundaries into the LLM together with a coarse-to-fine autoregressive-then-regression pose predictor.

If this is right

  • Higher geometric perception from avoiding text-based information loss.
  • Explicit handling of spatial constraints such as collisions and containment.
  • Faster inference than agents that rely on textual descriptions.
  • Improved layout coherence when trained across multiple layout datasets.
  • Each added component contributes measurably to overall performance as shown by ablations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same direct-encoding approach could reduce errors in other LLM tasks that involve 3D spatial planning.
  • Integration with 3D vision encoders might allow end-to-end generation from images without intermediate text.
  • Scaling the method to larger scenes could test whether native 3D input continues to prevent quality drop-off.

Load-bearing premise

Directly encoding 3D assets and boundaries into the LLM preserves fine-grained geometry and enables explicit reasoning over spatial relationships without introducing new modality-specific errors or training instabilities.

What would settle it

Running NaLA and a text-conversion baseline on identical scene inputs and checking whether NaLA produces fewer object collisions or unsupported placements while using less inference time.

Figures

Figures reproduced from arXiv: 2606.29395 by Cheng Wan, Chucheng Xiang, Runze Wang, Rushi Dai, Wenzheng Wu, Xiang Zhang, Yongsen Mao, Yuan Liu, Yuxuan Xie, Zhongyuan Liu.

Figure 1
Figure 1. Figure 1: Geometry-aware 3D layout generation with NaLA. [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the NaLA pipeline. 3D token encoding: [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Placement comparison between NaLA and baseline models. Baseline models cannot perceive fine-grained asset geometry (e.g., round tables, cabinets with shelves), failing to achieve the precise placements produced by NaLA. using SPFormer [24]. To bridge the modality gap, we employ two lightweight, trainable Q-Formers (Query Transformers [13]) to encode the dense visual fea￾tures of the assets and the scene in… view at source ↗
Figure 4
Figure 4. Figure 4: Comparison of different output designs in NaLA. [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Coarse-to-fine token design in NaLA. The first four tokens determine the coarse location and orientation of an asset, while the final regression tokens are decoded into fine-grained poses. Different assets are distinguished using different ID tokens. For clarity, we illustrate the mechanism in 2D. ID token in the prefix, thereby associating the predicted pose with the correct asset [PITH_FULL_IMAGE:figure… view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative comparison of layout generation. Each row shows top-down views for one scene type (bedroom, conference room, storage); each column shows results from a different method under identical room and asset conditions. NaLA produces physically plausible and semantically coherent arrangements. contrast, NaLA operates in a single pass without any post-optimization. Despite this, it maintains a small col… view at source ↗
Figure 7
Figure 7. Figure 7: Irregular-scene evaluation and inference efficiency. (a–b) Top-down views from NaLA and LayoutGPT under the same irregular room; NaLA keeps objects within the boundary while LayoutGPT exhibits out-of-bounds placements. (c) Out-of-bounds (OOB) ratio on irregular scenes. (d) Inference time vs. number of items across methods. NaLA achieves lower OOB and faster inference. Point Cloud Input: We compare the (I) … view at source ↗
Figure 8
Figure 8. Figure 8: Qualitative results of NaLA’s layout generation. We present additional placement results for NaLA in [PITH_FULL_IMAGE:figures/full_fig_p018_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Failure Cases of NaLA. The model finds it difficult to learn placement patterns for rare assets (e.g., the spatial relationships among computers, mice, and keyboards, or the layout of clinical equipment) from limited training examples [PITH_FULL_IMAGE:figures/full_fig_p027_9.png] view at source ↗
read the original abstract

Recently, Large Language Models (LLMs) have emerged as promising layout agents for 3D scene generation. Existing layout agents still suffer from implausible layout generation because most of them convert 3D assets and 3D layouts into textual descriptions as inputs and outputs, which involves severe information loss due to the modality gap between texts and 3D assets and 3D layouts. We propose NaLA, a native 3D LLM layout Agent for high-quality 3D scene generation by placing 3D assets in the scene. For the inputs, NaLA encodes 3D scene boundaries and 3D assets directly into the LLM, preserving fine-grained geometry and enabling explicit reasoning over relationships like collisions, surface supporting, and containment. To accurately output the positions and orientations of assets, NaLA adopts a coarse-to-fine prediction mechanism that first predicts discrete poses in an autoregressive manner and then refines the discrete poses with a continuous regression. Trained on diverse layout datasets, NaLA attains strong geometric perception and layout coherence. Experiments demonstrate that NaLA outperforms prior layout agents in both generation quality and inference efficiency, with comprehensive ablation studies to verify each component's effectiveness.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper proposes NaLA, a native 3D LLM layout agent for 3D scene generation. It encodes 3D scene boundaries and assets directly into the LLM (avoiding text-based modality gaps) to enable reasoning over collisions, support, and containment; uses a coarse-to-fine autoregressive discrete pose prediction followed by continuous regression; and claims, based on training on diverse layout datasets plus experiments and ablations, to outperform prior layout agents in generation quality and inference efficiency while attaining strong geometric perception and layout coherence.

Significance. If the claimed outperformance and component effectiveness hold under rigorous evaluation, the work could advance LLM-based 3D scene synthesis by reducing information loss in spatial reasoning, with potential downstream impact on applications requiring coherent 3D layouts. The explicit mention of comprehensive ablation studies to verify each component is a positive aspect of the experimental design.

major comments (2)
  1. [Abstract] Abstract: The central claim that 'NaLA outperforms prior layout agents in both generation quality and inference efficiency' is stated without any quantitative metrics, baselines, dataset sizes, error bars, or experimental protocol details. This absence makes the strength of the result impossible to assess from the provided text and is load-bearing for the paper's primary contribution.
  2. [Abstract] Abstract: The key assumption that directly encoding 3D assets and boundaries 'preserves fine-grained geometry' and enables explicit reasoning 'without introducing new modality-specific errors or training instabilities' is load-bearing for attributing gains to the native approach. No mechanism details (e.g., tokenization or embedding of continuous 3D geometry) or quantitative fidelity checks (e.g., reconstruction error of encoded assets) are supplied, leaving open the possibility that discretization losses are comparable to or larger than those in text-based methods.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the abstract. We address each major comment below and will revise the manuscript to improve clarity and support for the central claims.

read point-by-point responses
  1. Referee: [Abstract] The central claim that 'NaLA outperforms prior layout agents in both generation quality and inference efficiency' is stated without any quantitative metrics, baselines, dataset sizes, error bars, or experimental protocol details. This absence makes the strength of the result impossible to assess from the provided text and is load-bearing for the paper's primary contribution.

    Authors: We agree that the abstract would be strengthened by including key quantitative highlights. The full manuscript reports these details in the Experiments section (including specific baselines, dataset sizes, and metrics with standard deviations). We will revise the abstract to concisely incorporate summary results (e.g., relative improvements on generation quality metrics and inference speed) while preserving length constraints. revision: yes

  2. Referee: [Abstract] The key assumption that directly encoding 3D assets and boundaries 'preserves fine-grained geometry' and enables explicit reasoning 'without introducing new modality-specific errors or training instabilities' is load-bearing for attributing gains to the native approach. No mechanism details (e.g., tokenization or embedding of continuous 3D geometry) or quantitative fidelity checks (e.g., reconstruction error of encoded assets) are supplied, leaving open the possibility that discretization losses are comparable to or larger than those in text-based methods.

    Authors: Mechanism details for 3D encoding, tokenization, and embedding are provided in Section 3 (Method) of the manuscript. We acknowledge that the abstract does not include quantitative fidelity metrics. We will add a brief reference to these details in the abstract and include a new quantitative analysis (reconstruction error and stability metrics) in the revised Experiments or Ablation section to directly address potential discretization concerns. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical claims rest on external experiments

full rationale

The provided abstract and description contain no equations, derivations, or self-referential steps. The method is described as encoding 3D boundaries/assets directly into an LLM with a coarse-to-fine pose mechanism, trained on diverse datasets, and evaluated via experiments and ablations. No self-definitional reductions (e.g., a quantity defined in terms of itself), fitted inputs renamed as predictions, load-bearing self-citations, imported uniqueness theorems, or ansatzes smuggled via prior work appear. Central claims of outperformance and geometric preservation are asserted via experimental results rather than reducing to inputs by construction. This is the expected non-finding for an empirical systems paper without a mathematical derivation chain.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated.

pith-pipeline@v0.9.1-grok · 5773 in / 1059 out tokens · 29902 ms · 2026-06-30T08:09:40.199806+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

64 extracted references · 3 canonical work pages · 1 internal anchor

  1. [1]

    Advances in neural information processing systems33, 1877–1901 (2020)

    Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Nee- lakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems33, 1877–1901 (2020)

  2. [2]

    217–234 (2024)

    Çelen, A., Han, G., Schindler, K., Van Gool, L., Armeni, I., Obukhov, A., Wang, X.: I-design: Personalized llm interior designer, pp. 217–234 (2024)

  3. [3]

    ACM Transactions on Graphics (TOG)36(4), 1–12 (2017)

    Cordonnier, G., Galin, E., Gain, J., Benes, B., Guérin, E., Peytavie, A., Cani, M.P.: Authoring landscapes by combining ecosystem and terrain erosion simula- tion. ACM Transactions on Graphics (TOG)36(4), 1–12 (2017)

  4. [4]

    569–593 (1992)

    Efron, B.: Bootstrap methods: another look at the jackknife pp. 569–593 (1992)

  5. [5]

    Feng, W., Zhu, W., Fu, T.j., Jampani, V., Akula, A., He, X., Basu, S., Wang, X.E., Wang, W.Y.: Layoutgpt: Compositional visual planning and generation with large language models (2023)

  6. [6]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Fu, H., Cai, B., Gao, L., Zhang, L.X., Wang, J., Li, C., Zeng, Q., Sun, C., Jia, R., Zhao, B., et al.: 3d-front: 3d furnished rooms with layouts and semantics. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 10933–10942 (2021)

  7. [7]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Gao, G., Liu, W., Chen, A., Geiger, A., Schölkopf, B.: Graphdreamer: Composi- tional 3d scene synthesis from scene graphs. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 21295–21304 (2024) 16 C. Wan et al

  8. [8]

    ACM Trans

    Kerbl, B., Kopanas, G., Leimkühler, T., Drettakis, G., et al.: 3d gaussian splatting for real-time radiance field rendering. ACM Trans. Graph.42(4), 139–1 (2023)

  9. [9]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Kim, S.W., Brown, B., Yin, K., Kreis, K., Schwarz, K., Li, D., Rombach, R., Torralba, A., Fidler, S.: Neuralfield-ldm: Scene generation with hierarchical latent diffusion models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 8496–8506 (2023)

  10. [10]

    In: Pro- ceedings of the AAAI Conference on Artificial Intelligence and Interactive Digital Entertainment

    Kumaran, V., Rowe, J., Mott, B., Lester, J.: Scenecraft: automating interactive narrative scene generation in digital games with large language models. In: Pro- ceedings of the AAAI Conference on Artificial Intelligence and Interactive Digital Entertainment. vol. 19, pp. 86–96 (2023)

  11. [11]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Lai, X., Tian, Z., Chen, Y., Li, Y., Yuan, Y., Liu, S., Jia, J.: Lisa: Reasoning seg- mentation via large language model. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 9579–9589 (2024)

  12. [12]

    Neurocomputing566, 127052 (2024)

    Li, H., Zhu, G., Zhang, L., Jiang, Y., Dang, Y., Hou, H., Shen, P., Zhao, X., Shah, S.A.A., Bennamoun, M.: Scene graph generation: A comprehensive survey. Neurocomputing566, 127052 (2024)

  13. [13]

    In: International confer- ence on machine learning

    Li, J., Li, D., Xiong, C., Hoi, S.: Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International confer- ence on machine learning. pp. 12888–12900. PMLR (2022)

  14. [14]

    In: SIGGRAPH Asia 2024 Conference Papers

    Li, X.L., Li, H., Chen, H.X., Mu, T.J., Hu, S.M.: Discene: Object decoupling and interaction modeling for complex scene generation. In: SIGGRAPH Asia 2024 Conference Papers. pp. 1–12 (2024)

  15. [15]

    Li, X., Lai, Z., Xu, L., Qu, Y., Cao, L., Zhang, S., Dai, B., Ji, R.: Director3d: Real-world camera trajectory and 3d scene generation from text (2024)

  16. [16]

    Archives of psychology (1932)

    Likert, R.: A technique for the measurement of attitudes. Archives of psychology (1932)

  17. [17]

    arXiv preprint arXiv:2505.02836 (2025)

    Ling, L., Lin, C.H., Lin, T.Y., Ding, Y., Zeng, Y., Sheng, Y., Ge, Y., Liu, M.Y., Bera, A., Li, Z.: Scenethesis: A language and vision agentic framework for 3d scene generation. arXiv preprint arXiv:2505.02836 (2025)

  18. [18]

    In: Ad- vances in Neural Information Processing Systems (2025)

    Mao, Y., Zhong, J., Fang, C., Zheng, J., Tang, R., Zhu, H., Tan, P., Zhou, Z.: Spatiallm: Training large language models for structured indoor modeling. In: Ad- vances in Neural Information Processing Systems (2025)

  19. [19]

    Commu- nications of the ACM65(1), 99–106 (2021)

    Mildenhall, B., Srinivasan, P.P., Tancik, M., Barron, J.T., Ramamoorthi, R., Ng, R.: Nerf: Representing scenes as neural radiance fields for view synthesis. Commu- nications of the ACM65(1), 99–106 (2021)

  20. [20]

    In: European Conference on Computer Vision

    Öcal, B.M., Tatarchenko, M., Karaoğlu, S., Gevers, T.: Sceneteller: Language-to- 3d scene generation. In: European Conference on Computer Vision. pp. 362–378. Springer (2024)

  21. [21]

    In: International conference on machine learning

    Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International conference on machine learning. pp. 8748–8763. PmLR (2021)

  22. [22]

    Advances in Neural Information Processing Systems38, 125055–125081 (2026)

    Ran, X., Li, Y., Xu, L., Yu, M., Dai, B.: Direct numerical layout generation for 3d indoor scene synthesis via spatial reasoning. Advances in Neural Information Processing Systems38, 125055–125081 (2026)

  23. [23]

    Sun, F.Y., Liu, W., Gu, S., Lim, D., Bhat, G., Tombari, F., Li, M., Haber, N., Wu, J.: Layoutvlm: Differentiable optimization of 3d layout via vision-language models (2025)

  24. [24]

    In: Proceedings of the AAAI Conference on Artificial Intelligence

    Sun, J., Qing, C., Tan, J., Xu, X.: Superpoint transformer for 3d scene instance segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 37, pp. 2393–2401 (2023) NaLA: A 3D Native LLM Layout Agent 17

  25. [25]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Tang, J., Nie, Y., Markhasin, L., Dai, A., Thies, J., Nießner, M.: Diffuscene: De- noising diffusion models for generative indoor scene synthesis. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 20507– 20518 (2024)

  26. [26]

    Gemini: A Family of Highly Capable Multimodal Models

    Team, G., Anil, R., Borgeaud, S., Alayrac, J.B., Yu, J., Soricut, R., Schalkwyk, J., Dai, A.M., Hauth, A., Millican, K., et al.: Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805 (2023)

  27. [27]

    Team, Q.: Qwen2.5: A party of foundation models (September 2024),https:// qwenlm.github.io/blog/qwen2.5/

  28. [28]

    arXiv preprint arXiv:2505.05474 (2025)

    Wen, B., Xie, H., Chen, Z., Hong, F., Liu, Z.: 3d scene generation: A survey. arXiv preprint arXiv:2505.05474 (2025)

  29. [29]

    Wu, W., Fan, L., Liu, L., Wonka, P.: Miqp-based layout design for building interiors 37(2), 511–521 (2018)

  30. [30]

    Yang, Y., Lu, J., Zhao, Z., Luo, Z., Yu, J.J., Sanchez, V., Zheng, F.: Llplace: The 3d indoor scene layout generation and editing via large language model (2024)

  31. [31]

    Yang, Y., Sun, F.Y., Weihs, L., VanderBilt, E., Herrasti, A., Han, W., Wu, J., Haber, N., Krishna, R., Liu, L., et al.: Holodeck: Language guided generation of 3d embodied ai environments (2024)

  32. [32]

    Yu, H., Wang, C., Zhuang, P., Menapace, W., Siarohin, A., Cao, J., Jeni, L., Tulyakov, S., Lee, H.Y.: 4real: Towards photorealistic 4d scene generation via video diffusion models. vol. 37, pp. 45256–45280 (2024)

  33. [33]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Yu, H.X., Duan, H., Hur, J., Sargent, K., Rubinstein, M., Freeman, W.T., Cole, F., Sun, D., Snavely, N., Wu, J., et al.: Wonderjourney: Going from anywhere to everywhere. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 6658–6667 (2024)

  34. [34]

    ACM Trans

    Yu, L.F., Yeung, S.K., Tang, C.K., Terzopoulos, D., Chan, T.F., Osher, S.J.: Make it home: Automatic optimization of furniture arrangement. ACM Trans. Graph. 30(4), 86 (2011)

  35. [35]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Yu, X., Tang, L., Rao, Y., Huang, T., Zhou, J., Lu, J.: Point-bert: Pre-training 3d point cloud transformers with masked point modeling. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 19313– 19322 (2022)

  36. [36]

    Zausinger, J., Pennig, L., Kozina, A., Sdahl, S., Sikora, J., Dendorfer, A., Kuznetsov, T., Hagog, M., Wiedemann, N., Chlodny, K., et al.: Regress, don’t guess–a regression-like loss on number tokens for language models (2024)

  37. [37]

    Zhang, Y., Cai, Z., Wang, M., Guo, M., Li, T., Lin, L., Wang, Y.: M3dlayout: A multi-source dataset of 3d indoor layouts and structured descriptions for 3d generation (2026)

  38. [38]

    Zhong, W., Cao, P., Jin, Y., Li, L., Cai, W., Lin, J., Wang, H., Lyu, Z., Wang, T., XU, X., et al.: Internscenes: A large-scale simulatable indoor scene dataset with realistic layouts (2026)

  39. [39]

    In: Proceedings of the AAAI Conference on Artificial Intelligence

    Zhou,M.,Wang,Y.,Hou,J.,Zhang,S.,Li,Y.,Luo,C.,Peng,J.,Zhang,Z.:Scenex: Procedural controllable large-scale scene generation. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 39, pp. 10806–10814 (2025)

  40. [40]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Zhu, C., Wang, T., Zhang, W., Pang, J., Liu, X.: Llava-3d: A simple yet effec- tive pathway to empowering lmms with 3d capabilities. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 4295–4305 (2025)

  41. [41]

    Dining_chair

    Zhu, X., Huang, X., Xie, Q., Deng, Z., Yu, J., Guan, Y., Liu, Z., Zhu, L., Zhao, Q., Liu, L., et al.: Imaginarium: Vision-guided high-quality 3d scene layout generation. ACM Transactions on Graphics (TOG)44(6), 1–24 (2025) 18 C. Wan et al. A Supplementary Experiments A.1 Additional Quantitative Results Fig.8:Qualitative results of NaLA’s layout generation...

  42. [42]

    Select items that naturally belong in a {target_scene_type}

  43. [43]

    The total count must be 20

  44. [44]

    str1", "str2

    Output strictly a JSON list of strings [ "str1", "str2" ... ]. Your JSON List: """ Here,{target_scene_type}specifies the room type, and{style_description} is filled with a randomly sampled room style (out of “minimalist”, “cozy and warm”, “messy and cluttered”, “vintage and retro”, “industrial”, “luxurious and expensive”, “Modern and clean”) to increase t...

  45. [45]

    Physical Plausibility

  46. [46]

    Semantic Plausibility

  47. [47]

    Visual Aesthetics **SCORING METHODOLOGY (CRITICAL):** For each main dimension, you must evaluate specific **Sub-criteria**

  48. [48]

    Assign a score from 1 to 5 (Integer) for EACH sub-criterion

  49. [49]

    Calculate the average of these sub-scores to get the Final Dimension Score

  50. [50]

    ""Here are the rendered images of a generated room layout. **Target Room Type:**

    Round the Final Dimension Score to the nearest integer for the JSON output. Scoring Rubric: - 1: Very Poor (Critical failure, completely unusable) - 2: Poor (Major flaws, breaks immersion) - 3: Fair (Acceptable logic, but unrefined) - 4: Good (Functional and pleasing, minor issues) - 5: Excellent (Professional quality, flawless) Output strictly in JSON fo...

  51. [51]

    **Gravity & Support:** Are objects floating in the air? Are heavy objects naturally supported by the floor or other surfaces?

  52. [52]

    **Collision:** Do objects intersect or clip into each other or walls significantly? (Ignore very minor mesh overlaps)

  53. [53]

    **Stability:** Are objects placed in a way that implies they would fall over in real life? --- ### Dimension 2: Semantic Plausibility **Is the layout practical and functional for human use?** *(Sub-criteria)*

  54. [54]

    {room_type}

    **Scene Identity (Layout-based):** Given the fixed set of assets, does this specific **arrangement** successfully convey the function of a "{room_type}"? (e.g., A bathroom layout should look like a bathroom, not a bedroom, based on how items are grouped)

  55. [55]

    **Accessibility / Flow:** Can a human physically walk through the space? Are pathways clear? Are doors, drawers, or critical zones blocked by other objects?

  56. [56]

    * *Examples:* A sofa must face the TV; A toilet must have legroom; A desk chair must face the desk

    **Usability Logic:** **Definition:** The strict functional relationship between interacting objects. * *Examples:* A sofa must face the TV; A toilet must have legroom; A desk chair must face the desk. Is the primary function of the furniture enabled by its orientation and position?

  57. [57]

    natural

    **Everyday Habits:** **Definition:** The "soft" constraints of human behavior and comfort, distinct from strict logic. * *Examples:* Is the nightstand practically placed within reach of the bedhead? Is the coffee table at a comfortable reach distance from the sofa (not too far, not too close)? Does the layout feel "natural " to live in? --- ### Dimension ...

  58. [58]

    visual weight

    **Composition & Spatial Balance:** * *Explanation:* Evaluate the distribution of "visual weight". Does the room feel lopsided? Is there an appropriate use of negative space (empty floor), or is it overcrowded/too sparse?

  59. [59]

    Are objects aligned to implied architectural lines (walls, rugs)? Are rotations clean or chaotically random without purpose? Do edges align pleasingly?

    **Alignment & Grid Logic:** * *Explanation:* Evaluate the geometric order. Are objects aligned to implied architectural lines (walls, rugs)? Are rotations clean or chaotically random without purpose? Do edges align pleasingly?

  60. [60]

    reasoning

    **Arrangement Harmony:** * *Explanation:* Do the objects feel like they belong together in this specific cluster? Is the grouping aesthetically coherent, or does it look like a random pile of assets dumped on the floor? --- 26 C. Wan et al. ### Output Format Output ONLY valid JSON. **Important:** Inside the "reasoning" text, you must explicitly list the s...

  61. [61]

    Adjacent points in the list must share either the exact same X or the same Y coordinate

    Orthogonal Only: All corners must be right angles. Adjacent points in the list must share either the exact same X or the same Y coordinate. No diagonal lines

  62. [62]

    - 1 cutout requires exactly 6 points (e.g., L-shape)

    Form: Treat the room as a main rectangular Bounding Box with 1 or 2 rectangular "cutouts" missing from its edges or corners. - 1 cutout requires exactly 6 points (e.g., L-shape). - 2 cutouts require exactly 8 points (e.g., Z-shape, T-shape). NaLA: A 3D Native LLM Layout Agent 27

  63. [63]

    The area of each individual cutout must be $\le 0.25 \times S$

    Area Limit: Let the area of the main Bounding Box be $S$. The area of each individual cutout must be $\le 0.25 \times S$

  64. [64]

    Scale is in meters

    Tracing: The coordinates must trace the perimeter of the room in a continuous, non-intersecting closed loop (clockwise or counter- clockwise), starting at [0.0, 0.0]. Scale is in meters. Output Format: For each room, briefly state the Bounding Box dimensions, cutout dimensions, and prove the area constraint ($Cutout Area \le 0.25 \ times S$). Then output ...