pith. machine review for the scientific record. sign in

arxiv: 2604.19907 · v1 · submitted 2026-04-21 · 💻 cs.CV

Recognition: unknown

SceneOrchestra: Efficient Agentic 3D Scene Synthesis via Full Tool-Call Trajectory Generation

Authors on Pith no claims yet

Pith reviewed 2026-05-10 03:19 UTC · model grok-4.3

classification 💻 cs.CV
keywords 3D scene synthesisagentic frameworkstool-call trajectoriesorchestratordiscriminatorLLM agents3D generationtrajectory optimization
0
0 comments X

The pith

SceneOrchestra trains an orchestrator to generate complete tool-call trajectories for 3D scene synthesis without any intermediate rendering or review steps.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper targets the limitations of current agentic 3D scene synthesis methods that rely on execute-review-reflect loops driven by heuristics and slowed by per-step rendering. It introduces SceneOrchestra, which trains an orchestrator to output full sequences of tool calls and parameters from an instruction, paired with a discriminator that evaluates entire trajectories during training. A two-phase process first teaches trajectory generation and quality assessment, then uses interleaved training so the discriminator adapts to the orchestrator's outputs and improves it. At inference time only the orchestrator runs, producing and executing the full trajectory directly. Experiments indicate this yields state-of-the-art scene quality while cutting runtime by removing repeated rendering and suboptimal step decisions.

Core claim

SceneOrchestra consists of an orchestrator and a discriminator fine-tuned with a two-phase training strategy. In the first phase the orchestrator learns context-aware tool selection and complete tool-call trajectory generation while the discriminator is trained to assess the quality of full trajectories. In the second phase interleaved training lets the discriminator adapt to the orchestrator's evolving distribution and distill its discriminative capability back into the orchestrator. At inference the orchestrator alone generates and executes full tool-call trajectories from instructions without requiring the discriminator or any intermediate rendering.

What carries the argument

The orchestrator that produces complete tool-call trajectories in one pass, paired with a discriminator that selects the best trajectory from candidates during training.

If this is right

  • State-of-the-art scene quality is achieved compared with prior agentic methods.
  • Runtime decreases because intermediate rendering and review steps are eliminated.
  • Tool selection and parameter choices improve over heuristic rules.
  • Inference requires only the orchestrator and runs without the discriminator.
  • Full trajectories avoid the latency accumulation of step-by-step execution.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The full-trajectory design may reduce compounding errors that arise when each step depends on visual feedback from the previous render.
  • This training pattern could transfer to other agentic domains that currently use per-step review, such as sequential code editing or robotic manipulation planning.
  • Interleaved training between generator and selector might scale to larger models and more complex multi-room scenes if more trajectory data becomes available.

Load-bearing premise

An orchestrator trained on full trajectories can reliably produce high-quality 3D scenes without any intermediate rendering or review steps, and the discriminator can consistently identify superior trajectories from the orchestrator's outputs.

What would settle it

Direct comparison on the same 3D scene benchmarks showing that SceneOrchestra scenes receive lower quality scores or fail to reduce runtime once intermediate rendering is re-introduced in an ablation.

Figures

Figures reproduced from arXiv: 2604.19907 by Kelin Yu, Matthias Zwicker, Yun He.

Figure 1
Figure 1. Figure 1: SceneOrchestra is a trainable orchestration framework that outputs full tool￾call trajectories to generate 3D scenes, optimized for efficiency and quality. Building on initial tool-call trajectories generated using an existing method, we train Scene￾Orchetra in two phases leveraging supervised fine tuning (SFT), direct preference optimization (DPO), and model distillation techniques. At inference time, Sce… view at source ↗
Figure 2
Figure 2. Figure 2: Training Paradigm. Our training consists of two phases. In the indepen￾dent training phase, we train the orchestrator and discriminator separately to develop their initial capabilities. The orchestrator undergoes four stages: 1) stepwise SFT, 2) trajectory-level SFT, 3) stepwise DPO, and 4) trajectory-level DPO, learning context￾aware tool selection and complete trajectory generation. Meanwhile, the discri… view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative comparison with baselines on both seen (bathroom, meeting room) and unseen (bedroom, gym) room types. Our method achieves higher visual realism, more plausible layouts, and richer details. pairs (#CN). Visual-semantic metrics assess perceptual quality across four di￾mensions: visual realism (Real.), functionality (Func.), layout correctness (Lay.), and scene completeness (Comp.). Implementation… view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative comparison on complex instructions. [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Extra qualitative comparison with baselines [PITH_FULL_IMAGE:figures/full_fig_p019_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Extra qualitative results on complex instructions. [PITH_FULL_IMAGE:figures/full_fig_p020_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Example of user study [PITH_FULL_IMAGE:figures/full_fig_p021_7.png] view at source ↗
read the original abstract

Recent agentic frameworks for 3D scene synthesis have advanced realism and diversity by integrating heterogeneous generation and editing tools. These tools are organized into workflows orchestrated by an off-the-shelf LLM. Current approaches typically adopt an execute-review-reflect loop: at each step, the orchestrator executes a tool, renders intermediate results for review, and then decides on the tool and its parameters for the next step. However, this design has two key limitations. First, next-step tool selection and parameter configuration are driven by heuristic rules, which can lead to suboptimal execution flows, unnecessary tool invocations, degraded output quality, and increased runtime. Second, rendering and reviewing intermediate results after each step introduces additional latency. To address these issues, we propose SceneOrchestra, a trainable orchestration framework that optimizes the tool-call execution flow and eliminates the step-by-step review loop, improving both efficiency and output quality. SceneOrchestra consists of an orchestrator and a discriminator, which we fine-tune with a two-phase training strategy. In the first phase, the orchestrator learns context-aware tool selection and complete tool-call trajectory generation, while the discriminator is trained to assess the quality of full trajectories, enabling it to select the best trajectory from multiple candidates. In the second phase, we perform interleaved training, where the discriminator adapts to the orchestrator's evolving trajectory distribution and distills its discriminative capability back into the orchestrator. At inference, we only use the orchestrator to generate and execute full tool-call trajectories from instructions, without requiring the discriminator. Extensive experiments show that our method achieves state-of-the-art scene quality while reducing runtime compared to previous work.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces SceneOrchestra, a trainable agentic framework for 3D scene synthesis consisting of an orchestrator that generates complete tool-call trajectories and a discriminator that selects the highest-quality trajectory among candidates. It replaces the standard execute-review-reflect loop (with per-step rendering and heuristic next-step decisions) by a two-phase training procedure: phase 1 trains the orchestrator on full trajectories and the discriminator on trajectory quality, while phase 2 interleaves updates so the discriminator adapts to the orchestrator's distribution and distills feedback back. At inference only the orchestrator is used, with no intermediate renders. The central claim is that this yields state-of-the-art scene quality together with substantially lower runtime.

Significance. If the quantitative claims hold, the work would demonstrate that full-trajectory orchestration can eliminate costly per-step rendering while still producing higher-quality 3D scenes than heuristic execute-review-reflect baselines. The two-phase interleaved training procedure is a concrete, reproducible mechanism for aligning the discriminator with the orchestrator's evolving distribution; if the experiments include ablations on trajectory length, tool-order sensitivity, and discriminator selection accuracy, this would constitute a clear methodological advance for agentic tool-use pipelines.

major comments (2)
  1. [Abstract] Abstract: the claim that the method 'achieves state-of-the-art scene quality while reducing runtime' is presented without any numerical values, baseline names, dataset sizes, or metric definitions. Because the central contribution is an empirical improvement over execute-review-reflect loops, the absence of these numbers in the abstract (and the lack of any table or figure reference) makes the claim impossible to evaluate from the provided text.
  2. [Method (two-phase training description)] The weakest assumption identified in the stress-test note is load-bearing: the orchestrator is trained only on static full trajectories collected from prior heuristics, yet 3D placement, scaling and lighting tools are order-dependent and collision/occlusion errors often require visual correction. No ablation is described that tests whether trajectories generated without any intermediate render or review step actually avoid these errors at inference time, nor is there a comparison of discriminator accuracy on held-out orchestrator-generated trajectories versus the training distribution.
minor comments (1)
  1. [Abstract] The abstract introduces the terms 'orchestrator' and 'discriminator' without a brief parenthetical definition or forward reference to the section where their architectures are specified.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive assessment of the significance of our work and for the constructive major comments. We address each point below and outline the revisions we will make to the manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim that the method 'achieves state-of-the-art scene quality while reducing runtime' is presented without any numerical values, baseline names, dataset sizes, or metric definitions. Because the central contribution is an empirical improvement over execute-review-reflect loops, the absence of these numbers in the abstract (and the lack of any table or figure reference) makes the claim impossible to evaluate from the provided text.

    Authors: We agree that the abstract would be strengthened by including concrete quantitative support for the central claims. In the revised manuscript we will update the abstract to report key numerical results (scene quality metrics and runtime reductions relative to the execute-review-reflect baselines), name the primary baselines, specify the evaluation datasets, define the metrics, and add explicit references to the relevant tables and figures. revision: yes

  2. Referee: [Method (two-phase training description)] The weakest assumption identified in the stress-test note is load-bearing: the orchestrator is trained only on static full trajectories collected from prior heuristics, yet 3D placement, scaling and lighting tools are order-dependent and collision/occlusion errors often require visual correction. No ablation is described that tests whether trajectories generated without any intermediate render or review step actually avoid these errors at inference time, nor is there a comparison of discriminator accuracy on held-out orchestrator-generated trajectories versus the training distribution.

    Authors: We acknowledge that the manuscript does not currently contain the requested ablations on tool-order sensitivity, error avoidance without intermediate rendering, or discriminator accuracy on held-out orchestrator-generated trajectories. While the two-phase interleaved training is intended to adapt the discriminator to the orchestrator's distribution and the final experimental results show that the learned full trajectories produce higher-quality scenes than the baselines, we agree that explicit validation of these points would strengthen the claims. In the revision we will add the suggested ablation studies, including a direct comparison of discriminator performance on held-out orchestrator trajectories versus the original training distribution. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical two-phase training procedure with external evaluation

full rationale

The paper presents an empirical ML training pipeline (orchestrator fine-tuned on full trajectories in phase 1, interleaved discriminator adaptation in phase 2) whose success is measured by downstream scene-quality metrics and runtime on held-out instructions. No equations, fitted parameters renamed as predictions, self-definitional loops, or load-bearing self-citations appear in the derivation chain. The central claim reduces to standard supervised fine-tuning plus selection, which is independently falsifiable against baselines and does not collapse to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The framework rests on the domain assumption that LLMs can be fine-tuned to produce coherent multi-step tool sequences for 3D tasks and introduces two new trainable modules whose effectiveness is asserted but not independently validated in the provided abstract.

axioms (1)
  • domain assumption LLMs can be fine-tuned to generate coherent full tool-call sequences for 3D scene tasks without intermediate feedback
    This is the central premise that allows removal of the execute-review-reflect loop.
invented entities (2)
  • Orchestrator no independent evidence
    purpose: Generates complete tool-call trajectories from instructions
    New trainable component introduced to replace step-wise decision making.
  • Discriminator no independent evidence
    purpose: Scores quality of full trajectories to guide training
    New trainable component used only during training to select best plans.

pith-pipeline@v0.9.0 · 5600 in / 1402 out tokens · 38429 ms · 2026-05-10T03:19:42.057780+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

44 extracted references · 19 canonical work pages · 13 internal anchors

  1. [1]

    GPT-4 Technical Report

    Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F.L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al.: Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023)

  2. [2]

    Advances in neural information processing systems33, 1877–1901 (2020)

    Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Nee- lakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems33, 1877–1901 (2020)

  3. [3]

    In: European Conference on Computer Vision

    Çelen,A.,Han,G.,Schindler,K.,VanGool,L.,Armeni,I.,Obukhov,A.,Wang,X.: I-design: Personalized llm interior designer. In: European Conference on Computer Vision. pp. 217–234. Springer (2024)

  4. [4]

    Advances in Neural Information Processing Systems 35, 5982–5994 (2022)

    Deitke, M., VanderBilt, E., Herrasti, A., Weihs, L., Ehsani, K., Salvador, J., Han, W., Kolve, E., Kembhavi, A., Mottaghi, R.: Procthor: Large-scale embodied ai using procedural generation. Advances in Neural Information Processing Systems 35, 5982–5994 (2022)

  5. [5]

    Advances in Neural Information Processing Systems36, 18225–18250 (2023)

    Feng, W., Zhu, W., Fu, T.j., Jampani, V., Akula, A., He, X., Basu, S., Wang, X.E., Wang, W.Y.: Layoutgpt: Compositional visual planning and generation with large language models. Advances in Neural Information Processing Systems36, 18225–18250 (2023)

  6. [6]

    In: European Conference on Computer Vision

    Fu, R., Wen, Z., Liu, Z., Sridhar, S.: Anyhome: Open-vocabulary generation of structured and textured 3d homes. In: European Conference on Computer Vision. pp. 52–70. Springer (2024)

  7. [7]

    In: International conference on machine learning

    Fujimoto, S., Meger, D., Precup, D.: Off-policy deep reinforcement learning with- out exploration. In: International conference on machine learning. pp. 2052–2062. PMLR (2019)

  8. [8]

    Communications of the ACM63(11), 139–144 (2020)

    Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM63(11), 139–144 (2020)

  9. [9]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference

    Gu, Z., Cui, Y., Li, Z., Wei, F., Ge, Y., Gu, J., Liu, M.Y., Davis, A., Ding, Y.: Ar- tiscene: Language-driven artistic 3d scene generation through image intermediary. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 2891–2901 (2025)

  10. [10]

    Mastering Atari with Discrete World Models

    Hafner, D., Lillicrap, T., Norouzi, M., Ba, J.: Mastering atari with discrete world models. arXiv preprint arXiv:2010.02193 (2020)

  11. [11]

    LangDriveCTRL: Natural Language Controllable Driving Scene Editing with Multi-modal Agents

    He, Y., Pittaluga, F., Jiang, Z., Zwicker, M., Chandraker, M., Tasneem, Z.: Lang- drivectrl: Natural language controllable driving scene editing with multi-modal agents. arXiv preprint arXiv:2512.17445 (2025)

  12. [12]

    In: Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition

    He, Y., Ren, X., Tang, D., Zhang, Y., Xue, X., Fu, Y.: Density-preserving deep point cloud compression. In: Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition. pp. 2333–2342 (2022)

  13. [13]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    He, Y., Tang, D., Zhang, Y., Xue, X., Fu, Y.: Grad-pu: Arbitrary-scale point cloud upsampling via gradient descent with learned distance functions. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 5354–5363 (2023)

  14. [14]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Höllein, L., Cao, A., Owens, A., Johnson, J., Nießner, M.: Text2room: Extract- ing textured 3d meshes from 2d text-to-image models. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 7909–7920 (2023)

  15. [15]

    In: Forty-first International Conference on Machine Learning (2024) 16 Y

    Hu, Z., Iscen, A., Jain, A., Kipf, T., Yue, Y., Ross, D.A., Schmid, C., Fathi, A.: Scenecraft: An llm agent for synthesizing 3d scenes as blender code. In: Forty-first International Conference on Machine Learning (2024) 16 Y. He et al

  16. [16]

    Jiang, D., Lu, Y., Li, Z., Lyu, Z., Nie, P., Wang, H., Su, A., Chen, H., Zou, K., Du, C., Pang, T., Chen, W.: Verltool: Towards holistic agentic reinforcement learning with tool use (2025),https://arxiv.org/abs/2509.01055

  17. [17]

    Offline Reinforcement Learning with Implicit Q-Learning

    Kostrikov, I., Nair, A., Levine, S.: Offline reinforcement learning with implicit q- learning. arXiv preprint arXiv:2110.06169 (2021)

  18. [18]

    Advances in neural information processing systems33, 1179–1191 (2020)

    Kumar, A., Zhou, A., Tucker, G., Levine, S.: Conservative q-learning for offline reinforcement learning. Advances in neural information processing systems33, 1179–1191 (2020)

  19. [19]

    Levine, S., Kumar, A., Tucker, G., Fu, J.: Offline reinforcement learning: Tutorial, review,andperspectivesonopenproblems.arXivpreprintarXiv:2005.01643(2020)

  20. [20]

    In: ICLR 2024 Workshop on Large Language Model (LLM) Agents (2024),https://openreview.net/forum?id=xXo4JL8FvV

    Li, Z., Yu, K., Cheng, S., Xu, D.: LEAGUE++: EMPOWERING CONTIN- UAL ROBOT LEARNING THROUGH GUIDED SKILL ACQUISITION WITH LARGE LANGUAGE MODELS. In: ICLR 2024 Workshop on Large Language Model (LLM) Agents (2024),https://openreview.net/forum?id=xXo4JL8FvV

  21. [21]

    arXiv preprint arXiv:2505.02836 (2025)

    Ling, L., Lin, C.H., Lin, T.Y., Ding, Y., Zeng, Y., Sheng, Y., Ge, Y., Liu, M.Y., Bera, A., Li, Z.: Scenethesis: A language and vision agentic framework for 3d scene generation. arXiv preprint arXiv:2505.02836 (2025)

  22. [22]

    Large Language Models: A Survey

    Minaee, S., Mikolov, T., Nikzad, N., Chenaghlu, M., Socher, R., Amatriain, X., Gao, J.: Large language models: A survey. arXiv preprint arXiv:2402.06196 (2024)

  23. [23]

    OpenAI: Openai api reference.https : / / platform . openai . com / docs / api - reference(2024), accessed: 2025-11-08

  24. [24]

    Advances in neural infor- mation processing systems34, 12013–12026 (2021)

    Paschalidou, D., Kar, A., Shugrina, M., Kreis, K., Geiger, A., Fidler, S.: Atiss: Autoregressive transformers for indoor scene synthesis. Advances in neural infor- mation processing systems34, 12013–12026 (2021)

  25. [25]

    Patil, S.G., Zhang, T., Wang, X., Gonzalez, J.E.: Gorilla: Large language model connected with massive apis (2023),https://arxiv.org/abs/2305.15334

  26. [26]

    Scenesmith: Agentic generation of simulation-ready indoor scenes.arXiv preprint arXiv:2602.09153, 2026

    Pfaff, N., Cohn, T., Zakharov, S., Cory, R., Tedrake, R.: Scenesmith: Agentic gen- eration of simulation-ready indoor scenes. arXiv preprint arXiv:2602.09153 (2026)

  27. [27]

    Advances in neural information processing systems36, 53728–53741 (2023)

    Rafailov, R., Sharma, A., Mitchell, E., Manning, C.D., Ermon, S., Finn, C.: Direct preference optimization: Your language model is secretly a reward model. Advances in neural information processing systems36, 53728–53741 (2023)

  28. [28]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Raistrick, A., Mei, L., Kayan, K., Yan, D., Zuo, Y., Han, B., Wen, H., Parakh, M., Alexandropoulos, S., Lipson, L., et al.: Infinigen indoors: Photorealistic indoor scenes using procedural generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 21783–21794 (2024)

  29. [29]

    Nature588(7839), 604–609 (2020)

    Schrittwieser, J., Antonoglou, I., Hubert, T., Simonyan, K., Sifre, L., Schmitt, S., Guez, A., Lockhart, E., Hassabis, D., Graepel, T., et al.: Mastering atari, go, chess and shogi by planning with a learned model. Nature588(7839), 604–609 (2020)

  30. [30]

    Proximal Policy Optimization Algorithms

    Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017)

  31. [31]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., Zhang, M., Li, Y., Wu, Y., et al.: Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300 (2024)

  32. [32]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference

    Sun, F.Y., Liu, W., Gu, S., Lim, D., Bhat, G., Tombari, F., Li, M., Haber, N., Wu, J.: Layoutvlm: Differentiable optimization of 3d layout via vision-language models. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 29469–29478 (2025)

  33. [33]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Tang, J., Nie, Y., Markhasin, L., Dai, A., Thies, J., Nießner, M.: Diffuscene: De- noising diffusion models for generative indoor scene synthesis. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 20507– 20518 (2024) SceneOrchestra 17

  34. [34]

    3d scene generation: A survey,

    Wen, B., Xie, H., Chen, Z., Hong, F., Liu, Z.: 3d scene generation: A survey. arXiv preprint arXiv:2505.05474 (2025)

  35. [35]

    Sage: Scalable agentic 3d scene generation for embodied ai, 2026

    Xia, H., Li, X., Li, Z., Ma, Q., Xu, J., Liu, M.Y., Cui, Y., Lin, T.Y., Ma, W.C., Wang, S., et al.: Sage: Scalable agentic 3d scene generation for embodied ai. arXiv preprint arXiv:2602.10116 (2026)

  36. [36]

    Qwen3 Technical Report

    Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., et al.: Qwen3 technical report. arXiv preprint arXiv:2505.09388 (2025)

  37. [37]

    Sceneweaver: All-in-one 3d scene synthesis with an extensible and self-reflective agent, 2025

    Yang, Y., Jia, B., Zhang, S., Huang, S.: Sceneweaver: All-in-one 3d scene synthesis with an extensible and self-reflective agent. arXiv preprint arXiv:2509.20414 (2025)

  38. [38]

    In: Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition

    Yang, Y., Jia, B., Zhi, P., Huang, S.: Physcene: Physically interactable 3d scene synthesis for embodied ai. In: Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition. pp. 16262–16272 (2024)

  39. [39]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Yang, Y., Sun, F.Y., Weihs, L., VanderBilt, E., Herrasti, A., Han, W., Wu, J., Haber, N., Krishna, R., Liu, L., et al.: Holodeck: Language guided generation of 3d embodied ai environments. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 16227–16237 (2024)

  40. [40]

    Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: React: Synergizing reasoning and acting in language models (2023),https://arxiv.org/ abs/2210.03629

  41. [41]

    Vision-as-Inverse-Graphics Agent via Interleaved Multimodal Reasoning

    Yin, S., Ge, J., Wang, Z.Z., Li, X., Black, M.J., Darrell, T., Kanazawa, A., Feng, H.: Vision-as-inverse-graphics agent via interleaved multimodal reasoning. arXiv preprint arXiv:2601.11109 (2026)

  42. [42]

    Zhang, G., Geng, H., Yu, X., Yin, Z., Zhang, Z., Tan, Z., Zhou, H., Li, Z., Xue, X., Li, Y., Zhou, Y., Chen, Y., Zhang, C., Fan, Y., Wang, Z., Huang, S., Piedrahita- Velez, F., Liao, Y., Wang, H., Yang, M., Ji, H., Wang, J., Yan, S., Torr, P., Bai, L.: The landscape of agentic reinforcement learning for llms: A survey (2026), https://arxiv.org/abs/2509.02547

  43. [43]

    Journal of Computer Science and Technology34(3), 594–608 (2019)

    Zhang, S.H., Zhang, S.K., Liang, Y., Hall, P.: A survey of 3d indoor scene synthesis. Journal of Computer Science and Technology34(3), 594–608 (2019)

  44. [44]

    Design me a <room type>

    Zheng, Y., Zhang, R., Zhang, J., Ye, Y., Luo, Z.: Llamafactory: Unified efficient fine-tuning of 100+ language models. In: Proceedings of the 62nd annual meeting of the association for computational linguistics (volume 3: system demonstrations). pp. 400–410 (2024) 18 Y. He et al. In this supplementary material, we provide a user study to further evaluate ...