Perceive-then-Plan: Layout-as-Policy for Monocular 3D Scene Layout Estimation

Junwei Zhou; Yu-Wing Tai

arxiv: 2605.25326 · v1 · pith:GUXV6W7Xnew · submitted 2026-05-25 · 💻 cs.CV

Perceive-then-Plan: Layout-as-Policy for Monocular 3D Scene Layout Estimation

Junwei Zhou , Yu-Wing Tai This is my paper

Pith reviewed 2026-06-29 23:09 UTC · model grok-4.3

classification 💻 cs.CV

keywords monocular 3D scene layoutlayout-as-policyperceive-then-planvision-language modelspolicy learning3D scene understandingiterative refinement

0 comments

The pith

Monocular 3D layout estimation improves by using a perceiver to ground objects and a planner to iteratively refine via corrective actions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper formulates monocular 3D scene layout estimation as a perceive-then-plan problem. A Perceiver first grounds 3D objects from the image using vision-language models. A Planner then treats the layout as a state and applies discrete actions like translation, rotation, and rescaling to improve physical plausibility while keeping consistency with the image. The planner is trained with supervised trajectory initialization combined with preference-based optimization. This turns the task into an iterative refinement process that handles global constraints better.

Core claim

By casting the planning stage as a policy learning problem where 3D layouts are structured states refined via discrete actions, starting from an observation-aligned initialization, the model produces layouts that are more physically coherent and better aligned with visual observations.

What carries the argument

Layout-as-Policy (LaP), which represents 3D layouts as structured states refined through discrete actions in an iterative planner trained with supervised initialization and preference optimization.

If this is right

Layouts are more physically coherent than those from direct prediction.
Outputs align better with the input image observations.
The approach naturally supports downstream tasks such as scene editing and manipulation.
The task shifts from one-shot prediction to iterative refinement for handling complex interactions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This method could reduce reliance on full 3D supervision during training by using preference optimization.
It may generalize to other structured prediction tasks in vision that require enforcing physical constraints.
Applying the planner to video inputs could enable temporally consistent scene layouts.

Load-bearing premise

The planner learns reliable corrective action sequences from supervised trajectories and preference optimization without explicit rewards or ground-truth 3D layouts.

What would settle it

A direct comparison on a benchmark dataset showing whether the iterative planner produces lower physical inconsistency scores than a non-iterative baseline on images with known geometric conflicts.

Figures

Figures reproduced from arXiv: 2605.25326 by Junwei Zhou, Yu-Wing Tai.

**Figure 2.** Figure 2: Overview of our perceive-then-plan framework. The Perceiver grounds 3D boxes from the input image. A canonicalized, grid-based representation is then iteratively refined by the LaP Planner, which treats each layout as a structured state and selects discrete actions (translate, rotate, rescale) via a learned policy until convergence. The final layout supports both scene assembly (digital twin) and camera-sp… view at source ↗

**Figure 3.** Figure 3: A simple illustration of DPO data pair synthesis. We note that the action sequences can support multiple objects. Action Space. We design a compact action space that captures the minimal yet sufficient set of operations for layout refinement. Each action is formulated in a programming language format with explicit parameters: SELECT(i) Selecting the object i in the structured layout L3d = {li}M i=1 for la… view at source ↗

**Figure 4.** Figure 4: We visualize the 3D grounding result generated by our Perceiver 8B model. For better [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Example of our LaP guided action refinement process. We visualize each iteration’s [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: More visualization of final 3D layout and reprojection. We back convert the structured [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: We show examples of our assembled scene with the estimated 3D layout. We compare with [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗

**Figure 8.** Figure 8: Scene editing with instructions. We ablate the key design choices of both components, with a particular focus on the LaP Planner to isolate the contribution of each element in our policy learning pipeline. For the Perceiver, we study the local-axis representation and geometry modulation in Tab. 3. Removing geometry modulation yields the largest drop across all metrics, confirming that fusing visual fea… view at source ↗

**Figure 9.** Figure 9: Another downstream application of instructional refinement using user input language. After layout refinement by the LaP Planner, we populate the predicted 3D boxes with actual 3D assets (retrieved from existing datasets or generated by 3D generative models, we generate with SAM 3D in the main paper for better visualization) and apply a gravity simulation step to resolve residual placement artifacts cau… view at source ↗

**Figure 10.** Figure 10: We present more visualization results of our Perceiver on real-world captures. [PITH_FULL_IMAGE:figures/full_fig_p020_10.png] view at source ↗

**Figure 11.** Figure 11: We present more visualization results of our LaP Planner. [PITH_FULL_IMAGE:figures/full_fig_p021_11.png] view at source ↗

read the original abstract

Building structured 3D scene layouts from a single image requires reconciling visual observations with physical and spatial constraints, a challenge that is difficult to address with direct prediction alone. In this work, we formulate monocular 3D layout estimation as a perceive-then-plan problem with vision-language models, where a Perceiver first grounds the 3D objects and then a Planner iteratively refines the scene hypothesis through actions that improve physical plausibility while preserving consistency with the input image. We propose Layout-as-Policy (LaP), which casts the planning stage as a policy learning problem: 3D layouts are represented as structured states, and refined via discrete actions such as translation, rotation, and rescaling. Starting from an observation-aligned initialization with the geometry-enhanced Perceiver, the LaP Planner is trained to produce action sequences that progressively resolve geometric inconsistencies and enforce realistic spatial relations. To enable effective learning, we combine supervised trajectory initialization with preference-based optimization, allowing the model to learn corrective behaviors without requiring explicit reward engineering. This formulation transforms layout estimation from a one-shot prediction task into an iterative refinement process, enabling better handling of global constraints and complex object interactions. Experiments demonstrate that our approach produces layouts that are more physically coherent and better aligned with visual observations, while naturally supporting downstream tasks such as scene editing and manipulation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LaP turns layout estimation into iterative action-based planning but the abstract gives no evidence the preference training actually enforces geometric fixes.

read the letter

The paper's main idea is to split monocular 3D layout estimation into a Perceiver that grounds objects from the image and a Planner that treats the layout as a state and fixes it through discrete actions like translate, rotate, or rescale. They train the planner first with supervised trajectories then with preference optimization to learn corrective sequences without hand-crafted rewards.

This framing moves away from pure one-shot prediction and tries to handle global spatial constraints through iteration, which is a sensible direction for improving physical coherence and object relations. The downstream support for scene editing follows naturally from the action-based representation.

The soft spot is exactly the one flagged in the stress test. After the initial supervised pass the planner refines without ground-truth 3D access, so the preference signal has to come from somewhere else, probably image-based or indirect. Nothing in the abstract shows that this signal is strong enough to produce valid geometric corrections rather than just staying near the initialization distribution. The claim of better physical coherence and visual alignment is stated but unsupported by any numbers, baselines, or ablations.

This is for people already working on VLM-driven scene understanding or structured prediction in robotics and AR. It might be worth a referee if the full experiments and comparisons are there, but the current text does not let you judge whether the central training claim holds.

Referee Report

2 major / 1 minor

Summary. The paper formulates monocular 3D scene layout estimation as a perceive-then-plan task using vision-language models. A Perceiver first grounds 3D objects from the input image; a Planner then represents layouts as states and iteratively refines them via discrete actions (translate/rotate/rescale) to improve physical plausibility while preserving image consistency. The Planner is trained by supervised trajectory initialization followed by preference-based optimization, converting the task from one-shot prediction to iterative refinement.

Significance. If the empirical claims hold, the work offers a principled shift from direct regression to policy-driven refinement, potentially improving global constraint satisfaction and enabling downstream applications such as scene editing. The combination of supervised initialization with preference optimization without explicit reward engineering is a distinctive technical choice.

major comments (2)

[Planner training description] Planner training section: the description states that preference-based optimization enables the model to learn corrective behaviors without explicit reward engineering or ground-truth 3D layouts at refinement time, yet supplies no mechanism (e.g., preference data construction, loss formulation, or verification that actions reduce geometric inconsistency) showing how the learned policy discovers physically valid fixes rather than patterns from the initialization distribution. This directly bears on the central “perceive-then-plan” advantage.
[Experiments] Experiments section: the abstract asserts that the approach produces layouts that are more physically coherent and better aligned with visual observations, but the manuscript text provides neither quantitative metrics, baselines, dataset details, nor ablation results, leaving the central empirical claim unevaluable.

minor comments (1)

[Abstract] Abstract: key quantitative results supporting the coherence and alignment claims should be included to allow immediate assessment of the contribution.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. The comments identify two areas where additional detail will strengthen the manuscript. We address each point below and commit to a major revision that incorporates the requested clarifications and results.

read point-by-point responses

Referee: [Planner training description] Planner training section: the description states that preference-based optimization enables the model to learn corrective behaviors without explicit reward engineering or ground-truth 3D layouts at refinement time, yet supplies no mechanism (e.g., preference data construction, loss formulation, or verification that actions reduce geometric inconsistency) showing how the learned policy discovers physically valid fixes rather than patterns from the initialization distribution. This directly bears on the central “perceive-then-plan” advantage.

Authors: We agree that the current description is insufficiently detailed on these points. In the revised manuscript we will add: (i) the exact procedure used to construct preference pairs from the supervised initialization trajectories, (ii) the preference-optimization loss formulation (including how positive and negative actions are contrasted), and (iii) a verification experiment that quantifies the reduction in geometric inconsistency metrics after each refinement step. These additions will make explicit how the policy learns corrective actions that go beyond the initialization distribution. revision: yes
Referee: [Experiments] Experiments section: the abstract asserts that the approach produces layouts that are more physically coherent and better aligned with visual observations, but the manuscript text provides neither quantitative metrics, baselines, dataset details, nor ablation results, leaving the central empirical claim unevaluable.

Authors: We acknowledge the gap. Although an Experiments section exists, it currently lacks the quantitative support needed to evaluate the claims. In the revision we will insert a complete experimental subsection containing: quantitative metrics for physical coherence and image alignment, comparisons against relevant baselines, full dataset descriptions, and ablation studies isolating the contribution of the Planner. Tables and figures will be added to make all claims directly verifiable. revision: yes

Circularity Check

0 steps flagged

No circularity: formulation relies on external training data and standard optimization without self-referential reduction

full rationale

The provided abstract and description contain no equations, fitted parameters, or derivations. The LaP Planner is trained via supervised trajectory initialization plus preference-based optimization on external data, with no indication that any 'prediction' or result is defined in terms of itself or reduces to the input by construction. No self-citations, uniqueness theorems, or ansatzes are invoked in a load-bearing way. The perceive-then-plan framing is a modeling choice supported by the training procedure rather than a tautology. This matches the default case of a self-contained empirical method.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no explicit free parameters, axioms, or invented entities; the approach implicitly assumes vision-language models can ground 3D objects accurately and that preference optimization suffices for policy learning.

pith-pipeline@v0.9.1-grok · 5766 in / 1136 out tokens · 29515 ms · 2026-06-29T23:09:49.634249+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

69 extracted references · 30 canonical work pages · 14 internal anchors

[1]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

Qwen2.5-VL Technical Report

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2.5-vl technical report.arXiv preprint arXiv:2502.13923, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, et al. Training a helpful and harmless assistant with reinforcement learning from human feedback.arXiv preprint arXiv:2204.05862, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[4]

Survey on large language model-enhanced reinforcement learning: Concept, taxonomy, and methods.IEEE Transactions on Neural Networks and Learning Systems, 36(6):9737–9757, 2024

Yuji Cao, Huan Zhao, Yuheng Cheng, Ting Shu, Yue Chen, Guolong Liu, Gaoqi Liang, Junhua Zhao, Jinyue Yan, and Yun Li. Survey on large language model-enhanced reinforcement learning: Concept, taxonomy, and methods.IEEE Transactions on Neural Networks and Learning Systems, 36(6):9737–9757, 2024

2024
[5]

Physx-3d: Physical-grounded 3d asset generation

Ziang Cao, Zhaoxi Chen, Liang Pan, and Ziwei Liu. Physx-3d: Physical-grounded 3d asset generation. arXiv preprint arXiv:2507.12465, 2025

work page arXiv 2025
[6]

I-design: Personalized llm interior designer

Ata Çelen, Guo Han, Konrad Schindler, Luc Van Gool, Iro Armeni, Anton Obukhov, and Xi Wang. I-design: Personalized llm interior designer. InEuropean Conference on Computer Vision (ECCV), pages 217–234. Springer, 2024

2024
[7]

Spatialvlm: Endowing vision-language models with spatial reasoning capabilities

Boyuan Chen, Zhuo Xu, Sean Kirmani, Brain Ichter, Dorsa Sadigh, Leonidas Guibas, and Fei Xia. Spatialvlm: Endowing vision-language models with spatial reasoning capabilities. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 14455–14465, 2024

2024
[8]

Control3d: Towards controllable text-to-3d generation

Yang Chen, Yingwei Pan, Yehao Li, Ting Yao, and Tao Mei. Control3d: Towards controllable text-to-3d generation. InProceedings of the 31st ACM International Conference on Multimedia, pages 1148–1156, 2023

2023
[9]

Comboverse: Compositional 3d assets creation using spatially-aware diffusion guidance

Yongwei Chen, Tengfei Wang, Tong Wu, Xingang Pan, Kui Jia, and Ziwei Liu. Comboverse: Compositional 3d assets creation using spatially-aware diffusion guidance. InEuropean Conference on Computer Vision (ECCV), pages 128–146. Springer, 2024

2024
[10]

Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks

Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. InIEEE/CVF conference on computer vision and pattern recognition (CVPR), pages 24185–24198, 2024

2024
[11]

Instructblip: Towards general-purpose vision-language models with instruction tuning

Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. Instructblip: Towards general-purpose vision-language models with instruction tuning. InAdvances in Neural Information Processing Systems (NeurIPS), 2023

2023
[12]

KTO: Model Alignment as Prospect Theoretic Optimization

Kawin Ethayarajh, Winnie Xu, Niklas Muennighoff, Dan Jurafsky, and Douwe Kiela. Kto: Model alignment as prospect theoretic optimization.arXiv preprint arXiv:2402.01306, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[13]

Layoutgpt: Compositional visual planning and generation with large language models.Advances in Neural Information Processing Systems (NeurIPS), 36:18225–18250, 2023

Weixi Feng, Wanrong Zhu, Tsu-jui Fu, Varun Jampani, Arjun Akula, Xuehai He, Sugato Basu, Xin Eric Wang, and William Yang Wang. Layoutgpt: Compositional visual planning and generation with large language models.Advances in Neural Information Processing Systems (NeurIPS), 36:18225–18250, 2023

2023
[14]

Compgs: Unleashing 2d compositionality for compositional text-to-3d via dynamically optimizing 3d gaussians

Chongjian Ge, Chenfeng Xu, Yuanfeng Ji, Chensheng Peng, Masayoshi Tomizuka, Ping Luo, Mingyu Ding, Varun Jampani, and Wei Zhan. Compgs: Unleashing 2d compositionality for compositional text-to-3d via dynamically optimizing 3d gaussians. InIEEE/CVF Conference on Computer Vision and Pattern Recognition Conference, pages 18509–18520, 2025

2025
[15]

Gemini: A Family of Highly Capable Multimodal Models

Gemini Team, Google. Gemini: A family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[16]

Gemma 3 Technical Report

Gemma Team, Google DeepMind. Gemma 3 technical report.arXiv preprint arXiv:2503.19786, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[17]

Spatial reasoning with vision-language models in ego-centric multi-view scenes

Mohsen Gholami, Ahmad Rezaei, Zhou Weimin, Sitong Mao, Shunbo Zhou, Yong Zhang, and Mohammad Akbari. Spatial reasoning with vision-language models in ego-centric multi-view scenes. InInternational Conference on Learning Representations (ICLR), 2026

2026
[18]

GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models

GLM-4.5 Team, Zhipu AI & Tsinghua University. GLM-4.5: Agentic, reasoning, and coding (arc) foundation models.arXiv preprint arXiv:2508.06471, 2025. 11

work page internal anchor Pith review Pith/arXiv arXiv 2025
[19]

Artiscene: Language-driven artistic 3d scene generation through image intermediary

Zeqi Gu, Yin Cui, Zhaoshuo Li, Fangyin Wei, Yunhao Ge, Jinwei Gu, Ming-Yu Liu, Abe Davis, and Yifan Ding. Artiscene: Language-driven artistic 3d scene generation through image intermediary. InIEEE/CVF Conference on Computer Vision and Pattern Recognition Conference (CVPR), pages 2891–2901, 2025

2025
[20]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[21]

3d-llm: Injecting the 3d world into large language models

Yining Hong, Haoyu Zhen, Peihao Chen, Shuhong Zheng, Yilun Du, Zhenfang Chen, and Chuang Gan. 3d-llm: Injecting the 3d world into large language models. InAdvances in Neural Information Processing Systems (NeurIPS), pages 20482–20494, 2023

2023
[22]

Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen

Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. InInternational Conference on Learning Representations (ICLR), 2022

2022
[23]

Ross, Cordelia Schmid, and Alireza Fathi

Ziniu Hu, Ahmet Iscen, Aashi Jain, Thomas Kipf, Yisong Yue, David A. Ross, Cordelia Schmid, and Alireza Fathi. SceneCraft: An llm agent for synthesizing 3d scene as blender code.arXiv preprint arXiv:2403.01248, 2024

work page arXiv 2024
[24]

Holistic 3d scene parsing and reconstruction from a single rgb image

Siyuan Huang, Siyuan Qi, Yixin Zhu, Yinxue Xiao, Yuanlu Xu, and Song-Chun Zhu. Holistic 3d scene parsing and reconstruction from a single rgb image. InEuropean Conference on Computer Vision (ECCV), 2018

2018
[25]

GPT-4o System Card

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[26]

OpenVLA: An Open-Source Vision-Language-Action Model

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[27]

LLaVA-OneVision: Easy Visual Task Transfer

Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, et al. Llava-onevision: Easy visual task transfer.arXiv preprint arXiv:2408.03326, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[28]

Seeground: See and ground for zero-shot open-vocabulary 3d visual grounding

Rong Li, Shijie Li, Lingdong Kong, Xulei Yang, and Junwei Liang. Seeground: See and ground for zero-shot open-vocabulary 3d visual grounding. InIEEE/CVF Conference on Computer Vision and Pattern Recognition Conference, pages 3707–3717, 2025

2025
[29]

Advances in 3d generation: A survey.arXiv preprint arXiv:2401.17807, 2024

Xiaoyu Li, Qi Zhang, Di Kang, Weihao Cheng, Yiming Gao, Jingbo Zhang, Zhihao Liang, Jing Liao, Yan-Pei Cao, and Ying Shan. Advances in 3d generation: A survey.arXiv preprint arXiv:2401.17807, 2024

work page arXiv 2024
[30]

arXiv preprint arXiv:2505.02836 (2025)

Lu Ling, Chen-Hsuan Lin, Tsung-Yi Lin, Yifan Ding, Yu Zeng, Yichen Sheng, Yunhao Ge, Ming-Yu Liu, Aniket Bera, and Zhaoshuo Li. Scenethesis: A language and vision agentic framework for 3d scene generation.arXiv preprint arXiv:2505.02836, 2025

work page arXiv 2025
[31]

Visual instruction tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. InAdvances in Neural Information Processing Systems (NeurIPS), 2023

2023
[32]

Worldcraft: Photo-realistic 3d world creation and customization via llm agents.arXiv preprint arXiv:2502.15601, 2025

Xinhang Liu, Chi-Keung Tang, and Yu-Wing Tai. Worldcraft: Photo-realistic 3d world creation and customization via llm agents.arXiv preprint arXiv:2502.15601, 2025

work page arXiv 2025
[33]

The Llama 3 Herd of Models

Llama Team, AI @ Meta. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[34]

Scenegen: Single-image 3d scene generation in one feedforward pass

Yanxu Meng, Haoning Wu, Ya Zhang, and Weidi Xie. Scenegen: Single-image 3d scene generation in one feedforward pass. InInternational Conference on 3D Vision 2026, 2026

2026
[35]

To- tal3dunderstanding: Joint layout, object pose and mesh reconstruction for indoor scenes from a single image

Yinyu Nie, Xiaoguang Han, Shihui Guo, Yujian Zheng, Jian Chang, and Jian Jun Zhang. To- tal3dunderstanding: Joint layout, object pose and mesh reconstruction for indoor scenes from a single image. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020

2020
[36]

Courville

Ethan Perez, Florian Strub, Harm de Vries, Vincent Dumoulin, and Aaron C. Courville. Film: Visual reasoning with a general conditioning layer. InAAAI, 2018

2018
[37]

Direct preference optimization: Your language model is secretly a reward model

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. InAdvances in neural information processing systems (NeurIPS), pages 53728–53741, 2023. 12

2023
[38]

Direct numerical layout generation for 3d indoor scene synthesis via spatial reasoning.arXiv preprint arXiv:2506.05341, 2025

Xingjian Ran, Yixuan Li, Linning Xu, Mulin Yu, and Bo Dai. Direct numerical layout generation for 3d indoor scene synthesis via spatial reasoning.arXiv preprint arXiv:2506.05341, 2025

work page arXiv 2025
[39]

Hypersim: A photorealistic synthetic dataset for holistic indoor scene understanding

Mike Roberts, Jason Ramapuram, Anurag Ranjan, Atulit Kumar, Miguel Angel Bautista, Nathan Paczan, Russ Webb, and Joshua M Susskind. Hypersim: A photorealistic synthetic dataset for holistic indoor scene understanding. InIEEE/CVF International Conference on Computer Vision (ICCV), pages 10912–10922, 2021

2021
[40]

SAM 3D: 3Dfy Anything in Images

SAM 3D Team, Meta Superintelligence Labs. SAM 3D: 3dfy anything in images.arXiv preprint arXiv:2511.16624, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[41]

Sun rgb-d: A rgb-d scene understanding benchmark suite

Shuran Song, Samuel P Lichtenberg, and Jianxiong Xiao. Sun rgb-d: A rgb-d scene understanding benchmark suite. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 567–576, 2015

2015
[42]

Layoutvlm: Differentiable optimization of 3d layout via vision-language models.arXiv preprint arXiv:2412.02193, 2024

Fan-Yun Sun, Weiyu Liu, Siyi Gu, Dylan Lim, Goutam Bhat, Federico Tombari, Manling Li, Nick Haber, and Jiajun Wu. Layoutvlm: Differentiable optimization of 3d layout via vision-language models.arXiv preprint arXiv:2412.02193, 2024

work page arXiv 2024
[43]

Sutton and Andrew G

Richard S. Sutton and Andrew G. Barto.Reinforcement Learning: An Introduction (second edition). MIT Press, 2018

2018
[44]

A survey on post-training of large language models.arXiv preprint arXiv:2503.06072, 2025

Guiyao Tie, Zeli Zhao, Dingjie Song, Fuyang Wei, Rong Zhou, Yurou Dai, Wen Yin, Zhejian Yang, Jiangyue Yan, Yao Su, et al. A survey on post-training of large language models.arXiv preprint arXiv:2503.06072, 2025

work page arXiv 2025
[45]

Cambrian-1: A fully open, vision-centric exploration of multimodal llms

Shengbang Tong, Ellis Brown, Penghao Wu, Sanghyun Woo, Manoj Middepogu, Sai Charitha Akula, Jihan Yang, Shusheng Yang, Adithya Iyer, Xichen Pan, Ziteng Wang, Rob Fergus, Yann LeCun, and Saining Xie. Cambrian-1: A fully open, vision-centric exploration of multimodal llms. InAdvances in Neural Information Processing Systems (NeurIPS), 2024

2024
[46]

Vggt: Visual geometry grounded transformer

Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Visual geometry grounded transformer. InIEEE/CVF Conference on Computer Vision and Pattern Recognition Conference (CVPR), pages 5294–5306, 2025

2025
[47]

Holistic 3d scene understanding from a single geo- tagged image

Shenlong Wang, Sanja Fidler, and Raquel Urtasun. Holistic 3d scene understanding from a single geo- tagged image. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2015

2015
[48]

N3d-vlm: Native 3d grounding enables accurate spatial reasoning in vision-language models

Yuxin Wang, Lei Ke, Boqiang Zhang, Tianyuan Qu, Hanxun Yu, Zhenpeng Huang, Meng Yu, Dan Xu, and Dong Yu. N3d-vlm: Native 3d grounding enables accurate spatial reasoning in vision-language models. arXiv preprint arXiv:2512.16561, 2025

work page arXiv 2025
[49]

Tabletopgen: Instance-level interactive 3d tabletop scene generation from text or single image.arXiv preprint arXiv:2512.01204, 2025

Ziqian Wang, Yonghao He, Licheng Yang, Wei Zou, Hongxuan Ma, Liu Liu, Wei Sui, Yuxin Guo, and Hu Su. Tabletopgen: Instance-level interactive 3d tabletop scene generation from text or single image.arXiv preprint arXiv:2512.01204, 2025

work page arXiv 2025
[50]

Amodal3r: Amodal 3d reconstruction from occluded 2d images

Tianhao Wu, Chuanxia Zheng, Frank Guan, Andrea Vedaldi, and Tat-Jen Cham. Amodal3r: Amodal 3d reconstruction from occluded 2d images. InIEEE/CVF International Conference on Computer Vision (ICCV), pages 9181–9193, 2025

2025
[51]

Llava-cot: Let vision language models reason step-by-step

Guowei Xu, Peng Jin, Ziang Wu, Hao Li, Yibing Song, Lichao Sun, and Li Yuan. Llava-cot: Let vision language models reason step-by-step. InIEEE/CVF International Conference on Computer Vision (ICCV), pages 2087–2098, 2025

2087
[52]

Vlm-grounder: A vlm agent for zero-shot 3d visual grounding

Runsen Xu, Zhiwei Huang, Tai Wang, Yilun Chen, Jiangmiao Pang, and Dahua Lin. Vlm-grounder: A vlm agent for zero-shot 3d visual grounding. InCoRL, 2024

2024
[53]

Thinking in space: How multimodal large language models see, remember, and recall spaces

Jihan Yang, Shusheng Yang, Anjali W Gupta, Rilyn Han, Li Fei-Fei, and Saining Xie. Thinking in space: How multimodal large language models see, remember, and recall spaces. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025

2025
[54]

Holodeck: Language guided generation of 3d embodied ai environments

Yue Yang, Fan-Yun Sun, Luca Weihs, Eli VanderBilt, Alvaro Herrasti, Winson Han, Jiajun Wu, Nick Haber, Ranjay Krishna, Lingjie Liu, Chris Callison-Burch, Mark Yatskar, Aniruddha Kembhavi, and Christopher Clark. Holodeck: Language guided generation of 3d embodied ai environments. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), page...

2024
[55]

React: Synergizing reasoning and acting in language models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InInternational Conference on Learning Representations (ICLR), 2022. 13

2022
[56]

How to enable llm with 3d capacity? a survey of spatial reasoning in llm.arXiv preprint arXiv:2504.05786, 2025

Jirong Zha, Yuxuan Fan, Xiao Yang, Chen Gao, and Xinlei Chen. How to enable llm with 3d capacity? a survey of spatial reasoning in llm.arXiv preprint arXiv:2504.05786, 2025

work page arXiv 2025
[57]

Deeppanocontext: Panoramic 3d scene understanding with holistic scene context graph and relation-based optimization

Cheng Zhang, Zhaopeng Cui, Cai Chen, Shuaicheng Liu, Bing Zeng, Hujun Bao, and Yinda Zhang. Deeppanocontext: Panoramic 3d scene understanding with holistic scene context graph and relation-based optimization. InIEEE/CVF International Conference on Computer Vision (ICCV), 2021

2021
[58]

Holistic 3d scene understanding from a single image with implicit representation

Cheng Zhang, Zhaopeng Cui, Yinda Zhang, Bing Zeng, Marc Pollefeys, and Shuaicheng Liu. Holistic 3d scene understanding from a single image with implicit representation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021

2021
[59]

Instruction tuning for large language models: A survey.arXiv preprint arXiv:2308.10792, 2025

Shengyu Zhang, Linfeng Dong, Xiaoya Li, Sen Zhang, Xiaofei Sun, Shuhe Wang, Jiwei Li, Runyi Hu, Tianwei Zhang, Fei Wu, and Guoyin Wang. Instruction tuning for large language models: A survey.arXiv preprint arXiv:2308.10792, 2025

work page arXiv 2025
[60]

The scene language: Representing scenes with programs, words, and embeddings.arXiv preprint arXiv:2410.16770, 2025

Yunzhi Zhang, Zizhang Li, Matt Zhou, Shangzhe Wu, and Jiajun Wu. The scene language: Representing scenes with programs, words, and embeddings.arXiv preprint arXiv:2410.16770, 2025

work page arXiv 2025
[61]

GENA3D: Generative Amodal 3D Modeling by Bridging 2D Priors and 3D Coherence

Junwei Zhou and Yu-Wing Tai. Gena3d: Generative amodal 3d modeling by bridging 2d priors and 3d coherence.arXiv preprint arXiv:2511.21945, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[62]

Layout-your-3d: Controllable and precise 3d generation with 2d blueprint

Junwei Zhou, Xueting Li, Lu Qi, and Ming-Hsuan Yang. Layout-your-3d: Controllable and precise 3d generation with 2d blueprint. InInternational Conference on Learning Representations (ICLR), 2025

2025
[63]

Coco4d: Comprehensive and complex 4d scene generation.arXiv preprint arXiv:2506.19798, 2025

Junwei Zhou, Xueting Li, Lu Qi, and Ming-Hsuan Yang. Coco4d: Comprehensive and complex 4d scene generation.arXiv preprint arXiv:2506.19798, 2025

work page arXiv 2025
[64]

Gala3d: Towards text-to-3d complex scene generation via layout-guided generative gaussian splatting.arXiv preprint arXiv:2402.07207, 2024

Xiaoyu Zhou, Xingjian Ran, Yajiao Xiong, Jinlin He, Zhiwei Lin, Yongtao Wang, Deqing Sun, and Ming-Hsuan Yang. Gala3d: Towards text-to-3d complex scene generation via layout-guided generative gaussian splatting.arXiv preprint arXiv:2402.07207, 2024

work page arXiv 2024
[65]

Llava-3d: A simple yet effective pathway to empowering lmms with 3d-awareness.arXiv preprint arXiv:2409.18125, 2024

Chenming Zhu, Tai Wang, Wenwei Zhang, Jiangmiao Pang, and Xihui Liu. Llava-3d: A simple yet effective pathway to empowering lmms with 3d-awareness.arXiv preprint arXiv:2409.18125, 2024

work page arXiv 2024
[66]

Weaksam: Segment any- thing meets weakly-supervised instance-level recognition

Lianghui Zhu, Junwei Zhou, Yan Liu, Xin Hao, Wenyu Liu, and Xinggang Wang. Weaksam: Segment any- thing meets weakly-supervised instance-level recognition. InProceedings of the 32nd ACM International Conference on Multimedia, page 7947–7956, 2024

2024
[67]

Layoutnet: Reconstructing the 3d room layout from a single rgb image

Chuhang Zou, Alex Colburn, Qi Shan, and Derek Hoiem. Layoutnet: Reconstructing the 3d room layout from a single rgb image. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2018. 14 A More Implementation Details A.1 Model Architecture and Training Details Both the Perceiver and the LaP Planner are built on Qwen3-V...

2018
[68]

Rule-Based Refinement: The rule-based baseline applies a fixed sequence of geometric corrections in the discretized grid space, requiring no learned model or image input. It runs two alternating passes based on the contact scene-graph generated: a de-floating pass that snaps each floating object onto its corresponding supporter or lowers it to the ground,...
[69]

id": int,

One-shot VLM direct refinement: The direct prediction baseline prompts the finetuned Qwen3- VL-8B-Instruct model with the same perturbation and gt 3D layout data. Given the RGB image and a structured text description of the perturbed layout (object category, 2D bounding box, grid position, size, and yaw), the model is asked to output a corrected JSON layo...

[1] [1]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[2] [2]

Qwen2.5-VL Technical Report

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2.5-vl technical report.arXiv preprint arXiv:2502.13923, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[3] [3]

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, et al. Training a helpful and harmless assistant with reinforcement learning from human feedback.arXiv preprint arXiv:2204.05862, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[4] [4]

Survey on large language model-enhanced reinforcement learning: Concept, taxonomy, and methods.IEEE Transactions on Neural Networks and Learning Systems, 36(6):9737–9757, 2024

Yuji Cao, Huan Zhao, Yuheng Cheng, Ting Shu, Yue Chen, Guolong Liu, Gaoqi Liang, Junhua Zhao, Jinyue Yan, and Yun Li. Survey on large language model-enhanced reinforcement learning: Concept, taxonomy, and methods.IEEE Transactions on Neural Networks and Learning Systems, 36(6):9737–9757, 2024

2024

[5] [5]

Physx-3d: Physical-grounded 3d asset generation

Ziang Cao, Zhaoxi Chen, Liang Pan, and Ziwei Liu. Physx-3d: Physical-grounded 3d asset generation. arXiv preprint arXiv:2507.12465, 2025

work page arXiv 2025

[6] [6]

I-design: Personalized llm interior designer

Ata Çelen, Guo Han, Konrad Schindler, Luc Van Gool, Iro Armeni, Anton Obukhov, and Xi Wang. I-design: Personalized llm interior designer. InEuropean Conference on Computer Vision (ECCV), pages 217–234. Springer, 2024

2024

[7] [7]

Spatialvlm: Endowing vision-language models with spatial reasoning capabilities

Boyuan Chen, Zhuo Xu, Sean Kirmani, Brain Ichter, Dorsa Sadigh, Leonidas Guibas, and Fei Xia. Spatialvlm: Endowing vision-language models with spatial reasoning capabilities. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 14455–14465, 2024

2024

[8] [8]

Control3d: Towards controllable text-to-3d generation

Yang Chen, Yingwei Pan, Yehao Li, Ting Yao, and Tao Mei. Control3d: Towards controllable text-to-3d generation. InProceedings of the 31st ACM International Conference on Multimedia, pages 1148–1156, 2023

2023

[9] [9]

Comboverse: Compositional 3d assets creation using spatially-aware diffusion guidance

Yongwei Chen, Tengfei Wang, Tong Wu, Xingang Pan, Kui Jia, and Ziwei Liu. Comboverse: Compositional 3d assets creation using spatially-aware diffusion guidance. InEuropean Conference on Computer Vision (ECCV), pages 128–146. Springer, 2024

2024

[10] [10]

Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks

Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. InIEEE/CVF conference on computer vision and pattern recognition (CVPR), pages 24185–24198, 2024

2024

[11] [11]

Instructblip: Towards general-purpose vision-language models with instruction tuning

Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. Instructblip: Towards general-purpose vision-language models with instruction tuning. InAdvances in Neural Information Processing Systems (NeurIPS), 2023

2023

[12] [12]

KTO: Model Alignment as Prospect Theoretic Optimization

Kawin Ethayarajh, Winnie Xu, Niklas Muennighoff, Dan Jurafsky, and Douwe Kiela. Kto: Model alignment as prospect theoretic optimization.arXiv preprint arXiv:2402.01306, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[13] [13]

Layoutgpt: Compositional visual planning and generation with large language models.Advances in Neural Information Processing Systems (NeurIPS), 36:18225–18250, 2023

Weixi Feng, Wanrong Zhu, Tsu-jui Fu, Varun Jampani, Arjun Akula, Xuehai He, Sugato Basu, Xin Eric Wang, and William Yang Wang. Layoutgpt: Compositional visual planning and generation with large language models.Advances in Neural Information Processing Systems (NeurIPS), 36:18225–18250, 2023

2023

[14] [14]

Compgs: Unleashing 2d compositionality for compositional text-to-3d via dynamically optimizing 3d gaussians

Chongjian Ge, Chenfeng Xu, Yuanfeng Ji, Chensheng Peng, Masayoshi Tomizuka, Ping Luo, Mingyu Ding, Varun Jampani, and Wei Zhan. Compgs: Unleashing 2d compositionality for compositional text-to-3d via dynamically optimizing 3d gaussians. InIEEE/CVF Conference on Computer Vision and Pattern Recognition Conference, pages 18509–18520, 2025

2025

[15] [15]

Gemini: A Family of Highly Capable Multimodal Models

Gemini Team, Google. Gemini: A family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[16] [16]

Gemma 3 Technical Report

Gemma Team, Google DeepMind. Gemma 3 technical report.arXiv preprint arXiv:2503.19786, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[17] [17]

Spatial reasoning with vision-language models in ego-centric multi-view scenes

Mohsen Gholami, Ahmad Rezaei, Zhou Weimin, Sitong Mao, Shunbo Zhou, Yong Zhang, and Mohammad Akbari. Spatial reasoning with vision-language models in ego-centric multi-view scenes. InInternational Conference on Learning Representations (ICLR), 2026

2026

[18] [18]

GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models

GLM-4.5 Team, Zhipu AI & Tsinghua University. GLM-4.5: Agentic, reasoning, and coding (arc) foundation models.arXiv preprint arXiv:2508.06471, 2025. 11

work page internal anchor Pith review Pith/arXiv arXiv 2025

[19] [19]

Artiscene: Language-driven artistic 3d scene generation through image intermediary

Zeqi Gu, Yin Cui, Zhaoshuo Li, Fangyin Wei, Yunhao Ge, Jinwei Gu, Ming-Yu Liu, Abe Davis, and Yifan Ding. Artiscene: Language-driven artistic 3d scene generation through image intermediary. InIEEE/CVF Conference on Computer Vision and Pattern Recognition Conference (CVPR), pages 2891–2901, 2025

2025

[20] [20]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[21] [21]

3d-llm: Injecting the 3d world into large language models

Yining Hong, Haoyu Zhen, Peihao Chen, Shuhong Zheng, Yilun Du, Zhenfang Chen, and Chuang Gan. 3d-llm: Injecting the 3d world into large language models. InAdvances in Neural Information Processing Systems (NeurIPS), pages 20482–20494, 2023

2023

[22] [22]

Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen

Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. InInternational Conference on Learning Representations (ICLR), 2022

2022

[23] [23]

Ross, Cordelia Schmid, and Alireza Fathi

Ziniu Hu, Ahmet Iscen, Aashi Jain, Thomas Kipf, Yisong Yue, David A. Ross, Cordelia Schmid, and Alireza Fathi. SceneCraft: An llm agent for synthesizing 3d scene as blender code.arXiv preprint arXiv:2403.01248, 2024

work page arXiv 2024

[24] [24]

Holistic 3d scene parsing and reconstruction from a single rgb image

Siyuan Huang, Siyuan Qi, Yixin Zhu, Yinxue Xiao, Yuanlu Xu, and Song-Chun Zhu. Holistic 3d scene parsing and reconstruction from a single rgb image. InEuropean Conference on Computer Vision (ECCV), 2018

2018

[25] [25]

GPT-4o System Card

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[26] [26]

OpenVLA: An Open-Source Vision-Language-Action Model

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[27] [27]

LLaVA-OneVision: Easy Visual Task Transfer

Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, et al. Llava-onevision: Easy visual task transfer.arXiv preprint arXiv:2408.03326, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[28] [28]

Seeground: See and ground for zero-shot open-vocabulary 3d visual grounding

Rong Li, Shijie Li, Lingdong Kong, Xulei Yang, and Junwei Liang. Seeground: See and ground for zero-shot open-vocabulary 3d visual grounding. InIEEE/CVF Conference on Computer Vision and Pattern Recognition Conference, pages 3707–3717, 2025

2025

[29] [29]

Advances in 3d generation: A survey.arXiv preprint arXiv:2401.17807, 2024

Xiaoyu Li, Qi Zhang, Di Kang, Weihao Cheng, Yiming Gao, Jingbo Zhang, Zhihao Liang, Jing Liao, Yan-Pei Cao, and Ying Shan. Advances in 3d generation: A survey.arXiv preprint arXiv:2401.17807, 2024

work page arXiv 2024

[30] [30]

arXiv preprint arXiv:2505.02836 (2025)

Lu Ling, Chen-Hsuan Lin, Tsung-Yi Lin, Yifan Ding, Yu Zeng, Yichen Sheng, Yunhao Ge, Ming-Yu Liu, Aniket Bera, and Zhaoshuo Li. Scenethesis: A language and vision agentic framework for 3d scene generation.arXiv preprint arXiv:2505.02836, 2025

work page arXiv 2025

[31] [31]

Visual instruction tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. InAdvances in Neural Information Processing Systems (NeurIPS), 2023

2023

[32] [32]

Worldcraft: Photo-realistic 3d world creation and customization via llm agents.arXiv preprint arXiv:2502.15601, 2025

Xinhang Liu, Chi-Keung Tang, and Yu-Wing Tai. Worldcraft: Photo-realistic 3d world creation and customization via llm agents.arXiv preprint arXiv:2502.15601, 2025

work page arXiv 2025

[33] [33]

The Llama 3 Herd of Models

Llama Team, AI @ Meta. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[34] [34]

Scenegen: Single-image 3d scene generation in one feedforward pass

Yanxu Meng, Haoning Wu, Ya Zhang, and Weidi Xie. Scenegen: Single-image 3d scene generation in one feedforward pass. InInternational Conference on 3D Vision 2026, 2026

2026

[35] [35]

To- tal3dunderstanding: Joint layout, object pose and mesh reconstruction for indoor scenes from a single image

Yinyu Nie, Xiaoguang Han, Shihui Guo, Yujian Zheng, Jian Chang, and Jian Jun Zhang. To- tal3dunderstanding: Joint layout, object pose and mesh reconstruction for indoor scenes from a single image. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020

2020

[36] [36]

Courville

Ethan Perez, Florian Strub, Harm de Vries, Vincent Dumoulin, and Aaron C. Courville. Film: Visual reasoning with a general conditioning layer. InAAAI, 2018

2018

[37] [37]

Direct preference optimization: Your language model is secretly a reward model

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. InAdvances in neural information processing systems (NeurIPS), pages 53728–53741, 2023. 12

2023

[38] [38]

Direct numerical layout generation for 3d indoor scene synthesis via spatial reasoning.arXiv preprint arXiv:2506.05341, 2025

Xingjian Ran, Yixuan Li, Linning Xu, Mulin Yu, and Bo Dai. Direct numerical layout generation for 3d indoor scene synthesis via spatial reasoning.arXiv preprint arXiv:2506.05341, 2025

work page arXiv 2025

[39] [39]

Hypersim: A photorealistic synthetic dataset for holistic indoor scene understanding

Mike Roberts, Jason Ramapuram, Anurag Ranjan, Atulit Kumar, Miguel Angel Bautista, Nathan Paczan, Russ Webb, and Joshua M Susskind. Hypersim: A photorealistic synthetic dataset for holistic indoor scene understanding. InIEEE/CVF International Conference on Computer Vision (ICCV), pages 10912–10922, 2021

2021

[40] [40]

SAM 3D: 3Dfy Anything in Images

SAM 3D Team, Meta Superintelligence Labs. SAM 3D: 3dfy anything in images.arXiv preprint arXiv:2511.16624, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[41] [41]

Sun rgb-d: A rgb-d scene understanding benchmark suite

Shuran Song, Samuel P Lichtenberg, and Jianxiong Xiao. Sun rgb-d: A rgb-d scene understanding benchmark suite. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 567–576, 2015

2015

[42] [42]

Layoutvlm: Differentiable optimization of 3d layout via vision-language models.arXiv preprint arXiv:2412.02193, 2024

Fan-Yun Sun, Weiyu Liu, Siyi Gu, Dylan Lim, Goutam Bhat, Federico Tombari, Manling Li, Nick Haber, and Jiajun Wu. Layoutvlm: Differentiable optimization of 3d layout via vision-language models.arXiv preprint arXiv:2412.02193, 2024

work page arXiv 2024

[43] [43]

Sutton and Andrew G

Richard S. Sutton and Andrew G. Barto.Reinforcement Learning: An Introduction (second edition). MIT Press, 2018

2018

[44] [44]

A survey on post-training of large language models.arXiv preprint arXiv:2503.06072, 2025

Guiyao Tie, Zeli Zhao, Dingjie Song, Fuyang Wei, Rong Zhou, Yurou Dai, Wen Yin, Zhejian Yang, Jiangyue Yan, Yao Su, et al. A survey on post-training of large language models.arXiv preprint arXiv:2503.06072, 2025

work page arXiv 2025

[45] [45]

Cambrian-1: A fully open, vision-centric exploration of multimodal llms

Shengbang Tong, Ellis Brown, Penghao Wu, Sanghyun Woo, Manoj Middepogu, Sai Charitha Akula, Jihan Yang, Shusheng Yang, Adithya Iyer, Xichen Pan, Ziteng Wang, Rob Fergus, Yann LeCun, and Saining Xie. Cambrian-1: A fully open, vision-centric exploration of multimodal llms. InAdvances in Neural Information Processing Systems (NeurIPS), 2024

2024

[46] [46]

Vggt: Visual geometry grounded transformer

Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Visual geometry grounded transformer. InIEEE/CVF Conference on Computer Vision and Pattern Recognition Conference (CVPR), pages 5294–5306, 2025

2025

[47] [47]

Holistic 3d scene understanding from a single geo- tagged image

Shenlong Wang, Sanja Fidler, and Raquel Urtasun. Holistic 3d scene understanding from a single geo- tagged image. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2015

2015

[48] [48]

N3d-vlm: Native 3d grounding enables accurate spatial reasoning in vision-language models

Yuxin Wang, Lei Ke, Boqiang Zhang, Tianyuan Qu, Hanxun Yu, Zhenpeng Huang, Meng Yu, Dan Xu, and Dong Yu. N3d-vlm: Native 3d grounding enables accurate spatial reasoning in vision-language models. arXiv preprint arXiv:2512.16561, 2025

work page arXiv 2025

[49] [49]

Tabletopgen: Instance-level interactive 3d tabletop scene generation from text or single image.arXiv preprint arXiv:2512.01204, 2025

Ziqian Wang, Yonghao He, Licheng Yang, Wei Zou, Hongxuan Ma, Liu Liu, Wei Sui, Yuxin Guo, and Hu Su. Tabletopgen: Instance-level interactive 3d tabletop scene generation from text or single image.arXiv preprint arXiv:2512.01204, 2025

work page arXiv 2025

[50] [50]

Amodal3r: Amodal 3d reconstruction from occluded 2d images

Tianhao Wu, Chuanxia Zheng, Frank Guan, Andrea Vedaldi, and Tat-Jen Cham. Amodal3r: Amodal 3d reconstruction from occluded 2d images. InIEEE/CVF International Conference on Computer Vision (ICCV), pages 9181–9193, 2025

2025

[51] [51]

Llava-cot: Let vision language models reason step-by-step

Guowei Xu, Peng Jin, Ziang Wu, Hao Li, Yibing Song, Lichao Sun, and Li Yuan. Llava-cot: Let vision language models reason step-by-step. InIEEE/CVF International Conference on Computer Vision (ICCV), pages 2087–2098, 2025

2087

[52] [52]

Vlm-grounder: A vlm agent for zero-shot 3d visual grounding

Runsen Xu, Zhiwei Huang, Tai Wang, Yilun Chen, Jiangmiao Pang, and Dahua Lin. Vlm-grounder: A vlm agent for zero-shot 3d visual grounding. InCoRL, 2024

2024

[53] [53]

Thinking in space: How multimodal large language models see, remember, and recall spaces

Jihan Yang, Shusheng Yang, Anjali W Gupta, Rilyn Han, Li Fei-Fei, and Saining Xie. Thinking in space: How multimodal large language models see, remember, and recall spaces. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025

2025

[54] [54]

Holodeck: Language guided generation of 3d embodied ai environments

Yue Yang, Fan-Yun Sun, Luca Weihs, Eli VanderBilt, Alvaro Herrasti, Winson Han, Jiajun Wu, Nick Haber, Ranjay Krishna, Lingjie Liu, Chris Callison-Burch, Mark Yatskar, Aniruddha Kembhavi, and Christopher Clark. Holodeck: Language guided generation of 3d embodied ai environments. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), page...

2024

[55] [55]

React: Synergizing reasoning and acting in language models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InInternational Conference on Learning Representations (ICLR), 2022. 13

2022

[56] [56]

How to enable llm with 3d capacity? a survey of spatial reasoning in llm.arXiv preprint arXiv:2504.05786, 2025

Jirong Zha, Yuxuan Fan, Xiao Yang, Chen Gao, and Xinlei Chen. How to enable llm with 3d capacity? a survey of spatial reasoning in llm.arXiv preprint arXiv:2504.05786, 2025

work page arXiv 2025

[57] [57]

Deeppanocontext: Panoramic 3d scene understanding with holistic scene context graph and relation-based optimization

Cheng Zhang, Zhaopeng Cui, Cai Chen, Shuaicheng Liu, Bing Zeng, Hujun Bao, and Yinda Zhang. Deeppanocontext: Panoramic 3d scene understanding with holistic scene context graph and relation-based optimization. InIEEE/CVF International Conference on Computer Vision (ICCV), 2021

2021

[58] [58]

Holistic 3d scene understanding from a single image with implicit representation

Cheng Zhang, Zhaopeng Cui, Yinda Zhang, Bing Zeng, Marc Pollefeys, and Shuaicheng Liu. Holistic 3d scene understanding from a single image with implicit representation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021

2021

[59] [59]

Instruction tuning for large language models: A survey.arXiv preprint arXiv:2308.10792, 2025

Shengyu Zhang, Linfeng Dong, Xiaoya Li, Sen Zhang, Xiaofei Sun, Shuhe Wang, Jiwei Li, Runyi Hu, Tianwei Zhang, Fei Wu, and Guoyin Wang. Instruction tuning for large language models: A survey.arXiv preprint arXiv:2308.10792, 2025

work page arXiv 2025

[60] [60]

The scene language: Representing scenes with programs, words, and embeddings.arXiv preprint arXiv:2410.16770, 2025

Yunzhi Zhang, Zizhang Li, Matt Zhou, Shangzhe Wu, and Jiajun Wu. The scene language: Representing scenes with programs, words, and embeddings.arXiv preprint arXiv:2410.16770, 2025

work page arXiv 2025

[61] [61]

GENA3D: Generative Amodal 3D Modeling by Bridging 2D Priors and 3D Coherence

Junwei Zhou and Yu-Wing Tai. Gena3d: Generative amodal 3d modeling by bridging 2d priors and 3d coherence.arXiv preprint arXiv:2511.21945, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[62] [62]

Layout-your-3d: Controllable and precise 3d generation with 2d blueprint

Junwei Zhou, Xueting Li, Lu Qi, and Ming-Hsuan Yang. Layout-your-3d: Controllable and precise 3d generation with 2d blueprint. InInternational Conference on Learning Representations (ICLR), 2025

2025

[63] [63]

Coco4d: Comprehensive and complex 4d scene generation.arXiv preprint arXiv:2506.19798, 2025

Junwei Zhou, Xueting Li, Lu Qi, and Ming-Hsuan Yang. Coco4d: Comprehensive and complex 4d scene generation.arXiv preprint arXiv:2506.19798, 2025

work page arXiv 2025

[64] [64]

Gala3d: Towards text-to-3d complex scene generation via layout-guided generative gaussian splatting.arXiv preprint arXiv:2402.07207, 2024

Xiaoyu Zhou, Xingjian Ran, Yajiao Xiong, Jinlin He, Zhiwei Lin, Yongtao Wang, Deqing Sun, and Ming-Hsuan Yang. Gala3d: Towards text-to-3d complex scene generation via layout-guided generative gaussian splatting.arXiv preprint arXiv:2402.07207, 2024

work page arXiv 2024

[65] [65]

Llava-3d: A simple yet effective pathway to empowering lmms with 3d-awareness.arXiv preprint arXiv:2409.18125, 2024

Chenming Zhu, Tai Wang, Wenwei Zhang, Jiangmiao Pang, and Xihui Liu. Llava-3d: A simple yet effective pathway to empowering lmms with 3d-awareness.arXiv preprint arXiv:2409.18125, 2024

work page arXiv 2024

[66] [66]

Weaksam: Segment any- thing meets weakly-supervised instance-level recognition

Lianghui Zhu, Junwei Zhou, Yan Liu, Xin Hao, Wenyu Liu, and Xinggang Wang. Weaksam: Segment any- thing meets weakly-supervised instance-level recognition. InProceedings of the 32nd ACM International Conference on Multimedia, page 7947–7956, 2024

2024

[67] [67]

Layoutnet: Reconstructing the 3d room layout from a single rgb image

Chuhang Zou, Alex Colburn, Qi Shan, and Derek Hoiem. Layoutnet: Reconstructing the 3d room layout from a single rgb image. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2018. 14 A More Implementation Details A.1 Model Architecture and Training Details Both the Perceiver and the LaP Planner are built on Qwen3-V...

2018

[68] [68]

Rule-Based Refinement: The rule-based baseline applies a fixed sequence of geometric corrections in the discretized grid space, requiring no learned model or image input. It runs two alternating passes based on the contact scene-graph generated: a de-floating pass that snaps each floating object onto its corresponding supporter or lowers it to the ground,...

[69] [69]

id": int,

One-shot VLM direct refinement: The direct prediction baseline prompts the finetuned Qwen3- VL-8B-Instruct model with the same perturbation and gt 3D layout data. Given the RGB image and a structured text description of the perturbed layout (object category, 2D bounding box, grid position, size, and yaw), the model is asked to output a corrected JSON layo...