ReScene: Structured Indoor Scene Reconstruction from Multi-View Captures

Daoguo Dong; Haoran Xu; Lechao Zhang; Xin Tan; Yan Gao

arxiv: 2606.28060 · v1 · pith:DVA23OOXnew · submitted 2026-06-26 · 💻 cs.CV

ReScene: Structured Indoor Scene Reconstruction from Multi-View Captures

Haoran Xu , Lechao Zhang , Daoguo Dong , Yan Gao , Xin Tan This is my paper

Pith reviewed 2026-06-29 04:27 UTC · model grok-4.3

classification 💻 cs.CV

keywords indoor scene reconstructionmulti-view capturesscene graphvision-language modelphysically consistent assemblyScanNetembodied AIhierarchical view selection

0 comments

The pith

ReScene threads multi-view geometry through view selection and relation fusion to assemble physically consistent indoor scenes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that the main obstacle in building simulation-ready 3D scenes from multi-view captures is not single-object modeling but fusing relations across views into physically plausible layouts. ReScene treats multi-view geometry as a unifying prior that runs through the entire pipeline. HierView replaces the largest-mask heuristic with selection based on semantic consistency and 3D coverage completeness. Relation-Aware Assembly merges vision-language model relation predictions with geometric and room-shell priors inside a confidence-weighted scene graph. The resulting scenes improve geometry accuracy, rendering quality, and speed on ScanNet data and support creation of an embodied visual question answering dataset.

Core claim

ReScene is a framework for structured indoor scene reconstruction from multi-view captures whose central claim is that cross-view relation fusion and physically plausible scene assembly, not single-object reconstruction, form the core bottleneck. The method consists of HierView, which prioritizes reconstruction views based on semantic consistency and 3D coverage completeness, and Relation-Aware Assembly, which fuses multi-frame relation predictions from a vision-language model with geometric and room-shell priors into a confidence-weighted scene graph. On a set of ScanNet scenes this produces a 17 percent reduction in Chamfer Distance and 26 percent reduction in LPIPS relative to the stronge

What carries the argument

The confidence-weighted scene graph that fuses vision-language model relation predictions with geometric and room-shell priors to enable physically consistent scene assembly.

If this is right

Reconstructed scenes become directly usable for simulation in Embodied AI without extra physical cleanup steps.
The same pipeline can generate large embodied visual question answering datasets for training spatial-reasoning models.
Faster runtime enables reconstruction of substantially more scenes than previous multi-view approaches.
Explicit object-level structure and inter-object relations improve accuracy on downstream rendering and perceptual tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The room-shell prior could be swapped for other structural knowledge to extend the method beyond indoor environments.
Stronger geometric priors might allow the framework to tolerate noisier vision-language predictions while preserving consistency.
The approach reduces dependence on specialized capture hardware by leveraging ordinary multi-view image sets.

Load-bearing premise

Multi-frame relation predictions from a vision-language model, when fused with geometric and room-shell priors via a confidence-weighted scene graph, will produce physically consistent assemblies without post-hoc corrections or additional physical simulation validation.

What would settle it

A collection of output scenes that contain clear physical violations such as intersecting objects or unsupported floating objects even when the scene graph reports high confidence on the assembled relations, or an ablation in which removing the relation-fusion step leaves physical-consistency metrics unchanged.

Figures

Figures reproduced from arXiv: 2606.28060 by Daoguo Dong, Haoran Xu, Lechao Zhang, Xin Tan, Yan Gao.

**Figure 2.** Figure 2: Embodied Visual Question Answering in Assembled Scenes. Given our structured scene [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Qualitative comparison with baselines on representative ScanNet scenes. Our method [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Qualitative comparison of view selection and mesh generation. Columns 2–4 show our [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

read the original abstract

Constructing simulation-ready 3D scenes from multi-view captures is a key bottleneck for Embodied Artificial Intelligence, as downstream tasks require object-level structure, explicit inter-object relations, and physical plausibility. Existing approaches either rely on specialized capture hardware, suffer from single-view bias in object reconstruction, or yield layouts that are geometrically reasonable but physically inconsistent. We identify that the problem is not single-object reconstruction but cross-view relation fusion and physically plausible scene assembly. To address this challenge, we present ReScene, a framework that threads multi-view geometry throughout the pipeline as a unifying prior. Our method consists of two main components: HierView prioritizes reconstruction views based on semantic consistency and 3D coverage completeness, replacing the largest-mask heuristic that conflates image occupancy with object coverage; and Relation-Aware Assembly fuses multi-frame relation predictions from a vision-language model with geometric and room-shell priors into a confidence-weighted scene graph, enabling physically consistent scene assembly. ReScene sets a new state of the art across geometry, rendering, and perceptual quality on a set of ScanNet scenes, achieving a 17% reduction in Chamfer Distance and 26% in LPIPS over the strongest prior baseline, while running up to 10x faster than prior multi-view methods. Based on the reconstructed scenes, we also generate an embodied visual question answering dataset, on which fine-tuned Qwen-VL approaches the performance of strong closed-source models on several spatial reasoning tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ReScene adds HierView for view selection and Relation-Aware Assembly for VLM-prior fusion in indoor scenes, with reported gains on ScanNet, but physical consistency rests on unvalidated assumptions.

read the letter

Colleague,

ReScene focuses on structured indoor reconstruction from multi-view data by threading geometry through the whole pipeline. The two named pieces are HierView, which picks views using semantic consistency and coverage instead of largest-mask heuristics, and Relation-Aware Assembly, which builds a confidence-weighted scene graph from multi-frame VLM relation predictions plus geometric and room-shell priors.

The paper does a clear job naming the bottleneck—cross-view relation fusion rather than single-object reconstruction—and reports concrete numbers on ScanNet: 17% lower Chamfer distance, 26% better LPIPS, and up to 10x faster runtime than prior multi-view methods. Generating an embodied VQA dataset from the output scenes is a straightforward downstream step that shows they considered use in robotics.

The soft spot is the physical-consistency claim. The assembly step is supposed to produce simulation-ready scenes without extra fixes, yet the abstract gives no penetration rates, stability checks under gravity, or collision metrics. VLM outputs can still hallucinate relations, and nothing in the provided description shows those errors are caught by the priors alone. The quantitative claims also lack any mention of error bars, exact baseline implementations, or dataset splits, so the size of the gains is hard to judge without the full text.

This is aimed at people building simulation environments or embodied agents who need object-level structure and relations. A reader working on multi-view fusion or scene graphs could extract the component ideas even if the end-to-end numbers need scrutiny. The work engages the right literature on the embodied-AI data gap and makes testable claims, so it deserves a serious referee rather than a desk reject.

I would bring it to a reading group to walk through the assembly details. I would not cite it in the next year without seeing the validation numbers. Send it to review.

Referee Report

2 major / 1 minor

Summary. The paper introduces ReScene, a framework for structured indoor scene reconstruction from multi-view captures aimed at producing simulation-ready scenes for Embodied AI. It proposes two components: HierView, which prioritizes views based on semantic consistency and 3D coverage instead of largest-mask heuristics, and Relation-Aware Assembly, which fuses multi-frame VLM relation predictions with geometric and room-shell priors into a confidence-weighted scene graph for physically consistent object assemblies. The work claims SOTA results on ScanNet scenes (17% Chamfer Distance reduction, 26% LPIPS improvement, up to 10x faster than prior multi-view methods) and generates an embodied VQA dataset where fine-tuned Qwen-VL approaches closed-source model performance on spatial tasks.

Significance. If the quantitative gains and physical consistency claims are substantiated, the work would meaningfully advance multi-view scene reconstruction by threading geometry as a prior throughout and leveraging VLM for explicit relations, addressing a bottleneck for downstream Embodied AI tasks. The view prioritization and scene-graph assembly ideas have potential for broader adoption if validated.

major comments (2)

[Abstract] Abstract: The headline quantitative claims (17% Chamfer Distance reduction, 26% LPIPS improvement, 10x speed-up) are presented without any evaluation protocol, baseline details, dataset splits, error bars, or statistical tests. This directly undermines assessment of the SOTA assertion, which is load-bearing for the paper's contribution.
[Abstract] Abstract (Relation-Aware Assembly description): The central claim that the method produces physically consistent assemblies via VLM-prior fusion lacks any quantitative validation (e.g., penetration rates, gravity stability, or collision metrics). This is load-bearing because the paper positions physical plausibility as the key differentiator from prior geometrically reasonable but inconsistent layouts.

minor comments (1)

[Abstract] Abstract: The embodied VQA dataset is mentioned as a contribution but without any details on construction, size, task definitions, or evaluation splits.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on the abstract. We agree that the headline claims require better contextualization and will revise accordingly. Point-by-point responses to the major comments follow.

read point-by-point responses

Referee: [Abstract] Abstract: The headline quantitative claims (17% Chamfer Distance reduction, 26% LPIPS improvement, 10x speed-up) are presented without any evaluation protocol, baseline details, dataset splits, error bars, or statistical tests. This directly undermines assessment of the SOTA assertion, which is load-bearing for the paper's contribution.

Authors: The full manuscript details the evaluation protocol, baselines, ScanNet splits, and metrics (including error bars from repeated runs and statistical tests) in Section 4 and the supplementary material. The abstract is length-constrained, but we will revise it to briefly reference the evaluation setting (e.g., "on ScanNet validation scenes") to improve self-containment while preserving conciseness. revision: partial
Referee: [Abstract] Abstract (Relation-Aware Assembly description): The central claim that the method produces physically consistent assemblies via VLM-prior fusion lacks any quantitative validation (e.g., penetration rates, gravity stability, or collision metrics). This is load-bearing because the paper positions physical plausibility as the key differentiator from prior geometrically reasonable but inconsistent layouts.

Authors: We agree that the current manuscript lacks direct quantitative physical-consistency metrics such as penetration or collision rates. Physical plausibility is achieved through the geometric and room-shell priors in Relation-Aware Assembly and is supported indirectly by the improved geometry/perceptual metrics and embodied VQA results. We will add explicit quantitative physical validation (e.g., simulation-based collision and stability metrics) to the experiments section in revision. revision: yes

Circularity Check

0 steps flagged

No circularity: framework is methodological with no derivational reductions

full rationale

The paper describes an engineering pipeline (HierView view prioritization + Relation-Aware Assembly via VLM + geometric priors) evaluated empirically on ScanNet. No equations, parameter-fitting steps, or first-principles derivations appear in the abstract or described structure. No self-definitional loops, fitted inputs renamed as predictions, or load-bearing self-citations that reduce the central claims to their own inputs by construction. The physical-consistency claim is an empirical assertion (not a derived theorem), and the reader's note confirms absence of equations. This is the common honest case of a self-contained applied method.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no equations, no parameter counts, and no explicit assumptions beyond the high-level claim that multi-view geometry acts as a unifying prior; therefore the ledger is empty.

pith-pipeline@v0.9.1-grok · 5799 in / 1181 out tokens · 33106 ms · 2026-06-29T04:27:39.365309+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

82 extracted references · 29 canonical work pages · 9 internal anchors

[1]

AI2-THOR: An Interactive 3D Environment for Visual AI

Eric Kolve, Roozbeh Mottaghi, Winson Han, Eli VanderBilt, Luca Weihs, Alvaro Herrasti, Matt Deitke, Kiana Ehsani, Daniel Gordon, Yuke Zhu, Aniruddha Kembhavi, Abhinav Gupta, and Ali Farhadi. Ai2-thor: An interactive 3d environment for visual ai, 2022. URL https: //arxiv.org/abs/1712.05474

work page internal anchor Pith review Pith/arXiv arXiv 2022
[2]

Habitat: A platform for embodied ai research

Manolis Savva, Abhishek Kadian, Oleksandr Maksymets, Yili Zhao, Erik Wijmans, Bhavana Jain, Julian Straub, Jia Liu, Vladlen Koltun, Jitendra Malik, Devi Parikh, and Dhruv Batra. Habitat: A platform for embodied ai research. InInternational Conference on Computer Vision (ICCV), pages 9339–9347, 2019

2019
[3]

Procthor: Large- scale embodied ai using procedural generation, 2022

Matt Deitke, Eli VanderBilt, Alvaro Herrasti, Luca Weihs, Jordi Salvador, Kiana Ehsani, Winson Han, Eric Kolve, Ali Farhadi, Aniruddha Kembhavi, and Roozbeh Mottaghi. Procthor: Large- scale embodied ai using procedural generation, 2022. URL https://arxiv.org/abs/2206. 06994

2022
[4]

3D-FRONT: 3d furnished rooms with layouts and semantics

Huan Fu, Bowen Cai, Lin Gao, Ling-Xiao Zhang, Jiaming Wang, Cao Li, Qixun Zeng, Chengyue Sun, Rongfei Jia, Binqiang Zhao, and Hao Zhang. 3D-FRONT: 3d furnished rooms with layouts and semantics. InInternational Conference on Computer Vision (ICCV), pages 10933–10942, 2021

2021
[5]

Objaverse: A universe of annotated 3d objects

Matt Deitke, Dustin Schwenk, Jordi Salvador, Luca Weihs, Oscar Michel, Eli VanderBilt, Ludwig Schmidt, Kiana Ehsani, Aniruddha Kembhavi, and Ali Farhadi. Objaverse: A universe of annotated 3d objects. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13142–13153, 2023

2023
[6]

Metascenes: Towards automated replica creation for real-world 3d scans, 2025

Huangyue Yu, Baoxiong Jia, Yixin Chen, Yandan Yang, Puhao Li, Rongpeng Su, Jiaxin Li, Qing Li, Wei Liang, Song-Chun Zhu, Tengyu Liu, and Siyuan Huang. Metascenes: Towards automated replica creation for real-world 3d scans, 2025. URL https://arxiv.org/abs/ 2505.02388

work page arXiv 2025
[7]

LiteReality: Graphics-ready 3d scene reconstruction from rgb-d scans, 2025

Zhening Huang, Xiaoyang Wu, Fangcheng Zhong, Hengshuang Zhao, Matthias Nießner, and Joan Lasenby. LiteReality: Graphics-ready 3d scene reconstruction from rgb-d scans, 2025. URLhttps://arxiv.org/abs/2507.02861

work page arXiv 2025
[8]

Gpt4scene: Understand 3d scenes from videos with vision-language models,

Zhangyang Qi, Zhixiong Zhang, Ye Fang, Jiaqi Wang, and Hengshuang Zhao. GPT4Scene: Understand 3d scenes from videos with vision-language models, 2025. URL https://arxiv. org/abs/2501.01428

work page arXiv 2025
[9]

SimRecon: Simready compositional scene reconstruction from real videos, 2026

Chong Xia, Kai Zhu, Zizhuo Wang, Fangfu Liu, Zhizheng Zhang, and Yueqi Duan. SimRecon: Simready compositional scene reconstruction from real videos, 2026. URL https://arxiv. org/abs/2603.02133

work page arXiv 2026
[10]

RICO: Regularizing the unobservable for indoor compositional reconstruction

Zizhang Li, Xiaoyang Lyu, Yuanyuan Ding, Mengmeng Wang, Yiyi Liao, and Yong Liu. RICO: Regularizing the unobservable for indoor compositional reconstruction. InInternational Conference on Computer Vision (ICCV), 2023

2023
[11]

Decompositional neural scene reconstruction with generative diffusion prior, 2025

Junfeng Ni, Yu Liu, Ruijie Lu, Zirui Zhou, Song-Chun Zhu, Yixin Chen, and Siyuan Huang. Decompositional neural scene reconstruction with generative diffusion prior, 2025. URL https://arxiv.org/abs/2503.14830

work page arXiv 2025
[12]

Instascene: Towards complete 3d instance decomposition and reconstruc- tion from cluttered scenes, 2025

Zesong Yang, Bangbang Yang, Wenqi Dong, Chenxuan Cao, Liyuan Cui, Yuewen Ma, Zhaopeng Cui, and Hujun Bao. Instascene: Towards complete 3d instance decomposition and reconstruc- tion from cluttered scenes, 2025. URLhttps://arxiv.org/abs/2507.08416

work page arXiv 2025
[13]

Embodied Question Answering

Abhishek Das, Samyak Datta, Georgia Gkioxari, Stefan Lee, Devi Parikh, and Dhruv Batra. Embodied question answering, 2017. URLhttps://arxiv.org/abs/1711.11543

work page internal anchor Pith review Pith/arXiv arXiv 2017
[14]

ScanQA: 3d question answering for spatial scene understanding

Daichi Azuma, Taiki Miyanishi, Shuhei Kurita, and Motoaki Kawanabe. ScanQA: 3d question answering for spatial scene understanding. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022. 10

2022
[15]

SQA3D: Situated question answering in 3d scenes

Xiaojian Ma, Silong Yong, Zilong Zheng, Qing Li, Yitao Liang, Song-Chun Zhu, and Siyuan Huang. SQA3D: Situated question answering in 3d scenes. InInternational Conference on Learning Representations (ICLR), 2023

2023
[16]

3d gaussian splatting for real-time radiance field rendering.ACM Transactions on Graphics, 42(4), 2023

Bernhard Kerbl, Georgios Kopanas, Thomas Leimkühler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering.ACM Transactions on Graphics, 42(4), 2023

2023
[17]

Srinivasan, Matthew Tancik, Jonathan T

Ben Mildenhall, Pratul P. Srinivasan, Matthew Tancik, Jonathan T. Barron, Ravi Ramamoorthi, and Ren Ng. NeRF: Representing scenes as neural radiance fields for view synthesis. In European Conference on Computer Vision (ECCV), pages 405–421, 2020

2020
[18]

MonoSDF: Exploring monocular geometric cues for neural implicit surface reconstruction

Zehao Yu, Songyou Peng, Michael Niemeyer, Torsten Sattler, and Andreas Geiger. MonoSDF: Exploring monocular geometric cues for neural implicit surface reconstruction. InAdvances in Neural Information Processing Systems (NeurIPS), 2022

2022
[19]

Generalizable 3d scene reconstruction via divide and conquer from a single view

Andreea Ardelean, Mert Özer, and Bernhard Egger. Generalizable 3d scene reconstruction via divide and conquer from a single view. InInternational Conference on 3D Vision (3DV), 2025

2025
[20]

Midi: Multi-instance diffusion for single image to 3d scene generation, 2025

Zehuan Huang, Yuan-Chen Guo, Xingqiao An, Yunhan Yang, Yangguang Li, Zi-Xin Zou, Ding Liang, Xihui Liu, Yan-Pei Cao, and Lu Sheng. Midi: Multi-instance diffusion for single image to 3d scene generation, 2025. URLhttps://arxiv.org/abs/2412.03558

work page arXiv 2025
[21]

Scenegen: Single-image 3d scene generation in one feedforward pass, 2025

Yanxu Meng, Haoning Wu, Ya Zhang, and Weidi Xie. Scenegen: Single-image 3d scene generation in one feedforward pass, 2025. URLhttps://arxiv.org/abs/2508.15769

work page arXiv 2025
[22]

Cast: Component-aligned 3d scene reconstruction from an rgb image.ACM Trans

Kaixin Yao, Longwen Zhang, Xinhao Yan, Yan Zeng, Qixuan Zhang, Lan Xu, Wei Yang, Jiayuan Gu, and Jingyi Yu. Cast: Component-aligned 3d scene reconstruction from an rgb image.ACM Trans. Graph., 44(4), July 2025. ISSN 0730-0301. doi: 10.1145/3730841. URL https://doi.org/10.1145/3730841

work page doi:10.1145/3730841 2025
[23]

DeepSDF: Learning continuous signed distance functions for shape representation

Jeong Joon Park, Peter Florence, Julian Straub, Richard Newcombe, and Steven Lovegrove. DeepSDF: Learning continuous signed distance functions for shape representation. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019

2019
[24]

Barron, and Ben Mildenhall

Ben Poole, Ajay Jain, Jonathan T. Barron, and Ben Mildenhall. DreamFusion: Text-to-3d using 2d diffusion. InInternational Conference on Learning Representations (ICLR), 2023

2023
[25]

Denoising diffusion probabilistic models

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. In Advances in Neural Information Processing Systems (NeurIPS), 2020

2020
[26]

Denoising diffusion implicit models

Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. In International Conference on Learning Representations (ICLR), 2021

2021
[27]

High- resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High- resolution image synthesis with latent diffusion models. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022

2022
[28]

ShapeNet: An Information-Rich 3D Model Repository

Angel X. Chang, Thomas Funkhouser, Leonidas Guibas, Pat Hanrahan, Qixing Huang, Zimo Li, Silvio Savarese, Manolis Savva, Shuran Song, Hao Su, Jianxiong Xiao, Li Yi, and Fisher Yu. ShapeNet: An information-rich 3d model repository.arXiv preprint arXiv:1512.03012, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[29]

Objaverse-XL: A universe of 10m+ 3d objects

Matt Deitke, Ruoshi Liu, Matthew Wallingford, Huong Ngo, Oscar Michel, Ujval Kusupati, Alan Fan, Christian Laforte, Vikram V oleti, Samir Yitzhak Gadre, Eli VanderBilt, Aniruddha Kembhavi, Carl V ondrick, Georgia Gkioxari, Kiana Ehsani, Ludwig Schmidt, and Ali Farhadi. Objaverse-XL: A universe of 10m+ 3d objects. InAdvances in Neural Information Processin...

2023
[30]

3D-FUTURE: 3d furniture shape with textures.Interna- tional Journal of Computer Vision, 2021

Huan Fu, Bowen Cai, Lin Gao, Ling-Xiao Zhang, Cao Li, Qixun Zeng, Chengyue Sun, Rongfei Jia, Binqiang Zhao, and Hao Zhang. 3D-FUTURE: 3d furniture shape with textures.Interna- tional Journal of Computer Vision, 2021

2021
[31]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InInternational Conference on Computer Vision (ICCV), 2023. 11

2023
[32]

Clay: A controllable large-scale generative model for creating high-quality 3d assets, 2024

Longwen Zhang, Ziyu Wang, Qixuan Zhang, Qiwei Qiu, Anqi Pang, Haoran Jiang, Wei Yang, Lan Xu, and Jingyi Yu. Clay: A controllable large-scale generative model for creating high-quality 3d assets, 2024. URLhttps://arxiv.org/abs/2406.13897

work page arXiv 2024
[33]

TripoSG: High-Fidelity 3D Shape Synthesis using Large-Scale Rectified Flow Models

Yangguang Li, Zi-Xin Zou, Zexiang Liu, Dehu Wang, Yuan Liang, Zhipeng Yu, Xingchao Liu, Yuan-Chen Guo, Ding Liang, Wanli Ouyang, and Yan-Pei Cao. Triposg: High-fidelity 3d shape synthesis using large-scale rectified flow models, 2025. URL https://arxiv.org/ abs/2502.06608

work page internal anchor Pith review Pith/arXiv arXiv 2025
[34]

Direct3d: Scalable image-to-3d generation via 3d latent diffusion transformer, 2024

Shuang Wu, Youtian Lin, Feihu Zhang, Yifei Zeng, Jingxi Xu, Philip Torr, Xun Cao, and Yao Yao. Direct3d: Scalable image-to-3d generation via 3d latent diffusion transformer, 2024. URL https://arxiv.org/abs/2405.14832

work page arXiv 2024
[35]

Hunyuan3d 1.0: A unified framework for text-to-3d and image-to-3d generation, 2025

Xianghui Yang, Huiwen Shi, Bowen Zhang, Fan Yang, Jiacheng Wang, Hongxu Zhao, Xinhai Liu, Xinzhou Wang, Qingxiang Lin, Jiaao Yu, Lifu Wang, Jing Xu, Zebin He, Zhuo Chen, Sicong Liu, Junta Wu, Yihang Lian, Shaoxiong Yang, Yuhong Liu, Yong Yang, Di Wang, Jie Jiang, and Chunchao Guo. Hunyuan3d 1.0: A unified framework for text-to-3d and image-to-3d generatio...

work page arXiv 2025
[36]

Structured 3d latents for scalable and versatile 3d generation,

Jianfeng Xiang, Zelong Lv, Sicheng Xu, Yu Deng, Ruicheng Wang, Bowen Zhang, Dong Chen, Xin Tong, and Jiaolong Yang. Structured 3d latents for scalable and versatile 3d generation,
[37]

URLhttps://arxiv.org/abs/2412.01506

work page internal anchor Pith review Pith/arXiv arXiv
[38]

Craftsman3d: High-fidelity mesh generation with 3d native generation and interactive geometry refiner, 2025

Weiyu Li, Jiarui Liu, Hongyu Yan, Rui Chen, Yixun Liang, Xuelin Chen, Ping Tan, and Xiaoxiao Long. Craftsman3d: High-fidelity mesh generation with 3d native generation and interactive geometry refiner, 2025. URLhttps://arxiv.org/abs/2405.14979

work page arXiv 2025
[39]

Michelangelo: Conditional 3d shape generation based on shape-image- text aligned latent representation, 2023

Zibo Zhao, Wen Liu, Xin Chen, Xianfang Zeng, Rui Wang, Pei Cheng, Bin Fu, Tao Chen, Gang Yu, and Shenghua Gao. Michelangelo: Conditional 3d shape generation based on shape-image- text aligned latent representation, 2023. URLhttps://arxiv.org/abs/2306.17115

work page arXiv 2023
[40]

Hi3dgen: High-fidelity 3d geometry generation from images via normal bridging, 2025

Chongjie Ye, Yushuang Wu, Ziteng Lu, Jiahao Chang, Xiaoyang Guo, Jiaqing Zhou, Hao Zhao, and Xiaoguang Han. Hi3dgen: High-fidelity 3d geometry generation from images via normal bridging, 2025. URLhttps://arxiv.org/abs/2503.22236

work page arXiv 2025
[41]

SAM 3D: 3Dfy Anything in Images

SAM 3D Team, Xingyu Chen, Fu-Jen Chu, Pierre Gleize, Kevin J Liang, Alexander Sax, Hao Tang, Weiyao Wang, Michelle Guo, Thibaut Hardin, Xiang Li, Aohan Lin, Jiawei Liu, Ziqi Ma, Anushka Sagar, Bowen Song, Xiaodong Wang, Jianing Yang, Bowen Zhang, Piotr Dollár, Georgia Gkioxari, Matt Feiszli, and Jitendra Malik. SAM 3D: 3dfy anything in images, 2025. URLht...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[42]

Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Nießner

Angela Dai, Angel X. Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Nießner. ScanNet: Richly-annotated 3d reconstructions of indoor scenes. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017

2017
[43]

Matterport3D: Learning from RGB-D data in indoor environments

Angel Chang, Angela Dai, Thomas Funkhouser, Maciej Halber, Matthias Nießner, Manolis Savva, Shuran Song, Andy Zeng, and Yinda Zhang. Matterport3D: Learning from RGB-D data in indoor environments. InInternational Conference on 3D Vision (3DV), 2017

2017
[44]

ARKitScenes: A diverse real-world dataset for 3d indoor scene understanding using mobile RGB-D data

Gilad Baruch, Zhuoyuan Chen, Afshin Dehghan, Tal Dimry, Yuri Feigin, Peter Fu, Thomas Gebauer, Brandon Joffe, Daniel Kurz, Arik Schwartz, and Elad Shulman. ARKitScenes: A diverse real-world dataset for 3d indoor scene understanding using mobile RGB-D data. In Advances in Neural Information Processing Systems Datasets and Benchmarks Track, 2021

2021
[45]

ScanNet++: A high-fidelity dataset of 3d indoor scenes

Chandan Yeshwanth, Yueh-Cheng Liu, Matthias Nießner, and Angela Dai. ScanNet++: A high-fidelity dataset of 3d indoor scenes. InInternational Conference on Computer Vision (ICCV), 2023

2023
[46]

Julian Straub, Thomas Whelan, Lingni Ma, Yufan Chen, Erik Wijmans, Simon Green, Jakob J. Engel, Raul Mur-Artal, Carl Ren, Shobhit Verma, Anton Clarkson, Ming Yan, Brian Budge, Yuan Yan, Xiaqing Pan, June Yon, Yuyang Zou, Kimberly Leon, Nigel Carter, Jesus Briales, Tyler Gillingham, Elias Mueggler, Luis Pesqueira, Manolis Savva, Dhruv Batra, Hauke Strasdat...

work page internal anchor Pith review Pith/arXiv arXiv 1906
[47]

SceneNN: A scene meshes dataset with annotations

Binh-Son Hua, Quang-Hieu Pham, Duc Thanh Nguyen, Minh-Khoi Tran, Lap-Fai Yu, and Sai-Kit Yeung. SceneNN: A scene meshes dataset with annotations. InInternational Conference on 3D Vision (3DV), 2016

2016
[48]

OpenVLA: An Open-Source Vision-Language-Action Model

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, Quan Vuong, Thomas Kollar, Benjamin Burchfiel, Russ Tedrake, Dorsa Sadigh, Sergey Levine, Percy Liang, and Chelsea Finn. Openvla: An open-source vision-language-action model, 2024. URL https://arxiv. org/abs/...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[49]

Lm-nav: Robotic navigation with large pre-trained models of language, vision, and action,

Dhruv Shah, Blazej Osinski, Brian Ichter, and Sergey Levine. Lm-nav: Robotic navigation with large pre-trained models of language, vision, and action, 2022. URL https://arxiv.org/ abs/2207.04429

work page arXiv 2022
[50]

Karen Liu, Silvio Savarese, Hyowon Gweon, Jiajun Wu, and Li Fei-Fei

Sanjana Srivastava, Chengshu Li, Michael Lingelbach, Roberto Martín-Martín, Fei Xia, Kent Vainio, Zheng Lian, Cem Gokmen, Shyamal Buch, C. Karen Liu, Silvio Savarese, Hyowon Gweon, Jiajun Wu, and Li Fei-Fei. Behavior: Benchmark for everyday household activities in virtual, interactive, and ecological environments, 2021. URL https://arxiv.org/abs/ 2108.03332

work page arXiv 2021
[51]

arXiv preprint arXiv:2310.13724 (2023) 3

Xavier Puig, Eric Undersander, Andrew Szot, Marc-Alexandre Cote, Tsung-Yen Yang, Rus- lan Partsey, Rutav Desai, Alexander Clegg, Michal Hlavac, So Yeon Min, Viktor V ondrus, Theophile Gervet, Vincent-Pierre Berges, John Turner, Oleksandr Maksymets, Zsolt Kira, Mri- nal Kalakrishnan, Jitendra Malik, Devendra Singh Chaplot, Unnat Jain, Dhruv Batra, Akshara ...

work page arXiv 2023
[52]

Behavior vision suite: Customizable dataset generation via simulation, 2024

Yunhao Ge, Yihe Tang, Jiashu Xu, Cem Gokmen, Chengshu Li, Wensi Ai, Benjamin Jose Martinez, Arman Aydin, Mona Anvari, Ayush K Chakravarthy, Hong-Xing Yu, Josiah Wong, Sanjana Srivastava, Sharon Lee, Shengxin Zha, Laurent Itti, Yunzhu Li, Roberto Martín-Martín, Miao Liu, Pengchuan Zhang, Ruohan Zhang, Li Fei-Fei, and Jiajun Wu. Behavior vision suite: Custo...

2024
[53]

ATISS: Autoregressive transformers for indoor scene synthesis

Despoina Paschalidou, Amlan Kar, Maria Shugrina, Karsten Kreis, Andreas Geiger, and Sanja Fidler. ATISS: Autoregressive transformers for indoor scene synthesis. InAdvances in Neural Information Processing Systems (NeurIPS), 2021

2021
[54]

Holodeck: Language guided generation of 3d embodied ai environments, 2024

Yue Yang, Fan-Yun Sun, Luca Weihs, Eli VanderBilt, Alvaro Herrasti, Winson Han, Jiajun Wu, Nick Haber, Ranjay Krishna, Lingjie Liu, Chris Callison-Burch, Mark Yatskar, Aniruddha Kembhavi, and Christopher Clark. Holodeck: Language guided generation of 3d embodied ai environments, 2024. URLhttps://arxiv.org/abs/2312.09067

work page arXiv 2024
[55]

DiffuScene: Denoising diffusion models for generative indoor scene synthesis

Jiapeng Tang, Yinyu Nie, Lev Markhasin, Angela Dai, Justus Thies, and Matthias Nießner. DiffuScene: Denoising diffusion models for generative indoor scene synthesis. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

2024
[56]

Panst3r: Multi-view consistent panoptic segmentation, 2025

Lojze Zust, Yohann Cabon, Juliette Marrie, Leonid Antsfeld, Boris Chidlovskii, Jerome Revaud, and Gabriela Csurka. Panst3r: Multi-view consistent panoptic segmentation, 2025. URL https://arxiv.org/abs/2506.21348

work page arXiv 2025
[57]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agar- wal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. InInterna- tional Conference on Machine Learning (ICML), 2021

2021
[58]

Fast point feature histograms (FPFH) for 3d registration

Radu Bogdan Rusu, Nico Blodow, and Michael Beetz. Fast point feature histograms (FPFH) for 3d registration. InIEEE International Conference on Robotics and Automation (ICRA), 2009

2009
[59]

Least-squares estimation of transformation parameters between two point patterns.IEEE Transactions on Pattern Analysis and Machine Intelligence, 13(4):376–380, 1991

Shinji Umeyama. Least-squares estimation of transformation parameters between two point patterns.IEEE Transactions on Pattern Analysis and Machine Intelligence, 13(4):376–380, 1991. 13

1991
[60]

Learning 3d semantic scene graphs from 3d indoor reconstructions

Johanna Wald, Helisa Dhamo, Nassir Navab, and Federico Tombari. Learning 3d semantic scene graphs from 3d indoor reconstructions. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020

2020
[61]

Hydra: A real-time spatial perception system for 3d scene graph construction and optimization

Nathan Hughes, Yun Chang, and Luca Carlone. Hydra: A real-time spatial perception system for 3d scene graph construction and optimization. InRobotics: Science and Systems (RSS), 2022

2022
[62]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shixuan ...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[63]

visual_fidelity

Hongchi Xia, Entong Su, Marius Memmel, Arhan Jain, Raymond Yu, Numfor Mbiziwo-Tiapo, Ali Farhadi, Abhishek Gupta, Shenlong Wang, and Wei-Chiu Ma. DRAWER: Digital recon- struction and articulation with environment realism. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 21771–21782, 2025. A Evaluation Protocol A.1 Baseline In...

2025
[64]

Panoptic multi-view reconstruction.We run PanSt3R to obtain camera-aware multi-view geometry, instance masks, per-instance labels, and semantic point clouds in a shared world coordinate frame
[65]

HierView selection.For each instance, we select one reconstruction view using the visibility, semantic, and 3D-completeness filters described in Sec. 3.2
[66]

Single-view asset generation.The selected image and mask are passed to SAM3D-Objects to reconstruct an initial per-instance mesh
[67]

Instance registration.Each generated mesh is aligned to its corresponding semantic point cloud using bounded Sim(3) registration
[68]

Scene graph construction.We select key frames by greedy 3D coverage, render ID-overlaid images, infer per-frame relations with a VLM, and merge them into a global scene graph
[69]

Attachment and refinement.We compile relation edges into geometric constraints, run the attachment solver, optionally run a floor-only branch for dual attachment, and apply staged post-refinement and depenetration. All stages share the same instance IDs, so the selected view, generated mesh, registration target, scene graph node, and final attachment tran...
[70]

Build the candidate view set from frames where the instance has a non-empty mask
[71]

Remove candidates with mask area belowa min or bounding-box area ratio belowb min. 17
[72]

Encode the masked crop and the category text with CLIP, then keep views whose semantic score is above the threshold or whose semantic rank is sufficiently high
[73]

Project a capped set of instance points into each remaining view and compute the fraction landing inside the mask
[74]

objects": [ {

Select the view with the highest 3D-to-2D completeness score. If a stage removes all candidates, the implementation falls back to the best candidate from the previous stage rather than dropping the instance immediately. Table A2: Default HierView hyperparameters. Parameter Value CLIP backend/model ViT-B/32 Semantic probability thresholdτ s 0.12 Semantic r...
[75]

Floor snap: snap floor-supported objects to the floor plane and project footprints back into the floor boundary if needed
[76]

Wall resolution: resolve the canonical wall root to a concrete wall plane by distance and orientation agreement
[77]

Wall attachment: align wall-mounted objects to the selected wall plane while preserving lateral order along the wall
[78]

Object support: project child contact surfaces onto feasible support faces of parent objects
[79]

Near-wall snap: optionally snap large furniture near walls when the scene graph and geometry support it
[80]

Outside repair: move objects whose footprints leave the room boundary back toward feasible interior positions

Showing first 80 references.

[1] [1]

AI2-THOR: An Interactive 3D Environment for Visual AI

Eric Kolve, Roozbeh Mottaghi, Winson Han, Eli VanderBilt, Luca Weihs, Alvaro Herrasti, Matt Deitke, Kiana Ehsani, Daniel Gordon, Yuke Zhu, Aniruddha Kembhavi, Abhinav Gupta, and Ali Farhadi. Ai2-thor: An interactive 3d environment for visual ai, 2022. URL https: //arxiv.org/abs/1712.05474

work page internal anchor Pith review Pith/arXiv arXiv 2022

[2] [2]

Habitat: A platform for embodied ai research

Manolis Savva, Abhishek Kadian, Oleksandr Maksymets, Yili Zhao, Erik Wijmans, Bhavana Jain, Julian Straub, Jia Liu, Vladlen Koltun, Jitendra Malik, Devi Parikh, and Dhruv Batra. Habitat: A platform for embodied ai research. InInternational Conference on Computer Vision (ICCV), pages 9339–9347, 2019

2019

[3] [3]

Procthor: Large- scale embodied ai using procedural generation, 2022

Matt Deitke, Eli VanderBilt, Alvaro Herrasti, Luca Weihs, Jordi Salvador, Kiana Ehsani, Winson Han, Eric Kolve, Ali Farhadi, Aniruddha Kembhavi, and Roozbeh Mottaghi. Procthor: Large- scale embodied ai using procedural generation, 2022. URL https://arxiv.org/abs/2206. 06994

2022

[4] [4]

3D-FRONT: 3d furnished rooms with layouts and semantics

Huan Fu, Bowen Cai, Lin Gao, Ling-Xiao Zhang, Jiaming Wang, Cao Li, Qixun Zeng, Chengyue Sun, Rongfei Jia, Binqiang Zhao, and Hao Zhang. 3D-FRONT: 3d furnished rooms with layouts and semantics. InInternational Conference on Computer Vision (ICCV), pages 10933–10942, 2021

2021

[5] [5]

Objaverse: A universe of annotated 3d objects

Matt Deitke, Dustin Schwenk, Jordi Salvador, Luca Weihs, Oscar Michel, Eli VanderBilt, Ludwig Schmidt, Kiana Ehsani, Aniruddha Kembhavi, and Ali Farhadi. Objaverse: A universe of annotated 3d objects. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13142–13153, 2023

2023

[6] [6]

Metascenes: Towards automated replica creation for real-world 3d scans, 2025

Huangyue Yu, Baoxiong Jia, Yixin Chen, Yandan Yang, Puhao Li, Rongpeng Su, Jiaxin Li, Qing Li, Wei Liang, Song-Chun Zhu, Tengyu Liu, and Siyuan Huang. Metascenes: Towards automated replica creation for real-world 3d scans, 2025. URL https://arxiv.org/abs/ 2505.02388

work page arXiv 2025

[7] [7]

LiteReality: Graphics-ready 3d scene reconstruction from rgb-d scans, 2025

Zhening Huang, Xiaoyang Wu, Fangcheng Zhong, Hengshuang Zhao, Matthias Nießner, and Joan Lasenby. LiteReality: Graphics-ready 3d scene reconstruction from rgb-d scans, 2025. URLhttps://arxiv.org/abs/2507.02861

work page arXiv 2025

[8] [8]

Gpt4scene: Understand 3d scenes from videos with vision-language models,

Zhangyang Qi, Zhixiong Zhang, Ye Fang, Jiaqi Wang, and Hengshuang Zhao. GPT4Scene: Understand 3d scenes from videos with vision-language models, 2025. URL https://arxiv. org/abs/2501.01428

work page arXiv 2025

[9] [9]

SimRecon: Simready compositional scene reconstruction from real videos, 2026

Chong Xia, Kai Zhu, Zizhuo Wang, Fangfu Liu, Zhizheng Zhang, and Yueqi Duan. SimRecon: Simready compositional scene reconstruction from real videos, 2026. URL https://arxiv. org/abs/2603.02133

work page arXiv 2026

[10] [10]

RICO: Regularizing the unobservable for indoor compositional reconstruction

Zizhang Li, Xiaoyang Lyu, Yuanyuan Ding, Mengmeng Wang, Yiyi Liao, and Yong Liu. RICO: Regularizing the unobservable for indoor compositional reconstruction. InInternational Conference on Computer Vision (ICCV), 2023

2023

[11] [11]

Decompositional neural scene reconstruction with generative diffusion prior, 2025

Junfeng Ni, Yu Liu, Ruijie Lu, Zirui Zhou, Song-Chun Zhu, Yixin Chen, and Siyuan Huang. Decompositional neural scene reconstruction with generative diffusion prior, 2025. URL https://arxiv.org/abs/2503.14830

work page arXiv 2025

[12] [12]

Instascene: Towards complete 3d instance decomposition and reconstruc- tion from cluttered scenes, 2025

Zesong Yang, Bangbang Yang, Wenqi Dong, Chenxuan Cao, Liyuan Cui, Yuewen Ma, Zhaopeng Cui, and Hujun Bao. Instascene: Towards complete 3d instance decomposition and reconstruc- tion from cluttered scenes, 2025. URLhttps://arxiv.org/abs/2507.08416

work page arXiv 2025

[13] [13]

Embodied Question Answering

Abhishek Das, Samyak Datta, Georgia Gkioxari, Stefan Lee, Devi Parikh, and Dhruv Batra. Embodied question answering, 2017. URLhttps://arxiv.org/abs/1711.11543

work page internal anchor Pith review Pith/arXiv arXiv 2017

[14] [14]

ScanQA: 3d question answering for spatial scene understanding

Daichi Azuma, Taiki Miyanishi, Shuhei Kurita, and Motoaki Kawanabe. ScanQA: 3d question answering for spatial scene understanding. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022. 10

2022

[15] [15]

SQA3D: Situated question answering in 3d scenes

Xiaojian Ma, Silong Yong, Zilong Zheng, Qing Li, Yitao Liang, Song-Chun Zhu, and Siyuan Huang. SQA3D: Situated question answering in 3d scenes. InInternational Conference on Learning Representations (ICLR), 2023

2023

[16] [16]

3d gaussian splatting for real-time radiance field rendering.ACM Transactions on Graphics, 42(4), 2023

Bernhard Kerbl, Georgios Kopanas, Thomas Leimkühler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering.ACM Transactions on Graphics, 42(4), 2023

2023

[17] [17]

Srinivasan, Matthew Tancik, Jonathan T

Ben Mildenhall, Pratul P. Srinivasan, Matthew Tancik, Jonathan T. Barron, Ravi Ramamoorthi, and Ren Ng. NeRF: Representing scenes as neural radiance fields for view synthesis. In European Conference on Computer Vision (ECCV), pages 405–421, 2020

2020

[18] [18]

MonoSDF: Exploring monocular geometric cues for neural implicit surface reconstruction

Zehao Yu, Songyou Peng, Michael Niemeyer, Torsten Sattler, and Andreas Geiger. MonoSDF: Exploring monocular geometric cues for neural implicit surface reconstruction. InAdvances in Neural Information Processing Systems (NeurIPS), 2022

2022

[19] [19]

Generalizable 3d scene reconstruction via divide and conquer from a single view

Andreea Ardelean, Mert Özer, and Bernhard Egger. Generalizable 3d scene reconstruction via divide and conquer from a single view. InInternational Conference on 3D Vision (3DV), 2025

2025

[20] [20]

Midi: Multi-instance diffusion for single image to 3d scene generation, 2025

Zehuan Huang, Yuan-Chen Guo, Xingqiao An, Yunhan Yang, Yangguang Li, Zi-Xin Zou, Ding Liang, Xihui Liu, Yan-Pei Cao, and Lu Sheng. Midi: Multi-instance diffusion for single image to 3d scene generation, 2025. URLhttps://arxiv.org/abs/2412.03558

work page arXiv 2025

[21] [21]

Scenegen: Single-image 3d scene generation in one feedforward pass, 2025

Yanxu Meng, Haoning Wu, Ya Zhang, and Weidi Xie. Scenegen: Single-image 3d scene generation in one feedforward pass, 2025. URLhttps://arxiv.org/abs/2508.15769

work page arXiv 2025

[22] [22]

Cast: Component-aligned 3d scene reconstruction from an rgb image.ACM Trans

Kaixin Yao, Longwen Zhang, Xinhao Yan, Yan Zeng, Qixuan Zhang, Lan Xu, Wei Yang, Jiayuan Gu, and Jingyi Yu. Cast: Component-aligned 3d scene reconstruction from an rgb image.ACM Trans. Graph., 44(4), July 2025. ISSN 0730-0301. doi: 10.1145/3730841. URL https://doi.org/10.1145/3730841

work page doi:10.1145/3730841 2025

[23] [23]

DeepSDF: Learning continuous signed distance functions for shape representation

Jeong Joon Park, Peter Florence, Julian Straub, Richard Newcombe, and Steven Lovegrove. DeepSDF: Learning continuous signed distance functions for shape representation. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019

2019

[24] [24]

Barron, and Ben Mildenhall

Ben Poole, Ajay Jain, Jonathan T. Barron, and Ben Mildenhall. DreamFusion: Text-to-3d using 2d diffusion. InInternational Conference on Learning Representations (ICLR), 2023

2023

[25] [25]

Denoising diffusion probabilistic models

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. In Advances in Neural Information Processing Systems (NeurIPS), 2020

2020

[26] [26]

Denoising diffusion implicit models

Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. In International Conference on Learning Representations (ICLR), 2021

2021

[27] [27]

High- resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High- resolution image synthesis with latent diffusion models. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022

2022

[28] [28]

ShapeNet: An Information-Rich 3D Model Repository

Angel X. Chang, Thomas Funkhouser, Leonidas Guibas, Pat Hanrahan, Qixing Huang, Zimo Li, Silvio Savarese, Manolis Savva, Shuran Song, Hao Su, Jianxiong Xiao, Li Yi, and Fisher Yu. ShapeNet: An information-rich 3d model repository.arXiv preprint arXiv:1512.03012, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015

[29] [29]

Objaverse-XL: A universe of 10m+ 3d objects

Matt Deitke, Ruoshi Liu, Matthew Wallingford, Huong Ngo, Oscar Michel, Ujval Kusupati, Alan Fan, Christian Laforte, Vikram V oleti, Samir Yitzhak Gadre, Eli VanderBilt, Aniruddha Kembhavi, Carl V ondrick, Georgia Gkioxari, Kiana Ehsani, Ludwig Schmidt, and Ali Farhadi. Objaverse-XL: A universe of 10m+ 3d objects. InAdvances in Neural Information Processin...

2023

[30] [30]

3D-FUTURE: 3d furniture shape with textures.Interna- tional Journal of Computer Vision, 2021

Huan Fu, Bowen Cai, Lin Gao, Ling-Xiao Zhang, Cao Li, Qixun Zeng, Chengyue Sun, Rongfei Jia, Binqiang Zhao, and Hao Zhang. 3D-FUTURE: 3d furniture shape with textures.Interna- tional Journal of Computer Vision, 2021

2021

[31] [31]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InInternational Conference on Computer Vision (ICCV), 2023. 11

2023

[32] [32]

Clay: A controllable large-scale generative model for creating high-quality 3d assets, 2024

Longwen Zhang, Ziyu Wang, Qixuan Zhang, Qiwei Qiu, Anqi Pang, Haoran Jiang, Wei Yang, Lan Xu, and Jingyi Yu. Clay: A controllable large-scale generative model for creating high-quality 3d assets, 2024. URLhttps://arxiv.org/abs/2406.13897

work page arXiv 2024

[33] [33]

TripoSG: High-Fidelity 3D Shape Synthesis using Large-Scale Rectified Flow Models

Yangguang Li, Zi-Xin Zou, Zexiang Liu, Dehu Wang, Yuan Liang, Zhipeng Yu, Xingchao Liu, Yuan-Chen Guo, Ding Liang, Wanli Ouyang, and Yan-Pei Cao. Triposg: High-fidelity 3d shape synthesis using large-scale rectified flow models, 2025. URL https://arxiv.org/ abs/2502.06608

work page internal anchor Pith review Pith/arXiv arXiv 2025

[34] [34]

Direct3d: Scalable image-to-3d generation via 3d latent diffusion transformer, 2024

Shuang Wu, Youtian Lin, Feihu Zhang, Yifei Zeng, Jingxi Xu, Philip Torr, Xun Cao, and Yao Yao. Direct3d: Scalable image-to-3d generation via 3d latent diffusion transformer, 2024. URL https://arxiv.org/abs/2405.14832

work page arXiv 2024

[35] [35]

Hunyuan3d 1.0: A unified framework for text-to-3d and image-to-3d generation, 2025

Xianghui Yang, Huiwen Shi, Bowen Zhang, Fan Yang, Jiacheng Wang, Hongxu Zhao, Xinhai Liu, Xinzhou Wang, Qingxiang Lin, Jiaao Yu, Lifu Wang, Jing Xu, Zebin He, Zhuo Chen, Sicong Liu, Junta Wu, Yihang Lian, Shaoxiong Yang, Yuhong Liu, Yong Yang, Di Wang, Jie Jiang, and Chunchao Guo. Hunyuan3d 1.0: A unified framework for text-to-3d and image-to-3d generatio...

work page arXiv 2025

[36] [36]

Structured 3d latents for scalable and versatile 3d generation,

Jianfeng Xiang, Zelong Lv, Sicheng Xu, Yu Deng, Ruicheng Wang, Bowen Zhang, Dong Chen, Xin Tong, and Jiaolong Yang. Structured 3d latents for scalable and versatile 3d generation,

[37] [37]

URLhttps://arxiv.org/abs/2412.01506

work page internal anchor Pith review Pith/arXiv arXiv

[38] [38]

Craftsman3d: High-fidelity mesh generation with 3d native generation and interactive geometry refiner, 2025

Weiyu Li, Jiarui Liu, Hongyu Yan, Rui Chen, Yixun Liang, Xuelin Chen, Ping Tan, and Xiaoxiao Long. Craftsman3d: High-fidelity mesh generation with 3d native generation and interactive geometry refiner, 2025. URLhttps://arxiv.org/abs/2405.14979

work page arXiv 2025

[39] [39]

Michelangelo: Conditional 3d shape generation based on shape-image- text aligned latent representation, 2023

Zibo Zhao, Wen Liu, Xin Chen, Xianfang Zeng, Rui Wang, Pei Cheng, Bin Fu, Tao Chen, Gang Yu, and Shenghua Gao. Michelangelo: Conditional 3d shape generation based on shape-image- text aligned latent representation, 2023. URLhttps://arxiv.org/abs/2306.17115

work page arXiv 2023

[40] [40]

Hi3dgen: High-fidelity 3d geometry generation from images via normal bridging, 2025

Chongjie Ye, Yushuang Wu, Ziteng Lu, Jiahao Chang, Xiaoyang Guo, Jiaqing Zhou, Hao Zhao, and Xiaoguang Han. Hi3dgen: High-fidelity 3d geometry generation from images via normal bridging, 2025. URLhttps://arxiv.org/abs/2503.22236

work page arXiv 2025

[41] [41]

SAM 3D: 3Dfy Anything in Images

SAM 3D Team, Xingyu Chen, Fu-Jen Chu, Pierre Gleize, Kevin J Liang, Alexander Sax, Hao Tang, Weiyao Wang, Michelle Guo, Thibaut Hardin, Xiang Li, Aohan Lin, Jiawei Liu, Ziqi Ma, Anushka Sagar, Bowen Song, Xiaodong Wang, Jianing Yang, Bowen Zhang, Piotr Dollár, Georgia Gkioxari, Matt Feiszli, and Jitendra Malik. SAM 3D: 3dfy anything in images, 2025. URLht...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[42] [42]

Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Nießner

Angela Dai, Angel X. Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Nießner. ScanNet: Richly-annotated 3d reconstructions of indoor scenes. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017

2017

[43] [43]

Matterport3D: Learning from RGB-D data in indoor environments

Angel Chang, Angela Dai, Thomas Funkhouser, Maciej Halber, Matthias Nießner, Manolis Savva, Shuran Song, Andy Zeng, and Yinda Zhang. Matterport3D: Learning from RGB-D data in indoor environments. InInternational Conference on 3D Vision (3DV), 2017

2017

[44] [44]

ARKitScenes: A diverse real-world dataset for 3d indoor scene understanding using mobile RGB-D data

Gilad Baruch, Zhuoyuan Chen, Afshin Dehghan, Tal Dimry, Yuri Feigin, Peter Fu, Thomas Gebauer, Brandon Joffe, Daniel Kurz, Arik Schwartz, and Elad Shulman. ARKitScenes: A diverse real-world dataset for 3d indoor scene understanding using mobile RGB-D data. In Advances in Neural Information Processing Systems Datasets and Benchmarks Track, 2021

2021

[45] [45]

ScanNet++: A high-fidelity dataset of 3d indoor scenes

Chandan Yeshwanth, Yueh-Cheng Liu, Matthias Nießner, and Angela Dai. ScanNet++: A high-fidelity dataset of 3d indoor scenes. InInternational Conference on Computer Vision (ICCV), 2023

2023

[46] [46]

Julian Straub, Thomas Whelan, Lingni Ma, Yufan Chen, Erik Wijmans, Simon Green, Jakob J. Engel, Raul Mur-Artal, Carl Ren, Shobhit Verma, Anton Clarkson, Ming Yan, Brian Budge, Yuan Yan, Xiaqing Pan, June Yon, Yuyang Zou, Kimberly Leon, Nigel Carter, Jesus Briales, Tyler Gillingham, Elias Mueggler, Luis Pesqueira, Manolis Savva, Dhruv Batra, Hauke Strasdat...

work page internal anchor Pith review Pith/arXiv arXiv 1906

[47] [47]

SceneNN: A scene meshes dataset with annotations

Binh-Son Hua, Quang-Hieu Pham, Duc Thanh Nguyen, Minh-Khoi Tran, Lap-Fai Yu, and Sai-Kit Yeung. SceneNN: A scene meshes dataset with annotations. InInternational Conference on 3D Vision (3DV), 2016

2016

[48] [48]

OpenVLA: An Open-Source Vision-Language-Action Model

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, Quan Vuong, Thomas Kollar, Benjamin Burchfiel, Russ Tedrake, Dorsa Sadigh, Sergey Levine, Percy Liang, and Chelsea Finn. Openvla: An open-source vision-language-action model, 2024. URL https://arxiv. org/abs/...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[49] [49]

Lm-nav: Robotic navigation with large pre-trained models of language, vision, and action,

Dhruv Shah, Blazej Osinski, Brian Ichter, and Sergey Levine. Lm-nav: Robotic navigation with large pre-trained models of language, vision, and action, 2022. URL https://arxiv.org/ abs/2207.04429

work page arXiv 2022

[50] [50]

Karen Liu, Silvio Savarese, Hyowon Gweon, Jiajun Wu, and Li Fei-Fei

Sanjana Srivastava, Chengshu Li, Michael Lingelbach, Roberto Martín-Martín, Fei Xia, Kent Vainio, Zheng Lian, Cem Gokmen, Shyamal Buch, C. Karen Liu, Silvio Savarese, Hyowon Gweon, Jiajun Wu, and Li Fei-Fei. Behavior: Benchmark for everyday household activities in virtual, interactive, and ecological environments, 2021. URL https://arxiv.org/abs/ 2108.03332

work page arXiv 2021

[51] [51]

arXiv preprint arXiv:2310.13724 (2023) 3

Xavier Puig, Eric Undersander, Andrew Szot, Marc-Alexandre Cote, Tsung-Yen Yang, Rus- lan Partsey, Rutav Desai, Alexander Clegg, Michal Hlavac, So Yeon Min, Viktor V ondrus, Theophile Gervet, Vincent-Pierre Berges, John Turner, Oleksandr Maksymets, Zsolt Kira, Mri- nal Kalakrishnan, Jitendra Malik, Devendra Singh Chaplot, Unnat Jain, Dhruv Batra, Akshara ...

work page arXiv 2023

[52] [52]

Behavior vision suite: Customizable dataset generation via simulation, 2024

Yunhao Ge, Yihe Tang, Jiashu Xu, Cem Gokmen, Chengshu Li, Wensi Ai, Benjamin Jose Martinez, Arman Aydin, Mona Anvari, Ayush K Chakravarthy, Hong-Xing Yu, Josiah Wong, Sanjana Srivastava, Sharon Lee, Shengxin Zha, Laurent Itti, Yunzhu Li, Roberto Martín-Martín, Miao Liu, Pengchuan Zhang, Ruohan Zhang, Li Fei-Fei, and Jiajun Wu. Behavior vision suite: Custo...

2024

[53] [53]

ATISS: Autoregressive transformers for indoor scene synthesis

Despoina Paschalidou, Amlan Kar, Maria Shugrina, Karsten Kreis, Andreas Geiger, and Sanja Fidler. ATISS: Autoregressive transformers for indoor scene synthesis. InAdvances in Neural Information Processing Systems (NeurIPS), 2021

2021

[54] [54]

Holodeck: Language guided generation of 3d embodied ai environments, 2024

Yue Yang, Fan-Yun Sun, Luca Weihs, Eli VanderBilt, Alvaro Herrasti, Winson Han, Jiajun Wu, Nick Haber, Ranjay Krishna, Lingjie Liu, Chris Callison-Burch, Mark Yatskar, Aniruddha Kembhavi, and Christopher Clark. Holodeck: Language guided generation of 3d embodied ai environments, 2024. URLhttps://arxiv.org/abs/2312.09067

work page arXiv 2024

[55] [55]

DiffuScene: Denoising diffusion models for generative indoor scene synthesis

Jiapeng Tang, Yinyu Nie, Lev Markhasin, Angela Dai, Justus Thies, and Matthias Nießner. DiffuScene: Denoising diffusion models for generative indoor scene synthesis. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

2024

[56] [56]

Panst3r: Multi-view consistent panoptic segmentation, 2025

Lojze Zust, Yohann Cabon, Juliette Marrie, Leonid Antsfeld, Boris Chidlovskii, Jerome Revaud, and Gabriela Csurka. Panst3r: Multi-view consistent panoptic segmentation, 2025. URL https://arxiv.org/abs/2506.21348

work page arXiv 2025

[57] [57]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agar- wal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. InInterna- tional Conference on Machine Learning (ICML), 2021

2021

[58] [58]

Fast point feature histograms (FPFH) for 3d registration

Radu Bogdan Rusu, Nico Blodow, and Michael Beetz. Fast point feature histograms (FPFH) for 3d registration. InIEEE International Conference on Robotics and Automation (ICRA), 2009

2009

[59] [59]

Least-squares estimation of transformation parameters between two point patterns.IEEE Transactions on Pattern Analysis and Machine Intelligence, 13(4):376–380, 1991

Shinji Umeyama. Least-squares estimation of transformation parameters between two point patterns.IEEE Transactions on Pattern Analysis and Machine Intelligence, 13(4):376–380, 1991. 13

1991

[60] [60]

Learning 3d semantic scene graphs from 3d indoor reconstructions

Johanna Wald, Helisa Dhamo, Nassir Navab, and Federico Tombari. Learning 3d semantic scene graphs from 3d indoor reconstructions. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020

2020

[61] [61]

Hydra: A real-time spatial perception system for 3d scene graph construction and optimization

Nathan Hughes, Yun Chang, and Luca Carlone. Hydra: A real-time spatial perception system for 3d scene graph construction and optimization. InRobotics: Science and Systems (RSS), 2022

2022

[62] [62]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shixuan ...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[63] [63]

visual_fidelity

Hongchi Xia, Entong Su, Marius Memmel, Arhan Jain, Raymond Yu, Numfor Mbiziwo-Tiapo, Ali Farhadi, Abhishek Gupta, Shenlong Wang, and Wei-Chiu Ma. DRAWER: Digital recon- struction and articulation with environment realism. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 21771–21782, 2025. A Evaluation Protocol A.1 Baseline In...

2025

[64] [64]

Panoptic multi-view reconstruction.We run PanSt3R to obtain camera-aware multi-view geometry, instance masks, per-instance labels, and semantic point clouds in a shared world coordinate frame

[65] [65]

HierView selection.For each instance, we select one reconstruction view using the visibility, semantic, and 3D-completeness filters described in Sec. 3.2

[66] [66]

Single-view asset generation.The selected image and mask are passed to SAM3D-Objects to reconstruct an initial per-instance mesh

[67] [67]

Instance registration.Each generated mesh is aligned to its corresponding semantic point cloud using bounded Sim(3) registration

[68] [68]

Scene graph construction.We select key frames by greedy 3D coverage, render ID-overlaid images, infer per-frame relations with a VLM, and merge them into a global scene graph

[69] [69]

Attachment and refinement.We compile relation edges into geometric constraints, run the attachment solver, optionally run a floor-only branch for dual attachment, and apply staged post-refinement and depenetration. All stages share the same instance IDs, so the selected view, generated mesh, registration target, scene graph node, and final attachment tran...

[70] [70]

Build the candidate view set from frames where the instance has a non-empty mask

[71] [71]

Remove candidates with mask area belowa min or bounding-box area ratio belowb min. 17

[72] [72]

Encode the masked crop and the category text with CLIP, then keep views whose semantic score is above the threshold or whose semantic rank is sufficiently high

[73] [73]

Project a capped set of instance points into each remaining view and compute the fraction landing inside the mask

[74] [74]

objects": [ {

Select the view with the highest 3D-to-2D completeness score. If a stage removes all candidates, the implementation falls back to the best candidate from the previous stage rather than dropping the instance immediately. Table A2: Default HierView hyperparameters. Parameter Value CLIP backend/model ViT-B/32 Semantic probability thresholdτ s 0.12 Semantic r...

[75] [75]

Floor snap: snap floor-supported objects to the floor plane and project footprints back into the floor boundary if needed

[76] [76]

Wall resolution: resolve the canonical wall root to a concrete wall plane by distance and orientation agreement

[77] [77]

Wall attachment: align wall-mounted objects to the selected wall plane while preserving lateral order along the wall

[78] [78]

Object support: project child contact surfaces onto feasible support faces of parent objects

[79] [79]

Near-wall snap: optionally snap large furniture near walls when the scene graph and geometry support it

[80] [80]

Outside repair: move objects whose footprints leave the room boundary back toward feasible interior positions