pith. sign in

arxiv: 2606.28060 · v1 · pith:DVA23OOXnew · submitted 2026-06-26 · 💻 cs.CV

ReScene: Structured Indoor Scene Reconstruction from Multi-View Captures

Pith reviewed 2026-06-29 04:27 UTC · model grok-4.3

classification 💻 cs.CV
keywords indoor scene reconstructionmulti-view capturesscene graphvision-language modelphysically consistent assemblyScanNetembodied AIhierarchical view selection
0
0 comments X

The pith

ReScene threads multi-view geometry through view selection and relation fusion to assemble physically consistent indoor scenes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that the main obstacle in building simulation-ready 3D scenes from multi-view captures is not single-object modeling but fusing relations across views into physically plausible layouts. ReScene treats multi-view geometry as a unifying prior that runs through the entire pipeline. HierView replaces the largest-mask heuristic with selection based on semantic consistency and 3D coverage completeness. Relation-Aware Assembly merges vision-language model relation predictions with geometric and room-shell priors inside a confidence-weighted scene graph. The resulting scenes improve geometry accuracy, rendering quality, and speed on ScanNet data and support creation of an embodied visual question answering dataset.

Core claim

ReScene is a framework for structured indoor scene reconstruction from multi-view captures whose central claim is that cross-view relation fusion and physically plausible scene assembly, not single-object reconstruction, form the core bottleneck. The method consists of HierView, which prioritizes reconstruction views based on semantic consistency and 3D coverage completeness, and Relation-Aware Assembly, which fuses multi-frame relation predictions from a vision-language model with geometric and room-shell priors into a confidence-weighted scene graph. On a set of ScanNet scenes this produces a 17 percent reduction in Chamfer Distance and 26 percent reduction in LPIPS relative to the stronge

What carries the argument

The confidence-weighted scene graph that fuses vision-language model relation predictions with geometric and room-shell priors to enable physically consistent scene assembly.

If this is right

  • Reconstructed scenes become directly usable for simulation in Embodied AI without extra physical cleanup steps.
  • The same pipeline can generate large embodied visual question answering datasets for training spatial-reasoning models.
  • Faster runtime enables reconstruction of substantially more scenes than previous multi-view approaches.
  • Explicit object-level structure and inter-object relations improve accuracy on downstream rendering and perceptual tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The room-shell prior could be swapped for other structural knowledge to extend the method beyond indoor environments.
  • Stronger geometric priors might allow the framework to tolerate noisier vision-language predictions while preserving consistency.
  • The approach reduces dependence on specialized capture hardware by leveraging ordinary multi-view image sets.

Load-bearing premise

Multi-frame relation predictions from a vision-language model, when fused with geometric and room-shell priors via a confidence-weighted scene graph, will produce physically consistent assemblies without post-hoc corrections or additional physical simulation validation.

What would settle it

A collection of output scenes that contain clear physical violations such as intersecting objects or unsupported floating objects even when the scene graph reports high confidence on the assembled relations, or an ablation in which removing the relation-fusion step leaves physical-consistency metrics unchanged.

Figures

Figures reproduced from arXiv: 2606.28060 by Daoguo Dong, Haoran Xu, Lechao Zhang, Xin Tan, Yan Gao.

Figure 1
Figure 1. Figure 1: Overview of our framework. Given a casually captured room video, our method proceeds [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Embodied Visual Question Answering in Assembled Scenes. Given our structured scene [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative comparison with baselines on representative ScanNet scenes. Our method [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative comparison of view selection and mesh generation. Columns 2–4 show our [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
read the original abstract

Constructing simulation-ready 3D scenes from multi-view captures is a key bottleneck for Embodied Artificial Intelligence, as downstream tasks require object-level structure, explicit inter-object relations, and physical plausibility. Existing approaches either rely on specialized capture hardware, suffer from single-view bias in object reconstruction, or yield layouts that are geometrically reasonable but physically inconsistent. We identify that the problem is not single-object reconstruction but cross-view relation fusion and physically plausible scene assembly. To address this challenge, we present ReScene, a framework that threads multi-view geometry throughout the pipeline as a unifying prior. Our method consists of two main components: HierView prioritizes reconstruction views based on semantic consistency and 3D coverage completeness, replacing the largest-mask heuristic that conflates image occupancy with object coverage; and Relation-Aware Assembly fuses multi-frame relation predictions from a vision-language model with geometric and room-shell priors into a confidence-weighted scene graph, enabling physically consistent scene assembly. ReScene sets a new state of the art across geometry, rendering, and perceptual quality on a set of ScanNet scenes, achieving a 17% reduction in Chamfer Distance and 26% in LPIPS over the strongest prior baseline, while running up to 10x faster than prior multi-view methods. Based on the reconstructed scenes, we also generate an embodied visual question answering dataset, on which fine-tuned Qwen-VL approaches the performance of strong closed-source models on several spatial reasoning tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces ReScene, a framework for structured indoor scene reconstruction from multi-view captures aimed at producing simulation-ready scenes for Embodied AI. It proposes two components: HierView, which prioritizes views based on semantic consistency and 3D coverage instead of largest-mask heuristics, and Relation-Aware Assembly, which fuses multi-frame VLM relation predictions with geometric and room-shell priors into a confidence-weighted scene graph for physically consistent object assemblies. The work claims SOTA results on ScanNet scenes (17% Chamfer Distance reduction, 26% LPIPS improvement, up to 10x faster than prior multi-view methods) and generates an embodied VQA dataset where fine-tuned Qwen-VL approaches closed-source model performance on spatial tasks.

Significance. If the quantitative gains and physical consistency claims are substantiated, the work would meaningfully advance multi-view scene reconstruction by threading geometry as a prior throughout and leveraging VLM for explicit relations, addressing a bottleneck for downstream Embodied AI tasks. The view prioritization and scene-graph assembly ideas have potential for broader adoption if validated.

major comments (2)
  1. [Abstract] Abstract: The headline quantitative claims (17% Chamfer Distance reduction, 26% LPIPS improvement, 10x speed-up) are presented without any evaluation protocol, baseline details, dataset splits, error bars, or statistical tests. This directly undermines assessment of the SOTA assertion, which is load-bearing for the paper's contribution.
  2. [Abstract] Abstract (Relation-Aware Assembly description): The central claim that the method produces physically consistent assemblies via VLM-prior fusion lacks any quantitative validation (e.g., penetration rates, gravity stability, or collision metrics). This is load-bearing because the paper positions physical plausibility as the key differentiator from prior geometrically reasonable but inconsistent layouts.
minor comments (1)
  1. [Abstract] Abstract: The embodied VQA dataset is mentioned as a contribution but without any details on construction, size, task definitions, or evaluation splits.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on the abstract. We agree that the headline claims require better contextualization and will revise accordingly. Point-by-point responses to the major comments follow.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The headline quantitative claims (17% Chamfer Distance reduction, 26% LPIPS improvement, 10x speed-up) are presented without any evaluation protocol, baseline details, dataset splits, error bars, or statistical tests. This directly undermines assessment of the SOTA assertion, which is load-bearing for the paper's contribution.

    Authors: The full manuscript details the evaluation protocol, baselines, ScanNet splits, and metrics (including error bars from repeated runs and statistical tests) in Section 4 and the supplementary material. The abstract is length-constrained, but we will revise it to briefly reference the evaluation setting (e.g., "on ScanNet validation scenes") to improve self-containment while preserving conciseness. revision: partial

  2. Referee: [Abstract] Abstract (Relation-Aware Assembly description): The central claim that the method produces physically consistent assemblies via VLM-prior fusion lacks any quantitative validation (e.g., penetration rates, gravity stability, or collision metrics). This is load-bearing because the paper positions physical plausibility as the key differentiator from prior geometrically reasonable but inconsistent layouts.

    Authors: We agree that the current manuscript lacks direct quantitative physical-consistency metrics such as penetration or collision rates. Physical plausibility is achieved through the geometric and room-shell priors in Relation-Aware Assembly and is supported indirectly by the improved geometry/perceptual metrics and embodied VQA results. We will add explicit quantitative physical validation (e.g., simulation-based collision and stability metrics) to the experiments section in revision. revision: yes

Circularity Check

0 steps flagged

No circularity: framework is methodological with no derivational reductions

full rationale

The paper describes an engineering pipeline (HierView view prioritization + Relation-Aware Assembly via VLM + geometric priors) evaluated empirically on ScanNet. No equations, parameter-fitting steps, or first-principles derivations appear in the abstract or described structure. No self-definitional loops, fitted inputs renamed as predictions, or load-bearing self-citations that reduce the central claims to their own inputs by construction. The physical-consistency claim is an empirical assertion (not a derived theorem), and the reader's note confirms absence of equations. This is the common honest case of a self-contained applied method.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no equations, no parameter counts, and no explicit assumptions beyond the high-level claim that multi-view geometry acts as a unifying prior; therefore the ledger is empty.

pith-pipeline@v0.9.1-grok · 5799 in / 1181 out tokens · 33106 ms · 2026-06-29T04:27:39.365309+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

82 extracted references · 29 canonical work pages · 9 internal anchors

  1. [1]

    AI2-THOR: An Interactive 3D Environment for Visual AI

    Eric Kolve, Roozbeh Mottaghi, Winson Han, Eli VanderBilt, Luca Weihs, Alvaro Herrasti, Matt Deitke, Kiana Ehsani, Daniel Gordon, Yuke Zhu, Aniruddha Kembhavi, Abhinav Gupta, and Ali Farhadi. Ai2-thor: An interactive 3d environment for visual ai, 2022. URL https: //arxiv.org/abs/1712.05474

  2. [2]

    Habitat: A platform for embodied ai research

    Manolis Savva, Abhishek Kadian, Oleksandr Maksymets, Yili Zhao, Erik Wijmans, Bhavana Jain, Julian Straub, Jia Liu, Vladlen Koltun, Jitendra Malik, Devi Parikh, and Dhruv Batra. Habitat: A platform for embodied ai research. InInternational Conference on Computer Vision (ICCV), pages 9339–9347, 2019

  3. [3]

    Procthor: Large- scale embodied ai using procedural generation, 2022

    Matt Deitke, Eli VanderBilt, Alvaro Herrasti, Luca Weihs, Jordi Salvador, Kiana Ehsani, Winson Han, Eric Kolve, Ali Farhadi, Aniruddha Kembhavi, and Roozbeh Mottaghi. Procthor: Large- scale embodied ai using procedural generation, 2022. URL https://arxiv.org/abs/2206. 06994

  4. [4]

    3D-FRONT: 3d furnished rooms with layouts and semantics

    Huan Fu, Bowen Cai, Lin Gao, Ling-Xiao Zhang, Jiaming Wang, Cao Li, Qixun Zeng, Chengyue Sun, Rongfei Jia, Binqiang Zhao, and Hao Zhang. 3D-FRONT: 3d furnished rooms with layouts and semantics. InInternational Conference on Computer Vision (ICCV), pages 10933–10942, 2021

  5. [5]

    Objaverse: A universe of annotated 3d objects

    Matt Deitke, Dustin Schwenk, Jordi Salvador, Luca Weihs, Oscar Michel, Eli VanderBilt, Ludwig Schmidt, Kiana Ehsani, Aniruddha Kembhavi, and Ali Farhadi. Objaverse: A universe of annotated 3d objects. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13142–13153, 2023

  6. [6]

    Metascenes: Towards automated replica creation for real-world 3d scans, 2025

    Huangyue Yu, Baoxiong Jia, Yixin Chen, Yandan Yang, Puhao Li, Rongpeng Su, Jiaxin Li, Qing Li, Wei Liang, Song-Chun Zhu, Tengyu Liu, and Siyuan Huang. Metascenes: Towards automated replica creation for real-world 3d scans, 2025. URL https://arxiv.org/abs/ 2505.02388

  7. [7]

    LiteReality: Graphics-ready 3d scene reconstruction from rgb-d scans, 2025

    Zhening Huang, Xiaoyang Wu, Fangcheng Zhong, Hengshuang Zhao, Matthias Nießner, and Joan Lasenby. LiteReality: Graphics-ready 3d scene reconstruction from rgb-d scans, 2025. URLhttps://arxiv.org/abs/2507.02861

  8. [8]

    Gpt4scene: Understand 3d scenes from videos with vision-language models,

    Zhangyang Qi, Zhixiong Zhang, Ye Fang, Jiaqi Wang, and Hengshuang Zhao. GPT4Scene: Understand 3d scenes from videos with vision-language models, 2025. URL https://arxiv. org/abs/2501.01428

  9. [9]

    SimRecon: Simready compositional scene reconstruction from real videos, 2026

    Chong Xia, Kai Zhu, Zizhuo Wang, Fangfu Liu, Zhizheng Zhang, and Yueqi Duan. SimRecon: Simready compositional scene reconstruction from real videos, 2026. URL https://arxiv. org/abs/2603.02133

  10. [10]

    RICO: Regularizing the unobservable for indoor compositional reconstruction

    Zizhang Li, Xiaoyang Lyu, Yuanyuan Ding, Mengmeng Wang, Yiyi Liao, and Yong Liu. RICO: Regularizing the unobservable for indoor compositional reconstruction. InInternational Conference on Computer Vision (ICCV), 2023

  11. [11]

    Decompositional neural scene reconstruction with generative diffusion prior, 2025

    Junfeng Ni, Yu Liu, Ruijie Lu, Zirui Zhou, Song-Chun Zhu, Yixin Chen, and Siyuan Huang. Decompositional neural scene reconstruction with generative diffusion prior, 2025. URL https://arxiv.org/abs/2503.14830

  12. [12]

    Instascene: Towards complete 3d instance decomposition and reconstruc- tion from cluttered scenes, 2025

    Zesong Yang, Bangbang Yang, Wenqi Dong, Chenxuan Cao, Liyuan Cui, Yuewen Ma, Zhaopeng Cui, and Hujun Bao. Instascene: Towards complete 3d instance decomposition and reconstruc- tion from cluttered scenes, 2025. URLhttps://arxiv.org/abs/2507.08416

  13. [13]

    Embodied Question Answering

    Abhishek Das, Samyak Datta, Georgia Gkioxari, Stefan Lee, Devi Parikh, and Dhruv Batra. Embodied question answering, 2017. URLhttps://arxiv.org/abs/1711.11543

  14. [14]

    ScanQA: 3d question answering for spatial scene understanding

    Daichi Azuma, Taiki Miyanishi, Shuhei Kurita, and Motoaki Kawanabe. ScanQA: 3d question answering for spatial scene understanding. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022. 10

  15. [15]

    SQA3D: Situated question answering in 3d scenes

    Xiaojian Ma, Silong Yong, Zilong Zheng, Qing Li, Yitao Liang, Song-Chun Zhu, and Siyuan Huang. SQA3D: Situated question answering in 3d scenes. InInternational Conference on Learning Representations (ICLR), 2023

  16. [16]

    3d gaussian splatting for real-time radiance field rendering.ACM Transactions on Graphics, 42(4), 2023

    Bernhard Kerbl, Georgios Kopanas, Thomas Leimkühler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering.ACM Transactions on Graphics, 42(4), 2023

  17. [17]

    Srinivasan, Matthew Tancik, Jonathan T

    Ben Mildenhall, Pratul P. Srinivasan, Matthew Tancik, Jonathan T. Barron, Ravi Ramamoorthi, and Ren Ng. NeRF: Representing scenes as neural radiance fields for view synthesis. In European Conference on Computer Vision (ECCV), pages 405–421, 2020

  18. [18]

    MonoSDF: Exploring monocular geometric cues for neural implicit surface reconstruction

    Zehao Yu, Songyou Peng, Michael Niemeyer, Torsten Sattler, and Andreas Geiger. MonoSDF: Exploring monocular geometric cues for neural implicit surface reconstruction. InAdvances in Neural Information Processing Systems (NeurIPS), 2022

  19. [19]

    Generalizable 3d scene reconstruction via divide and conquer from a single view

    Andreea Ardelean, Mert Özer, and Bernhard Egger. Generalizable 3d scene reconstruction via divide and conquer from a single view. InInternational Conference on 3D Vision (3DV), 2025

  20. [20]

    Midi: Multi-instance diffusion for single image to 3d scene generation, 2025

    Zehuan Huang, Yuan-Chen Guo, Xingqiao An, Yunhan Yang, Yangguang Li, Zi-Xin Zou, Ding Liang, Xihui Liu, Yan-Pei Cao, and Lu Sheng. Midi: Multi-instance diffusion for single image to 3d scene generation, 2025. URLhttps://arxiv.org/abs/2412.03558

  21. [21]

    Scenegen: Single-image 3d scene generation in one feedforward pass, 2025

    Yanxu Meng, Haoning Wu, Ya Zhang, and Weidi Xie. Scenegen: Single-image 3d scene generation in one feedforward pass, 2025. URLhttps://arxiv.org/abs/2508.15769

  22. [22]

    Cast: Component-aligned 3d scene reconstruction from an rgb image.ACM Trans

    Kaixin Yao, Longwen Zhang, Xinhao Yan, Yan Zeng, Qixuan Zhang, Lan Xu, Wei Yang, Jiayuan Gu, and Jingyi Yu. Cast: Component-aligned 3d scene reconstruction from an rgb image.ACM Trans. Graph., 44(4), July 2025. ISSN 0730-0301. doi: 10.1145/3730841. URL https://doi.org/10.1145/3730841

  23. [23]

    DeepSDF: Learning continuous signed distance functions for shape representation

    Jeong Joon Park, Peter Florence, Julian Straub, Richard Newcombe, and Steven Lovegrove. DeepSDF: Learning continuous signed distance functions for shape representation. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019

  24. [24]

    Barron, and Ben Mildenhall

    Ben Poole, Ajay Jain, Jonathan T. Barron, and Ben Mildenhall. DreamFusion: Text-to-3d using 2d diffusion. InInternational Conference on Learning Representations (ICLR), 2023

  25. [25]

    Denoising diffusion probabilistic models

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. In Advances in Neural Information Processing Systems (NeurIPS), 2020

  26. [26]

    Denoising diffusion implicit models

    Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. In International Conference on Learning Representations (ICLR), 2021

  27. [27]

    High- resolution image synthesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High- resolution image synthesis with latent diffusion models. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022

  28. [28]

    ShapeNet: An Information-Rich 3D Model Repository

    Angel X. Chang, Thomas Funkhouser, Leonidas Guibas, Pat Hanrahan, Qixing Huang, Zimo Li, Silvio Savarese, Manolis Savva, Shuran Song, Hao Su, Jianxiong Xiao, Li Yi, and Fisher Yu. ShapeNet: An information-rich 3d model repository.arXiv preprint arXiv:1512.03012, 2015

  29. [29]

    Objaverse-XL: A universe of 10m+ 3d objects

    Matt Deitke, Ruoshi Liu, Matthew Wallingford, Huong Ngo, Oscar Michel, Ujval Kusupati, Alan Fan, Christian Laforte, Vikram V oleti, Samir Yitzhak Gadre, Eli VanderBilt, Aniruddha Kembhavi, Carl V ondrick, Georgia Gkioxari, Kiana Ehsani, Ludwig Schmidt, and Ali Farhadi. Objaverse-XL: A universe of 10m+ 3d objects. InAdvances in Neural Information Processin...

  30. [30]

    3D-FUTURE: 3d furniture shape with textures.Interna- tional Journal of Computer Vision, 2021

    Huan Fu, Bowen Cai, Lin Gao, Ling-Xiao Zhang, Cao Li, Qixun Zeng, Chengyue Sun, Rongfei Jia, Binqiang Zhao, and Hao Zhang. 3D-FUTURE: 3d furniture shape with textures.Interna- tional Journal of Computer Vision, 2021

  31. [31]

    Scalable diffusion models with transformers

    William Peebles and Saining Xie. Scalable diffusion models with transformers. InInternational Conference on Computer Vision (ICCV), 2023. 11

  32. [32]

    Clay: A controllable large-scale generative model for creating high-quality 3d assets, 2024

    Longwen Zhang, Ziyu Wang, Qixuan Zhang, Qiwei Qiu, Anqi Pang, Haoran Jiang, Wei Yang, Lan Xu, and Jingyi Yu. Clay: A controllable large-scale generative model for creating high-quality 3d assets, 2024. URLhttps://arxiv.org/abs/2406.13897

  33. [33]

    TripoSG: High-Fidelity 3D Shape Synthesis using Large-Scale Rectified Flow Models

    Yangguang Li, Zi-Xin Zou, Zexiang Liu, Dehu Wang, Yuan Liang, Zhipeng Yu, Xingchao Liu, Yuan-Chen Guo, Ding Liang, Wanli Ouyang, and Yan-Pei Cao. Triposg: High-fidelity 3d shape synthesis using large-scale rectified flow models, 2025. URL https://arxiv.org/ abs/2502.06608

  34. [34]

    Direct3d: Scalable image-to-3d generation via 3d latent diffusion transformer, 2024

    Shuang Wu, Youtian Lin, Feihu Zhang, Yifei Zeng, Jingxi Xu, Philip Torr, Xun Cao, and Yao Yao. Direct3d: Scalable image-to-3d generation via 3d latent diffusion transformer, 2024. URL https://arxiv.org/abs/2405.14832

  35. [35]

    Hunyuan3d 1.0: A unified framework for text-to-3d and image-to-3d generation, 2025

    Xianghui Yang, Huiwen Shi, Bowen Zhang, Fan Yang, Jiacheng Wang, Hongxu Zhao, Xinhai Liu, Xinzhou Wang, Qingxiang Lin, Jiaao Yu, Lifu Wang, Jing Xu, Zebin He, Zhuo Chen, Sicong Liu, Junta Wu, Yihang Lian, Shaoxiong Yang, Yuhong Liu, Yong Yang, Di Wang, Jie Jiang, and Chunchao Guo. Hunyuan3d 1.0: A unified framework for text-to-3d and image-to-3d generatio...

  36. [36]

    Structured 3d latents for scalable and versatile 3d generation,

    Jianfeng Xiang, Zelong Lv, Sicheng Xu, Yu Deng, Ruicheng Wang, Bowen Zhang, Dong Chen, Xin Tong, and Jiaolong Yang. Structured 3d latents for scalable and versatile 3d generation,

  37. [37]

    URLhttps://arxiv.org/abs/2412.01506

  38. [38]

    Craftsman3d: High-fidelity mesh generation with 3d native generation and interactive geometry refiner, 2025

    Weiyu Li, Jiarui Liu, Hongyu Yan, Rui Chen, Yixun Liang, Xuelin Chen, Ping Tan, and Xiaoxiao Long. Craftsman3d: High-fidelity mesh generation with 3d native generation and interactive geometry refiner, 2025. URLhttps://arxiv.org/abs/2405.14979

  39. [39]

    Michelangelo: Conditional 3d shape generation based on shape-image- text aligned latent representation, 2023

    Zibo Zhao, Wen Liu, Xin Chen, Xianfang Zeng, Rui Wang, Pei Cheng, Bin Fu, Tao Chen, Gang Yu, and Shenghua Gao. Michelangelo: Conditional 3d shape generation based on shape-image- text aligned latent representation, 2023. URLhttps://arxiv.org/abs/2306.17115

  40. [40]

    Hi3dgen: High-fidelity 3d geometry generation from images via normal bridging, 2025

    Chongjie Ye, Yushuang Wu, Ziteng Lu, Jiahao Chang, Xiaoyang Guo, Jiaqing Zhou, Hao Zhao, and Xiaoguang Han. Hi3dgen: High-fidelity 3d geometry generation from images via normal bridging, 2025. URLhttps://arxiv.org/abs/2503.22236

  41. [41]

    SAM 3D: 3Dfy Anything in Images

    SAM 3D Team, Xingyu Chen, Fu-Jen Chu, Pierre Gleize, Kevin J Liang, Alexander Sax, Hao Tang, Weiyao Wang, Michelle Guo, Thibaut Hardin, Xiang Li, Aohan Lin, Jiawei Liu, Ziqi Ma, Anushka Sagar, Bowen Song, Xiaodong Wang, Jianing Yang, Bowen Zhang, Piotr Dollár, Georgia Gkioxari, Matt Feiszli, and Jitendra Malik. SAM 3D: 3dfy anything in images, 2025. URLht...

  42. [42]

    Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Nießner

    Angela Dai, Angel X. Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Nießner. ScanNet: Richly-annotated 3d reconstructions of indoor scenes. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017

  43. [43]

    Matterport3D: Learning from RGB-D data in indoor environments

    Angel Chang, Angela Dai, Thomas Funkhouser, Maciej Halber, Matthias Nießner, Manolis Savva, Shuran Song, Andy Zeng, and Yinda Zhang. Matterport3D: Learning from RGB-D data in indoor environments. InInternational Conference on 3D Vision (3DV), 2017

  44. [44]

    ARKitScenes: A diverse real-world dataset for 3d indoor scene understanding using mobile RGB-D data

    Gilad Baruch, Zhuoyuan Chen, Afshin Dehghan, Tal Dimry, Yuri Feigin, Peter Fu, Thomas Gebauer, Brandon Joffe, Daniel Kurz, Arik Schwartz, and Elad Shulman. ARKitScenes: A diverse real-world dataset for 3d indoor scene understanding using mobile RGB-D data. In Advances in Neural Information Processing Systems Datasets and Benchmarks Track, 2021

  45. [45]

    ScanNet++: A high-fidelity dataset of 3d indoor scenes

    Chandan Yeshwanth, Yueh-Cheng Liu, Matthias Nießner, and Angela Dai. ScanNet++: A high-fidelity dataset of 3d indoor scenes. InInternational Conference on Computer Vision (ICCV), 2023

  46. [46]

    Julian Straub, Thomas Whelan, Lingni Ma, Yufan Chen, Erik Wijmans, Simon Green, Jakob J. Engel, Raul Mur-Artal, Carl Ren, Shobhit Verma, Anton Clarkson, Ming Yan, Brian Budge, Yuan Yan, Xiaqing Pan, June Yon, Yuyang Zou, Kimberly Leon, Nigel Carter, Jesus Briales, Tyler Gillingham, Elias Mueggler, Luis Pesqueira, Manolis Savva, Dhruv Batra, Hauke Strasdat...

  47. [47]

    SceneNN: A scene meshes dataset with annotations

    Binh-Son Hua, Quang-Hieu Pham, Duc Thanh Nguyen, Minh-Khoi Tran, Lap-Fai Yu, and Sai-Kit Yeung. SceneNN: A scene meshes dataset with annotations. InInternational Conference on 3D Vision (3DV), 2016

  48. [48]

    OpenVLA: An Open-Source Vision-Language-Action Model

    Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, Quan Vuong, Thomas Kollar, Benjamin Burchfiel, Russ Tedrake, Dorsa Sadigh, Sergey Levine, Percy Liang, and Chelsea Finn. Openvla: An open-source vision-language-action model, 2024. URL https://arxiv. org/abs/...

  49. [49]

    Lm-nav: Robotic navigation with large pre-trained models of language, vision, and action,

    Dhruv Shah, Blazej Osinski, Brian Ichter, and Sergey Levine. Lm-nav: Robotic navigation with large pre-trained models of language, vision, and action, 2022. URL https://arxiv.org/ abs/2207.04429

  50. [50]

    Karen Liu, Silvio Savarese, Hyowon Gweon, Jiajun Wu, and Li Fei-Fei

    Sanjana Srivastava, Chengshu Li, Michael Lingelbach, Roberto Martín-Martín, Fei Xia, Kent Vainio, Zheng Lian, Cem Gokmen, Shyamal Buch, C. Karen Liu, Silvio Savarese, Hyowon Gweon, Jiajun Wu, and Li Fei-Fei. Behavior: Benchmark for everyday household activities in virtual, interactive, and ecological environments, 2021. URL https://arxiv.org/abs/ 2108.03332

  51. [51]

    arXiv preprint arXiv:2310.13724 (2023) 3

    Xavier Puig, Eric Undersander, Andrew Szot, Marc-Alexandre Cote, Tsung-Yen Yang, Rus- lan Partsey, Rutav Desai, Alexander Clegg, Michal Hlavac, So Yeon Min, Viktor V ondrus, Theophile Gervet, Vincent-Pierre Berges, John Turner, Oleksandr Maksymets, Zsolt Kira, Mri- nal Kalakrishnan, Jitendra Malik, Devendra Singh Chaplot, Unnat Jain, Dhruv Batra, Akshara ...

  52. [52]

    Behavior vision suite: Customizable dataset generation via simulation, 2024

    Yunhao Ge, Yihe Tang, Jiashu Xu, Cem Gokmen, Chengshu Li, Wensi Ai, Benjamin Jose Martinez, Arman Aydin, Mona Anvari, Ayush K Chakravarthy, Hong-Xing Yu, Josiah Wong, Sanjana Srivastava, Sharon Lee, Shengxin Zha, Laurent Itti, Yunzhu Li, Roberto Martín-Martín, Miao Liu, Pengchuan Zhang, Ruohan Zhang, Li Fei-Fei, and Jiajun Wu. Behavior vision suite: Custo...

  53. [53]

    ATISS: Autoregressive transformers for indoor scene synthesis

    Despoina Paschalidou, Amlan Kar, Maria Shugrina, Karsten Kreis, Andreas Geiger, and Sanja Fidler. ATISS: Autoregressive transformers for indoor scene synthesis. InAdvances in Neural Information Processing Systems (NeurIPS), 2021

  54. [54]

    Holodeck: Language guided generation of 3d embodied ai environments, 2024

    Yue Yang, Fan-Yun Sun, Luca Weihs, Eli VanderBilt, Alvaro Herrasti, Winson Han, Jiajun Wu, Nick Haber, Ranjay Krishna, Lingjie Liu, Chris Callison-Burch, Mark Yatskar, Aniruddha Kembhavi, and Christopher Clark. Holodeck: Language guided generation of 3d embodied ai environments, 2024. URLhttps://arxiv.org/abs/2312.09067

  55. [55]

    DiffuScene: Denoising diffusion models for generative indoor scene synthesis

    Jiapeng Tang, Yinyu Nie, Lev Markhasin, Angela Dai, Justus Thies, and Matthias Nießner. DiffuScene: Denoising diffusion models for generative indoor scene synthesis. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

  56. [56]

    Panst3r: Multi-view consistent panoptic segmentation, 2025

    Lojze Zust, Yohann Cabon, Juliette Marrie, Leonid Antsfeld, Boris Chidlovskii, Jerome Revaud, and Gabriela Csurka. Panst3r: Multi-view consistent panoptic segmentation, 2025. URL https://arxiv.org/abs/2506.21348

  57. [57]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agar- wal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. InInterna- tional Conference on Machine Learning (ICML), 2021

  58. [58]

    Fast point feature histograms (FPFH) for 3d registration

    Radu Bogdan Rusu, Nico Blodow, and Michael Beetz. Fast point feature histograms (FPFH) for 3d registration. InIEEE International Conference on Robotics and Automation (ICRA), 2009

  59. [59]

    Least-squares estimation of transformation parameters between two point patterns.IEEE Transactions on Pattern Analysis and Machine Intelligence, 13(4):376–380, 1991

    Shinji Umeyama. Least-squares estimation of transformation parameters between two point patterns.IEEE Transactions on Pattern Analysis and Machine Intelligence, 13(4):376–380, 1991. 13

  60. [60]

    Learning 3d semantic scene graphs from 3d indoor reconstructions

    Johanna Wald, Helisa Dhamo, Nassir Navab, and Federico Tombari. Learning 3d semantic scene graphs from 3d indoor reconstructions. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020

  61. [61]

    Hydra: A real-time spatial perception system for 3d scene graph construction and optimization

    Nathan Hughes, Yun Chang, and Luca Carlone. Hydra: A real-time spatial perception system for 3d scene graph construction and optimization. InRobotics: Science and Systems (RSS), 2022

  62. [62]

    Qwen3-VL Technical Report

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shixuan ...

  63. [63]

    visual_fidelity

    Hongchi Xia, Entong Su, Marius Memmel, Arhan Jain, Raymond Yu, Numfor Mbiziwo-Tiapo, Ali Farhadi, Abhishek Gupta, Shenlong Wang, and Wei-Chiu Ma. DRAWER: Digital recon- struction and articulation with environment realism. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 21771–21782, 2025. A Evaluation Protocol A.1 Baseline In...

  64. [64]

    Panoptic multi-view reconstruction.We run PanSt3R to obtain camera-aware multi-view geometry, instance masks, per-instance labels, and semantic point clouds in a shared world coordinate frame

  65. [65]

    HierView selection.For each instance, we select one reconstruction view using the visibility, semantic, and 3D-completeness filters described in Sec. 3.2

  66. [66]

    Single-view asset generation.The selected image and mask are passed to SAM3D-Objects to reconstruct an initial per-instance mesh

  67. [67]

    Instance registration.Each generated mesh is aligned to its corresponding semantic point cloud using bounded Sim(3) registration

  68. [68]

    Scene graph construction.We select key frames by greedy 3D coverage, render ID-overlaid images, infer per-frame relations with a VLM, and merge them into a global scene graph

  69. [69]

    Attachment and refinement.We compile relation edges into geometric constraints, run the attachment solver, optionally run a floor-only branch for dual attachment, and apply staged post-refinement and depenetration. All stages share the same instance IDs, so the selected view, generated mesh, registration target, scene graph node, and final attachment tran...

  70. [70]

    Build the candidate view set from frames where the instance has a non-empty mask

  71. [71]

    Remove candidates with mask area belowa min or bounding-box area ratio belowb min. 17

  72. [72]

    Encode the masked crop and the category text with CLIP, then keep views whose semantic score is above the threshold or whose semantic rank is sufficiently high

  73. [73]

    Project a capped set of instance points into each remaining view and compute the fraction landing inside the mask

  74. [74]

    objects": [ {

    Select the view with the highest 3D-to-2D completeness score. If a stage removes all candidates, the implementation falls back to the best candidate from the previous stage rather than dropping the instance immediately. Table A2: Default HierView hyperparameters. Parameter Value CLIP backend/model ViT-B/32 Semantic probability thresholdτ s 0.12 Semantic r...

  75. [75]

    Floor snap: snap floor-supported objects to the floor plane and project footprints back into the floor boundary if needed

  76. [76]

    Wall resolution: resolve the canonical wall root to a concrete wall plane by distance and orientation agreement

  77. [77]

    Wall attachment: align wall-mounted objects to the selected wall plane while preserving lateral order along the wall

  78. [78]

    Object support: project child contact surfaces onto feasible support faces of parent objects

  79. [79]

    Near-wall snap: optionally snap large furniture near walls when the scene graph and geometry support it

  80. [80]

    Outside repair: move objects whose footprints leave the room boundary back toward feasible interior positions

Showing first 80 references.