arxiv: 2604.01907 · v2 · submitted 2026-04-02 · 💻 cs.CV · cs.AI

Recognition: no theorem link

Lifting Unlabeled Internet-level Data for 3D Scene Understanding

Yixin Chen , Yaowei Zhang , Huangyue Yu , Junchao He , Yan Wang , Jiangyong Huang , Hongyu Shen , Junfeng Ni

show 4 more authors

Shaofei Wang Baoxiong Jia Song-Chun Zhu Siyuan Huang

Authors on Pith no claims yet

Pith reviewed 2026-05-13 22:13 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords 3D scene understandingunlabeled videosdata generationzero-shot performance3D object detectionvisual question answeringvision-language navigation

0 comments

The pith

Unlabeled internet videos can be automatically converted into training data for 3D scene understanding models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that data engines can process abundant unlabeled web videos to create training signals for 3D scene models, reducing dependence on scarce human annotations. It identifies bottlenecks in this automated generation and tests the resulting data on tasks from 3D object detection and instance segmentation to spatial visual question answering and vision-language navigation. Models trained on the generated data alone achieve strong zero-shot performance and improve further when fine-tuned with limited labeled examples. This establishes a practical route to scaling 3D perception using existing online video resources.

Core claim

Carefully designed data engines can leverage web-curated, unlabeled videos to automatically generate training data that facilitates end-to-end models in 3D scene understanding, delivering strong zero-shot performance across low-level perception and high-level reasoning tasks with further gains after finetuning.

What carries the argument

The automated data generation engine that extracts 3D training signals and annotations from unlabeled videos by addressing identified bottlenecks in the lifting process.

If this is right

3D scene models can train effectively using only data lifted from internet videos without human labels for initial performance.
The same generated data supports both low-level tasks like object detection and high-level tasks like spatial VQA and navigation.
Combining the generated data with small amounts of human annotations yields further accuracy gains through finetuning.
Training data volume can scale directly with the amount of available unlabeled online video content.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach could extend to streaming internet video for ongoing model updates without new annotation campaigns.
Similar lifting from unlabeled sources might apply to other data-scarce areas such as 3D human interaction or robotics simulation.
If generation quality improves, full replacement of manual 3D annotation becomes feasible for many downstream applications.

Load-bearing premise

The automated process from unlabeled videos produces sufficiently clean and diverse 3D training signals without introducing harmful biases or noise.

What would settle it

Models trained only on the generated data show no zero-shot improvement over random baselines or no gains after finetuning on standard 3D benchmarks for detection, segmentation, VQA, or navigation.

Figures

Figures reproduced from arXiv: 2604.01907 by Baoxiong Jia, Hongyu Shen, Huangyue Yu, Jiangyong Huang, Junchao He, Junfeng Ni, Shaofei Wang, Siyuan Huang, Song-Chun Zhu, Yan Wang, Yaowei Zhang, Yixin Chen.

**Figure 1.** Figure 1: Overview of SceneVerse++. From unlabeled internet videos, we build automated data engines to create training data for comprehensive 3D scene understanding, realizing strong zero-shot performance on existing benchmarks, with further improvement after finetuning. This pinpoints future direction towards 3D spatial intelligence through improved automation on unlabeled, web-scale data. Abstract Annotated 3D sce… view at source ↗

**Figure 2.** Figure 2: Statistics comparison. SceneVerse++ encompasses more scenes, larger areas, and greater object diversity compared with existing real-world datasets. Statistics The dataset statistics are shown in [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Overview of data generation. The pipeline leverages a modular design for automatic 3D reconstruction and segmentation. Dense Reconstruction. Recent advances in 3D reconstruction introduce various approaches with different tradeoffs between quality and efficiency. Neural rendering methods [16, 39, 48, 79–81, 117] produce photo-realistic rendering and recover detailed geometry, but they require dense com… view at source ↗

**Figure 4.** Figure 4: Reconstruction and segmentation comparison, where SceneVerse++ features a balance in quality and efficiency. duce duplicated instances due to incorrect cross-view associations. In contrast, feature-lifting methods [9, 37, 96] exploit spatial correspondences across multiple views through rendering [53, 77], but their performance is affected by the rendering quality and typically requires substantial compu… view at source ↗

**Figure 5.** Figure 5: Training dynamics. Performance We evaluate the performance of Qwen2.5- VL after LoRA fine-tuning [47] on VSI-Bench, which spans ScanNet (SN), ScanNet++ (SN++), and ARKitScenes (ARKit). Given the domain discrepancy between datasets, we regard training and testing on SN and SN++ as indomain (ID), and out-of-domain (OOD) otherwise. For fairness, we sample 202K data from SceneVerse++ for training, comparabl… view at source ↗

**Figure 6.** Figure 6: Overview of the VLN data generation pipeline. We construct VLN data from room-tour videos by (i) preprocessing trajectories to eliminate redundant local rotations and segmenting long paths into sub-paths suitable for instruction generation; (ii) converting camera transitions within each sub-path into R2R-style navigation actions; and (iii) generating instructions for each sub-path using VLMs. • Simulated •… view at source ↗

**Figure 7.** Figure 7: Trajectory comparison. Top: Room-tour videos show irregular and redundant camera motions. Middle: R2R trajectories are smooth and goal-directed. Bottom: raw videos are converted into VLN-compatible data. Different colors indicate sub-paths. Path Pre-processing. The goal of this stage is to extract clean and coherent trajectories from room-tour videos using the SfM reconstructions described in Sec. 3. We fi… view at source ↗

read the original abstract

Annotated 3D scene data is scarce and expensive to acquire, while abundant unlabeled videos are readily available on the internet. In this paper, we demonstrate that carefully designed data engines can leverage web-curated, unlabeled videos to automatically generate training data, to facilitate end-to-end models in 3D scene understanding alongside human-annotated datasets. We identify and analyze bottlenecks in automated data generation, revealing critical factors that determine the efficiency and effectiveness of learning from unlabeled data. To validate our approach across different perception granularities, we evaluate on three tasks spanning low-level perception, i.e., 3D object detection and instance segmentation, to high-evel reasoning, i.e., 3D spatial Visual Question Answering (VQA) and Vision-Lanugage Navigation (VLN). Models trained on our generated data demonstrate strong zero-shot performance and show further improvement after finetuning. This demonstrates the viability of leveraging readily available web data as a path toward more capable scene understanding systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes a data engine that automatically lifts 3D annotations from unlabeled internet videos to train models for 3D object detection, instance segmentation, spatial VQA, and VLN. It claims to identify and analyze bottlenecks in the generation process and reports that models trained on the generated data achieve strong zero-shot performance that further improves after finetuning on human-annotated data.

Significance. If the generated labels prove sufficiently clean and diverse, the work would be significant for scaling 3D scene understanding without relying solely on expensive manual annotations, by demonstrating a viable path from abundant web video to usable training signals across perception and reasoning tasks.

major comments (2)

Abstract: the central claim of 'strong zero-shot performance' is presented without any quantitative results, error analysis, ablation studies, or metrics on generated-label fidelity (such as 3D IoU, depth RMSE, or pseudo-label precision). This absence is load-bearing because the viability argument rests on the automated data being clean enough to support the reported downstream gains.
Experiments section (inferred from abstract evaluation claims): no quantitative assessment of the data engine's output quality against reference sets is supplied, leaving open the possibility that residual noise from SfM, monocular depth, or pseudo-labeling steps (common failure modes in web video) drives or inflates the zero-shot numbers rather than genuine signal.

minor comments (1)

Abstract: 'high-evel' is a typo and should read 'high-level'; 'Vision-Lanugage' should read 'Vision-Language'.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We agree that the abstract would benefit from explicit quantitative support and that additional direct assessments of generated label quality would strengthen the presentation. We address each major comment below and outline the revisions we will make.

read point-by-point responses

Referee: Abstract: the central claim of 'strong zero-shot performance' is presented without any quantitative results, error analysis, ablation studies, or metrics on generated-label fidelity (such as 3D IoU, depth RMSE, or pseudo-label precision). This absence is load-bearing because the viability argument rests on the automated data being clean enough to support the reported downstream gains.

Authors: We agree that the abstract currently summarizes the claims at a high level without embedding specific numbers. The full manuscript reports quantitative zero-shot and finetuned results across 3D detection, instance segmentation, spatial VQA, and VLN, along with bottleneck analysis. In the revision we will update the abstract to include key performance metrics (e.g., mAP on detection, accuracy on VQA) that support the zero-shot claim. We will also add a concise statement on label-fidelity analysis drawn from our bottleneck study. revision: yes
Referee: Experiments section (inferred from abstract evaluation claims): no quantitative assessment of the data engine's output quality against reference sets is supplied, leaving open the possibility that residual noise from SfM, monocular depth, or pseudo-labeling steps (common failure modes in web video) drives or inflates the zero-shot numbers rather than genuine signal.

Authors: We partially agree. The experiments evaluate the generated data via downstream task performance and explicit bottleneck analysis that identifies which generation steps most affect final accuracy; this provides indirect but task-relevant evidence of signal quality. However, we acknowledge that direct metrics against reference 3D annotations (3D IoU, depth RMSE, pseudo-label precision) are not reported. In the revision we will add a dedicated paragraph and table on a held-out subset with available ground-truth 3D data to quantify these fidelity metrics and address the noise concern. revision: partial

Circularity Check

0 steps flagged

No significant circularity: empirical results rest on external validation

full rationale

The paper describes an empirical pipeline for lifting 3D annotations from unlabeled web videos and reports downstream task performance (zero-shot and fine-tuned) on detection, segmentation, VQA, and VLN. No equations, parameter fittings, uniqueness theorems, or derivations appear in the provided text. All central claims are supported by experimental comparisons against human-annotated baselines rather than by any self-referential reduction of outputs to inputs. The absence of mathematical structure precludes the self-definitional, fitted-input, or self-citation-load-bearing patterns required for a positive circularity finding.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Review conducted on abstract only; no explicit free parameters, axioms, or invented entities are stated in the provided text.

pith-pipeline@v0.9.0 · 5509 in / 1028 out tokens · 38471 ms · 2026-05-13T22:13:10.360965+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

135 extracted references · 135 canonical work pages · 10 internal anchors

[1]

Referit3d: Neu- ral listeners for fine-grained 3d object identification in real- world scenes

Panos Achlioptas, Ahmed Abdelreheem, Fei Xia, Mo- hamed Elhoseiny, and Leonidas Guibas. Referit3d: Neu- ral listeners for fine-grained 3d object identification in real- world scenes. InEuropean Conference on Computer Vision (ECCV), 2020. 2

work page 2020
[2]

World Simulation with Video Foundation Models for Physical AI

Arslan Ali, Junjie Bai, Maciej Bala, Yogesh Balaji, Aaron Blakeman, Tiffany Cai, Jiaxin Cao, Tianshi Cao, Eliza- beth Cha, Yu-Wei Chao, et al. World simulation with video foundation models for physical ai.arXiv preprint arXiv:2511.00062, 2025. 3

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments

Peter Anderson, Qi Wu, Damien Teney, Jake Bruce, Mark Johnson, Niko Sünderhauf, Ian Reid, Stephen Gould, and Anton van den Hengel. Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments. InConference on Computer Vision and Pattern Recognition (CVPR), 2018. 2, 6, 8

work page 2018
[4]

3d scene graph: A structure for unified semantics, 3d space, and camera

Iro Armeni, Zhi-Yang He, JunYoung Gwak, Amir R Za- mir, Martin Fischer, Jitendra Malik, and Silvio Savarese. 3d scene graph: A structure for unified semantics, 3d space, and camera. InInternational Conference on Computer Vi- sion (ICCV), 2019. 5

work page 2019
[5]

Scenescript: Reconstructing scenes with an au- toregressive structured language model

Armen Avetisyan, Christopher Xie, Henry Howard- Jenkins, Tsun-Yi Yang, Samir Aroudj, Suvam Patra, Fuyang Zhang, Duncan Frost, Luke Holland, Campbell Orme, et al. Scenescript: Reconstructing scenes with an au- toregressive structured language model. InEuropean Con- ference on Computer Vision (ECCV), 2024. 2

work page 2024
[6]

Scanqa: 3d question answering for spatial scene understanding

Daichi Azuma, Taiki Miyanishi, Shuhei Kurita, and Mo- toaki Kawanabe. Scanqa: 3d question answering for spatial scene understanding. InConference on Computer Vision and Pattern Recognition (CVPR), 2022. 2

work page 2022
[7]

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 2, 3, 4, 5

work page internal anchor Pith review Pith/arXiv arXiv 2025
[8]

Arkitscenes: A diverse real-world dataset for 3d indoor scene understand- ing using mobile rgb-d data

Gilad Baruch, Zhuoyuan Chen, Afshin Dehghan, Tal Dimry, Yuri Feigin, Peter Fu, Thomas Gebauer, Brandon Joffe, Daniel Kurz, Arik Schwartz, et al. Arkitscenes: A diverse real-world dataset for 3d indoor scene understand- ing using mobile rgb-d data. InProceedings of Advances in Neural Information Processing Systems Datasets and Benchmarks (NeurIPS Datasets ...

work page
[9]

Contrastive lift: 3d ob- ject instance segmentation by slow-fast contrastive fusion

Yash Bhalgat, Iro Laina, João F Henriques, Andrea Vedaldi, and Andrew Zisserman. Contrastive lift: 3d ob- ject instance segmentation by slow-fast contrastive fusion. InAdvances in Neural Information Processing Systems (NeurIPS), 2023. 4

work page 2023
[10]

Depth pro: Sharp monocular metric depth in less than a second

Alexey Bochkovskiy, Amaël Delaunoy, Hugo Germain, Marcel Santos, Yichao Zhou, Stephan Richter, and Vladlen Koltun. Depth pro: Sharp monocular metric depth in less than a second. InInternational Conference on Learning Representations (ICLR), 2025. 3

work page 2025
[11]

Omni3d: A large benchmark and model for 3d object detection in the wild

Garrick Brazil, Abhinav Kumar, Julian Straub, Nikhila Ravi, Justin Johnson, and Georgia Gkioxari. Omni3d: A large benchmark and model for 3d object detection in the wild. InConference on Computer Vision and Pattern Recognition (CVPR), 2023. 2

work page 2023
[12]

Sims-v: Simulated instruction-tuning for spatial video understanding.arXiv preprint arXiv:2511.04668, 2025

Ellis Brown, Arijit Ray, Ranjay Krishna, Ross Girshick, Rob Fergus, and Saining Xie. Sims-v: Simulated instruction-tuning for spatial video understanding.arXiv preprint arXiv:2511.04668, 2025. 6, 8

work page arXiv 2025
[13]

train on the test set

Ellis Brown, Jihan Yang, Shusheng Yang, Rob Fergus, and Saining Xie. Benchmark designers should “train on the test set” to expose exploitable non-visual shortcuts.arXiv preprint arXiv:2511.04655, 2025. 6

work page arXiv 2025
[14]

Matterport3d: Learning from rgb-d data in indoor environments

Angel Chang, Angela Dai, Thomas Funkhouser, Maciej Halber, Matthias Niessner, Manolis Savva, Shuran Song, Andy Zeng, and Yinda Zhang. Matterport3d: Learning from rgb-d data in indoor environments. InInternational Conference on 3D Vision (3DV), 2017. 6

work page 2017
[15]

Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts

Soravit Changpinyo, Piyush Sharma, Nan Ding, and Radu Soricut. Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. In Conference on Computer Vision and Pattern Recognition (CVPR), 2021. 2

work page 2021
[16]

Pgsr: Planar-based gaussian splat- ting for efficient and high-fidelity surface reconstruction

Danpeng Chen, Hai Li, Weicai Ye, Yifan Wang, Weijian Xie, Shangjin Zhai, Nan Wang, Haomin Liu, Hujun Bao, and Guofeng Zhang. Pgsr: Planar-based gaussian splat- ting for efficient and high-fidelity surface reconstruction. IEEE Transactions on Visualization and Computer Graph- ics, 2024. 4

work page 2024
[17]

Scanrefer: 3d object localization in rgb-d scans using natu- ral language

Dave Zhenyu Chen, Angel X Chang, and Matthias Nießner. Scanrefer: 3d object localization in rgb-d scans using natu- ral language. InEuropean Conference on Computer Vision (ECCV), 2020. 2

work page 2020
[18]

D3net: a speaker-listener architecture for semi-supervised dense captioning and visual grounding in rgb-d scans

Dave Zhenyu Chen, Qirui Wu, Matthias Nießner, and An- gel X Chang. D3net: a speaker-listener architecture for semi-supervised dense captioning and visual grounding in rgb-d scans. InEuropean Conference on Computer Vision (ECCV), 2022. 2

work page 2022
[19]

End-to-end 3d dense captioning with vote2cap-detr

Sijin Chen, Hongyuan Zhu, Xin Chen, Yinjie Lei, Gang Yu, and Tao Chen. End-to-end 3d dense captioning with vote2cap-detr. InConference on Computer Vision and Pat- tern Recognition (CVPR), 2023. 2

work page 2023
[20]

Monocular 3d object de- tection for autonomous driving

Xiaozhi Chen, Kaustav Kundu, Ziyu Zhang, Huimin Ma, Sanja Fidler, and Raquel Urtasun. Monocular 3d object de- tection for autonomous driving. InConference on Com- puter Vision and Pattern Recognition (CVPR), 2016. 2

work page 2016
[21]

Synergai: Per- ception alignment for human-robot collaboration

Yixin Chen, I Guoxi Zhang, Yaowei Zhang, Hongming Xu, Peiyuan Zhi, Qing Li, and Siyuan Huang. Synergai: Per- ception alignment for human-robot collaboration. InInter- national Conference on Robotics and Automation (ICRA),

work page
[22]

Scan2cap: Context-aware dense captioning in rgb- d scans

Zhenyu Chen, Ali Gholami, Matthias Nießner, and Angel X Chang. Scan2cap: Context-aware dense captioning in rgb- d scans. InConference on Computer Vision and Pattern Recognition (CVPR), 2021. 2

work page 2021
[23]

Internvl: Scaling up vision founda- tion models and aligning for generic visual-linguistic tasks

Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision founda- tion models and aligning for generic visual-linguistic tasks. InConference on Computer Vision and Pattern Recognition (CVPR), 2024. 2

work page 2024
[24]

Navila: Legged robot vision-language-action model for navigation

An-Chieh Cheng, Yandong Ji, Zhaojing Yang, Xueyan Zou, Jan Kautz, Erdem Biyik, Hongxu Yin, Sifei Liu, and Xiao- long Wang. Navila: Legged robot vision-language-action model for navigation. InRobotics: Science and Systems (RSS), 2025. 3, 6, 8, 4

work page 2025
[25]

Depth-regularized optimization for 3d gaussian splatting in few-shot images

Jaeyoung Chung, Jeongtaek Oh, and Kyoung Mu Lee. Depth-regularized optimization for 3d gaussian splatting in few-shot images. InConference on Computer Vision and Pattern Recognition (CVPR), 2024. 1

work page 2024
[26]

A volumetric method for building complex models from range images

Brian Curless and Marc Levoy. A volumetric method for building complex models from range images. InProceed- ings of the 23rd annual conference on Computer graphics and interactive techniques, 1996. 2

work page 1996
[27]

Scannet: Richly-annotated 3d reconstructions of indoor scenes

Angela Dai, Angel X Chang, Manolis Savva, Maciej Hal- ber, Thomas Funkhouser, and Matthias Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In Conference on Computer Vision and Pattern Recognition (CVPR), 2017. 2, 3

work page 2017
[28]

Depth-supervised nerf: Fewer views and faster training for free

Kangle Deng, Andrew Liu, Jun-Yan Zhu, and Deva Ra- manan. Depth-supervised nerf: Fewer views and faster training for free. InConference on Computer Vision and Pattern Recognition (CVPR), 2022. 1

work page 2022
[29]

V otenet: A deep learning label fusion method for multi-atlas segmen- tation

Zhipeng Ding, Xu Han, and Marc Niethammer. V otenet: A deep learning label fusion method for multi-atlas segmen- tation. InProceedings of International Conference on Med- ical Image Computing and Computer-Assisted Intervention (MICCAI), 2019. 2

work page 2019
[30]

ivs-net: Learning human view synthesis from internet videos

Junting Dong, Qi Fang, Tianshuo Yang, Qing Shuai, Chengyu Qiao, and Sida Peng. ivs-net: Learning human view synthesis from internet videos. InInternational Con- ference on Computer Vision (ICCV), 2023. 3

work page 2023
[31]

Mast3r-sfm: a fully-integrated solution for unconstrained structure-from-motion

Bardienus Pieter Duisterhof, Lojze Zust, Philippe Weinza- epfel, Vincent Leroy, Yohann Cabon, and Jerome Revaud. Mast3r-sfm: a fully-integrated solution for unconstrained structure-from-motion. InInternational Conference on 3D Vision (3DV), 2025. 3, 1

work page 2025
[32]

Depth map prediction from a single image using a multi-scale deep network

David Eigen, Christian Puhrsch, and Rob Fergus. Depth map prediction from a single image using a multi-scale deep network. InAdvances in Neural Information Process- ing Systems (NeurIPS), 2014. 1

work page 2014
[33]

Structure and content-guided video synthesis with diffusion models

Patrick Esser, Johnathan Chiu, Parmida Atighehchian, Jonathan Granskog, and Anastasis Germanidis. Structure and content-guided video synthesis with diffusion models. InInternational Conference on Computer Vision (ICCV),

work page
[34]

VLM-3R: Vision-Language Models Augmented with Instruction-Aligned 3D Reconstruction

Zhiwen Fan, Jian Zhang, Renjie Li, Junge Zhang, Runjin Chen, Hezhen Hu, Kevin Wang, Huaizhi Qu, Dilin Wang, Zhicheng Yan, et al. Vlm-3r: Vision-language models aug- mented with instruction-aligned 3d reconstruction.arXiv preprint arXiv:2505.20279, 2025. 5, 6

work page internal anchor Pith review Pith/arXiv arXiv 2025
[35]

Efficient graph-based image segmentation.International Journal of Computer Vision (IJCV), 2004

Pedro F Felzenszwalb and Daniel P Huttenlocher. Efficient graph-based image segmentation.International Journal of Computer Vision (IJCV), 2004. 5

work page 2004
[36]

Deep ordinal regression network for monocular depth estimation

Huan Fu, Mingming Gong, Chaohui Wang, Kayhan Bat- manghelich, and Dacheng Tao. Deep ordinal regression network for monocular depth estimation. InConference on Computer Vision and Pattern Recognition (CVPR), 2018. 1

work page 2018
[37]

Panoptic nerf: 3d-to-2d label transfer for panoptic urban scene segmentation

Xiao Fu, Shangzhan Zhang, Tianrun Chen, Yichong Lu, Lanyun Zhu, Xiaowei Zhou, Andreas Geiger, and Yiyi Liao. Panoptic nerf: 3d-to-2d label transfer for panoptic urban scene segmentation. InInternational Conference on 3D Vision (3DV), 2022. 2, 4

work page 2022
[38]

Basic books, 2011

Howard Gardner.Frames of mind: The theory of multiple intelligences. Basic books, 2011. 5

work page 2011
[39]

Matcha gaussians: Atlas of charts for high- quality geometry and photorealism from sparse views

Antoine Guédon, Tomoki Ichikawa, Kohei Yamashita, and Ko Nishino. Matcha gaussians: Atlas of charts for high- quality geometry and photorealism from sparse views. In Conference on Computer Vision and Pattern Recognition (CVPR), 2025. 4

work page 2025
[40]

Roomtour3d: Geometry-aware video- instruction tuning for embodied navigation

Mingfei Han, Liang Ma, Kamila Zhumakhanova, Ekaterina Radionova, Jingyi Zhang, Xiaojun Chang, Xiaodan Liang, and Ivan Laptev. Roomtour3d: Geometry-aware video- instruction tuning for embodied navigation. InConfer- ence on Computer Vision and Pattern Recognition (CVPR),

work page
[41]

Cambridge university press,

Richard Hartley and Andrew Zisserman.Multiple view ge- ometry in computer vision. Cambridge university press,

work page
[42]

Mask r-cnn

Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Gir- shick. Mask r-cnn. InInternational Conference on Com- puter Vision (ICCV), 2017. 3

work page 2017
[43]

Cam- bridge university press Cambridge, 1986

Annette Herskovits.Language and spatial cognition. Cam- bridge university press Cambridge, 1986. 5

work page 1986
[44]

Imagen Video: High Definition Video Generation with Diffusion Models

Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Alexey Gritsenko, Diederik P Kingma, Ben Poole, Mohammad Norouzi, David J Fleet, et al. Ima- gen video: High definition video generation with diffusion models.arXiv preprint arXiv:2210.02303, 2022. 3

work page internal anchor Pith review Pith/arXiv arXiv 2022
[45]

Vln bert: A recurrent vision- and-language bert for navigation

Yicong Hong, Qi Wu, Yuankai Qi, Cristian Rodriguez- Opazo, and Stephen Gould. Vln bert: A recurrent vision- and-language bert for navigation. InConference on Com- puter Vision and Pattern Recognition (CVPR), 2021. 2

work page 2021
[46]

3d concept learn- ing and reasoning from multi-view images

Yining Hong, Chunru Lin, Yilun Du, Zhenfang Chen, Joshua B Tenenbaum, and Chuang Gan. 3d concept learn- ing and reasoning from multi-view images. InConfer- ence on Computer Vision and Pattern Recognition (CVPR),

work page
[47]

Lora: Low-rank adaptation of large language models

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models. InInternational Conference on Learning Representations (ICLR), 2022. 6

work page 2022
[48]

2d gaussian splatting for geometrically ac- curate radiance fields

Binbin Huang, Zehao Yu, Anpei Chen, Andreas Geiger, and Shenghua Gao. 2d gaussian splatting for geometrically ac- curate radiance fields. InACM SIGGRAPH / Eurographics Symposium on Computer Animation (SCA), 2024. 4

work page 2024
[49]

arXiv preprint arXiv:2311.12871 (2023)

Jiangyong Huang, Silong Yong, Xiaojian Ma, Xiongkun Linghu, Puhao Li, Yan Wang, Qing Li, Song-Chun Zhu, Baoxiong Jia, and Siyuan Huang. An embodied generalist agent in 3d world.arXiv preprint arXiv:2311.12871, 2023. 5

work page arXiv 2023
[50]

Sceneverse: Scaling 3d vision-language learning for grounded scene understanding

Baoxiong Jia, Yixin Chen, Huangyue Yu, Yan Wang, Xuesong Niu, Tengyu Liu, Qing Li, and Siyuan Huang. Sceneverse: Scaling 3d vision-language learning for grounded scene understanding. InEuropean Conference on Computer Vision (ECCV). Springer, 2024. 2, 5

work page 2024
[51]

Pointgroup: Dual-set point group- ing for 3d instance segmentation

Li Jiang, Hengshuang Zhao, Shaoshuai Shi, Shu Liu, Chi- Wing Fu, and Jiaya Jia. Pointgroup: Dual-set point group- ing for 3d instance segmentation. InConference on Com- puter Vision and Pattern Recognition (CVPR), 2020. 2

work page 2020
[52]

Scaling Laws for Neural Language Models

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361,

work page internal anchor Pith review Pith/arXiv arXiv 2001
[53]

3d gaussian splatting for real-time radiance field rendering

Bernhard Kerbl, Georgios Kopanas, Thomas Leimkühler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering. InACM SIGGRAPH / Eurograph- ics Symposium on Computer Animation (SCA), 2023. 2, 4

work page 2023
[54]

Lerf: Language embed- ded radiance fields

Justin Kerr, Chung Min Kim, Ken Goldberg, Angjoo Kanazawa, and Matthew Tancik. Lerf: Language embed- ded radiance fields. InInternational Conference on Com- puter Vision (ICCV), 2023. 2

work page 2023
[55]

Habitat synthetic scenes dataset (hssd-200): An analysis of 3d scene scale and realism tradeoffs for objectgoal naviga- tion

Mukul Khanna, Yongsen Mao, Hanxiao Jiang, Sanjay Haresh, Brennan Shacklett, Dhruv Batra, Alexander Clegg, Eric Undersander, Angel X Chang, and Manolis Savva. Habitat synthetic scenes dataset (hssd-200): An analysis of 3d scene scale and realism tradeoffs for objectgoal naviga- tion. InConference on Computer Vision and Pattern Recog- nition (CVPR), 2024. 2

work page 2024
[56]

Segment anything

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. Segment anything. InInternational Conference on Computer Vision (ICCV), 2023. 4

work page 2023
[57]

Large language models are zero-shot reasoners

Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners. InAdvances in Neural Information Processing Systems (NeurIPS), 2022. 2

work page 2022
[58]

Unidet3d: Multi- dataset indoor 3d object detection

Maksim Kolodiazhnyi, Anna V orontsova, Matvey Skripkin, Danila Rukhovich, and Anton Konushin. Unidet3d: Multi- dataset indoor 3d object detection. InAAAI Conference on Artificial Intelligence (AAAI), 2025. 2

work page 2025
[59]

Room-across-room: Multilingual vision-and-language navigation with dense spatiotemporal grounding

Alexander Ku, Peter Anderson, Roma Patel, Eugene Ie, and Jason Baldridge. Room-across-room: Multilingual vision-and-language navigation with dense spatiotemporal grounding. InAnnual Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020. 2

work page 2020
[60]

Panoptic neu- ral fields: A semantic object-aware neural scene representa- tion

Abhijit Kundu, Kyle Genova, Xiaoqi Yin, Alireza Fathi, Caroline Pantofaru, Leonidas J Guibas, Andrea Tagliasac- chi, Frank Dellaert, and Thomas Funkhouser. Panoptic neu- ral fields: A semantic object-aware neural scene representa- tion. InConference on Computer Vision and Pattern Recog- nition (CVPR), 2022. 2

work page 2022
[61]

Grounding image matching in 3d with mast3r

Vincent Leroy, Yohann Cabon, and Jérôme Revaud. Grounding image matching in 3d with mast3r. InEuropean Conference on Computer Vision (ECCV), 2024. 3

work page 2024
[62]

BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models. InInter- national Conference on Machine Learning (ICML), 2023. 2

work page 2023
[63]

Describe anything: De- tailed localized image and video captioning.arXiv preprint arXiv:2504.16072, 2025

Long Lian, Yifan Ding, Yunhao Ge, Sifei Liu, Hanzi Mao, Boyi Li, Marco Pavone, Ming-Yu Liu, Trevor Dar- rell, Adam Yala, and Yin Cui. Describe anything: De- tailed localized image and video captioning.arXiv preprint arXiv:2504.16072, 2025. 4

work page arXiv 2025
[64]

Learning vision-and- language navigation from youtube videos

Kunyang Lin, Peihao Chen, Diwei Huang, Thomas H Li, Mingkui Tan, and Chuang Gan. Learning vision-and- language navigation from youtube videos. InInternational Conference on Computer Vision (ICCV), 2023. 3, 6

work page 2023
[65]

Microsoft coco: Common objects in context

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. InEuropean Conference on Computer Vision (ECCV), 2014. 2

work page 2014
[66]

Weakly supervised 3d open- vocabulary segmentation

Kunhao Liu, Fangneng Zhan, Jiahui Zhang, Muyu Xu, Yingchen Yu, Abdulmotaleb El Saddik, Christian Theobalt, Eric Xing, and Shijian Lu. Weakly supervised 3d open- vocabulary segmentation. InAdvances in Neural Informa- tion Processing Systems (NeurIPS), 2023. 2

work page 2023
[67]

Novel view extrapolation with video diffusion priors.arXiv preprint arXiv:2411.14208,

Kunhao Liu, Ling Shao, and Shijian Lu. Novel view extrapolation with video diffusion priors.arXiv preprint arXiv:2411.14208, 2024. 3

work page arXiv 2024
[68]

3dgs-enhancer: Enhancing unbounded 3d gaussian splatting with view- consistent 2d diffusion priors

Xi Liu, Chaoyi Zhou, and Siyu Huang. 3dgs-enhancer: Enhancing unbounded 3d gaussian splatting with view- consistent 2d diffusion priors. InAdvances in Neural In- formation Processing Systems (NeurIPS), 2024. 2

work page 2024
[69]

Taco: Taming diffusion for in-the-wild video amodal completion

Ruijie Lu, Yixin Chen, Yu Liu, Jiaxiang Tang, Junfeng Ni, Diwen Wan, Gang Zeng, and Siyuan Huang. Taco: Taming diffusion for in-the-wild video amodal completion. InIn- ternational Conference on Computer Vision (ICCV), 2025. 3

work page 2025
[70]

Mo- vis: Enhancing multi-object novel view synthesis for in- door scenes

Ruijie Lu, Yixin Chen, Junfeng Ni, Baoxiong Jia, Yu Liu, Diwen Wan, Gang Zeng, and Siyuan Huang. Mo- vis: Enhancing multi-object novel view synthesis for in- door scenes. InConference on Computer Vision and Pattern Recognition (CVPR), 2025. 3

work page 2025
[71]

Scalable 3d captioning with pretrained mod- els

Tiange Luo, Chris Rockwell, Honglak Lee, and Justin Johnson. Scalable 3d captioning with pretrained mod- els. InAdvances in Neural Information Processing Systems (NeurIPS), 2023. 2

work page 2023
[72]

You see it, you got it: Learning 3d creation on pose-free videos at scale

Baorui Ma, Huachen Gao, Haoge Deng, Zhengxiong Luo, Tiejun Huang, Lulu Tang, and Xinlong Wang. You see it, you got it: Learning 3d creation on pose-free videos at scale. InConference on Computer Vision and Pattern Recognition (CVPR), 2025. 3

work page 2025
[73]

Sqa3d: Sit- uated question answering in 3d scenes

Xiaojian Ma, Silong Yong, Zilong Zheng, Qing Li, Yitao Liang, Song-Chun Zhu, and Siyuan Huang. Sqa3d: Sit- uated question answering in 3d scenes. InInternational Conference on Learning Representations (ICLR), 2023. 2

work page 2023
[74]

Multiscan: Scalable rgbd scanning for 3d environments with articulated objects

Yongsen Mao, Yiming Zhang, Hanxiao Jiang, Angel Chang, and Manolis Savva. Multiscan: Scalable rgbd scanning for 3d environments with articulated objects. InAdvances in Neural Information Processing Systems (NeurIPS), 2022. 2, 3

work page 2022
[75]

Spatiallm: Training large language models for structured indoor mod- eling

Yongsen Mao, Junhao Zhong, Chuan Fang, Jia Zheng, Rui Tang, Hao Zhu, Ping Tan, and Zihan Zhou. Spatiallm: Training large language models for structured indoor mod- eling. InAdvances in Neural Information Processing Sys- tems (NeurIPS), 2025. 2, 5

work page 2025
[76]

Towards scalable spatial intelligence via 2d-to-3d data lifting

Xingyu Miao, Haoran Duan, Quanhao Qian, Jiuniu Wang, Yang Long, Ling Shao, Deli Zhao, Ran Xu, and Gongjie Zhang. Towards scalable spatial intelligence via 2d-to-3d data lifting. InInternational Conference on Computer Vi- sion (ICCV), 2025. 2

work page 2025
[77]

Srinivasan, Matthew Tancik, Jonathan T

Ben Mildenhall, Pratul P. Srinivasan, Matthew Tancik, Jonathan T. Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view synthesis. InEuropean Conference on Computer Vision (ECCV), 2020. 2, 4

work page 2020
[78]

An end- to-end transformer model for 3d object detection

Ishan Misra, Rohit Girdhar, and Armand Joulin. An end- to-end transformer model for 3d object detection. InIn- ternational Conference on Computer Vision (ICCV), 2021. 2

work page 2021
[79]

Phyrecon: Physically plausible neural scene recon- struction

Junfeng Ni, Yixin Chen, Bohan Jing, Nan Jiang, Bin Wang, Bo Dai, Puhao Li, Yixin Zhu, Song-Chun Zhu, and Siyuan Huang. Phyrecon: Physically plausible neural scene recon- struction. InAdvances in Neural Information Processing Systems (NeurIPS), 2024. 2, 4

work page 2024
[80]

Decompositional neural scene reconstruction with generative diffusion prior

Junfeng Ni, Yu Liu, Ruijie Lu, Zirui Zhou, Song-Chun Zhu, Yixin Chen, and Siyuan Huang. Decompositional neural scene reconstruction with generative diffusion prior. In Conference on Computer Vision and Pattern Recognition (CVPR), 2025

work page 2025

Showing first 80 references.