pith. machine review for the scientific record. sign in

arxiv: 2604.01907 · v2 · submitted 2026-04-02 · 💻 cs.CV · cs.AI

Recognition: no theorem link

Lifting Unlabeled Internet-level Data for 3D Scene Understanding

Authors on Pith no claims yet

Pith reviewed 2026-05-13 22:13 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords 3D scene understandingunlabeled videosdata generationzero-shot performance3D object detectionvisual question answeringvision-language navigation
0
0 comments X

The pith

Unlabeled internet videos can be automatically converted into training data for 3D scene understanding models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that data engines can process abundant unlabeled web videos to create training signals for 3D scene models, reducing dependence on scarce human annotations. It identifies bottlenecks in this automated generation and tests the resulting data on tasks from 3D object detection and instance segmentation to spatial visual question answering and vision-language navigation. Models trained on the generated data alone achieve strong zero-shot performance and improve further when fine-tuned with limited labeled examples. This establishes a practical route to scaling 3D perception using existing online video resources.

Core claim

Carefully designed data engines can leverage web-curated, unlabeled videos to automatically generate training data that facilitates end-to-end models in 3D scene understanding, delivering strong zero-shot performance across low-level perception and high-level reasoning tasks with further gains after finetuning.

What carries the argument

The automated data generation engine that extracts 3D training signals and annotations from unlabeled videos by addressing identified bottlenecks in the lifting process.

If this is right

  • 3D scene models can train effectively using only data lifted from internet videos without human labels for initial performance.
  • The same generated data supports both low-level tasks like object detection and high-level tasks like spatial VQA and navigation.
  • Combining the generated data with small amounts of human annotations yields further accuracy gains through finetuning.
  • Training data volume can scale directly with the amount of available unlabeled online video content.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The approach could extend to streaming internet video for ongoing model updates without new annotation campaigns.
  • Similar lifting from unlabeled sources might apply to other data-scarce areas such as 3D human interaction or robotics simulation.
  • If generation quality improves, full replacement of manual 3D annotation becomes feasible for many downstream applications.

Load-bearing premise

The automated process from unlabeled videos produces sufficiently clean and diverse 3D training signals without introducing harmful biases or noise.

What would settle it

Models trained only on the generated data show no zero-shot improvement over random baselines or no gains after finetuning on standard 3D benchmarks for detection, segmentation, VQA, or navigation.

Figures

Figures reproduced from arXiv: 2604.01907 by Baoxiong Jia, Hongyu Shen, Huangyue Yu, Jiangyong Huang, Junchao He, Junfeng Ni, Shaofei Wang, Siyuan Huang, Song-Chun Zhu, Yan Wang, Yaowei Zhang, Yixin Chen.

Figure 1
Figure 1. Figure 1: Overview of SceneVerse++. From unlabeled internet videos, we build automated data engines to create training data for comprehensive 3D scene understanding, realizing strong zero-shot performance on existing benchmarks, with further improvement after finetuning. This pinpoints future direction towards 3D spatial intelligence through improved automation on unlabeled, web-scale data. Abstract Annotated 3D sce… view at source ↗
Figure 2
Figure 2. Figure 2: Statistics comparison. SceneVerse++ encompasses more scenes, larger areas, and greater object diversity compared with existing real-world datasets. Statistics The dataset statistics are shown in [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Overview of data generation. The pipeline leverages a modular design for automatic 3D reconstruction and segmentation. Dense Reconstruction. Recent advances in 3D recon￾struction introduce various approaches with different trade￾offs between quality and efficiency. Neural rendering meth￾ods [16, 39, 48, 79–81, 117] produce photo-realistic ren￾dering and recover detailed geometry, but they require dense com… view at source ↗
Figure 4
Figure 4. Figure 4: Reconstruction and segmentation comparison, where SceneVerse++ features a balance in quality and efficiency. duce duplicated instances due to incorrect cross-view asso￾ciations. In contrast, feature-lifting methods [9, 37, 96] ex￾ploit spatial correspondences across multiple views through rendering [53, 77], but their performance is affected by the rendering quality and typically requires substantial compu… view at source ↗
Figure 5
Figure 5. Figure 5: Training dynamics. Performance We evaluate the performance of Qwen2.5- VL after LoRA fine-tuning [47] on VSI-Bench, which spans ScanNet (SN), ScanNet++ (SN++), and ARKitScenes (ARKit). Given the domain discrepancy between datasets, we regard training and testing on SN and SN++ as in￾domain (ID), and out-of-domain (OOD) otherwise. For fair￾ness, we sample 202K data from SceneVerse++ for train￾ing, comparabl… view at source ↗
Figure 6
Figure 6. Figure 6: Overview of the VLN data generation pipeline. We construct VLN data from room-tour videos by (i) preprocessing trajectories to eliminate redundant local rotations and segmenting long paths into sub-paths suitable for instruction generation; (ii) converting camera transitions within each sub-path into R2R-style navigation actions; and (iii) generating instructions for each sub-path using VLMs. • Simulated •… view at source ↗
Figure 7
Figure 7. Figure 7: Trajectory comparison. Top: Room-tour videos show irregular and redundant camera motions. Middle: R2R trajectories are smooth and goal-directed. Bottom: raw videos are converted into VLN-compatible data. Different colors indicate sub-paths. Path Pre-processing. The goal of this stage is to extract clean and coherent trajectories from room-tour videos using the SfM reconstructions described in Sec. 3. We fi… view at source ↗
read the original abstract

Annotated 3D scene data is scarce and expensive to acquire, while abundant unlabeled videos are readily available on the internet. In this paper, we demonstrate that carefully designed data engines can leverage web-curated, unlabeled videos to automatically generate training data, to facilitate end-to-end models in 3D scene understanding alongside human-annotated datasets. We identify and analyze bottlenecks in automated data generation, revealing critical factors that determine the efficiency and effectiveness of learning from unlabeled data. To validate our approach across different perception granularities, we evaluate on three tasks spanning low-level perception, i.e., 3D object detection and instance segmentation, to high-evel reasoning, i.e., 3D spatial Visual Question Answering (VQA) and Vision-Lanugage Navigation (VLN). Models trained on our generated data demonstrate strong zero-shot performance and show further improvement after finetuning. This demonstrates the viability of leveraging readily available web data as a path toward more capable scene understanding systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes a data engine that automatically lifts 3D annotations from unlabeled internet videos to train models for 3D object detection, instance segmentation, spatial VQA, and VLN. It claims to identify and analyze bottlenecks in the generation process and reports that models trained on the generated data achieve strong zero-shot performance that further improves after finetuning on human-annotated data.

Significance. If the generated labels prove sufficiently clean and diverse, the work would be significant for scaling 3D scene understanding without relying solely on expensive manual annotations, by demonstrating a viable path from abundant web video to usable training signals across perception and reasoning tasks.

major comments (2)
  1. Abstract: the central claim of 'strong zero-shot performance' is presented without any quantitative results, error analysis, ablation studies, or metrics on generated-label fidelity (such as 3D IoU, depth RMSE, or pseudo-label precision). This absence is load-bearing because the viability argument rests on the automated data being clean enough to support the reported downstream gains.
  2. Experiments section (inferred from abstract evaluation claims): no quantitative assessment of the data engine's output quality against reference sets is supplied, leaving open the possibility that residual noise from SfM, monocular depth, or pseudo-labeling steps (common failure modes in web video) drives or inflates the zero-shot numbers rather than genuine signal.
minor comments (1)
  1. Abstract: 'high-evel' is a typo and should read 'high-level'; 'Vision-Lanugage' should read 'Vision-Language'.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We agree that the abstract would benefit from explicit quantitative support and that additional direct assessments of generated label quality would strengthen the presentation. We address each major comment below and outline the revisions we will make.

read point-by-point responses
  1. Referee: Abstract: the central claim of 'strong zero-shot performance' is presented without any quantitative results, error analysis, ablation studies, or metrics on generated-label fidelity (such as 3D IoU, depth RMSE, or pseudo-label precision). This absence is load-bearing because the viability argument rests on the automated data being clean enough to support the reported downstream gains.

    Authors: We agree that the abstract currently summarizes the claims at a high level without embedding specific numbers. The full manuscript reports quantitative zero-shot and finetuned results across 3D detection, instance segmentation, spatial VQA, and VLN, along with bottleneck analysis. In the revision we will update the abstract to include key performance metrics (e.g., mAP on detection, accuracy on VQA) that support the zero-shot claim. We will also add a concise statement on label-fidelity analysis drawn from our bottleneck study. revision: yes

  2. Referee: Experiments section (inferred from abstract evaluation claims): no quantitative assessment of the data engine's output quality against reference sets is supplied, leaving open the possibility that residual noise from SfM, monocular depth, or pseudo-labeling steps (common failure modes in web video) drives or inflates the zero-shot numbers rather than genuine signal.

    Authors: We partially agree. The experiments evaluate the generated data via downstream task performance and explicit bottleneck analysis that identifies which generation steps most affect final accuracy; this provides indirect but task-relevant evidence of signal quality. However, we acknowledge that direct metrics against reference 3D annotations (3D IoU, depth RMSE, pseudo-label precision) are not reported. In the revision we will add a dedicated paragraph and table on a held-out subset with available ground-truth 3D data to quantify these fidelity metrics and address the noise concern. revision: partial

Circularity Check

0 steps flagged

No significant circularity: empirical results rest on external validation

full rationale

The paper describes an empirical pipeline for lifting 3D annotations from unlabeled web videos and reports downstream task performance (zero-shot and fine-tuned) on detection, segmentation, VQA, and VLN. No equations, parameter fittings, uniqueness theorems, or derivations appear in the provided text. All central claims are supported by experimental comparisons against human-annotated baselines rather than by any self-referential reduction of outputs to inputs. The absence of mathematical structure precludes the self-definitional, fitted-input, or self-citation-load-bearing patterns required for a positive circularity finding.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Review conducted on abstract only; no explicit free parameters, axioms, or invented entities are stated in the provided text.

pith-pipeline@v0.9.0 · 5509 in / 1028 out tokens · 38471 ms · 2026-05-13T22:13:10.360965+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

135 extracted references · 135 canonical work pages · 10 internal anchors

  1. [1]

    Referit3d: Neu- ral listeners for fine-grained 3d object identification in real- world scenes

    Panos Achlioptas, Ahmed Abdelreheem, Fei Xia, Mo- hamed Elhoseiny, and Leonidas Guibas. Referit3d: Neu- ral listeners for fine-grained 3d object identification in real- world scenes. InEuropean Conference on Computer Vision (ECCV), 2020. 2

  2. [2]

    World Simulation with Video Foundation Models for Physical AI

    Arslan Ali, Junjie Bai, Maciej Bala, Yogesh Balaji, Aaron Blakeman, Tiffany Cai, Jiaxin Cao, Tianshi Cao, Eliza- beth Cha, Yu-Wei Chao, et al. World simulation with video foundation models for physical ai.arXiv preprint arXiv:2511.00062, 2025. 3

  3. [3]

    Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments

    Peter Anderson, Qi Wu, Damien Teney, Jake Bruce, Mark Johnson, Niko Sünderhauf, Ian Reid, Stephen Gould, and Anton van den Hengel. Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments. InConference on Computer Vision and Pattern Recognition (CVPR), 2018. 2, 6, 8

  4. [4]

    3d scene graph: A structure for unified semantics, 3d space, and camera

    Iro Armeni, Zhi-Yang He, JunYoung Gwak, Amir R Za- mir, Martin Fischer, Jitendra Malik, and Silvio Savarese. 3d scene graph: A structure for unified semantics, 3d space, and camera. InInternational Conference on Computer Vi- sion (ICCV), 2019. 5

  5. [5]

    Scenescript: Reconstructing scenes with an au- toregressive structured language model

    Armen Avetisyan, Christopher Xie, Henry Howard- Jenkins, Tsun-Yi Yang, Samir Aroudj, Suvam Patra, Fuyang Zhang, Duncan Frost, Luke Holland, Campbell Orme, et al. Scenescript: Reconstructing scenes with an au- toregressive structured language model. InEuropean Con- ference on Computer Vision (ECCV), 2024. 2

  6. [6]

    Scanqa: 3d question answering for spatial scene understanding

    Daichi Azuma, Taiki Miyanishi, Shuhei Kurita, and Mo- toaki Kawanabe. Scanqa: 3d question answering for spatial scene understanding. InConference on Computer Vision and Pattern Recognition (CVPR), 2022. 2

  7. [7]

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 2, 3, 4, 5

  8. [8]

    Arkitscenes: A diverse real-world dataset for 3d indoor scene understand- ing using mobile rgb-d data

    Gilad Baruch, Zhuoyuan Chen, Afshin Dehghan, Tal Dimry, Yuri Feigin, Peter Fu, Thomas Gebauer, Brandon Joffe, Daniel Kurz, Arik Schwartz, et al. Arkitscenes: A diverse real-world dataset for 3d indoor scene understand- ing using mobile rgb-d data. InProceedings of Advances in Neural Information Processing Systems Datasets and Benchmarks (NeurIPS Datasets ...

  9. [9]

    Contrastive lift: 3d ob- ject instance segmentation by slow-fast contrastive fusion

    Yash Bhalgat, Iro Laina, João F Henriques, Andrea Vedaldi, and Andrew Zisserman. Contrastive lift: 3d ob- ject instance segmentation by slow-fast contrastive fusion. InAdvances in Neural Information Processing Systems (NeurIPS), 2023. 4

  10. [10]

    Depth pro: Sharp monocular metric depth in less than a second

    Alexey Bochkovskiy, Amaël Delaunoy, Hugo Germain, Marcel Santos, Yichao Zhou, Stephan Richter, and Vladlen Koltun. Depth pro: Sharp monocular metric depth in less than a second. InInternational Conference on Learning Representations (ICLR), 2025. 3

  11. [11]

    Omni3d: A large benchmark and model for 3d object detection in the wild

    Garrick Brazil, Abhinav Kumar, Julian Straub, Nikhila Ravi, Justin Johnson, and Georgia Gkioxari. Omni3d: A large benchmark and model for 3d object detection in the wild. InConference on Computer Vision and Pattern Recognition (CVPR), 2023. 2

  12. [12]

    Sims-v: Simulated instruction-tuning for spatial video understanding.arXiv preprint arXiv:2511.04668, 2025

    Ellis Brown, Arijit Ray, Ranjay Krishna, Ross Girshick, Rob Fergus, and Saining Xie. Sims-v: Simulated instruction-tuning for spatial video understanding.arXiv preprint arXiv:2511.04668, 2025. 6, 8

  13. [13]

    train on the test set

    Ellis Brown, Jihan Yang, Shusheng Yang, Rob Fergus, and Saining Xie. Benchmark designers should “train on the test set” to expose exploitable non-visual shortcuts.arXiv preprint arXiv:2511.04655, 2025. 6

  14. [14]

    Matterport3d: Learning from rgb-d data in indoor environments

    Angel Chang, Angela Dai, Thomas Funkhouser, Maciej Halber, Matthias Niessner, Manolis Savva, Shuran Song, Andy Zeng, and Yinda Zhang. Matterport3d: Learning from rgb-d data in indoor environments. InInternational Conference on 3D Vision (3DV), 2017. 6

  15. [15]

    Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts

    Soravit Changpinyo, Piyush Sharma, Nan Ding, and Radu Soricut. Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. In Conference on Computer Vision and Pattern Recognition (CVPR), 2021. 2

  16. [16]

    Pgsr: Planar-based gaussian splat- ting for efficient and high-fidelity surface reconstruction

    Danpeng Chen, Hai Li, Weicai Ye, Yifan Wang, Weijian Xie, Shangjin Zhai, Nan Wang, Haomin Liu, Hujun Bao, and Guofeng Zhang. Pgsr: Planar-based gaussian splat- ting for efficient and high-fidelity surface reconstruction. IEEE Transactions on Visualization and Computer Graph- ics, 2024. 4

  17. [17]

    Scanrefer: 3d object localization in rgb-d scans using natu- ral language

    Dave Zhenyu Chen, Angel X Chang, and Matthias Nießner. Scanrefer: 3d object localization in rgb-d scans using natu- ral language. InEuropean Conference on Computer Vision (ECCV), 2020. 2

  18. [18]

    D3net: a speaker-listener architecture for semi-supervised dense captioning and visual grounding in rgb-d scans

    Dave Zhenyu Chen, Qirui Wu, Matthias Nießner, and An- gel X Chang. D3net: a speaker-listener architecture for semi-supervised dense captioning and visual grounding in rgb-d scans. InEuropean Conference on Computer Vision (ECCV), 2022. 2

  19. [19]

    End-to-end 3d dense captioning with vote2cap-detr

    Sijin Chen, Hongyuan Zhu, Xin Chen, Yinjie Lei, Gang Yu, and Tao Chen. End-to-end 3d dense captioning with vote2cap-detr. InConference on Computer Vision and Pat- tern Recognition (CVPR), 2023. 2

  20. [20]

    Monocular 3d object de- tection for autonomous driving

    Xiaozhi Chen, Kaustav Kundu, Ziyu Zhang, Huimin Ma, Sanja Fidler, and Raquel Urtasun. Monocular 3d object de- tection for autonomous driving. InConference on Com- puter Vision and Pattern Recognition (CVPR), 2016. 2

  21. [21]

    Synergai: Per- ception alignment for human-robot collaboration

    Yixin Chen, I Guoxi Zhang, Yaowei Zhang, Hongming Xu, Peiyuan Zhi, Qing Li, and Siyuan Huang. Synergai: Per- ception alignment for human-robot collaboration. InInter- national Conference on Robotics and Automation (ICRA),

  22. [22]

    Scan2cap: Context-aware dense captioning in rgb- d scans

    Zhenyu Chen, Ali Gholami, Matthias Nießner, and Angel X Chang. Scan2cap: Context-aware dense captioning in rgb- d scans. InConference on Computer Vision and Pattern Recognition (CVPR), 2021. 2

  23. [23]

    Internvl: Scaling up vision founda- tion models and aligning for generic visual-linguistic tasks

    Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision founda- tion models and aligning for generic visual-linguistic tasks. InConference on Computer Vision and Pattern Recognition (CVPR), 2024. 2

  24. [24]

    Navila: Legged robot vision-language-action model for navigation

    An-Chieh Cheng, Yandong Ji, Zhaojing Yang, Xueyan Zou, Jan Kautz, Erdem Biyik, Hongxu Yin, Sifei Liu, and Xiao- long Wang. Navila: Legged robot vision-language-action model for navigation. InRobotics: Science and Systems (RSS), 2025. 3, 6, 8, 4

  25. [25]

    Depth-regularized optimization for 3d gaussian splatting in few-shot images

    Jaeyoung Chung, Jeongtaek Oh, and Kyoung Mu Lee. Depth-regularized optimization for 3d gaussian splatting in few-shot images. InConference on Computer Vision and Pattern Recognition (CVPR), 2024. 1

  26. [26]

    A volumetric method for building complex models from range images

    Brian Curless and Marc Levoy. A volumetric method for building complex models from range images. InProceed- ings of the 23rd annual conference on Computer graphics and interactive techniques, 1996. 2

  27. [27]

    Scannet: Richly-annotated 3d reconstructions of indoor scenes

    Angela Dai, Angel X Chang, Manolis Savva, Maciej Hal- ber, Thomas Funkhouser, and Matthias Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In Conference on Computer Vision and Pattern Recognition (CVPR), 2017. 2, 3

  28. [28]

    Depth-supervised nerf: Fewer views and faster training for free

    Kangle Deng, Andrew Liu, Jun-Yan Zhu, and Deva Ra- manan. Depth-supervised nerf: Fewer views and faster training for free. InConference on Computer Vision and Pattern Recognition (CVPR), 2022. 1

  29. [29]

    V otenet: A deep learning label fusion method for multi-atlas segmen- tation

    Zhipeng Ding, Xu Han, and Marc Niethammer. V otenet: A deep learning label fusion method for multi-atlas segmen- tation. InProceedings of International Conference on Med- ical Image Computing and Computer-Assisted Intervention (MICCAI), 2019. 2

  30. [30]

    ivs-net: Learning human view synthesis from internet videos

    Junting Dong, Qi Fang, Tianshuo Yang, Qing Shuai, Chengyu Qiao, and Sida Peng. ivs-net: Learning human view synthesis from internet videos. InInternational Con- ference on Computer Vision (ICCV), 2023. 3

  31. [31]

    Mast3r-sfm: a fully-integrated solution for unconstrained structure-from-motion

    Bardienus Pieter Duisterhof, Lojze Zust, Philippe Weinza- epfel, Vincent Leroy, Yohann Cabon, and Jerome Revaud. Mast3r-sfm: a fully-integrated solution for unconstrained structure-from-motion. InInternational Conference on 3D Vision (3DV), 2025. 3, 1

  32. [32]

    Depth map prediction from a single image using a multi-scale deep network

    David Eigen, Christian Puhrsch, and Rob Fergus. Depth map prediction from a single image using a multi-scale deep network. InAdvances in Neural Information Process- ing Systems (NeurIPS), 2014. 1

  33. [33]

    Structure and content-guided video synthesis with diffusion models

    Patrick Esser, Johnathan Chiu, Parmida Atighehchian, Jonathan Granskog, and Anastasis Germanidis. Structure and content-guided video synthesis with diffusion models. InInternational Conference on Computer Vision (ICCV),

  34. [34]

    VLM-3R: Vision-Language Models Augmented with Instruction-Aligned 3D Reconstruction

    Zhiwen Fan, Jian Zhang, Renjie Li, Junge Zhang, Runjin Chen, Hezhen Hu, Kevin Wang, Huaizhi Qu, Dilin Wang, Zhicheng Yan, et al. Vlm-3r: Vision-language models aug- mented with instruction-aligned 3d reconstruction.arXiv preprint arXiv:2505.20279, 2025. 5, 6

  35. [35]

    Efficient graph-based image segmentation.International Journal of Computer Vision (IJCV), 2004

    Pedro F Felzenszwalb and Daniel P Huttenlocher. Efficient graph-based image segmentation.International Journal of Computer Vision (IJCV), 2004. 5

  36. [36]

    Deep ordinal regression network for monocular depth estimation

    Huan Fu, Mingming Gong, Chaohui Wang, Kayhan Bat- manghelich, and Dacheng Tao. Deep ordinal regression network for monocular depth estimation. InConference on Computer Vision and Pattern Recognition (CVPR), 2018. 1

  37. [37]

    Panoptic nerf: 3d-to-2d label transfer for panoptic urban scene segmentation

    Xiao Fu, Shangzhan Zhang, Tianrun Chen, Yichong Lu, Lanyun Zhu, Xiaowei Zhou, Andreas Geiger, and Yiyi Liao. Panoptic nerf: 3d-to-2d label transfer for panoptic urban scene segmentation. InInternational Conference on 3D Vision (3DV), 2022. 2, 4

  38. [38]

    Basic books, 2011

    Howard Gardner.Frames of mind: The theory of multiple intelligences. Basic books, 2011. 5

  39. [39]

    Matcha gaussians: Atlas of charts for high- quality geometry and photorealism from sparse views

    Antoine Guédon, Tomoki Ichikawa, Kohei Yamashita, and Ko Nishino. Matcha gaussians: Atlas of charts for high- quality geometry and photorealism from sparse views. In Conference on Computer Vision and Pattern Recognition (CVPR), 2025. 4

  40. [40]

    Roomtour3d: Geometry-aware video- instruction tuning for embodied navigation

    Mingfei Han, Liang Ma, Kamila Zhumakhanova, Ekaterina Radionova, Jingyi Zhang, Xiaojun Chang, Xiaodan Liang, and Ivan Laptev. Roomtour3d: Geometry-aware video- instruction tuning for embodied navigation. InConfer- ence on Computer Vision and Pattern Recognition (CVPR),

  41. [41]

    Cambridge university press,

    Richard Hartley and Andrew Zisserman.Multiple view ge- ometry in computer vision. Cambridge university press,

  42. [42]

    Mask r-cnn

    Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Gir- shick. Mask r-cnn. InInternational Conference on Com- puter Vision (ICCV), 2017. 3

  43. [43]

    Cam- bridge university press Cambridge, 1986

    Annette Herskovits.Language and spatial cognition. Cam- bridge university press Cambridge, 1986. 5

  44. [44]

    Imagen Video: High Definition Video Generation with Diffusion Models

    Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Alexey Gritsenko, Diederik P Kingma, Ben Poole, Mohammad Norouzi, David J Fleet, et al. Ima- gen video: High definition video generation with diffusion models.arXiv preprint arXiv:2210.02303, 2022. 3

  45. [45]

    Vln bert: A recurrent vision- and-language bert for navigation

    Yicong Hong, Qi Wu, Yuankai Qi, Cristian Rodriguez- Opazo, and Stephen Gould. Vln bert: A recurrent vision- and-language bert for navigation. InConference on Com- puter Vision and Pattern Recognition (CVPR), 2021. 2

  46. [46]

    3d concept learn- ing and reasoning from multi-view images

    Yining Hong, Chunru Lin, Yilun Du, Zhenfang Chen, Joshua B Tenenbaum, and Chuang Gan. 3d concept learn- ing and reasoning from multi-view images. InConfer- ence on Computer Vision and Pattern Recognition (CVPR),

  47. [47]

    Lora: Low-rank adaptation of large language models

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models. InInternational Conference on Learning Representations (ICLR), 2022. 6

  48. [48]

    2d gaussian splatting for geometrically ac- curate radiance fields

    Binbin Huang, Zehao Yu, Anpei Chen, Andreas Geiger, and Shenghua Gao. 2d gaussian splatting for geometrically ac- curate radiance fields. InACM SIGGRAPH / Eurographics Symposium on Computer Animation (SCA), 2024. 4

  49. [49]

    An embodied generalist agent in 3d world.arXiv preprint arXiv:2311.12871, 2023

    Jiangyong Huang, Silong Yong, Xiaojian Ma, Xiongkun Linghu, Puhao Li, Yan Wang, Qing Li, Song-Chun Zhu, Baoxiong Jia, and Siyuan Huang. An embodied generalist agent in 3d world.arXiv preprint arXiv:2311.12871, 2023. 5

  50. [50]

    Sceneverse: Scaling 3d vision-language learning for grounded scene understanding

    Baoxiong Jia, Yixin Chen, Huangyue Yu, Yan Wang, Xuesong Niu, Tengyu Liu, Qing Li, and Siyuan Huang. Sceneverse: Scaling 3d vision-language learning for grounded scene understanding. InEuropean Conference on Computer Vision (ECCV). Springer, 2024. 2, 5

  51. [51]

    Pointgroup: Dual-set point group- ing for 3d instance segmentation

    Li Jiang, Hengshuang Zhao, Shaoshuai Shi, Shu Liu, Chi- Wing Fu, and Jiaya Jia. Pointgroup: Dual-set point group- ing for 3d instance segmentation. InConference on Com- puter Vision and Pattern Recognition (CVPR), 2020. 2

  52. [52]

    Scaling Laws for Neural Language Models

    Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361,

  53. [53]

    3d gaussian splatting for real-time radiance field rendering

    Bernhard Kerbl, Georgios Kopanas, Thomas Leimkühler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering. InACM SIGGRAPH / Eurograph- ics Symposium on Computer Animation (SCA), 2023. 2, 4

  54. [54]

    Lerf: Language embed- ded radiance fields

    Justin Kerr, Chung Min Kim, Ken Goldberg, Angjoo Kanazawa, and Matthew Tancik. Lerf: Language embed- ded radiance fields. InInternational Conference on Com- puter Vision (ICCV), 2023. 2

  55. [55]

    Habitat synthetic scenes dataset (hssd-200): An analysis of 3d scene scale and realism tradeoffs for objectgoal naviga- tion

    Mukul Khanna, Yongsen Mao, Hanxiao Jiang, Sanjay Haresh, Brennan Shacklett, Dhruv Batra, Alexander Clegg, Eric Undersander, Angel X Chang, and Manolis Savva. Habitat synthetic scenes dataset (hssd-200): An analysis of 3d scene scale and realism tradeoffs for objectgoal naviga- tion. InConference on Computer Vision and Pattern Recog- nition (CVPR), 2024. 2

  56. [56]

    Segment anything

    Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. Segment anything. InInternational Conference on Computer Vision (ICCV), 2023. 4

  57. [57]

    Large language models are zero-shot reasoners

    Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners. InAdvances in Neural Information Processing Systems (NeurIPS), 2022. 2

  58. [58]

    Unidet3d: Multi- dataset indoor 3d object detection

    Maksim Kolodiazhnyi, Anna V orontsova, Matvey Skripkin, Danila Rukhovich, and Anton Konushin. Unidet3d: Multi- dataset indoor 3d object detection. InAAAI Conference on Artificial Intelligence (AAAI), 2025. 2

  59. [59]

    Room-across-room: Multilingual vision-and-language navigation with dense spatiotemporal grounding

    Alexander Ku, Peter Anderson, Roma Patel, Eugene Ie, and Jason Baldridge. Room-across-room: Multilingual vision-and-language navigation with dense spatiotemporal grounding. InAnnual Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020. 2

  60. [60]

    Panoptic neu- ral fields: A semantic object-aware neural scene representa- tion

    Abhijit Kundu, Kyle Genova, Xiaoqi Yin, Alireza Fathi, Caroline Pantofaru, Leonidas J Guibas, Andrea Tagliasac- chi, Frank Dellaert, and Thomas Funkhouser. Panoptic neu- ral fields: A semantic object-aware neural scene representa- tion. InConference on Computer Vision and Pattern Recog- nition (CVPR), 2022. 2

  61. [61]

    Grounding image matching in 3d with mast3r

    Vincent Leroy, Yohann Cabon, and Jérôme Revaud. Grounding image matching in 3d with mast3r. InEuropean Conference on Computer Vision (ECCV), 2024. 3

  62. [62]

    BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models

    Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models. InInter- national Conference on Machine Learning (ICML), 2023. 2

  63. [63]

    Describe anything: De- tailed localized image and video captioning.arXiv preprint arXiv:2504.16072, 2025

    Long Lian, Yifan Ding, Yunhao Ge, Sifei Liu, Hanzi Mao, Boyi Li, Marco Pavone, Ming-Yu Liu, Trevor Dar- rell, Adam Yala, and Yin Cui. Describe anything: De- tailed localized image and video captioning.arXiv preprint arXiv:2504.16072, 2025. 4

  64. [64]

    Learning vision-and- language navigation from youtube videos

    Kunyang Lin, Peihao Chen, Diwei Huang, Thomas H Li, Mingkui Tan, and Chuang Gan. Learning vision-and- language navigation from youtube videos. InInternational Conference on Computer Vision (ICCV), 2023. 3, 6

  65. [65]

    Microsoft coco: Common objects in context

    Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. InEuropean Conference on Computer Vision (ECCV), 2014. 2

  66. [66]

    Weakly supervised 3d open- vocabulary segmentation

    Kunhao Liu, Fangneng Zhan, Jiahui Zhang, Muyu Xu, Yingchen Yu, Abdulmotaleb El Saddik, Christian Theobalt, Eric Xing, and Shijian Lu. Weakly supervised 3d open- vocabulary segmentation. InAdvances in Neural Informa- tion Processing Systems (NeurIPS), 2023. 2

  67. [67]

    Novel view extrapolation with video diffusion priors.arXiv preprint arXiv:2411.14208,

    Kunhao Liu, Ling Shao, and Shijian Lu. Novel view extrapolation with video diffusion priors.arXiv preprint arXiv:2411.14208, 2024. 3

  68. [68]

    3dgs-enhancer: Enhancing unbounded 3d gaussian splatting with view- consistent 2d diffusion priors

    Xi Liu, Chaoyi Zhou, and Siyu Huang. 3dgs-enhancer: Enhancing unbounded 3d gaussian splatting with view- consistent 2d diffusion priors. InAdvances in Neural In- formation Processing Systems (NeurIPS), 2024. 2

  69. [69]

    Taco: Taming diffusion for in-the-wild video amodal completion

    Ruijie Lu, Yixin Chen, Yu Liu, Jiaxiang Tang, Junfeng Ni, Diwen Wan, Gang Zeng, and Siyuan Huang. Taco: Taming diffusion for in-the-wild video amodal completion. InIn- ternational Conference on Computer Vision (ICCV), 2025. 3

  70. [70]

    Mo- vis: Enhancing multi-object novel view synthesis for in- door scenes

    Ruijie Lu, Yixin Chen, Junfeng Ni, Baoxiong Jia, Yu Liu, Diwen Wan, Gang Zeng, and Siyuan Huang. Mo- vis: Enhancing multi-object novel view synthesis for in- door scenes. InConference on Computer Vision and Pattern Recognition (CVPR), 2025. 3

  71. [71]

    Scalable 3d captioning with pretrained mod- els

    Tiange Luo, Chris Rockwell, Honglak Lee, and Justin Johnson. Scalable 3d captioning with pretrained mod- els. InAdvances in Neural Information Processing Systems (NeurIPS), 2023. 2

  72. [72]

    You see it, you got it: Learning 3d creation on pose-free videos at scale

    Baorui Ma, Huachen Gao, Haoge Deng, Zhengxiong Luo, Tiejun Huang, Lulu Tang, and Xinlong Wang. You see it, you got it: Learning 3d creation on pose-free videos at scale. InConference on Computer Vision and Pattern Recognition (CVPR), 2025. 3

  73. [73]

    Sqa3d: Sit- uated question answering in 3d scenes

    Xiaojian Ma, Silong Yong, Zilong Zheng, Qing Li, Yitao Liang, Song-Chun Zhu, and Siyuan Huang. Sqa3d: Sit- uated question answering in 3d scenes. InInternational Conference on Learning Representations (ICLR), 2023. 2

  74. [74]

    Multiscan: Scalable rgbd scanning for 3d environments with articulated objects

    Yongsen Mao, Yiming Zhang, Hanxiao Jiang, Angel Chang, and Manolis Savva. Multiscan: Scalable rgbd scanning for 3d environments with articulated objects. InAdvances in Neural Information Processing Systems (NeurIPS), 2022. 2, 3

  75. [75]

    Spatiallm: Training large language models for structured indoor mod- eling

    Yongsen Mao, Junhao Zhong, Chuan Fang, Jia Zheng, Rui Tang, Hao Zhu, Ping Tan, and Zihan Zhou. Spatiallm: Training large language models for structured indoor mod- eling. InAdvances in Neural Information Processing Sys- tems (NeurIPS), 2025. 2, 5

  76. [76]

    Towards scalable spatial intelligence via 2d-to-3d data lifting

    Xingyu Miao, Haoran Duan, Quanhao Qian, Jiuniu Wang, Yang Long, Ling Shao, Deli Zhao, Ran Xu, and Gongjie Zhang. Towards scalable spatial intelligence via 2d-to-3d data lifting. InInternational Conference on Computer Vi- sion (ICCV), 2025. 2

  77. [77]

    Srinivasan, Matthew Tancik, Jonathan T

    Ben Mildenhall, Pratul P. Srinivasan, Matthew Tancik, Jonathan T. Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view synthesis. InEuropean Conference on Computer Vision (ECCV), 2020. 2, 4

  78. [78]

    An end- to-end transformer model for 3d object detection

    Ishan Misra, Rohit Girdhar, and Armand Joulin. An end- to-end transformer model for 3d object detection. InIn- ternational Conference on Computer Vision (ICCV), 2021. 2

  79. [79]

    Phyrecon: Physically plausible neural scene recon- struction

    Junfeng Ni, Yixin Chen, Bohan Jing, Nan Jiang, Bin Wang, Bo Dai, Puhao Li, Yixin Zhu, Song-Chun Zhu, and Siyuan Huang. Phyrecon: Physically plausible neural scene recon- struction. InAdvances in Neural Information Processing Systems (NeurIPS), 2024. 2, 4

  80. [80]

    Decompositional neural scene reconstruction with generative diffusion prior

    Junfeng Ni, Yu Liu, Ruijie Lu, Zirui Zhou, Song-Chun Zhu, Yixin Chen, and Siyuan Huang. Decompositional neural scene reconstruction with generative diffusion prior. In Conference on Computer Vision and Pattern Recognition (CVPR), 2025

Showing first 80 references.