Recognition: no theorem link
Lifting Unlabeled Internet-level Data for 3D Scene Understanding
Pith reviewed 2026-05-13 22:13 UTC · model grok-4.3
The pith
Unlabeled internet videos can be automatically converted into training data for 3D scene understanding models.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Carefully designed data engines can leverage web-curated, unlabeled videos to automatically generate training data that facilitates end-to-end models in 3D scene understanding, delivering strong zero-shot performance across low-level perception and high-level reasoning tasks with further gains after finetuning.
What carries the argument
The automated data generation engine that extracts 3D training signals and annotations from unlabeled videos by addressing identified bottlenecks in the lifting process.
If this is right
- 3D scene models can train effectively using only data lifted from internet videos without human labels for initial performance.
- The same generated data supports both low-level tasks like object detection and high-level tasks like spatial VQA and navigation.
- Combining the generated data with small amounts of human annotations yields further accuracy gains through finetuning.
- Training data volume can scale directly with the amount of available unlabeled online video content.
Where Pith is reading between the lines
- The approach could extend to streaming internet video for ongoing model updates without new annotation campaigns.
- Similar lifting from unlabeled sources might apply to other data-scarce areas such as 3D human interaction or robotics simulation.
- If generation quality improves, full replacement of manual 3D annotation becomes feasible for many downstream applications.
Load-bearing premise
The automated process from unlabeled videos produces sufficiently clean and diverse 3D training signals without introducing harmful biases or noise.
What would settle it
Models trained only on the generated data show no zero-shot improvement over random baselines or no gains after finetuning on standard 3D benchmarks for detection, segmentation, VQA, or navigation.
Figures
read the original abstract
Annotated 3D scene data is scarce and expensive to acquire, while abundant unlabeled videos are readily available on the internet. In this paper, we demonstrate that carefully designed data engines can leverage web-curated, unlabeled videos to automatically generate training data, to facilitate end-to-end models in 3D scene understanding alongside human-annotated datasets. We identify and analyze bottlenecks in automated data generation, revealing critical factors that determine the efficiency and effectiveness of learning from unlabeled data. To validate our approach across different perception granularities, we evaluate on three tasks spanning low-level perception, i.e., 3D object detection and instance segmentation, to high-evel reasoning, i.e., 3D spatial Visual Question Answering (VQA) and Vision-Lanugage Navigation (VLN). Models trained on our generated data demonstrate strong zero-shot performance and show further improvement after finetuning. This demonstrates the viability of leveraging readily available web data as a path toward more capable scene understanding systems.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes a data engine that automatically lifts 3D annotations from unlabeled internet videos to train models for 3D object detection, instance segmentation, spatial VQA, and VLN. It claims to identify and analyze bottlenecks in the generation process and reports that models trained on the generated data achieve strong zero-shot performance that further improves after finetuning on human-annotated data.
Significance. If the generated labels prove sufficiently clean and diverse, the work would be significant for scaling 3D scene understanding without relying solely on expensive manual annotations, by demonstrating a viable path from abundant web video to usable training signals across perception and reasoning tasks.
major comments (2)
- Abstract: the central claim of 'strong zero-shot performance' is presented without any quantitative results, error analysis, ablation studies, or metrics on generated-label fidelity (such as 3D IoU, depth RMSE, or pseudo-label precision). This absence is load-bearing because the viability argument rests on the automated data being clean enough to support the reported downstream gains.
- Experiments section (inferred from abstract evaluation claims): no quantitative assessment of the data engine's output quality against reference sets is supplied, leaving open the possibility that residual noise from SfM, monocular depth, or pseudo-labeling steps (common failure modes in web video) drives or inflates the zero-shot numbers rather than genuine signal.
minor comments (1)
- Abstract: 'high-evel' is a typo and should read 'high-level'; 'Vision-Lanugage' should read 'Vision-Language'.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We agree that the abstract would benefit from explicit quantitative support and that additional direct assessments of generated label quality would strengthen the presentation. We address each major comment below and outline the revisions we will make.
read point-by-point responses
-
Referee: Abstract: the central claim of 'strong zero-shot performance' is presented without any quantitative results, error analysis, ablation studies, or metrics on generated-label fidelity (such as 3D IoU, depth RMSE, or pseudo-label precision). This absence is load-bearing because the viability argument rests on the automated data being clean enough to support the reported downstream gains.
Authors: We agree that the abstract currently summarizes the claims at a high level without embedding specific numbers. The full manuscript reports quantitative zero-shot and finetuned results across 3D detection, instance segmentation, spatial VQA, and VLN, along with bottleneck analysis. In the revision we will update the abstract to include key performance metrics (e.g., mAP on detection, accuracy on VQA) that support the zero-shot claim. We will also add a concise statement on label-fidelity analysis drawn from our bottleneck study. revision: yes
-
Referee: Experiments section (inferred from abstract evaluation claims): no quantitative assessment of the data engine's output quality against reference sets is supplied, leaving open the possibility that residual noise from SfM, monocular depth, or pseudo-labeling steps (common failure modes in web video) drives or inflates the zero-shot numbers rather than genuine signal.
Authors: We partially agree. The experiments evaluate the generated data via downstream task performance and explicit bottleneck analysis that identifies which generation steps most affect final accuracy; this provides indirect but task-relevant evidence of signal quality. However, we acknowledge that direct metrics against reference 3D annotations (3D IoU, depth RMSE, pseudo-label precision) are not reported. In the revision we will add a dedicated paragraph and table on a held-out subset with available ground-truth 3D data to quantify these fidelity metrics and address the noise concern. revision: partial
Circularity Check
No significant circularity: empirical results rest on external validation
full rationale
The paper describes an empirical pipeline for lifting 3D annotations from unlabeled web videos and reports downstream task performance (zero-shot and fine-tuned) on detection, segmentation, VQA, and VLN. No equations, parameter fittings, uniqueness theorems, or derivations appear in the provided text. All central claims are supported by experimental comparisons against human-annotated baselines rather than by any self-referential reduction of outputs to inputs. The absence of mathematical structure precludes the self-definitional, fitted-input, or self-citation-load-bearing patterns required for a positive circularity finding.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Referit3d: Neu- ral listeners for fine-grained 3d object identification in real- world scenes
Panos Achlioptas, Ahmed Abdelreheem, Fei Xia, Mo- hamed Elhoseiny, and Leonidas Guibas. Referit3d: Neu- ral listeners for fine-grained 3d object identification in real- world scenes. InEuropean Conference on Computer Vision (ECCV), 2020. 2
work page 2020
-
[2]
World Simulation with Video Foundation Models for Physical AI
Arslan Ali, Junjie Bai, Maciej Bala, Yogesh Balaji, Aaron Blakeman, Tiffany Cai, Jiaxin Cao, Tianshi Cao, Eliza- beth Cha, Yu-Wei Chao, et al. World simulation with video foundation models for physical ai.arXiv preprint arXiv:2511.00062, 2025. 3
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[3]
Peter Anderson, Qi Wu, Damien Teney, Jake Bruce, Mark Johnson, Niko Sünderhauf, Ian Reid, Stephen Gould, and Anton van den Hengel. Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments. InConference on Computer Vision and Pattern Recognition (CVPR), 2018. 2, 6, 8
work page 2018
-
[4]
3d scene graph: A structure for unified semantics, 3d space, and camera
Iro Armeni, Zhi-Yang He, JunYoung Gwak, Amir R Za- mir, Martin Fischer, Jitendra Malik, and Silvio Savarese. 3d scene graph: A structure for unified semantics, 3d space, and camera. InInternational Conference on Computer Vi- sion (ICCV), 2019. 5
work page 2019
-
[5]
Scenescript: Reconstructing scenes with an au- toregressive structured language model
Armen Avetisyan, Christopher Xie, Henry Howard- Jenkins, Tsun-Yi Yang, Samir Aroudj, Suvam Patra, Fuyang Zhang, Duncan Frost, Luke Holland, Campbell Orme, et al. Scenescript: Reconstructing scenes with an au- toregressive structured language model. InEuropean Con- ference on Computer Vision (ECCV), 2024. 2
work page 2024
-
[6]
Scanqa: 3d question answering for spatial scene understanding
Daichi Azuma, Taiki Miyanishi, Shuhei Kurita, and Mo- toaki Kawanabe. Scanqa: 3d question answering for spatial scene understanding. InConference on Computer Vision and Pattern Recognition (CVPR), 2022. 2
work page 2022
-
[7]
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 2, 3, 4, 5
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[8]
Gilad Baruch, Zhuoyuan Chen, Afshin Dehghan, Tal Dimry, Yuri Feigin, Peter Fu, Thomas Gebauer, Brandon Joffe, Daniel Kurz, Arik Schwartz, et al. Arkitscenes: A diverse real-world dataset for 3d indoor scene understand- ing using mobile rgb-d data. InProceedings of Advances in Neural Information Processing Systems Datasets and Benchmarks (NeurIPS Datasets ...
-
[9]
Contrastive lift: 3d ob- ject instance segmentation by slow-fast contrastive fusion
Yash Bhalgat, Iro Laina, João F Henriques, Andrea Vedaldi, and Andrew Zisserman. Contrastive lift: 3d ob- ject instance segmentation by slow-fast contrastive fusion. InAdvances in Neural Information Processing Systems (NeurIPS), 2023. 4
work page 2023
-
[10]
Depth pro: Sharp monocular metric depth in less than a second
Alexey Bochkovskiy, Amaël Delaunoy, Hugo Germain, Marcel Santos, Yichao Zhou, Stephan Richter, and Vladlen Koltun. Depth pro: Sharp monocular metric depth in less than a second. InInternational Conference on Learning Representations (ICLR), 2025. 3
work page 2025
-
[11]
Omni3d: A large benchmark and model for 3d object detection in the wild
Garrick Brazil, Abhinav Kumar, Julian Straub, Nikhila Ravi, Justin Johnson, and Georgia Gkioxari. Omni3d: A large benchmark and model for 3d object detection in the wild. InConference on Computer Vision and Pattern Recognition (CVPR), 2023. 2
work page 2023
-
[12]
Ellis Brown, Arijit Ray, Ranjay Krishna, Ross Girshick, Rob Fergus, and Saining Xie. Sims-v: Simulated instruction-tuning for spatial video understanding.arXiv preprint arXiv:2511.04668, 2025. 6, 8
-
[13]
Ellis Brown, Jihan Yang, Shusheng Yang, Rob Fergus, and Saining Xie. Benchmark designers should “train on the test set” to expose exploitable non-visual shortcuts.arXiv preprint arXiv:2511.04655, 2025. 6
-
[14]
Matterport3d: Learning from rgb-d data in indoor environments
Angel Chang, Angela Dai, Thomas Funkhouser, Maciej Halber, Matthias Niessner, Manolis Savva, Shuran Song, Andy Zeng, and Yinda Zhang. Matterport3d: Learning from rgb-d data in indoor environments. InInternational Conference on 3D Vision (3DV), 2017. 6
work page 2017
-
[15]
Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts
Soravit Changpinyo, Piyush Sharma, Nan Ding, and Radu Soricut. Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. In Conference on Computer Vision and Pattern Recognition (CVPR), 2021. 2
work page 2021
-
[16]
Pgsr: Planar-based gaussian splat- ting for efficient and high-fidelity surface reconstruction
Danpeng Chen, Hai Li, Weicai Ye, Yifan Wang, Weijian Xie, Shangjin Zhai, Nan Wang, Haomin Liu, Hujun Bao, and Guofeng Zhang. Pgsr: Planar-based gaussian splat- ting for efficient and high-fidelity surface reconstruction. IEEE Transactions on Visualization and Computer Graph- ics, 2024. 4
work page 2024
-
[17]
Scanrefer: 3d object localization in rgb-d scans using natu- ral language
Dave Zhenyu Chen, Angel X Chang, and Matthias Nießner. Scanrefer: 3d object localization in rgb-d scans using natu- ral language. InEuropean Conference on Computer Vision (ECCV), 2020. 2
work page 2020
-
[18]
Dave Zhenyu Chen, Qirui Wu, Matthias Nießner, and An- gel X Chang. D3net: a speaker-listener architecture for semi-supervised dense captioning and visual grounding in rgb-d scans. InEuropean Conference on Computer Vision (ECCV), 2022. 2
work page 2022
-
[19]
End-to-end 3d dense captioning with vote2cap-detr
Sijin Chen, Hongyuan Zhu, Xin Chen, Yinjie Lei, Gang Yu, and Tao Chen. End-to-end 3d dense captioning with vote2cap-detr. InConference on Computer Vision and Pat- tern Recognition (CVPR), 2023. 2
work page 2023
-
[20]
Monocular 3d object de- tection for autonomous driving
Xiaozhi Chen, Kaustav Kundu, Ziyu Zhang, Huimin Ma, Sanja Fidler, and Raquel Urtasun. Monocular 3d object de- tection for autonomous driving. InConference on Com- puter Vision and Pattern Recognition (CVPR), 2016. 2
work page 2016
-
[21]
Synergai: Per- ception alignment for human-robot collaboration
Yixin Chen, I Guoxi Zhang, Yaowei Zhang, Hongming Xu, Peiyuan Zhi, Qing Li, and Siyuan Huang. Synergai: Per- ception alignment for human-robot collaboration. InInter- national Conference on Robotics and Automation (ICRA),
-
[22]
Scan2cap: Context-aware dense captioning in rgb- d scans
Zhenyu Chen, Ali Gholami, Matthias Nießner, and Angel X Chang. Scan2cap: Context-aware dense captioning in rgb- d scans. InConference on Computer Vision and Pattern Recognition (CVPR), 2021. 2
work page 2021
-
[23]
Internvl: Scaling up vision founda- tion models and aligning for generic visual-linguistic tasks
Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision founda- tion models and aligning for generic visual-linguistic tasks. InConference on Computer Vision and Pattern Recognition (CVPR), 2024. 2
work page 2024
-
[24]
Navila: Legged robot vision-language-action model for navigation
An-Chieh Cheng, Yandong Ji, Zhaojing Yang, Xueyan Zou, Jan Kautz, Erdem Biyik, Hongxu Yin, Sifei Liu, and Xiao- long Wang. Navila: Legged robot vision-language-action model for navigation. InRobotics: Science and Systems (RSS), 2025. 3, 6, 8, 4
work page 2025
-
[25]
Depth-regularized optimization for 3d gaussian splatting in few-shot images
Jaeyoung Chung, Jeongtaek Oh, and Kyoung Mu Lee. Depth-regularized optimization for 3d gaussian splatting in few-shot images. InConference on Computer Vision and Pattern Recognition (CVPR), 2024. 1
work page 2024
-
[26]
A volumetric method for building complex models from range images
Brian Curless and Marc Levoy. A volumetric method for building complex models from range images. InProceed- ings of the 23rd annual conference on Computer graphics and interactive techniques, 1996. 2
work page 1996
-
[27]
Scannet: Richly-annotated 3d reconstructions of indoor scenes
Angela Dai, Angel X Chang, Manolis Savva, Maciej Hal- ber, Thomas Funkhouser, and Matthias Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In Conference on Computer Vision and Pattern Recognition (CVPR), 2017. 2, 3
work page 2017
-
[28]
Depth-supervised nerf: Fewer views and faster training for free
Kangle Deng, Andrew Liu, Jun-Yan Zhu, and Deva Ra- manan. Depth-supervised nerf: Fewer views and faster training for free. InConference on Computer Vision and Pattern Recognition (CVPR), 2022. 1
work page 2022
-
[29]
V otenet: A deep learning label fusion method for multi-atlas segmen- tation
Zhipeng Ding, Xu Han, and Marc Niethammer. V otenet: A deep learning label fusion method for multi-atlas segmen- tation. InProceedings of International Conference on Med- ical Image Computing and Computer-Assisted Intervention (MICCAI), 2019. 2
work page 2019
-
[30]
ivs-net: Learning human view synthesis from internet videos
Junting Dong, Qi Fang, Tianshuo Yang, Qing Shuai, Chengyu Qiao, and Sida Peng. ivs-net: Learning human view synthesis from internet videos. InInternational Con- ference on Computer Vision (ICCV), 2023. 3
work page 2023
-
[31]
Mast3r-sfm: a fully-integrated solution for unconstrained structure-from-motion
Bardienus Pieter Duisterhof, Lojze Zust, Philippe Weinza- epfel, Vincent Leroy, Yohann Cabon, and Jerome Revaud. Mast3r-sfm: a fully-integrated solution for unconstrained structure-from-motion. InInternational Conference on 3D Vision (3DV), 2025. 3, 1
work page 2025
-
[32]
Depth map prediction from a single image using a multi-scale deep network
David Eigen, Christian Puhrsch, and Rob Fergus. Depth map prediction from a single image using a multi-scale deep network. InAdvances in Neural Information Process- ing Systems (NeurIPS), 2014. 1
work page 2014
-
[33]
Structure and content-guided video synthesis with diffusion models
Patrick Esser, Johnathan Chiu, Parmida Atighehchian, Jonathan Granskog, and Anastasis Germanidis. Structure and content-guided video synthesis with diffusion models. InInternational Conference on Computer Vision (ICCV),
-
[34]
VLM-3R: Vision-Language Models Augmented with Instruction-Aligned 3D Reconstruction
Zhiwen Fan, Jian Zhang, Renjie Li, Junge Zhang, Runjin Chen, Hezhen Hu, Kevin Wang, Huaizhi Qu, Dilin Wang, Zhicheng Yan, et al. Vlm-3r: Vision-language models aug- mented with instruction-aligned 3d reconstruction.arXiv preprint arXiv:2505.20279, 2025. 5, 6
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[35]
Efficient graph-based image segmentation.International Journal of Computer Vision (IJCV), 2004
Pedro F Felzenszwalb and Daniel P Huttenlocher. Efficient graph-based image segmentation.International Journal of Computer Vision (IJCV), 2004. 5
work page 2004
-
[36]
Deep ordinal regression network for monocular depth estimation
Huan Fu, Mingming Gong, Chaohui Wang, Kayhan Bat- manghelich, and Dacheng Tao. Deep ordinal regression network for monocular depth estimation. InConference on Computer Vision and Pattern Recognition (CVPR), 2018. 1
work page 2018
-
[37]
Panoptic nerf: 3d-to-2d label transfer for panoptic urban scene segmentation
Xiao Fu, Shangzhan Zhang, Tianrun Chen, Yichong Lu, Lanyun Zhu, Xiaowei Zhou, Andreas Geiger, and Yiyi Liao. Panoptic nerf: 3d-to-2d label transfer for panoptic urban scene segmentation. InInternational Conference on 3D Vision (3DV), 2022. 2, 4
work page 2022
-
[38]
Howard Gardner.Frames of mind: The theory of multiple intelligences. Basic books, 2011. 5
work page 2011
-
[39]
Matcha gaussians: Atlas of charts for high- quality geometry and photorealism from sparse views
Antoine Guédon, Tomoki Ichikawa, Kohei Yamashita, and Ko Nishino. Matcha gaussians: Atlas of charts for high- quality geometry and photorealism from sparse views. In Conference on Computer Vision and Pattern Recognition (CVPR), 2025. 4
work page 2025
-
[40]
Roomtour3d: Geometry-aware video- instruction tuning for embodied navigation
Mingfei Han, Liang Ma, Kamila Zhumakhanova, Ekaterina Radionova, Jingyi Zhang, Xiaojun Chang, Xiaodan Liang, and Ivan Laptev. Roomtour3d: Geometry-aware video- instruction tuning for embodied navigation. InConfer- ence on Computer Vision and Pattern Recognition (CVPR),
-
[41]
Richard Hartley and Andrew Zisserman.Multiple view ge- ometry in computer vision. Cambridge university press,
-
[42]
Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Gir- shick. Mask r-cnn. InInternational Conference on Com- puter Vision (ICCV), 2017. 3
work page 2017
-
[43]
Cam- bridge university press Cambridge, 1986
Annette Herskovits.Language and spatial cognition. Cam- bridge university press Cambridge, 1986. 5
work page 1986
-
[44]
Imagen Video: High Definition Video Generation with Diffusion Models
Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Alexey Gritsenko, Diederik P Kingma, Ben Poole, Mohammad Norouzi, David J Fleet, et al. Ima- gen video: High definition video generation with diffusion models.arXiv preprint arXiv:2210.02303, 2022. 3
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[45]
Vln bert: A recurrent vision- and-language bert for navigation
Yicong Hong, Qi Wu, Yuankai Qi, Cristian Rodriguez- Opazo, and Stephen Gould. Vln bert: A recurrent vision- and-language bert for navigation. InConference on Com- puter Vision and Pattern Recognition (CVPR), 2021. 2
work page 2021
-
[46]
3d concept learn- ing and reasoning from multi-view images
Yining Hong, Chunru Lin, Yilun Du, Zhenfang Chen, Joshua B Tenenbaum, and Chuang Gan. 3d concept learn- ing and reasoning from multi-view images. InConfer- ence on Computer Vision and Pattern Recognition (CVPR),
-
[47]
Lora: Low-rank adaptation of large language models
Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models. InInternational Conference on Learning Representations (ICLR), 2022. 6
work page 2022
-
[48]
2d gaussian splatting for geometrically ac- curate radiance fields
Binbin Huang, Zehao Yu, Anpei Chen, Andreas Geiger, and Shenghua Gao. 2d gaussian splatting for geometrically ac- curate radiance fields. InACM SIGGRAPH / Eurographics Symposium on Computer Animation (SCA), 2024. 4
work page 2024
-
[49]
arXiv preprint arXiv:2311.12871 (2023)
Jiangyong Huang, Silong Yong, Xiaojian Ma, Xiongkun Linghu, Puhao Li, Yan Wang, Qing Li, Song-Chun Zhu, Baoxiong Jia, and Siyuan Huang. An embodied generalist agent in 3d world.arXiv preprint arXiv:2311.12871, 2023. 5
-
[50]
Sceneverse: Scaling 3d vision-language learning for grounded scene understanding
Baoxiong Jia, Yixin Chen, Huangyue Yu, Yan Wang, Xuesong Niu, Tengyu Liu, Qing Li, and Siyuan Huang. Sceneverse: Scaling 3d vision-language learning for grounded scene understanding. InEuropean Conference on Computer Vision (ECCV). Springer, 2024. 2, 5
work page 2024
-
[51]
Pointgroup: Dual-set point group- ing for 3d instance segmentation
Li Jiang, Hengshuang Zhao, Shaoshuai Shi, Shu Liu, Chi- Wing Fu, and Jiaya Jia. Pointgroup: Dual-set point group- ing for 3d instance segmentation. InConference on Com- puter Vision and Pattern Recognition (CVPR), 2020. 2
work page 2020
-
[52]
Scaling Laws for Neural Language Models
Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361,
work page internal anchor Pith review Pith/arXiv arXiv 2001
-
[53]
3d gaussian splatting for real-time radiance field rendering
Bernhard Kerbl, Georgios Kopanas, Thomas Leimkühler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering. InACM SIGGRAPH / Eurograph- ics Symposium on Computer Animation (SCA), 2023. 2, 4
work page 2023
-
[54]
Lerf: Language embed- ded radiance fields
Justin Kerr, Chung Min Kim, Ken Goldberg, Angjoo Kanazawa, and Matthew Tancik. Lerf: Language embed- ded radiance fields. InInternational Conference on Com- puter Vision (ICCV), 2023. 2
work page 2023
-
[55]
Mukul Khanna, Yongsen Mao, Hanxiao Jiang, Sanjay Haresh, Brennan Shacklett, Dhruv Batra, Alexander Clegg, Eric Undersander, Angel X Chang, and Manolis Savva. Habitat synthetic scenes dataset (hssd-200): An analysis of 3d scene scale and realism tradeoffs for objectgoal naviga- tion. InConference on Computer Vision and Pattern Recog- nition (CVPR), 2024. 2
work page 2024
-
[56]
Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. Segment anything. InInternational Conference on Computer Vision (ICCV), 2023. 4
work page 2023
-
[57]
Large language models are zero-shot reasoners
Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners. InAdvances in Neural Information Processing Systems (NeurIPS), 2022. 2
work page 2022
-
[58]
Unidet3d: Multi- dataset indoor 3d object detection
Maksim Kolodiazhnyi, Anna V orontsova, Matvey Skripkin, Danila Rukhovich, and Anton Konushin. Unidet3d: Multi- dataset indoor 3d object detection. InAAAI Conference on Artificial Intelligence (AAAI), 2025. 2
work page 2025
-
[59]
Room-across-room: Multilingual vision-and-language navigation with dense spatiotemporal grounding
Alexander Ku, Peter Anderson, Roma Patel, Eugene Ie, and Jason Baldridge. Room-across-room: Multilingual vision-and-language navigation with dense spatiotemporal grounding. InAnnual Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020. 2
work page 2020
-
[60]
Panoptic neu- ral fields: A semantic object-aware neural scene representa- tion
Abhijit Kundu, Kyle Genova, Xiaoqi Yin, Alireza Fathi, Caroline Pantofaru, Leonidas J Guibas, Andrea Tagliasac- chi, Frank Dellaert, and Thomas Funkhouser. Panoptic neu- ral fields: A semantic object-aware neural scene representa- tion. InConference on Computer Vision and Pattern Recog- nition (CVPR), 2022. 2
work page 2022
-
[61]
Grounding image matching in 3d with mast3r
Vincent Leroy, Yohann Cabon, and Jérôme Revaud. Grounding image matching in 3d with mast3r. InEuropean Conference on Computer Vision (ECCV), 2024. 3
work page 2024
-
[62]
Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models. InInter- national Conference on Machine Learning (ICML), 2023. 2
work page 2023
-
[63]
Long Lian, Yifan Ding, Yunhao Ge, Sifei Liu, Hanzi Mao, Boyi Li, Marco Pavone, Ming-Yu Liu, Trevor Dar- rell, Adam Yala, and Yin Cui. Describe anything: De- tailed localized image and video captioning.arXiv preprint arXiv:2504.16072, 2025. 4
-
[64]
Learning vision-and- language navigation from youtube videos
Kunyang Lin, Peihao Chen, Diwei Huang, Thomas H Li, Mingkui Tan, and Chuang Gan. Learning vision-and- language navigation from youtube videos. InInternational Conference on Computer Vision (ICCV), 2023. 3, 6
work page 2023
-
[65]
Microsoft coco: Common objects in context
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. InEuropean Conference on Computer Vision (ECCV), 2014. 2
work page 2014
-
[66]
Weakly supervised 3d open- vocabulary segmentation
Kunhao Liu, Fangneng Zhan, Jiahui Zhang, Muyu Xu, Yingchen Yu, Abdulmotaleb El Saddik, Christian Theobalt, Eric Xing, and Shijian Lu. Weakly supervised 3d open- vocabulary segmentation. InAdvances in Neural Informa- tion Processing Systems (NeurIPS), 2023. 2
work page 2023
-
[67]
Novel view extrapolation with video diffusion priors.arXiv preprint arXiv:2411.14208,
Kunhao Liu, Ling Shao, and Shijian Lu. Novel view extrapolation with video diffusion priors.arXiv preprint arXiv:2411.14208, 2024. 3
-
[68]
3dgs-enhancer: Enhancing unbounded 3d gaussian splatting with view- consistent 2d diffusion priors
Xi Liu, Chaoyi Zhou, and Siyu Huang. 3dgs-enhancer: Enhancing unbounded 3d gaussian splatting with view- consistent 2d diffusion priors. InAdvances in Neural In- formation Processing Systems (NeurIPS), 2024. 2
work page 2024
-
[69]
Taco: Taming diffusion for in-the-wild video amodal completion
Ruijie Lu, Yixin Chen, Yu Liu, Jiaxiang Tang, Junfeng Ni, Diwen Wan, Gang Zeng, and Siyuan Huang. Taco: Taming diffusion for in-the-wild video amodal completion. InIn- ternational Conference on Computer Vision (ICCV), 2025. 3
work page 2025
-
[70]
Mo- vis: Enhancing multi-object novel view synthesis for in- door scenes
Ruijie Lu, Yixin Chen, Junfeng Ni, Baoxiong Jia, Yu Liu, Diwen Wan, Gang Zeng, and Siyuan Huang. Mo- vis: Enhancing multi-object novel view synthesis for in- door scenes. InConference on Computer Vision and Pattern Recognition (CVPR), 2025. 3
work page 2025
-
[71]
Scalable 3d captioning with pretrained mod- els
Tiange Luo, Chris Rockwell, Honglak Lee, and Justin Johnson. Scalable 3d captioning with pretrained mod- els. InAdvances in Neural Information Processing Systems (NeurIPS), 2023. 2
work page 2023
-
[72]
You see it, you got it: Learning 3d creation on pose-free videos at scale
Baorui Ma, Huachen Gao, Haoge Deng, Zhengxiong Luo, Tiejun Huang, Lulu Tang, and Xinlong Wang. You see it, you got it: Learning 3d creation on pose-free videos at scale. InConference on Computer Vision and Pattern Recognition (CVPR), 2025. 3
work page 2025
-
[73]
Sqa3d: Sit- uated question answering in 3d scenes
Xiaojian Ma, Silong Yong, Zilong Zheng, Qing Li, Yitao Liang, Song-Chun Zhu, and Siyuan Huang. Sqa3d: Sit- uated question answering in 3d scenes. InInternational Conference on Learning Representations (ICLR), 2023. 2
work page 2023
-
[74]
Multiscan: Scalable rgbd scanning for 3d environments with articulated objects
Yongsen Mao, Yiming Zhang, Hanxiao Jiang, Angel Chang, and Manolis Savva. Multiscan: Scalable rgbd scanning for 3d environments with articulated objects. InAdvances in Neural Information Processing Systems (NeurIPS), 2022. 2, 3
work page 2022
-
[75]
Spatiallm: Training large language models for structured indoor mod- eling
Yongsen Mao, Junhao Zhong, Chuan Fang, Jia Zheng, Rui Tang, Hao Zhu, Ping Tan, and Zihan Zhou. Spatiallm: Training large language models for structured indoor mod- eling. InAdvances in Neural Information Processing Sys- tems (NeurIPS), 2025. 2, 5
work page 2025
-
[76]
Towards scalable spatial intelligence via 2d-to-3d data lifting
Xingyu Miao, Haoran Duan, Quanhao Qian, Jiuniu Wang, Yang Long, Ling Shao, Deli Zhao, Ran Xu, and Gongjie Zhang. Towards scalable spatial intelligence via 2d-to-3d data lifting. InInternational Conference on Computer Vi- sion (ICCV), 2025. 2
work page 2025
-
[77]
Srinivasan, Matthew Tancik, Jonathan T
Ben Mildenhall, Pratul P. Srinivasan, Matthew Tancik, Jonathan T. Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view synthesis. InEuropean Conference on Computer Vision (ECCV), 2020. 2, 4
work page 2020
-
[78]
An end- to-end transformer model for 3d object detection
Ishan Misra, Rohit Girdhar, and Armand Joulin. An end- to-end transformer model for 3d object detection. InIn- ternational Conference on Computer Vision (ICCV), 2021. 2
work page 2021
-
[79]
Phyrecon: Physically plausible neural scene recon- struction
Junfeng Ni, Yixin Chen, Bohan Jing, Nan Jiang, Bin Wang, Bo Dai, Puhao Li, Yixin Zhu, Song-Chun Zhu, and Siyuan Huang. Phyrecon: Physically plausible neural scene recon- struction. InAdvances in Neural Information Processing Systems (NeurIPS), 2024. 2, 4
work page 2024
-
[80]
Decompositional neural scene reconstruction with generative diffusion prior
Junfeng Ni, Yu Liu, Ruijie Lu, Zirui Zhou, Song-Chun Zhu, Yixin Chen, and Siyuan Huang. Decompositional neural scene reconstruction with generative diffusion prior. In Conference on Computer Vision and Pattern Recognition (CVPR), 2025
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.