pith. machine review for the scientific record. sign in

arxiv: 2111.08897 · v3 · submitted 2021-11-17 · 💻 cs.CV · cs.AI

Recognition: 2 theorem links

· Lean Theorem

ARKitScenes: A Diverse Real-World Dataset For 3D Indoor Scene Understanding Using Mobile RGB-D Data

Authors on Pith no claims yet

Pith reviewed 2026-05-15 10:41 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords RGB-D datasetindoor scene understanding3D object detectiondepth upsamplingmobile LiDAR3D bounding boxesreal-world datasetARKitScenes
0
0 comments X

The pith

ARKitScenes is the largest indoor RGB-D dataset captured with widely available mobile LiDAR sensors and includes laser-scanned depth plus manual 3D bounding box labels.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents ARKitScenes, a dataset of RGB-D captures collected from Apple iPads and iPhones that have LiDAR sensors. It augments the raw mobile data with high-resolution depth maps from a stationary laser scanner and manual 3D oriented bounding box labels for a large set of furniture categories. The authors test the data on two tasks, 3D object detection and color-guided depth upsampling, and report that it improves existing methods while exposing challenges closer to everyday conditions. A sympathetic reader would care because the captures come from devices already owned by millions of people, moving 3D scene understanding from controlled lab settings toward practical mobile use.

Core claim

ARKitScenes is the first RGB-D dataset captured with the widely available depth sensor on iPads and iPhones and the largest indoor scene understanding dataset released. It supplies raw and processed mobile device data, high-resolution depth maps from a stationary laser scanner, and manually labeled 3D oriented bounding boxes for furniture. Evaluation on 3D object detection and color-guided depth upsampling shows the dataset pushes state-of-the-art performance and introduces new real-world challenges.

What carries the argument

The ARKitScenes dataset that pairs mobile RGB-D captures with laser-scanner depth maps and manual 3D bounding box annotations for indoor furniture.

If this is right

  • 3D object detection models achieve higher accuracy on large furniture taxonomies when trained with the labeled mobile data.
  • Color-guided depth upsampling produces higher-resolution outputs by using the laser scans as precise ground truth.
  • The dataset scale supports training larger machine-learning models for indoor scene understanding.
  • Methods developed on the data must handle noise and viewpoint variation typical of handheld mobile captures.
  • The combination of mobile and laser data creates a bridge between consumer hardware and high-precision references.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • App developers could fine-tune models on this data to add room-layout awareness to consumer AR experiences without extra hardware.
  • The dataset could be used to study how well algorithms generalize from mobile captures to other depth sensors.
  • Future releases might add semantic segmentation labels or dynamic object tracks to extend the current static bounding-box focus.
  • Cross-validation across different device models within the captures could reveal hardware-specific biases in depth sensing.

Load-bearing premise

The mobile RGB-D captures, laser-scanned depth maps, and manual 3D bounding box labels are sufficiently accurate and representative of real-world indoor scenes to advance state-of-the-art methods.

What would settle it

A controlled test in which models trained on ARKitScenes show no improvement over models trained on prior datasets when evaluated on independent mobile RGB-D captures from varied indoor rooms would falsify the usefulness claim.

read the original abstract

Scene understanding is an active research area. Commercial depth sensors, such as Kinect, have enabled the release of several RGB-D datasets over the past few years which spawned novel methods in 3D scene understanding. More recently with the launch of the LiDAR sensor in Apple's iPads and iPhones, high quality RGB-D data is accessible to millions of people on a device they commonly use. This opens a whole new era in scene understanding for the Computer Vision community as well as app developers. The fundamental research in scene understanding together with the advances in machine learning can now impact people's everyday experiences. However, transforming these scene understanding methods to real-world experiences requires additional innovation and development. In this paper we introduce ARKitScenes. It is not only the first RGB-D dataset that is captured with a now widely available depth sensor, but to our best knowledge, it also is the largest indoor scene understanding data released. In addition to the raw and processed data from the mobile device, ARKitScenes includes high resolution depth maps captured using a stationary laser scanner, as well as manually labeled 3D oriented bounding boxes for a large taxonomy of furniture. We further analyze the usefulness of the data for two downstream tasks: 3D object detection and color-guided depth upsampling. We demonstrate that our dataset can help push the boundaries of existing state-of-the-art methods and it introduces new challenges that better represent real-world scenarios.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces ARKitScenes as the first RGB-D dataset captured with Apple's widely available LiDAR sensor on mobile iPads/iPhones and, to the authors' knowledge, the largest indoor scene understanding dataset released. It supplies raw and processed mobile RGB-D captures, registered high-resolution depth maps from a stationary laser scanner, and manually annotated 3D oriented bounding boxes over a furniture taxonomy. The authors compare scale and characteristics to prior datasets (ScanNet, Matterport3D) and demonstrate utility on two downstream tasks: 3D object detection and color-guided depth upsampling, claiming the data pushes SOTA boundaries while introducing real-world challenges.

Significance. If the scale, registration quality, and annotation accuracy hold, the release supplies a high-value resource whose mobile capture characteristics better match everyday consumer hardware than prior lab-style datasets. This can accelerate development of robust 3D scene understanding methods for mobile applications, with the laser-scanned depths and 3D boxes providing strong supervision signals for detection and upsampling benchmarks.

major comments (2)
  1. [§4] §4 (Dataset Statistics): the central claim that ARKitScenes is the largest indoor dataset requires an explicit side-by-side table (number of scenes, frames, annotated objects, capture conditions) against ScanNet and Matterport3D; without these numbers the size/diversity assertion is unsupported.
  2. [§6] §6 (Downstream Tasks): the demonstrations for 3D object detection and depth upsampling must report concrete metrics (mAP, RMSE, etc.) and baselines; the abstract states only that the data 'pushes boundaries' without evidence, which is load-bearing for the utility claim.
minor comments (2)
  1. Figure captions should explicitly state what each panel shows (RGB, mobile depth, laser depth, projected boxes) and include scale bars or units.
  2. [§3] The taxonomy of furniture classes and the exact annotation protocol (number of annotators, quality control) should be listed in a dedicated subsection or table.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive recommendation of minor revision and the constructive comments. We address each point below.

read point-by-point responses
  1. Referee: [§4] §4 (Dataset Statistics): the central claim that ARKitScenes is the largest indoor dataset requires an explicit side-by-side table (number of scenes, frames, annotated objects, capture conditions) against ScanNet and Matterport3D; without these numbers the size/diversity assertion is unsupported.

    Authors: We agree that an explicit comparison table will strengthen the claim. In the revised manuscript we will insert a side-by-side table in §4 that reports number of scenes, frames, annotated objects, and capture conditions for ARKitScenes, ScanNet, and Matterport3D. revision: yes

  2. Referee: [§6] §6 (Downstream Tasks): the demonstrations for 3D object detection and depth upsampling must report concrete metrics (mAP, RMSE, etc.) and baselines; the abstract states only that the data 'pushes boundaries' without evidence, which is load-bearing for the utility claim.

    Authors: We will revise the abstract to include the key quantitative results (mAP for detection and RMSE for upsampling) and will ensure §6 explicitly lists all metrics together with the baselines used. This will provide the concrete evidence requested. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper is a dataset release paper whose central claims concern the scale, sensor type, and annotation quality of ARKitScenes itself. No mathematical derivations, fitted parameters, or predictions appear in the manuscript. Claims of being the first LiDAR-based RGB-D dataset and the largest indoor scene-understanding release are supported by explicit size statistics and direct comparisons to ScanNet, Matterport3D, and similar prior releases, none of which reduce to self-citation chains or self-definitional loops. The two downstream-task demonstrations (3D object detection and depth upsampling) are empirical evaluations on the released data rather than derivations that collapse to their own inputs. The work is therefore self-contained against external benchmarks with no load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central contribution is a new data collection and annotation effort rather than a derivation; the main assumptions concern sensor accuracy and label quality, which are domain-standard for RGB-D datasets.

axioms (1)
  • domain assumption Mobile RGB-D sensors such as Apple's LiDAR produce depth data of sufficient quality for indoor scene understanding tasks
    Invoked when positioning the dataset as enabling real-world applications

pith-pipeline@v0.9.0 · 5597 in / 1250 out tokens · 43661 ms · 2026-05-15T10:41:47.251267+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 21 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. ViSRA: A Video-based Spatial Reasoning Agent for Multi-modal Large Language Models

    cs.CV 2026-05 unverdicted novelty 7.0

    ViSRA boosts MLLM 3D spatial reasoning performance by up to 28.9% on unseen tasks via a plug-and-play video-based agent that extracts explicit spatial cues from expert models without any post-training.

  2. SplatWeaver: Learning to Allocate Gaussian Primitives for Generalizable Novel View Synthesis

    cs.CV 2026-05 unverdicted novelty 7.0

    SplatWeaver dynamically allocates Gaussian primitives via cardinality experts and pixel-level routing guided by high-frequency cues for improved generalizable novel view synthesis.

  3. DENALI: A Dataset Enabling Non-Line-of-Sight Spatial Reasoning with Low-Cost LiDARs

    cs.RO 2026-04 unverdicted novelty 7.0

    DENALI is the first large-scale real-world dataset of space-time histograms from low-cost LiDARs for training models to perceive hidden objects via multi-bounce light cues.

  4. Any 3D Scene is Worth 1K Tokens: 3D-Grounded Representation for Scene Generation at Scale

    cs.CV 2026-04 unverdicted novelty 7.0

    A 3D-grounded autoencoder and diffusion transformer allow direct generation of 3D scenes in an implicit latent space using a fixed 1K-token representation for arbitrary views and resolutions.

  5. WildDet3D: Scaling Promptable 3D Detection in the Wild

    cs.CV 2026-04 unverdicted novelty 7.0

    WildDet3D is a promptable 3D detector paired with a new 1M-image dataset across 13.5K categories that sets SOTA on open-world and zero-shot 3D detection benchmarks.

  6. Reasoning over Video: Evaluating How MLLMs Extract, Integrate, and Reconstruct Spatiotemporal Evidence

    cs.CV 2026-03 unverdicted novelty 7.0

    VAEX-BENCH shows state-of-the-art MLLMs perform substantially worse on abstractive spatiotemporal reasoning tasks than on matched extractive tasks in video understanding.

  7. ZipMap: Linear-Time Stateful 3D Reconstruction via Test-Time Training

    cs.CV 2026-03 unverdicted novelty 7.0

    ZipMap achieves linear-time bidirectional 3D reconstruction by zipping image collections into a compact stateful representation via test-time training layers.

  8. $\pi^3$: Permutation-Equivariant Visual Geometry Learning

    cs.CV 2025-07 conditional novelty 7.0

    π³ is a feed-forward network with full permutation equivariance that outputs affine-invariant poses and scale-invariant local point maps without reference frames, reaching state-of-the-art on camera pose, depth, and d...

  9. Hyperbolic Distillation: Geometry-Guided Cross-Modal Transfer for Robust 3D Object Detection

    cs.CV 2026-05 unverdicted novelty 6.0

    HGC-Det applies hyperbolic geometry to constrain cross-modal distillation between images and point clouds, with added semantic-guided voxel optimization and feature aggregation, yielding improved accuracy-efficiency t...

  10. HSG: Hyperbolic Scene Graph

    cs.CV 2026-04 unverdicted novelty 6.0

    Hyperbolic Scene Graph (HSG) learns embeddings in hyperbolic space for better hierarchical structure in scene graphs, achieving graph IoU of 33.51 versus 25.37 for the best Euclidean baseline.

  11. Feed-Forward 3D Scene Modeling: A Problem-Driven Perspective

    cs.CV 2026-04 unverdicted novelty 6.0

    The paper proposes a problem-driven taxonomy for feed-forward 3D scene modeling that groups methods by five core challenges: feature enhancement, geometry awareness, model efficiency, augmentation strategies, and temp...

  12. ReplicateAnyScene: Zero-Shot Video-to-3D Composition via Textual-Visual-Spatial Alignment

    cs.CV 2026-04 unverdicted novelty 6.0

    ReplicateAnyScene performs fully automated zero-shot video-to-compositional-3D reconstruction by cascading alignments of generic priors from vision foundation models across textual, visual, and spatial dimensions.

  13. Boxer: Robust Lifting of Open-World 2D Bounding Boxes to 3D

    cs.CV 2026-04 unverdicted novelty 6.0

    BoxerNet lifts 2D bounding boxes to metric 3D boxes via transformer regression with aleatoric uncertainty and median depth encoding, then fuses multi-view results to outperform CuTR by large margins on open-world benchmarks.

  14. Contrastive Language-Colored Pointmap Pretraining for Unified 3D Scene Understanding

    cs.CV 2026-04 unverdicted novelty 6.0

    UniScene3D learns unified 3D scene representations from colored pointmaps using contrastive CLIP pretraining plus cross-view geometric and grounded view alignments, achieving state-of-the-art results on viewpoint grou...

  15. SpatialStack: Layered Geometry-Language Fusion for 3D VLM Spatial Reasoning

    cs.CV 2026-03 unverdicted novelty 6.0

    SpatialStack improves 3D spatial reasoning in vision-language models by stacking and synchronizing multi-level geometric features with the language backbone.

  16. Feeling the Space: Egomotion-Aware Video Representation for Efficient and Accurate 3D Scene Understanding

    cs.CV 2026-03 unverdicted novelty 6.0

    Motion-MLLM integrates IMU egomotion data into MLLMs using cascaded filtering and asymmetric fusion to ground visual content in physical trajectories for scale-aware 3D understanding, achieving competitive accuracy at...

  17. ST-Gen4D: Embedding 4D Spatiotemporal Cognition into World Model for 4D Generation

    cs.CV 2026-05 unverdicted novelty 5.0

    ST-Gen4D uses a world model that fuses global appearance and local dynamic graphs into a 4D cognition representation to guide consistent 4D Gaussian generation.

  18. R3D: Revisiting 3D Policy Learning

    cs.CV 2026-04 unverdicted novelty 5.0

    A transformer 3D encoder plus diffusion decoder architecture, with 3D-specific augmentations, outperforms prior 3D policy methods on manipulation benchmarks by improving training stability.

  19. OpenSpatial: A Principled Data Engine for Empowering Spatial Intelligence

    cs.CL 2026-04 unverdicted novelty 5.0

    OpenSpatial supplies a principled open-source data engine and 3-million-sample dataset that raises spatial-reasoning model performance by an average of 19 percent on benchmarks.

  20. Awaking Spatial Intelligence in Unified Multimodal Understanding and Generation

    cs.GR 2026-05 unverdicted novelty 4.0

    JoyAI-Image unifies visual understanding, generation, and editing in one model and claims stronger spatial intelligence through bidirectional perception-generation loops.

  21. Seed1.5-VL Technical Report

    cs.CV 2025-05 unverdicted novelty 4.0

    Seed1.5-VL is a compact multimodal model that sets new records on dozens of vision-language benchmarks and outperforms prior systems on agent-style tasks.

Reference graph

Works this paper leans on

46 extracted references · 46 canonical work pages · cited by 21 Pith papers · 2 internal anchors

  1. [1]

    3d-sis: 3d semantic instance segmentation of rgb-d scans

    Ji Hou, Angela Dai, and Matthias Nießner. 3d-sis: 3d semantic instance segmentation of rgb-d scans. In Proc. Conference on Computer Vision and Pattern Recognition (CVPR) , pages 4421–4430, 2019

  2. [2]

    Gspn: Generative shape proposal network for 3d instance segmentation in point cloud

    Li Yi, Wang Zhao, He Wang, Minhyuk Sung, and Leonidas J Guibas. Gspn: Generative shape proposal network for 3d instance segmentation in point cloud. In Proc. Conference on Computer Vision and Pattern Recognition (CVPR), pages 3947–3956, 2019

  3. [3]

    Sgpn: Similarity group proposal network for 3d point cloud instance segmentation

    Weiyue Wang, Ronald Yu, Qiangui Huang, and Ulrich Neumann. Sgpn: Similarity group proposal network for 3d point cloud instance segmentation. InProc. Conference on Computer Vision and Pattern Recognition (CVPR), pages 2569–2578, 2018

  4. [4]

    Deep hough voting for 3d object detection in point clouds

    Charles R Qi, Or Litany, Kaiming He, and Leonidas J Guibas. Deep hough voting for 3d object detection in point clouds. In Proc. Conference on Computer Vision and Pattern Recognition (CVPR), 2019

  5. [5]

    Qi, Xinlei Chen, and Leonidas J

    Charles R. Qi, Xinlei Chen, and Leonidas J. Guibas Or Litany. Imvotenet: Boosting 3d object detection in point clouds with image votes. In Proc. Conference on Computer Vision and Pattern Recognition (CVPR), 2020

  6. [6]

    Svga- net: Sparse voxel-graph attention network for 3d object detection from point clouds

    Qingdong He, Zhengning Wang, Hao Zeng, Yi Zeng, Shuaicheng Liu, and Bing Zeng. Svga- net: Sparse voxel-graph attention network for 3d object detection from point clouds. arXiv preprint arXiv:2006.04043, 2020

  7. [7]

    Group-free 3d object detection via transformers

    Ze Liu, Zheng Zhang, Yue Cao, Han Hu, and Xin Tong. Group-free 3d object detection via transformers. arXiv preprint arXiv:2104.00678, 2021

  8. [8]

    ShapeNet: An Information-Rich 3D Model Repository

    Angel X Chang, Thomas Funkhouser, Leonidas Guibas, Pat Hanrahan, Qixing Huang, Zimo Li, Silvio Savarese, Manolis Savva, Shuran Song, Hao Su, et al. Shapenet: An information-rich 3d model repository. arXiv preprint arXiv:1512.03012, 2015

  9. [9]

    Sun3d: A database of big spaces reconstructed using sfm and object labels

    Jianxiong Xiao, Andrew Owens, and Antonio Torralba. Sun3d: A database of big spaces reconstructed using sfm and object labels. In Proc. International Conference on Computer Vision (ICCV), pages 1625–1632, 2013

  10. [10]

    A category-level 3d object dataset: Putting the kinect to work

    Allison Janoch, Sergey Karayev, Yangqing Jia, Jonathan T Barron, Mario Fritz, Kate Saenko, and Trevor Darrell. A category-level 3d object dataset: Putting the kinect to work. InConsumer depth cameras for computer vision, pages 141–165. Springer, 2013

  11. [11]

    3d semantic parsing of large-scale indoor spaces

    Iro Armeni, Ozan Sener, Amir R Zamir, Helen Jiang, Ioannis Brilakis, Martin Fischer, and Silvio Savarese. 3d semantic parsing of large-scale indoor spaces. In Proc. Conference on Computer Vision and Pattern Recognition (CVPR), pages 1534–1543, 2016

  12. [12]

    Partnet: A large-scale benchmark for fine-grained and hierarchical part-level 3d object understanding

    Kaichun Mo, Shilin Zhu, Angel X Chang, Li Yi, Subarna Tripathi, Leonidas J Guibas, and Hao Su. Partnet: A large-scale benchmark for fine-grained and hierarchical part-level 3d object understanding. In Proc. Conference on Computer Vision and Pattern Recognition (CVPR) , pages 909–918, 2019

  13. [13]

    Scalability in perception for autonomous driving: Waymo open dataset

    Pei Sun, Henrik Kretzschmar, Xerxes Dotiwalla, Aurelien Chouard, Vijaysai Patnaik, Paul Tsui, James Guo, Yin Zhou, Yuning Chai, Benjamin Caine, et al. Scalability in perception for autonomous driving: Waymo open dataset. In Proc. Conference on Computer Vision and Pattern Recognition (CVPR), pages 2446–2454, 2020

  14. [14]

    Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Nießner

    Angela Dai, Angel X. Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In Proc. Conference on Computer Vision and Pattern Recognition (CVPR), 2017

  15. [15]

    Sun rgb-d: A rgb-d scene understanding benchmark suite

    S Song, S Lichtenberg, and J Xiao. Sun rgb-d: A rgb-d scene understanding benchmark suite. In Proc. Conference on Computer Vision and Pattern Recognition (CVPR) , 2015

  16. [16]

    Indoor scene segmentation using a structured light sensor

    Nathan Silberman and Rob Fergus. Indoor scene segmentation using a structured light sensor. In 2011 IEEE international conference on computer vision workshops (ICCV workshops), pages 601–608. IEEE, 2011

  17. [17]

    https://www.apple.com/newsroom/2020/03/apple-unveils-new-ipad-pro-with-lidar- scanner-and-trackpad-support-in-ipados/. 10

  18. [18]

    Imagenet: A large-scale hierarchical image database

    Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In Proc. Conference on Computer Vision and Pattern Recognition (CVPR), pages 248–255. Ieee, 2009

  19. [19]

    Vision meets robotics: The kitti dataset

    Andreas Geiger, Philip Lenz, Christoph Stiller, and Raquel Urtasun. Vision meets robotics: The kitti dataset. 2013

  20. [20]

    Kesten, M

    R. Kesten, M. Usman, J. Houston, T . Pandya, K. Nadhamuni, A. Ferreira, M. Yuan, B. Low, A. Jain, P . Ondruska, S. Omari, S. Shah, A. Kulkarni, A. Kazakova, C. Tao, L. Platinsky, W . Jiang, and V . Shet. Lyft level 5 perception dataset 2020. 2019

  21. [21]

    Lang, Sourabh Vora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Giancarlo Baldan, and Oscar Beijbom

    Holger Caesar, Varun Bankiti, Alex H. Lang, Sourabh Vora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Giancarlo Baldan, and Oscar Beijbom. nuscenes: A multimodal dataset for autonomous driving. In Proc. Conference on Computer Vision and Pattern Recognition (CVPR), June 2020

  22. [22]

    Matterport3D: Learning from RGB-D Data in Indoor Environments

    Angel Chang, Angela Dai, Thomas Funkhouser, Maciej Halber, Matthias Niessner, Manolis Savva, Shuran Song, Andy Zeng, and Yinda Zhang. Matterport3d: Learning from rgb-d data in indoor environments. arXiv preprint arXiv:1709.06158, 2017

  23. [23]

    Scenenn: A scene meshes dataset with annotations

    Binh-Son Hua, Quang-Hieu Pham, Duc Thanh Nguyen, Minh-Khoi Tran, Lap-Fai Yu, and Sai- Kit Yeung. Scenenn: A scene meshes dataset with annotations. In 2016 Fourth International Conference on 3D Vision (3DV), pages 92–101. IEEE, 2016

  24. [24]

    Pigraphs: Learning interaction snapshots from observations

    Manolis Savva, Angel X Chang, Pat Hanrahan, Matthew Fisher, and Matthias Nießner. Pigraphs: Learning interaction snapshots from observations. ACM Transactions on Graphics (TOG), 35(4):1–12, 2016

  25. [25]

    A naturalistic open source movie for optical flow evaluation

    Daniel J Butler, Jonas Wulff, Garrett B Stanley, and Michael J Black. A naturalistic open source movie for optical flow evaluation. In Proc. European Conference on Computer Vision (ECCV) , pages 611–625. Springer, 2012

  26. [26]

    High-resolution stereo datasets with subpixel-accurate ground truth

    Daniel Scharstein, Heiko Hirschmüller, York Kitajima, Greg Krathwohl, Nera Neši´ c, Xi Wang, and Porter Westling. High-resolution stereo datasets with subpixel-accurate ground truth. In German conference on pattern recognition, pages 31–42. Springer, 2014

  27. [27]

    Structure aware single-stage 3d object detection from point cloud

    Chenhang He, Hui Zeng, Jianqiang Huang, Xian-Sheng Hua, and Lei Zhang. Structure aware single-stage 3d object detection from point cloud. In Proc. Conference on Computer Vision and Pattern Recognition (CVPR), June 2020

  28. [28]

    Hvnet: Hybrid voxel network for lidar based 3d object detection

    Maosheng Ye, Shuangjie Xu, and Tongyi Cao. Hvnet: Hybrid voxel network for lidar based 3d object detection. In Proc. Conference on Computer Vision and Pattern Recognition (CVPR), June 2020

  29. [29]

    Point-gnn: Graph neural network for 3d object detection in a point cloud

    Weijing Shi and Raj Rajkumar. Point-gnn: Graph neural network for 3d object detection in a point cloud. In Proc. Conference on Computer Vision and Pattern Recognition (CVPR), June 2020

  30. [30]

    Mlcvnet: Multi-level context votenet for 3d object detection

    Qian Xie, Yu-Kun Lai, Jing Wu, Zhoutao Wang, Yiming Zhang, Kai Xu, and Jun Wang. Mlcvnet: Multi-level context votenet for 3d object detection. In Proc. Conference on Computer Vision and Pattern Recognition (CVPR), June 2020

  31. [31]

    Chen, and Jian Wu

    Jintai Chen, Biwen Lei, Qingyu Song, Haochao Ying, Danny Z. Chen, and Jian Wu. A hierarchi- cal graph network for 3d object detection on point clouds. In Proc. Conference on Computer Vision and Pattern Recognition (CVPR), June 2020

  32. [32]

    Frodo: From detections to 3d objects

    Martin Runz, Kejie Li, Meng Tang, Lingni Ma, Chen Kong, Tanner Schmidt, Ian Reid, Lourdes Agapito, Julian Straub, Steven Lovegrove, and Richard Newcombe. Frodo: From detections to 3d objects. In Proc. Conference on Computer Vision and Pattern Recognition (CVPR) , June 2020

  33. [33]

    Generative sparse detection networks for 3d single-shot object detection

    JunYoung Gwak, Christopher Choy, and Silvio Savarese. Generative sparse detection networks for 3d single-shot object detection. arXiv preprint arXiv:2006.12356, 2020

  34. [34]

    Frustum pointnets for 3d object detection from rgb-d data

    Charles R Qi, Wei Liu, Chenxia Wu, Hao Su, and Leonidas J Guibas. Frustum pointnets for 3d object detection from rgb-d data. In Proc. Conference on Computer Vision and Pattern Recognition (CVPR), pages 918–927, 2018

  35. [35]

    Pv-rcnn: Point-voxel feature set abstraction for 3d object detection

    Shaoshuai Shi, Chaoxu Guo, Li Jiang, Zhe Wang, Jianping Shi, Xiaogang Wang, and Hongsheng Li. Pv-rcnn: Point-voxel feature set abstraction for 3d object detection. In Proc. Conference on Computer Vision and Pattern Recognition (CVPR), pages 10529–10538, 2020. 11

  36. [36]

    Objectron: A large scale dataset of object-centric videos in the wild with pose annotations

    Adel Ahmadyan, Liangkai Zhang, Jianing Wei, Artsiom Ablavatski, and Matthias Grundmann. Objectron: A large scale dataset of object-centric videos in the wild with pose annotations. arXiv preprint arXiv:2012.09988, 2020

  37. [37]

    Depth map super-resolution by deep multi-scale guidance

    Tak-Wai Hui, Chen Change Loy, , and Xiaoou Tang. Depth map super-resolution by deep multi-scale guidance. In Proc. European Conference on Computer Vision (ECCV) , pages 353–369, 2016

  38. [38]

    Cohen, Dani Lischinski, and Matt Uyttendaele

    Johannes Kopf, Michael F . Cohen, Dani Lischinski, and Matt Uyttendaele. Joint bilateral upsampling. ACM Transactions on Graphics (Proceedings of SIGGRAPH 2007), 26(3):to appear, 2007

  39. [39]

    Image guided depth upsampling using anisotropic total generalized variation

    David Ferstl, Christian Reinbacher, Rene Ranftl, Matthias Rüther, and Horst Bischof. Image guided depth upsampling using anisotropic total generalized variation. InProc. International Conference on Computer Vision (ICCV) , pages 993–1000, 2013

  40. [40]

    A taxonomy and evaluation of dense two-frame stereo correspondence algorithms

    Daniel Scharstein and Richard Szeliski. A taxonomy and evaluation of dense two-frame stereo correspondence algorithms. International Journal of Computer Vision (IJCV) , 47(1):7–42, 2002

  41. [41]

    High-accuracy stereo depth maps using structured light

    Daniel Scharstein and Richard Szeliski. High-accuracy stereo depth maps using structured light. In Proc. Conference on Computer Vision and Pattern Recognition (CVPR) , volume 1, pages I–I. IEEE, 2003

  42. [42]

    Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography.Communications of the ACM, 24(6):381–395, 1981

    Martin A Fischler and Robert C Bolles. Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography.Communications of the ACM, 24(6):381–395, 1981

  43. [43]

    H3dnet: 3d object detection using hybrid geometric primitives

    Zaiwei Zhang, Bo Sun, Haitao Yang, and Qixing Huang. H3dnet: 3d object detection using hybrid geometric primitives. In Proc. European Conference on Computer Vision (ECCV) , 2020

  44. [44]

    Qi, Li Yi, Hao Su, and Leonidas J

    Charles R. Qi, Li Yi, Hao Su, and Leonidas J. Guibas. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. In Proc. Advances in Neural Information Processing Systems (NeurIPS), 2017

  45. [45]

    Multi-scale progressive fusion learning for depth map super-resolution

    Chuhua Xian, Kun Qian, Zitian Zhang, and Charlie CL Wang. Multi-scale progressive fusion learning for depth map super-resolution. arXiv preprint arXiv:2011.11865, 2020

  46. [46]

    Megadepth: Learning single-view depth prediction from internet photos

    Zhengqi Li and Noah Snavely. Megadepth: Learning single-view depth prediction from internet photos. In Proc. Conference on Computer Vision and Pattern Recognition (CVPR) , 2018. 12