arxiv: 2111.08897 · v3 · submitted 2021-11-17 · 💻 cs.CV · cs.AI

Recognition: 2 theorem links

· Lean Theorem

ARKitScenes: A Diverse Real-World Dataset For 3D Indoor Scene Understanding Using Mobile RGB-D Data

Gilad Baruch , Zhuoyuan Chen , Afshin Dehghan , Tal Dimry , Yuri Feigin , Peter Fu , Thomas Gebauer , Brandon Joffe

show 3 more authors

Daniel Kurz Arik Schwartz Elad Shulman

Authors on Pith no claims yet

Pith reviewed 2026-05-15 10:41 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords RGB-D datasetindoor scene understanding3D object detectiondepth upsamplingmobile LiDAR3D bounding boxesreal-world datasetARKitScenes

0 comments

The pith

ARKitScenes is the largest indoor RGB-D dataset captured with widely available mobile LiDAR sensors and includes laser-scanned depth plus manual 3D bounding box labels.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents ARKitScenes, a dataset of RGB-D captures collected from Apple iPads and iPhones that have LiDAR sensors. It augments the raw mobile data with high-resolution depth maps from a stationary laser scanner and manual 3D oriented bounding box labels for a large set of furniture categories. The authors test the data on two tasks, 3D object detection and color-guided depth upsampling, and report that it improves existing methods while exposing challenges closer to everyday conditions. A sympathetic reader would care because the captures come from devices already owned by millions of people, moving 3D scene understanding from controlled lab settings toward practical mobile use.

Core claim

ARKitScenes is the first RGB-D dataset captured with the widely available depth sensor on iPads and iPhones and the largest indoor scene understanding dataset released. It supplies raw and processed mobile device data, high-resolution depth maps from a stationary laser scanner, and manually labeled 3D oriented bounding boxes for furniture. Evaluation on 3D object detection and color-guided depth upsampling shows the dataset pushes state-of-the-art performance and introduces new real-world challenges.

What carries the argument

The ARKitScenes dataset that pairs mobile RGB-D captures with laser-scanner depth maps and manual 3D bounding box annotations for indoor furniture.

If this is right

3D object detection models achieve higher accuracy on large furniture taxonomies when trained with the labeled mobile data.
Color-guided depth upsampling produces higher-resolution outputs by using the laser scans as precise ground truth.
The dataset scale supports training larger machine-learning models for indoor scene understanding.
Methods developed on the data must handle noise and viewpoint variation typical of handheld mobile captures.
The combination of mobile and laser data creates a bridge between consumer hardware and high-precision references.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

App developers could fine-tune models on this data to add room-layout awareness to consumer AR experiences without extra hardware.
The dataset could be used to study how well algorithms generalize from mobile captures to other depth sensors.
Future releases might add semantic segmentation labels or dynamic object tracks to extend the current static bounding-box focus.
Cross-validation across different device models within the captures could reveal hardware-specific biases in depth sensing.

Load-bearing premise

The mobile RGB-D captures, laser-scanned depth maps, and manual 3D bounding box labels are sufficiently accurate and representative of real-world indoor scenes to advance state-of-the-art methods.

What would settle it

A controlled test in which models trained on ARKitScenes show no improvement over models trained on prior datasets when evaluated on independent mobile RGB-D captures from varied indoor rooms would falsify the usefulness claim.

read the original abstract

Scene understanding is an active research area. Commercial depth sensors, such as Kinect, have enabled the release of several RGB-D datasets over the past few years which spawned novel methods in 3D scene understanding. More recently with the launch of the LiDAR sensor in Apple's iPads and iPhones, high quality RGB-D data is accessible to millions of people on a device they commonly use. This opens a whole new era in scene understanding for the Computer Vision community as well as app developers. The fundamental research in scene understanding together with the advances in machine learning can now impact people's everyday experiences. However, transforming these scene understanding methods to real-world experiences requires additional innovation and development. In this paper we introduce ARKitScenes. It is not only the first RGB-D dataset that is captured with a now widely available depth sensor, but to our best knowledge, it also is the largest indoor scene understanding data released. In addition to the raw and processed data from the mobile device, ARKitScenes includes high resolution depth maps captured using a stationary laser scanner, as well as manually labeled 3D oriented bounding boxes for a large taxonomy of furniture. We further analyze the usefulness of the data for two downstream tasks: 3D object detection and color-guided depth upsampling. We demonstrate that our dataset can help push the boundaries of existing state-of-the-art methods and it introduces new challenges that better represent real-world scenarios.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ARKitScenes releases a large mobile RGB-D dataset from real Apple LiDAR hardware plus laser ground truth and 3D boxes, which is the useful part even if the paper stays mostly descriptive.

read the letter

The punchline is that ARKitScenes is a large new RGB-D dataset captured on consumer Apple devices with LiDAR, plus laser ground truth and 3D box labels. This is the part that matters most for the field. The work does a good job releasing data from the actual sensors people use now, instead of older Kinect-style setups. The scale is bigger than prior indoor datasets, and adding stationary laser scans gives a higher-quality reference for the mobile captures. The 3D oriented bounding boxes for furniture add another layer that supports detection tasks. They also show example uses in object detection and depth upsampling, which helps illustrate the data's relevance. The paper compares the new dataset to ScanNet and Matterport3D on size and properties, which is helpful. The claim that it introduces real-world challenges seems plausible given the mobile capture conditions. On the soft side, the description does not include much quantitative validation of the data quality or detailed collection protocols. For claims about diversity and accuracy, more evidence like error rates between mobile and laser depth or inter-annotator agreement on boxes would strengthen it. The downstream results are presented but without full numbers or baselines in the abstract, so the impact on SOTA is asserted rather than fully demonstrated here. This paper is mainly for researchers in 3D computer vision and AR who need training data that matches consumer hardware. It is worth sending to peer review because the dataset itself is a substantial new resource, even if the accompanying analysis stays at a high level.

Referee Report

2 major / 2 minor

Summary. The paper introduces ARKitScenes as the first RGB-D dataset captured with Apple's widely available LiDAR sensor on mobile iPads/iPhones and, to the authors' knowledge, the largest indoor scene understanding dataset released. It supplies raw and processed mobile RGB-D captures, registered high-resolution depth maps from a stationary laser scanner, and manually annotated 3D oriented bounding boxes over a furniture taxonomy. The authors compare scale and characteristics to prior datasets (ScanNet, Matterport3D) and demonstrate utility on two downstream tasks: 3D object detection and color-guided depth upsampling, claiming the data pushes SOTA boundaries while introducing real-world challenges.

Significance. If the scale, registration quality, and annotation accuracy hold, the release supplies a high-value resource whose mobile capture characteristics better match everyday consumer hardware than prior lab-style datasets. This can accelerate development of robust 3D scene understanding methods for mobile applications, with the laser-scanned depths and 3D boxes providing strong supervision signals for detection and upsampling benchmarks.

major comments (2)

[§4] §4 (Dataset Statistics): the central claim that ARKitScenes is the largest indoor dataset requires an explicit side-by-side table (number of scenes, frames, annotated objects, capture conditions) against ScanNet and Matterport3D; without these numbers the size/diversity assertion is unsupported.
[§6] §6 (Downstream Tasks): the demonstrations for 3D object detection and depth upsampling must report concrete metrics (mAP, RMSE, etc.) and baselines; the abstract states only that the data 'pushes boundaries' without evidence, which is load-bearing for the utility claim.

minor comments (2)

Figure captions should explicitly state what each panel shows (RGB, mobile depth, laser depth, projected boxes) and include scale bars or units.
[§3] The taxonomy of furniture classes and the exact annotation protocol (number of annotators, quality control) should be listed in a dedicated subsection or table.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive recommendation of minor revision and the constructive comments. We address each point below.

read point-by-point responses

Referee: [§4] §4 (Dataset Statistics): the central claim that ARKitScenes is the largest indoor dataset requires an explicit side-by-side table (number of scenes, frames, annotated objects, capture conditions) against ScanNet and Matterport3D; without these numbers the size/diversity assertion is unsupported.

Authors: We agree that an explicit comparison table will strengthen the claim. In the revised manuscript we will insert a side-by-side table in §4 that reports number of scenes, frames, annotated objects, and capture conditions for ARKitScenes, ScanNet, and Matterport3D. revision: yes
Referee: [§6] §6 (Downstream Tasks): the demonstrations for 3D object detection and depth upsampling must report concrete metrics (mAP, RMSE, etc.) and baselines; the abstract states only that the data 'pushes boundaries' without evidence, which is load-bearing for the utility claim.

Authors: We will revise the abstract to include the key quantitative results (mAP for detection and RMSE for upsampling) and will ensure §6 explicitly lists all metrics together with the baselines used. This will provide the concrete evidence requested. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper is a dataset release paper whose central claims concern the scale, sensor type, and annotation quality of ARKitScenes itself. No mathematical derivations, fitted parameters, or predictions appear in the manuscript. Claims of being the first LiDAR-based RGB-D dataset and the largest indoor scene-understanding release are supported by explicit size statistics and direct comparisons to ScanNet, Matterport3D, and similar prior releases, none of which reduce to self-citation chains or self-definitional loops. The two downstream-task demonstrations (3D object detection and depth upsampling) are empirical evaluations on the released data rather than derivations that collapse to their own inputs. The work is therefore self-contained against external benchmarks with no load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central contribution is a new data collection and annotation effort rather than a derivation; the main assumptions concern sensor accuracy and label quality, which are domain-standard for RGB-D datasets.

axioms (1)

domain assumption Mobile RGB-D sensors such as Apple's LiDAR produce depth data of sufficient quality for indoor scene understanding tasks
Invoked when positioning the dataset as enabling real-world applications

pith-pipeline@v0.9.0 · 5597 in / 1250 out tokens · 43661 ms · 2026-05-15T10:41:47.251267+00:00 · methodology

discussion (0)

Forward citations

Cited by 21 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

ViSRA: A Video-based Spatial Reasoning Agent for Multi-modal Large Language Models
cs.CV 2026-05 unverdicted novelty 7.0

ViSRA boosts MLLM 3D spatial reasoning performance by up to 28.9% on unseen tasks via a plug-and-play video-based agent that extracts explicit spatial cues from expert models without any post-training.
SplatWeaver: Learning to Allocate Gaussian Primitives for Generalizable Novel View Synthesis
cs.CV 2026-05 unverdicted novelty 7.0

SplatWeaver dynamically allocates Gaussian primitives via cardinality experts and pixel-level routing guided by high-frequency cues for improved generalizable novel view synthesis.
DENALI: A Dataset Enabling Non-Line-of-Sight Spatial Reasoning with Low-Cost LiDARs
cs.RO 2026-04 unverdicted novelty 7.0

DENALI is the first large-scale real-world dataset of space-time histograms from low-cost LiDARs for training models to perceive hidden objects via multi-bounce light cues.
Any 3D Scene is Worth 1K Tokens: 3D-Grounded Representation for Scene Generation at Scale
cs.CV 2026-04 unverdicted novelty 7.0

A 3D-grounded autoencoder and diffusion transformer allow direct generation of 3D scenes in an implicit latent space using a fixed 1K-token representation for arbitrary views and resolutions.
WildDet3D: Scaling Promptable 3D Detection in the Wild
cs.CV 2026-04 unverdicted novelty 7.0

WildDet3D is a promptable 3D detector paired with a new 1M-image dataset across 13.5K categories that sets SOTA on open-world and zero-shot 3D detection benchmarks.
Reasoning over Video: Evaluating How MLLMs Extract, Integrate, and Reconstruct Spatiotemporal Evidence
cs.CV 2026-03 unverdicted novelty 7.0

VAEX-BENCH shows state-of-the-art MLLMs perform substantially worse on abstractive spatiotemporal reasoning tasks than on matched extractive tasks in video understanding.
ZipMap: Linear-Time Stateful 3D Reconstruction via Test-Time Training
cs.CV 2026-03 unverdicted novelty 7.0

ZipMap achieves linear-time bidirectional 3D reconstruction by zipping image collections into a compact stateful representation via test-time training layers.
$\pi^3$: Permutation-Equivariant Visual Geometry Learning
cs.CV 2025-07 conditional novelty 7.0

π³ is a feed-forward network with full permutation equivariance that outputs affine-invariant poses and scale-invariant local point maps without reference frames, reaching state-of-the-art on camera pose, depth, and d...
Hyperbolic Distillation: Geometry-Guided Cross-Modal Transfer for Robust 3D Object Detection
cs.CV 2026-05 unverdicted novelty 6.0

HGC-Det applies hyperbolic geometry to constrain cross-modal distillation between images and point clouds, with added semantic-guided voxel optimization and feature aggregation, yielding improved accuracy-efficiency t...
HSG: Hyperbolic Scene Graph
cs.CV 2026-04 unverdicted novelty 6.0

Hyperbolic Scene Graph (HSG) learns embeddings in hyperbolic space for better hierarchical structure in scene graphs, achieving graph IoU of 33.51 versus 25.37 for the best Euclidean baseline.
Feed-Forward 3D Scene Modeling: A Problem-Driven Perspective
cs.CV 2026-04 unverdicted novelty 6.0

The paper proposes a problem-driven taxonomy for feed-forward 3D scene modeling that groups methods by five core challenges: feature enhancement, geometry awareness, model efficiency, augmentation strategies, and temp...
ReplicateAnyScene: Zero-Shot Video-to-3D Composition via Textual-Visual-Spatial Alignment
cs.CV 2026-04 unverdicted novelty 6.0

ReplicateAnyScene performs fully automated zero-shot video-to-compositional-3D reconstruction by cascading alignments of generic priors from vision foundation models across textual, visual, and spatial dimensions.
Boxer: Robust Lifting of Open-World 2D Bounding Boxes to 3D
cs.CV 2026-04 unverdicted novelty 6.0

BoxerNet lifts 2D bounding boxes to metric 3D boxes via transformer regression with aleatoric uncertainty and median depth encoding, then fuses multi-view results to outperform CuTR by large margins on open-world benchmarks.
Contrastive Language-Colored Pointmap Pretraining for Unified 3D Scene Understanding
cs.CV 2026-04 unverdicted novelty 6.0

UniScene3D learns unified 3D scene representations from colored pointmaps using contrastive CLIP pretraining plus cross-view geometric and grounded view alignments, achieving state-of-the-art results on viewpoint grou...
SpatialStack: Layered Geometry-Language Fusion for 3D VLM Spatial Reasoning
cs.CV 2026-03 unverdicted novelty 6.0

SpatialStack improves 3D spatial reasoning in vision-language models by stacking and synchronizing multi-level geometric features with the language backbone.
Feeling the Space: Egomotion-Aware Video Representation for Efficient and Accurate 3D Scene Understanding
cs.CV 2026-03 unverdicted novelty 6.0

Motion-MLLM integrates IMU egomotion data into MLLMs using cascaded filtering and asymmetric fusion to ground visual content in physical trajectories for scale-aware 3D understanding, achieving competitive accuracy at...
ST-Gen4D: Embedding 4D Spatiotemporal Cognition into World Model for 4D Generation
cs.CV 2026-05 unverdicted novelty 5.0

ST-Gen4D uses a world model that fuses global appearance and local dynamic graphs into a 4D cognition representation to guide consistent 4D Gaussian generation.
R3D: Revisiting 3D Policy Learning
cs.CV 2026-04 unverdicted novelty 5.0

A transformer 3D encoder plus diffusion decoder architecture, with 3D-specific augmentations, outperforms prior 3D policy methods on manipulation benchmarks by improving training stability.
OpenSpatial: A Principled Data Engine for Empowering Spatial Intelligence
cs.CL 2026-04 unverdicted novelty 5.0

OpenSpatial supplies a principled open-source data engine and 3-million-sample dataset that raises spatial-reasoning model performance by an average of 19 percent on benchmarks.
Awaking Spatial Intelligence in Unified Multimodal Understanding and Generation
cs.GR 2026-05 unverdicted novelty 4.0

JoyAI-Image unifies visual understanding, generation, and editing in one model and claims stronger spatial intelligence through bidirectional perception-generation loops.
Seed1.5-VL Technical Report
cs.CV 2025-05 unverdicted novelty 4.0

Seed1.5-VL is a compact multimodal model that sets new records on dozens of vision-language benchmarks and outperforms prior systems on agent-style tasks.

Reference graph

Works this paper leans on

46 extracted references · 46 canonical work pages · cited by 21 Pith papers · 2 internal anchors

[1]

3d-sis: 3d semantic instance segmentation of rgb-d scans

Ji Hou, Angela Dai, and Matthias Nießner. 3d-sis: 3d semantic instance segmentation of rgb-d scans. In Proc. Conference on Computer Vision and Pattern Recognition (CVPR) , pages 4421–4430, 2019

work page 2019
[2]

Gspn: Generative shape proposal network for 3d instance segmentation in point cloud

Li Yi, Wang Zhao, He Wang, Minhyuk Sung, and Leonidas J Guibas. Gspn: Generative shape proposal network for 3d instance segmentation in point cloud. In Proc. Conference on Computer Vision and Pattern Recognition (CVPR), pages 3947–3956, 2019

work page 2019
[3]

Sgpn: Similarity group proposal network for 3d point cloud instance segmentation

Weiyue Wang, Ronald Yu, Qiangui Huang, and Ulrich Neumann. Sgpn: Similarity group proposal network for 3d point cloud instance segmentation. InProc. Conference on Computer Vision and Pattern Recognition (CVPR), pages 2569–2578, 2018

work page 2018
[4]

Deep hough voting for 3d object detection in point clouds

Charles R Qi, Or Litany, Kaiming He, and Leonidas J Guibas. Deep hough voting for 3d object detection in point clouds. In Proc. Conference on Computer Vision and Pattern Recognition (CVPR), 2019

work page 2019
[5]

Qi, Xinlei Chen, and Leonidas J

Charles R. Qi, Xinlei Chen, and Leonidas J. Guibas Or Litany. Imvotenet: Boosting 3d object detection in point clouds with image votes. In Proc. Conference on Computer Vision and Pattern Recognition (CVPR), 2020

work page 2020
[6]

Svga- net: Sparse voxel-graph attention network for 3d object detection from point clouds

Qingdong He, Zhengning Wang, Hao Zeng, Yi Zeng, Shuaicheng Liu, and Bing Zeng. Svga- net: Sparse voxel-graph attention network for 3d object detection from point clouds. arXiv preprint arXiv:2006.04043, 2020

work page arXiv 2006
[7]

Group-free 3d object detection via transformers

Ze Liu, Zheng Zhang, Yue Cao, Han Hu, and Xin Tong. Group-free 3d object detection via transformers. arXiv preprint arXiv:2104.00678, 2021

work page arXiv 2021
[8]

ShapeNet: An Information-Rich 3D Model Repository

Angel X Chang, Thomas Funkhouser, Leonidas Guibas, Pat Hanrahan, Qixing Huang, Zimo Li, Silvio Savarese, Manolis Savva, Shuran Song, Hao Su, et al. Shapenet: An information-rich 3d model repository. arXiv preprint arXiv:1512.03012, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[9]

Sun3d: A database of big spaces reconstructed using sfm and object labels

Jianxiong Xiao, Andrew Owens, and Antonio Torralba. Sun3d: A database of big spaces reconstructed using sfm and object labels. In Proc. International Conference on Computer Vision (ICCV), pages 1625–1632, 2013

work page 2013
[10]

A category-level 3d object dataset: Putting the kinect to work

Allison Janoch, Sergey Karayev, Yangqing Jia, Jonathan T Barron, Mario Fritz, Kate Saenko, and Trevor Darrell. A category-level 3d object dataset: Putting the kinect to work. InConsumer depth cameras for computer vision, pages 141–165. Springer, 2013

work page 2013
[11]

3d semantic parsing of large-scale indoor spaces

Iro Armeni, Ozan Sener, Amir R Zamir, Helen Jiang, Ioannis Brilakis, Martin Fischer, and Silvio Savarese. 3d semantic parsing of large-scale indoor spaces. In Proc. Conference on Computer Vision and Pattern Recognition (CVPR), pages 1534–1543, 2016

work page 2016
[12]

Partnet: A large-scale benchmark for ﬁne-grained and hierarchical part-level 3d object understanding

Kaichun Mo, Shilin Zhu, Angel X Chang, Li Yi, Subarna Tripathi, Leonidas J Guibas, and Hao Su. Partnet: A large-scale benchmark for ﬁne-grained and hierarchical part-level 3d object understanding. In Proc. Conference on Computer Vision and Pattern Recognition (CVPR) , pages 909–918, 2019

work page 2019
[13]

Scalability in perception for autonomous driving: Waymo open dataset

Pei Sun, Henrik Kretzschmar, Xerxes Dotiwalla, Aurelien Chouard, Vijaysai Patnaik, Paul Tsui, James Guo, Yin Zhou, Yuning Chai, Benjamin Caine, et al. Scalability in perception for autonomous driving: Waymo open dataset. In Proc. Conference on Computer Vision and Pattern Recognition (CVPR), pages 2446–2454, 2020

work page 2020
[14]

Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Nießner

Angela Dai, Angel X. Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In Proc. Conference on Computer Vision and Pattern Recognition (CVPR), 2017

work page 2017
[15]

Sun rgb-d: A rgb-d scene understanding benchmark suite

S Song, S Lichtenberg, and J Xiao. Sun rgb-d: A rgb-d scene understanding benchmark suite. In Proc. Conference on Computer Vision and Pattern Recognition (CVPR) , 2015

work page 2015
[16]

Indoor scene segmentation using a structured light sensor

Nathan Silberman and Rob Fergus. Indoor scene segmentation using a structured light sensor. In 2011 IEEE international conference on computer vision workshops (ICCV workshops), pages 601–608. IEEE, 2011

work page 2011
[17]

https://www.apple.com/newsroom/2020/03/apple-unveils-new-ipad-pro-with-lidar- scanner-and-trackpad-support-in-ipados/. 10

work page 2020
[18]

Imagenet: A large-scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In Proc. Conference on Computer Vision and Pattern Recognition (CVPR), pages 248–255. Ieee, 2009

work page 2009
[19]

Vision meets robotics: The kitti dataset

Andreas Geiger, Philip Lenz, Christoph Stiller, and Raquel Urtasun. Vision meets robotics: The kitti dataset. 2013

work page 2013
[20]

Kesten, M

R. Kesten, M. Usman, J. Houston, T . Pandya, K. Nadhamuni, A. Ferreira, M. Yuan, B. Low, A. Jain, P . Ondruska, S. Omari, S. Shah, A. Kulkarni, A. Kazakova, C. Tao, L. Platinsky, W . Jiang, and V . Shet. Lyft level 5 perception dataset 2020. 2019

work page 2020
[21]

Lang, Sourabh Vora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Giancarlo Baldan, and Oscar Beijbom

Holger Caesar, Varun Bankiti, Alex H. Lang, Sourabh Vora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Giancarlo Baldan, and Oscar Beijbom. nuscenes: A multimodal dataset for autonomous driving. In Proc. Conference on Computer Vision and Pattern Recognition (CVPR), June 2020

work page 2020
[22]

Matterport3D: Learning from RGB-D Data in Indoor Environments

Angel Chang, Angela Dai, Thomas Funkhouser, Maciej Halber, Matthias Niessner, Manolis Savva, Shuran Song, Andy Zeng, and Yinda Zhang. Matterport3d: Learning from rgb-d data in indoor environments. arXiv preprint arXiv:1709.06158, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[23]

Scenenn: A scene meshes dataset with annotations

Binh-Son Hua, Quang-Hieu Pham, Duc Thanh Nguyen, Minh-Khoi Tran, Lap-Fai Yu, and Sai- Kit Yeung. Scenenn: A scene meshes dataset with annotations. In 2016 Fourth International Conference on 3D Vision (3DV), pages 92–101. IEEE, 2016

work page 2016
[24]

Pigraphs: Learning interaction snapshots from observations

Manolis Savva, Angel X Chang, Pat Hanrahan, Matthew Fisher, and Matthias Nießner. Pigraphs: Learning interaction snapshots from observations. ACM Transactions on Graphics (TOG), 35(4):1–12, 2016

work page 2016
[25]

A naturalistic open source movie for optical ﬂow evaluation

Daniel J Butler, Jonas Wulff, Garrett B Stanley, and Michael J Black. A naturalistic open source movie for optical ﬂow evaluation. In Proc. European Conference on Computer Vision (ECCV) , pages 611–625. Springer, 2012

work page 2012
[26]

High-resolution stereo datasets with subpixel-accurate ground truth

Daniel Scharstein, Heiko Hirschmüller, York Kitajima, Greg Krathwohl, Nera Neši´ c, Xi Wang, and Porter Westling. High-resolution stereo datasets with subpixel-accurate ground truth. In German conference on pattern recognition, pages 31–42. Springer, 2014

work page 2014
[27]

Structure aware single-stage 3d object detection from point cloud

Chenhang He, Hui Zeng, Jianqiang Huang, Xian-Sheng Hua, and Lei Zhang. Structure aware single-stage 3d object detection from point cloud. In Proc. Conference on Computer Vision and Pattern Recognition (CVPR), June 2020

work page 2020
[28]

Hvnet: Hybrid voxel network for lidar based 3d object detection

Maosheng Ye, Shuangjie Xu, and Tongyi Cao. Hvnet: Hybrid voxel network for lidar based 3d object detection. In Proc. Conference on Computer Vision and Pattern Recognition (CVPR), June 2020

work page 2020
[29]

Point-gnn: Graph neural network for 3d object detection in a point cloud

Weijing Shi and Raj Rajkumar. Point-gnn: Graph neural network for 3d object detection in a point cloud. In Proc. Conference on Computer Vision and Pattern Recognition (CVPR), June 2020

work page 2020
[30]

Mlcvnet: Multi-level context votenet for 3d object detection

Qian Xie, Yu-Kun Lai, Jing Wu, Zhoutao Wang, Yiming Zhang, Kai Xu, and Jun Wang. Mlcvnet: Multi-level context votenet for 3d object detection. In Proc. Conference on Computer Vision and Pattern Recognition (CVPR), June 2020

work page 2020
[31]

Chen, and Jian Wu

Jintai Chen, Biwen Lei, Qingyu Song, Haochao Ying, Danny Z. Chen, and Jian Wu. A hierarchi- cal graph network for 3d object detection on point clouds. In Proc. Conference on Computer Vision and Pattern Recognition (CVPR), June 2020

work page 2020
[32]

Frodo: From detections to 3d objects

Martin Runz, Kejie Li, Meng Tang, Lingni Ma, Chen Kong, Tanner Schmidt, Ian Reid, Lourdes Agapito, Julian Straub, Steven Lovegrove, and Richard Newcombe. Frodo: From detections to 3d objects. In Proc. Conference on Computer Vision and Pattern Recognition (CVPR) , June 2020

work page 2020
[33]

Generative sparse detection networks for 3d single-shot object detection

JunYoung Gwak, Christopher Choy, and Silvio Savarese. Generative sparse detection networks for 3d single-shot object detection. arXiv preprint arXiv:2006.12356, 2020

work page arXiv 2006
[34]

Frustum pointnets for 3d object detection from rgb-d data

Charles R Qi, Wei Liu, Chenxia Wu, Hao Su, and Leonidas J Guibas. Frustum pointnets for 3d object detection from rgb-d data. In Proc. Conference on Computer Vision and Pattern Recognition (CVPR), pages 918–927, 2018

work page 2018
[35]

Pv-rcnn: Point-voxel feature set abstraction for 3d object detection

Shaoshuai Shi, Chaoxu Guo, Li Jiang, Zhe Wang, Jianping Shi, Xiaogang Wang, and Hongsheng Li. Pv-rcnn: Point-voxel feature set abstraction for 3d object detection. In Proc. Conference on Computer Vision and Pattern Recognition (CVPR), pages 10529–10538, 2020. 11

work page 2020
[36]

Objectron: A large scale dataset of object-centric videos in the wild with pose annotations

Adel Ahmadyan, Liangkai Zhang, Jianing Wei, Artsiom Ablavatski, and Matthias Grundmann. Objectron: A large scale dataset of object-centric videos in the wild with pose annotations. arXiv preprint arXiv:2012.09988, 2020

work page arXiv 2012
[37]

Depth map super-resolution by deep multi-scale guidance

Tak-Wai Hui, Chen Change Loy, , and Xiaoou Tang. Depth map super-resolution by deep multi-scale guidance. In Proc. European Conference on Computer Vision (ECCV) , pages 353–369, 2016

work page 2016
[38]

Cohen, Dani Lischinski, and Matt Uyttendaele

Johannes Kopf, Michael F . Cohen, Dani Lischinski, and Matt Uyttendaele. Joint bilateral upsampling. ACM Transactions on Graphics (Proceedings of SIGGRAPH 2007), 26(3):to appear, 2007

work page 2007
[39]

Image guided depth upsampling using anisotropic total generalized variation

David Ferstl, Christian Reinbacher, Rene Ranftl, Matthias Rüther, and Horst Bischof. Image guided depth upsampling using anisotropic total generalized variation. InProc. International Conference on Computer Vision (ICCV) , pages 993–1000, 2013

work page 2013
[40]

A taxonomy and evaluation of dense two-frame stereo correspondence algorithms

Daniel Scharstein and Richard Szeliski. A taxonomy and evaluation of dense two-frame stereo correspondence algorithms. International Journal of Computer Vision (IJCV) , 47(1):7–42, 2002

work page 2002
[41]

High-accuracy stereo depth maps using structured light

Daniel Scharstein and Richard Szeliski. High-accuracy stereo depth maps using structured light. In Proc. Conference on Computer Vision and Pattern Recognition (CVPR) , volume 1, pages I–I. IEEE, 2003

work page 2003
[42]

Random sample consensus: a paradigm for model ﬁtting with applications to image analysis and automated cartography.Communications of the ACM, 24(6):381–395, 1981

Martin A Fischler and Robert C Bolles. Random sample consensus: a paradigm for model ﬁtting with applications to image analysis and automated cartography.Communications of the ACM, 24(6):381–395, 1981

work page 1981
[43]

H3dnet: 3d object detection using hybrid geometric primitives

Zaiwei Zhang, Bo Sun, Haitao Yang, and Qixing Huang. H3dnet: 3d object detection using hybrid geometric primitives. In Proc. European Conference on Computer Vision (ECCV) , 2020

work page 2020
[44]

Qi, Li Yi, Hao Su, and Leonidas J

Charles R. Qi, Li Yi, Hao Su, and Leonidas J. Guibas. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. In Proc. Advances in Neural Information Processing Systems (NeurIPS), 2017

work page 2017
[45]

Multi-scale progressive fusion learning for depth map super-resolution

Chuhua Xian, Kun Qian, Zitian Zhang, and Charlie CL Wang. Multi-scale progressive fusion learning for depth map super-resolution. arXiv preprint arXiv:2011.11865, 2020

work page arXiv 2011
[46]

Megadepth: Learning single-view depth prediction from internet photos

Zhengqi Li and Noah Snavely. Megadepth: Learning single-view depth prediction from internet photos. In Proc. Conference on Computer Vision and Pattern Recognition (CVPR) , 2018. 12

work page 2018