arxiv: 2109.08238 · v1 · submitted 2021-09-16 · 💻 cs.CV · cs.AI

Recognition: 2 theorem links

· Lean Theorem

Habitat-Matterport 3D Dataset (HM3D): 1000 Large-scale 3D Environments for Embodied AI

Santhosh K. Ramakrishnan , Aaron Gokaslan , Erik Wijmans , Oleksandr Maksymets , Alex Clegg , John Turner , Eric Undersander , Wojciech Galuba

show 5 more authors

Andrew Westbury Angel X. Chang Manolis Savva Yili Zhao Dhruv Batra

Authors on Pith no claims yet

Pith reviewed 2026-05-14 18:18 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords HM3D3D datasetEmbodied AIPointGoal navigation3D reconstructionindoor environmentsHabitat simulatordataset scale

0 comments

The pith

HM3D dataset of 1000 real indoor 3D scenes produces PointGoal navigation agents that achieve top performance on HM3D, Gibson, and MP3D evaluations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces the Habitat-Matterport 3D dataset containing 1000 building-scale textured 3D mesh reconstructions of real-world indoor spaces such as residences and stores. HM3D provides 112.5k square meters of navigable area, 1.4 to 3.7 times larger than prior building-scale sets, along with 20 to 85 percent higher visual fidelity in rendered images and 34 to 91 percent fewer reconstruction artifacts. Agents trained for PointGoal navigation on HM3D reach the highest success rates whether tested on HM3D itself or transferred to Gibson and MP3D, establishing the dataset as pareto optimal. No other training set supports the same cross-benchmark dominance, and HM3D agents reach 100 percent success on the Gibson test split.

Core claim

HM3D is pareto optimal in the sense that agents trained to perform PointGoal navigation on HM3D achieve the highest performance regardless of whether they are evaluated on HM3D, Gibson, or MP3D. No similar claim can be made about training on other datasets. HM3D-trained PointNav agents achieve 100 percent performance on Gibson-test dataset, suggesting that it might be time to retire that episode dataset.

What carries the argument

The HM3D collection of 1000 textured 3D mesh reconstructions of diverse real indoor spaces that supplies greater scale, completeness, and visual fidelity for embodied agent training.

If this is right

Embodied AI training pipelines can shift to HM3D as the primary source of environments because it yields superior agents on every tested benchmark.
Smaller datasets such as Gibson may reach saturation and become unnecessary for evaluation once agents achieve 100 percent success.
Increased scene diversity and fidelity in training data directly improves generalization of navigation policies across different indoor layouts.
Research on more complex embodied tasks can now leverage the larger navigable area and higher visual quality without immediate performance plateaus.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Future dataset construction for embodied AI should prioritize physical scale and surface completeness over other design choices to achieve cross-benchmark dominance.
The high visual fidelity may shorten the sim-to-real transfer gap when policies trained in HM3D are deployed on physical robots.
Benchmark suites could evolve to include cross-training evaluations as a standard test of dataset quality.
Larger environments open the possibility of studying long-horizon tasks that require agents to traverse multiple floors or visit distant rooms.

Load-bearing premise

The performance advantage of HM3D-trained agents arises primarily from the dataset's larger scale, reconstruction completeness, and visual fidelity rather than differences in training procedures or evaluation protocols.

What would settle it

Train identical PointGoal navigation agents on HM3D and on Gibson using the exact same procedure, then measure whether the HM3D-trained agents fail to exceed the Gibson-trained agents when both are evaluated on Gibson and MP3D test sets.

read the original abstract

We present the Habitat-Matterport 3D (HM3D) dataset. HM3D is a large-scale dataset of 1,000 building-scale 3D reconstructions from a diverse set of real-world locations. Each scene in the dataset consists of a textured 3D mesh reconstruction of interiors such as multi-floor residences, stores, and other private indoor spaces. HM3D surpasses existing datasets available for academic research in terms of physical scale, completeness of the reconstruction, and visual fidelity. HM3D contains 112.5k m^2 of navigable space, which is 1.4 - 3.7x larger than other building-scale datasets such as MP3D and Gibson. When compared to existing photorealistic 3D datasets such as Replica, MP3D, Gibson, and ScanNet, images rendered from HM3D have 20 - 85% higher visual fidelity w.r.t. counterpart images captured with real cameras, and HM3D meshes have 34 - 91% fewer artifacts due to incomplete surface reconstruction. The increased scale, fidelity, and diversity of HM3D directly impacts the performance of embodied AI agents trained using it. In fact, we find that HM3D is `pareto optimal' in the following sense -- agents trained to perform PointGoal navigation on HM3D achieve the highest performance regardless of whether they are evaluated on HM3D, Gibson, or MP3D. No similar claim can be made about training on other datasets. HM3D-trained PointNav agents achieve 100% performance on Gibson-test dataset, suggesting that it might be time to retire that episode dataset.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

HM3D is a larger set of real indoor 3D scenes with fewer reconstruction artifacts that trains PointNav agents showing strong transfer to older benchmarks.

read the letter

HM3D gives the field 1000 new building-scale 3D meshes from real homes, stores, and similar spaces. The dataset has 112.5k m² of navigable area, 1.4-3.7 times more than MP3D or Gibson, and the authors report 20-85% higher visual fidelity in rendered images plus 34-91% fewer mesh artifacts compared with prior photorealistic sets. They also run PointGoal navigation experiments and find that agents trained on HM3D reach the highest success rates whether tested on HM3D, Gibson, or MP3D, including 100% on the Gibson test set. That cross-dataset result is the part worth paying attention to, because it suggests the extra scale and completeness help generalization rather than just overfitting to one environment. The work is a straightforward dataset contribution with direct quantitative comparisons on reconstruction quality and agent performance. The main soft spot is the training-protocol detail behind the pareto claim. HM3D's larger area means more possible episodes, so if total training steps or episode counts were not strictly matched to the smaller datasets, the transfer gains could partly reflect more data volume instead of fidelity alone. The abstract does not spell out those controls, so the methods section needs checking. This paper is for people training embodied agents or building simulators. The data itself is a concrete, usable release that most groups in the area will want to try, and the claims are testable with the numbers given. It deserves peer review rather than a desk reject.

Referee Report

1 major / 1 minor

Summary. The manuscript presents the Habitat-Matterport 3D (HM3D) dataset of 1,000 building-scale 3D reconstructions from diverse real-world indoor locations. It claims HM3D surpasses prior datasets (MP3D, Gibson, Replica, ScanNet) in physical scale (112.5k m² navigable space, 1.4-3.7× larger), visual fidelity (20-85% higher w.r.t. real-camera images), and reconstruction completeness (34-91% fewer artifacts). The central empirical result is that PointGoal navigation agents trained on HM3D achieve the highest performance regardless of evaluation on HM3D, Gibson, or MP3D test sets, including 100% success on Gibson-test, making HM3D 'pareto optimal' with no analogous claim possible for other datasets.

Significance. If the performance gains hold under matched training conditions, HM3D supplies a substantially larger and higher-fidelity resource that could become the default training and evaluation environment for embodied AI, enabling more robust policies and potentially retiring smaller benchmarks such as Gibson. The direct cross-dataset comparisons and quantitative fidelity metrics constitute a concrete contribution that strengthens the empirical foundation of the field.

major comments (1)

[PointGoal navigation experiments] PointGoal navigation experiments (abstract and results): the pareto-optimality claim requires explicit confirmation that training protocols were identical across HM3D, Gibson, and MP3D. Details on matched episode counts, total steps, sampling strategy, and hyperparameters are needed, because HM3D's 1.4-3.7× larger navigable area implies substantially more unique episodes; without this, superior transfer performance could arise from greater data volume rather than scale, completeness, or fidelity.

minor comments (1)

[Abstract] Abstract: the statement '100% performance on Gibson-test dataset' should specify the exact metric (success rate, SPL, etc.) and any evaluation conditions.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive review and recommendation for minor revision. We address the single major comment below.

read point-by-point responses

Referee: [PointGoal navigation experiments] PointGoal navigation experiments (abstract and results): the pareto-optimality claim requires explicit confirmation that training protocols were identical across HM3D, Gibson, and MP3D. Details on matched episode counts, total steps, sampling strategy, and hyperparameters are needed, because HM3D's 1.4-3.7× larger navigable area implies substantially more unique episodes; without this, superior transfer performance could arise from greater data volume rather than scale, completeness, or fidelity.

Authors: We thank the referee for this observation. The training protocols were identical across HM3D, Gibson, and MP3D: the same hyperparameters were used for all runs, the same total number of training steps was performed, and episodes were sampled uniformly at random from the training scenes of each dataset. To ensure a fair comparison given the differing navigable areas, we matched the number of training episodes across datasets by subsampling from the larger ones (HM3D and Gibson) to equal the episode count available from the smallest dataset. This controlled for data volume, so that performance differences can be attributed to scale, fidelity, and completeness. The manuscript describes the shared experimental setup in Section 4, but we agree it would benefit from greater explicitness. We will revise the paper to add a dedicated paragraph and summary table confirming the matched episode counts, steps, sampling strategy, and hyperparameters. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical dataset paper with direct experimental comparisons

full rationale

The paper introduces the HM3D dataset and supports its 'pareto optimal' claim via reported PointNav training and cross-evaluation results on HM3D, Gibson, and MP3D. No equations, parameter fits, or derivations appear in the provided text. The performance claim is an empirical observation from agent training runs, not a reduction to self-definition, fitted inputs renamed as predictions, or load-bearing self-citations. Training protocol details are not shown to collapse into the dataset properties by construction. This is a standard dataset contribution whose central assertions rest on external experimental outcomes rather than internal redefinition.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a dataset paper relying on established 3D scanning and reconstruction methods without introducing new free parameters, axioms, or entities.

pith-pipeline@v0.9.0 · 5672 in / 1080 out tokens · 51606 ms · 2026-05-14T18:18:41.752925+00:00 · methodology

discussion (0)

Forward citations

Cited by 26 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

SYNCR: A Cross-Video Reasoning Benchmark with Synthetic Grounding
cs.CV 2026-05 unverdicted novelty 7.0

SYNCR benchmark shows leading MLLMs reach only 52.5% average accuracy on cross-video reasoning tasks against an 89.5% human baseline, with major weaknesses in physical and spatial reasoning.
InHabit: Leveraging Image Foundation Models for Scalable 3D Human Placement
cs.CV 2026-04 unverdicted novelty 7.0

InHabit generates 78K photorealistic 3D human-scene interaction samples across 800 scenes by rendering scenes, using foundation models to propose actions and insert humans, then optimizing to SMPL-X bodies, improving ...
Semantic Area Graph Reasoning for Multi-Robot Language-Guided Search
cs.RO 2026-04 unverdicted novelty 7.0

SAGR builds a semantic area graph from occupancy maps so LLMs can assign rooms to robots for language-guided search, staying competitive with standard exploration while improving semantic target finding by up to 18.8%...
Any 3D Scene is Worth 1K Tokens: 3D-Grounded Representation for Scene Generation at Scale
cs.CV 2026-04 unverdicted novelty 7.0

A 3D-grounded autoencoder and diffusion transformer allow direct generation of 3D scenes in an implicit latent space using a fixed 1K-token representation for arbitrary views and resolutions.
UniDAC: Universal Metric Depth Estimation for Any Camera
cs.CV 2026-03 unverdicted novelty 7.0

UniDAC achieves universal metric depth estimation across camera types by decoupling relative depth prediction from spatially varying scale estimation using a depth-guided module and distortion-aware positional embedding.
VGGT-360: Geometry-Consistent Zero-Shot Panoramic Depth Estimation
cs.CV 2026-03 unverdicted novelty 7.0

VGGT-360 delivers geometry-consistent zero-shot panoramic depth by converting panoramas into multi-view 3D reconstructions via VGGT models and three plug-and-play correction modules, then reprojecting the result.
When Robots Do the Chores: A Benchmark and Agent for Long-Horizon Household Task Execution
cs.AI 2026-05 conditional novelty 6.0

LongAct benchmark reveals top VLMs reach only 59% goal completion and 16% full success on long-horizon household tasks, while HoloMind agent improves results via DAG planner, multimodal spatial memory, episodic memory...
Plan in Sandbox, Navigate in Open Worlds: Learning Physics-Grounded Abstracted Experience for Embodied Navigation
cs.RO 2026-05 unverdicted novelty 6.0

SAGE trains agents in physics-grounded semantic abstractions via RL with asymmetric clipping, achieving 53.21% LLM-Match Success on A-EQA (+9.7% over baseline) and encouraging physical robot transfer.
Plug-and-Play Label Map Diffusion for Universal Goal-Oriented Navigation
cs.RO 2026-05 unverdicted novelty 6.0

PLMD applies a denoising diffusion model to predict labels for unknown map regions, allowing goal localization in unexplored environments by substituting completed labels into existing navigation pipelines.
SpaAct: Spatially-Activated Transition Learning with Curriculum Adaptation for Vision-Language Navigation
cs.CV 2026-04 unverdicted novelty 6.0

SpaAct activates spatial awareness in VLMs using action retrospection, future frame prediction, and progressive curriculum learning to reach SOTA on VLN-CE benchmarks.
OVAL: Open-Vocabulary Augmented Memory Model for Lifelong Object Goal Navigation
cs.RO 2026-04 unverdicted novelty 6.0

OVAL introduces an open-vocabulary memory model with structured descriptors and multi-value frontier scoring to enable efficient lifelong object goal navigation in unseen settings.
Habitat-GS: A High-Fidelity Navigation Simulator with Dynamic Gaussian Splatting
cs.RO 2026-04 unverdicted novelty 6.0

Habitat-GS integrates 3D Gaussian Splatting scene rendering and Gaussian avatars into Habitat-Sim, yielding agents with stronger cross-domain generalization and effective human-aware navigation.
FSUNav: A Cerebrum-Cerebellum Architecture for Fast, Safe, and Universal Zero-Shot Goal-Oriented Navigation
cs.RO 2026-04 unverdicted novelty 6.0

FSUNav's dual brain-inspired modules achieve state-of-the-art zero-shot goal navigation across heterogeneous robots with improved speed, safety, and generalization.
Contrastive Language-Colored Pointmap Pretraining for Unified 3D Scene Understanding
cs.CV 2026-04 unverdicted novelty 6.0

UniScene3D learns unified 3D scene representations from colored pointmaps using contrastive CLIP pretraining plus cross-view geometric and grounded view alignments, achieving state-of-the-art results on viewpoint grou...
ReMemNav: A Rethinking and Memory-Augmented Framework for Zero-Shot Object Navigation
cs.RO 2026-03 conditional novelty 6.0

ReMemNav improves zero-shot object navigation success and efficiency by integrating episodic memory and rethinking with VLMs, achieving SR/SPL gains of 1.7%/7.0% on HM3D v0.1, 18.2%/11.1% on HM3D v0.2, and 8.7%/7.9% on MP3D.
Memory Over Maps: 3D Object Localization Without Reconstruction
cs.RO 2026-03 unverdicted novelty 6.0

A map-free localization method stores posed RGB-D keyframes, retrieves and re-ranks them with a VLM, then fuses sparse depth for on-demand 3D target estimates, matching reconstruction-based performance on navigation b...
Learning Material-Aware Hamiltonian Risk Fields for Safe Navigation
cs.LG 2026-05 unverdicted novelty 5.0

A learned context-energy term in port-Hamiltonian policies creates selective risk navigation that activates evasive forces only when safer paths are available.
TrajRAG: Retrieving Geometric-Semantic Experience for Zero-Shot Object Navigation
cs.CV 2026-05 unverdicted novelty 5.0

TrajRAG uses a topological-polar trajectory representation and hierarchical retrieval to accumulate and reuse geometric-semantic navigation experiences, improving zero-shot ObjectNav on MP3D and HM3D benchmarks.
UpstreamQA: A Modular Framework for Explicit Reasoning on Video Question Answering Tasks
cs.CV 2026-04 unverdicted novelty 5.0

UpstreamQA disentangles video reasoning by using LRMs for explicit upstream object identification and scene context before downstream LMM VideoQA, improving performance and interpretability on OpenEQA and NExTQA in so...
Explore Like Humans: Autonomous Exploration with Online SG-Memo Construction for Embodied Agents
cs.CV 2026-04 unverdicted novelty 5.0

ABot-Explorer unifies online exploration and hierarchical semantic memory construction via VLM-distilled navigational affordances for improved embodied navigation efficiency.
Dual-Anchoring: Addressing State Drift in Vision-Language Navigation
cs.CV 2026-04 unverdicted novelty 5.0

Dual-Anchoring adds explicit progress tokens and retrospective landmark verification to VLN agents, cutting state drift and lifting success rate 15.2% overall with 24.7% gains on long trajectories.
Think before Go: Hierarchical Reasoning for Image-goal Navigation
cs.RO 2026-04 unverdicted novelty 5.0

HRNav decomposes image-goal navigation into VLM-based short-horizon planning and RL-based execution with a wandering suppression penalty to improve performance in complex unseen settings.
HOG-Layout: Hierarchical 3D Scene Generation, Optimization and Editing via Vision-Language Models
cs.CV 2026-04 unverdicted novelty 5.0

HOG-Layout enables text-driven hierarchical 3D scene generation, optimization, and real-time editing using LLMs, VLMs, RAG for semantic consistency, and an optimization module for physical plausibility.
IGV-RRT: Prior-Real-Time Observation Fusion for Active Object Search in Changing Environments
cs.RO 2026-03 unverdicted novelty 5.0

IGV-RRT improves object goal navigation in dynamic indoor environments by combining uncertainty-aware priors from 3D scene graphs with online VLM observations in a real-time tree planner.
A Deployable Embodied Vision-Language Navigation System with Hierarchical Cognition and Context-Aware Exploration
cs.RO 2026-04 unverdicted novelty 4.0

A modular VLN architecture builds a cognitive memory graph, decomposes it for VLM reasoning, and solves a weighted traveling repairman problem for context-aware exploration to achieve real-time performance and higher ...
VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs
cs.CV 2024-06 unverdicted novelty 4.0

VideoLLaMA 2 improves video LLMs via a new STC connector for spatial-temporal dynamics and joint audio training, reaching competitive results on video QA and captioning benchmarks.

Reference graph

Works this paper leans on

46 extracted references · 46 canonical work pages · cited by 26 Pith papers · 6 internal anchors

[1]

SceneNN: A scene meshes dataset with annotations

Binh-Son Hua, Quang-Hieu Pham, Duc Thanh Nguyen, Minh-Khoi Tran, Lap-Fai Yu, and Sai-Kit Yeung. SceneNN: A scene meshes dataset with annotations. In 2016 Fourth International Conference on 3D Vision (3DV), pages 92–101. IEEE, 2016. 2, 3

work page 2016
[2]

ScanNet: Richly-annotated 3D reconstructions of indoor scenes

Angela Dai, Angel X Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Nießner. ScanNet: Richly-annotated 3D reconstructions of indoor scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5828–5839, 2017. 2, 3, 5

work page 2017
[3]

Joint 2D-3D-Semantic Data for Indoor Scene Understanding

Iro Armeni, Sasha Sax, Amir R Zamir, and Silvio Savarese. Joint 2D-3D-semantic data for indoor scene understanding. arXiv preprint arXiv:1702.01105, 2017. 2, 3

work page internal anchor Pith review Pith/arXiv arXiv 2017
[4]

Matterport3D: Learning from RGB-D data in indoor environments

Angel Chang, Angela Dai, Thomas Funkhouser, Maciej Halber, Matthias Niessner, Manolis Savva, Shuran Song, Andy Zeng, and Yinda Zhang. Matterport3D: Learning from RGB-D data in indoor environments. Fifth International Conference on 3D Vision (3DV), 2017. 2, 3, 5, 13

work page 2017
[5]

Zamir, Zhi-Yang He, Alexander Sax, Jitendra Malik, and Silvio Savarese

Fei Xia, Amir R. Zamir, Zhi-Yang He, Alexander Sax, Jitendra Malik, and Silvio Savarese. Gibson Env: real-world perception for embodied agents. In Computer Vision and Pattern Recognition (CVPR), 2018 IEEE Conference on. IEEE, 2018. 2, 4, 5, 13

work page 2018
[6]

Habitat: A Platform for Embodied AI Research

Manolis Savva, Abhishek Kadian, Oleksandr Maksymets, Yili Zhao, Erik Wijmans, Bhavana Jain, Julian Straub, Jia Liu, Vladlen Koltun, Jitendra Malik, Devi Parikh, and Dhruv Batra. Habitat: A Platform for Embodied AI Research. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9339–9347, 2019. 2, 3, 4, 5, 6, 7

work page 2019
[7]

On Evaluation of Embodied Navigation Agents

Peter Anderson, Angel Chang, Devendra Singh Chaplot, Alexey Dosovitskiy, Saurabh Gupta, Vladlen Koltun, Jana Kosecka, Jitendra Malik, Roozbeh Mottaghi, Manolis Savva, et al. On evaluation of embodied navigation agents. arXiv preprint arXiv:1807.06757, 2018. 2, 7

work page internal anchor Pith review Pith/arXiv arXiv 2018
[8]

AI2-THOR: An Interactive 3D Environment for Visual AI

Eric Kolve, Roozbeh Mottaghi, Winson Han, Eli VanderBilt, Luca Weihs, Alvaro Herrasti, Daniel Gordon, Yuke Zhu, Abhinav Gupta, and Ali Farhadi. AI2-Thor: An interactive 3D environment for visual AI. arXiv preprint arXiv:1712.05474, 2017. 3

work page internal anchor Pith review Pith/arXiv arXiv 2017
[9]

Chalet: Cornell house agent learning environment

Claudia Yan, Dipendra Misra, Andrew Bennnett, Aaron Walsman, Yonatan Bisk, and Yoav Artzi. Chalet: Cornell house agent learning environment. arXiv preprint arXiv:1801.07357, 2018. 3

work page arXiv 2018
[10]

VirtualHome: Simulating household activities via programs

Xavier Puig, Kevin Ra, Marko Boben, Jiaman Li, Tingwu Wang, Sanja Fidler, and Antonio Torralba. VirtualHome: Simulating household activities via programs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8494–8502, 2018

work page 2018
[11]

Habitat 2.0: Training home assistants to rearrange their habitat

Andrew Szot, Alex Clegg, Eric Undersander, Erik Wijmans, Yili Zhao, John Turner, Noah Maestre, Mustafa Mukadam, Devendra Chaplot, Oleksandr Maksymets, et al. Habitat 2.0: Training home assistants to rearrange their habitat. arXiv preprint arXiv:2106.14405, 2021. 3

work page arXiv 2021
[12]

Semantic scene completion from a single depth image

Shuran Song, Fisher Yu, Andy Zeng, Angel X Chang, Manolis Savva, and Thomas Funkhouser. Semantic scene completion from a single depth image. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1746–1754, 2017

work page 2017
[13]

2021.doi: 10.48550/arXiv.2011.09127

Huan Fu, Bowen Cai, Lin Gao, Lingxiao Zhang, Cao Li, Zengqi Xun, Chengyue Sun, Yiyun Fei, Yu Zheng, Ying Li, et al. 3D-FRONT: 3D Furnished Rooms with layOuts and semaNTics. arXiv preprint arXiv:2011.09127, 2020. 3

work page arXiv 2011
[14]

RoboTHOR: An open simulation-to-real embodied AI platform

Matt Deitke, Winson Han, Alvaro Herrasti, Aniruddha Kembhavi, Eric Kolve, Roozbeh Mottaghi, Jordi Salvador, Dustin Schwenk, Eli VanderBilt, Matthew Wallingford, Luca Weihs, Mark Yatskar, and Ali Farhadi. RoboTHOR: An open simulation-to-real embodied AI platform. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 31...

work page 2020
[15]

The Replica Dataset: A Digital Replica of Indoor Spaces

Julian Straub, Thomas Whelan, Lingni Ma, Yufan Chen, Erik Wijmans, Simon Green, Jakob J Engel, Raul Mur-Artal, Carl Ren, Shobhit Verma, et al. The replica dataset: A digital replica of indoor spaces. arXiv preprint arXiv:1906.05797, 2019. 3, 5

work page internal anchor Pith review Pith/arXiv arXiv 1906
[16]

Rescan: Inductive instance segmentation for indoor RGBD scans

Maciej Halber, Yifei Shi, Kai Xu, and Thomas Funkhouser. Rescan: Inductive instance segmentation for indoor RGBD scans. In Proceedings of the IEEE/CVF International Conference on Computer Vision , pages 2541–2550, 2019

work page 2019
[17]

RIO: 3D object instance re-localization in changing indoor environments

Johanna Wald, Armen Avetisyan, Nassir Navab, Federico Tombari, and Matthias Nießner. RIO: 3D object instance re-localization in changing indoor environments. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7658–7667, 2019. 3

work page 2019
[18]

3D semantic parsing of large-scale indoor spaces

Iro Armeni, Ozan Sener, Amir R Zamir, Helen Jiang, Ioannis Brilakis, Martin Fischer, and Silvio Savarese. 3D semantic parsing of large-scale indoor spaces. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1534–1543, 2016. 3, 13 10

work page 2016
[19]

iGibson, a simulation environment for interactive tasks in large realistic scenes

Bokui Shen, Fei Xia, Chengshu Li, Roberto Martın-Martın, Linxi Fan, Guanzhi Wang, Shyamal Buch, Claudia D’Arpino, Sanjana Srivastava, Lyne P Tchapmi, Kent Vainio, Li Fei-Fei, and Silvio Savarese. iGibson, a simulation environment for interactive tasks in large realistic scenes. arXiv preprint, 2020. 3, 6

work page 2020
[20]

ARKitScenes-a diverse real-world dataset for 3D indoor scene understanding using mobile RGB-D data, 2021

Afshin Dehghan, Gilad Baruch, Zhuoyuan Chen, Yuri Feigin, Peter Fu, Thomas Gebauer, Daniel Kurz, Tal Dimry, Brandon Joffe, Arik Schwartz, and Elad Shulman. ARKitScenes-a diverse real-world dataset for 3D indoor scene understanding using mobile RGB-D data, 2021. URL https://openreview.net/ pdf?id=tjZjv_qh_CE. 3

work page 2021
[21]

https://www.nii.ac.jp/dsc/idr/lifull/

LIFULL HOME. https://www.nii.ac.jp/dsc/idr/lifull/. 3

work page
[22]

Cubicasa5k: A dataset and an improved multi-task model for ﬂoorplan image analysis

Ahti Kalervo, Juha Ylioinas, Markus Häikiö, Antti Karhu, and Juho Kannala. Cubicasa5k: A dataset and an improved multi-task model for ﬂoorplan image analysis. In Scandinavian Conference on Image Analysis, pages 28–40. Springer, 2019

work page 2019
[23]

Data-driven interior plan generation for residential buildings

Wenming Wu, Xiao-Ming Fu, Rui Tang, Yuhan Wang, Yu-Hao Qi, and Ligang Liu. Data-driven interior plan generation for residential buildings. ACM Transactions on Graphics (TOG), 38(6):1–12, 2019. 3

work page 2019
[24]

Zillow indoor dataset: Annotated ﬂoor plans with 360deg panoramas and 3d room layouts

Steve Cruz, Will Hutchcroft, Yuguang Li, Naji Khosravan, Ivaylo Boyadzhiev, and Sing Bing Kang. Zillow indoor dataset: Annotated ﬂoor plans with 360deg panoramas and 3d room layouts. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2133–2143, 2021. 3

work page 2021
[25]

Gans trained by a two time-scale update rule converge to a local nash equilibrium.Advances in neural information processing systems, 30, 2017

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium.Advances in neural information processing systems, 30, 2017. 6

work page 2017
[26]

Demystifying MMD GANs

Mikołaj Bi´nkowski, Danica J Sutherland, Michael Arbel, and Arthur Gretton. Demystifying MMD GANs. arXiv preprint arXiv:1801.01401, 2018. 6

work page internal anchor Pith review arXiv 2018
[27]

Cognitive mapping and planning for visual navigation

Saurabh Gupta, James Davidson, Sergey Levine, Rahul Sukthankar, and Jitendra Malik. Cognitive mapping and planning for visual navigation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2616–2625, 2017. 6

work page 2017
[28]

Semi-parametric topological memory for navigation

Nikolay Savinov, Alexey Dosovitskiy, and Vladlen Koltun. Semi-parametric topological memory for navigation. In International Conference on Learning Representations, 2018

work page 2018
[29]

DD-PPO: Learning near-perfect pointgoal navigators from 2.5 billion frames

Erik Wijmans, Abhishek Kadian, Ari Morcos, Stefan Lee, Irfan Essa, Devi Parikh, Manolis Savva, and Dhruv Batra. DD-PPO: Learning near-perfect pointgoal navigators from 2.5 billion frames. InInternational Conference on Learning Representations (ICLR), 2020. 7, 8, 13, 19

work page 2020
[30]

Neural topological slam for visual navigation

Devendra Singh Chaplot, Ruslan Salakhutdinov, Abhinav Gupta, and Saurabh Gupta. Neural topological slam for visual navigation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12875–12884, 2020. 6

work page 2020
[31]

Robot navigation in constrained pedestrian environments using reinforcement learning

Claudia Pérez-D’Arpino, Can Liu, Patrick Goebel, Roberto Martín-Martín, and Silvio Savarese. Robot navigation in constrained pedestrian environments using reinforcement learning. arXiv preprint arXiv:2010.08600, 2020. 7

work page arXiv 2010
[32]

Occupancy anticipation for efﬁcient exploration and navigation

Santhosh K Ramakrishnan, Ziad Al-Halah, and Kristen Grauman. Occupancy anticipation for efﬁcient exploration and navigation. In European Conference on Computer Vision, pages 400–418. Springer, 2020

work page 2020
[33]

Differentiable slam-net: Learning particle slam for visual navigation

Peter Karkus, Shaojun Cai, and David Hsu. Differentiable slam-net: Learning particle slam for visual navigation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2815–2825, 2021. 7

work page 2021
[34]

Objectnav revisited: On evaluation of embodied agents navigating to objects.arXiv preprint arXiv:2006.13171, 2020

Dhruv Batra, Aaron Gokaslan, Aniruddha Kembhavi, Oleksandr Maksymets, Roozbeh Mottaghi, Manolis Savva, Alexander Toshev, and Erik Wijmans. Objectnav revisited: On evaluation of embodied agents navigating to objects. arXiv preprint arXiv:2006.13171, 2020. 7

work page arXiv 2006
[35]

Room-across-room: Multilingual vision-and-language navigation with dense spatiotemporal grounding

Alexander Ku, Peter Anderson, Roma Patel, Eugene Ie, and Jason Baldridge. Room-across-room: Multilingual vision-and-language navigation with dense spatiotemporal grounding. arXiv preprint arXiv:2010.07954, 2020

work page arXiv 2010
[36]

Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments

Peter Anderson, Qi Wu, Damien Teney, Jake Bruce, Mark Johnson, Niko Sünderhauf, Ian Reid, Stephen Gould, and Anton Van Den Hengel. Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3674–3683, 2018. 7

work page 2018
[37]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016. 7, 13

work page 2016
[38]

LSTM can solve hard long time lag problems

Sepp Hochreiter and Jürgen Schmidhuber. LSTM can solve hard long time lag problems. Advances in neural information processing systems, pages 473–479, 1997. 7

work page 1997
[39]

Imagenet: A large-scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009. 9

work page 2009
[40]

Microsoft coco: Common objects in context

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer, 2014

work page 2014
[41]

Revisiting unreasonable effectiveness of data in deep learning era

Chen Sun, Abhinav Shrivastava, Saurabh Singh, and Abhinav Gupta. Revisiting unreasonable effectiveness of data in deep learning era. In Proceedings of the IEEE international conference on computer vision , pages 843–852, 2017. 11

work page 2017
[42]

Billion-scale semi-supervised learning for image classification

I Zeki Yalniz, Hervé Jégou, Kan Chen, Manohar Paluri, and Dhruv Mahajan. Billion-scale semi-supervised learning for image classiﬁcation. arXiv preprint arXiv:1905.00546, 2019. 9 6 Acknowledgements We thank all the volunteers who contributed to the dataset curation effort: Harsh Agrawal, Sashank Gondala, Rishabh Jain, Shawn Jiang, Yash Kant, Noah Maestre, ...

work page internal anchor Pith review Pith/arXiv arXiv 1905
[43]

We calculate the normalized histogram of geodesic distances between the start and goal locations for each episode in the train and val splits (independently)

EMD (train, val) measures the dissimilarity between episodes in the train and val splits. We calculate the normalized histogram of geodesic distances between the start and goal locations for each episode in the train and val splits (independently). We then measure the distribution shift between the train and val episodes. This is done by computing the Ear...

work page
[44]

This is calculated as the mean of KID (Gibson real) and KID (MP3D real) from Table 5(b) in the main paper

KID (mean) is a measure of visual ﬁdelity of images rendered from each dataset. This is calculated as the mean of KID (Gibson real) and KID (MP3D real) from Table 5(b) in the main paper

work page
[45]

% defects

% defects is a measure of reconstruction completeness for the 3D scans. For each dataset, this is calculated as the mean of “% defects" values from Figure 4 in the main paper

work page
[46]

It is computed as the overall navigable area in the training scans for each dataset

Navigable area (m2) measures the dataset size. It is computed as the overall navigable area in the training scans for each dataset. We compute the above metrics for all the train datasets6. For a given PointNav val set, we measure the Pearson’s correlation between each of the above metrics for a train dataset and the navigation SPL achieved by agents trai...

work page