Recognition: 2 theorem links
· Lean TheoremHabitat-Matterport 3D Dataset (HM3D): 1000 Large-scale 3D Environments for Embodied AI
Pith reviewed 2026-05-14 18:18 UTC · model grok-4.3
The pith
HM3D dataset of 1000 real indoor 3D scenes produces PointGoal navigation agents that achieve top performance on HM3D, Gibson, and MP3D evaluations.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
HM3D is pareto optimal in the sense that agents trained to perform PointGoal navigation on HM3D achieve the highest performance regardless of whether they are evaluated on HM3D, Gibson, or MP3D. No similar claim can be made about training on other datasets. HM3D-trained PointNav agents achieve 100 percent performance on Gibson-test dataset, suggesting that it might be time to retire that episode dataset.
What carries the argument
The HM3D collection of 1000 textured 3D mesh reconstructions of diverse real indoor spaces that supplies greater scale, completeness, and visual fidelity for embodied agent training.
If this is right
- Embodied AI training pipelines can shift to HM3D as the primary source of environments because it yields superior agents on every tested benchmark.
- Smaller datasets such as Gibson may reach saturation and become unnecessary for evaluation once agents achieve 100 percent success.
- Increased scene diversity and fidelity in training data directly improves generalization of navigation policies across different indoor layouts.
- Research on more complex embodied tasks can now leverage the larger navigable area and higher visual quality without immediate performance plateaus.
Where Pith is reading between the lines
- Future dataset construction for embodied AI should prioritize physical scale and surface completeness over other design choices to achieve cross-benchmark dominance.
- The high visual fidelity may shorten the sim-to-real transfer gap when policies trained in HM3D are deployed on physical robots.
- Benchmark suites could evolve to include cross-training evaluations as a standard test of dataset quality.
- Larger environments open the possibility of studying long-horizon tasks that require agents to traverse multiple floors or visit distant rooms.
Load-bearing premise
The performance advantage of HM3D-trained agents arises primarily from the dataset's larger scale, reconstruction completeness, and visual fidelity rather than differences in training procedures or evaluation protocols.
What would settle it
Train identical PointGoal navigation agents on HM3D and on Gibson using the exact same procedure, then measure whether the HM3D-trained agents fail to exceed the Gibson-trained agents when both are evaluated on Gibson and MP3D test sets.
read the original abstract
We present the Habitat-Matterport 3D (HM3D) dataset. HM3D is a large-scale dataset of 1,000 building-scale 3D reconstructions from a diverse set of real-world locations. Each scene in the dataset consists of a textured 3D mesh reconstruction of interiors such as multi-floor residences, stores, and other private indoor spaces. HM3D surpasses existing datasets available for academic research in terms of physical scale, completeness of the reconstruction, and visual fidelity. HM3D contains 112.5k m^2 of navigable space, which is 1.4 - 3.7x larger than other building-scale datasets such as MP3D and Gibson. When compared to existing photorealistic 3D datasets such as Replica, MP3D, Gibson, and ScanNet, images rendered from HM3D have 20 - 85% higher visual fidelity w.r.t. counterpart images captured with real cameras, and HM3D meshes have 34 - 91% fewer artifacts due to incomplete surface reconstruction. The increased scale, fidelity, and diversity of HM3D directly impacts the performance of embodied AI agents trained using it. In fact, we find that HM3D is `pareto optimal' in the following sense -- agents trained to perform PointGoal navigation on HM3D achieve the highest performance regardless of whether they are evaluated on HM3D, Gibson, or MP3D. No similar claim can be made about training on other datasets. HM3D-trained PointNav agents achieve 100% performance on Gibson-test dataset, suggesting that it might be time to retire that episode dataset.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents the Habitat-Matterport 3D (HM3D) dataset of 1,000 building-scale 3D reconstructions from diverse real-world indoor locations. It claims HM3D surpasses prior datasets (MP3D, Gibson, Replica, ScanNet) in physical scale (112.5k m² navigable space, 1.4-3.7× larger), visual fidelity (20-85% higher w.r.t. real-camera images), and reconstruction completeness (34-91% fewer artifacts). The central empirical result is that PointGoal navigation agents trained on HM3D achieve the highest performance regardless of evaluation on HM3D, Gibson, or MP3D test sets, including 100% success on Gibson-test, making HM3D 'pareto optimal' with no analogous claim possible for other datasets.
Significance. If the performance gains hold under matched training conditions, HM3D supplies a substantially larger and higher-fidelity resource that could become the default training and evaluation environment for embodied AI, enabling more robust policies and potentially retiring smaller benchmarks such as Gibson. The direct cross-dataset comparisons and quantitative fidelity metrics constitute a concrete contribution that strengthens the empirical foundation of the field.
major comments (1)
- [PointGoal navigation experiments] PointGoal navigation experiments (abstract and results): the pareto-optimality claim requires explicit confirmation that training protocols were identical across HM3D, Gibson, and MP3D. Details on matched episode counts, total steps, sampling strategy, and hyperparameters are needed, because HM3D's 1.4-3.7× larger navigable area implies substantially more unique episodes; without this, superior transfer performance could arise from greater data volume rather than scale, completeness, or fidelity.
minor comments (1)
- [Abstract] Abstract: the statement '100% performance on Gibson-test dataset' should specify the exact metric (success rate, SPL, etc.) and any evaluation conditions.
Simulated Author's Rebuttal
We thank the referee for their constructive review and recommendation for minor revision. We address the single major comment below.
read point-by-point responses
-
Referee: [PointGoal navigation experiments] PointGoal navigation experiments (abstract and results): the pareto-optimality claim requires explicit confirmation that training protocols were identical across HM3D, Gibson, and MP3D. Details on matched episode counts, total steps, sampling strategy, and hyperparameters are needed, because HM3D's 1.4-3.7× larger navigable area implies substantially more unique episodes; without this, superior transfer performance could arise from greater data volume rather than scale, completeness, or fidelity.
Authors: We thank the referee for this observation. The training protocols were identical across HM3D, Gibson, and MP3D: the same hyperparameters were used for all runs, the same total number of training steps was performed, and episodes were sampled uniformly at random from the training scenes of each dataset. To ensure a fair comparison given the differing navigable areas, we matched the number of training episodes across datasets by subsampling from the larger ones (HM3D and Gibson) to equal the episode count available from the smallest dataset. This controlled for data volume, so that performance differences can be attributed to scale, fidelity, and completeness. The manuscript describes the shared experimental setup in Section 4, but we agree it would benefit from greater explicitness. We will revise the paper to add a dedicated paragraph and summary table confirming the matched episode counts, steps, sampling strategy, and hyperparameters. revision: yes
Circularity Check
No circularity; empirical dataset paper with direct experimental comparisons
full rationale
The paper introduces the HM3D dataset and supports its 'pareto optimal' claim via reported PointNav training and cross-evaluation results on HM3D, Gibson, and MP3D. No equations, parameter fits, or derivations appear in the provided text. The performance claim is an empirical observation from agent training runs, not a reduction to self-definition, fitted inputs renamed as predictions, or load-bearing self-citations. Training protocol details are not shown to collapse into the dataset properties by construction. This is a standard dataset contribution whose central assertions rest on external experimental outcomes rather than internal redefinition.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 26 Pith papers
-
SYNCR: A Cross-Video Reasoning Benchmark with Synthetic Grounding
SYNCR benchmark shows leading MLLMs reach only 52.5% average accuracy on cross-video reasoning tasks against an 89.5% human baseline, with major weaknesses in physical and spatial reasoning.
-
InHabit: Leveraging Image Foundation Models for Scalable 3D Human Placement
InHabit generates 78K photorealistic 3D human-scene interaction samples across 800 scenes by rendering scenes, using foundation models to propose actions and insert humans, then optimizing to SMPL-X bodies, improving ...
-
Semantic Area Graph Reasoning for Multi-Robot Language-Guided Search
SAGR builds a semantic area graph from occupancy maps so LLMs can assign rooms to robots for language-guided search, staying competitive with standard exploration while improving semantic target finding by up to 18.8%...
-
Any 3D Scene is Worth 1K Tokens: 3D-Grounded Representation for Scene Generation at Scale
A 3D-grounded autoencoder and diffusion transformer allow direct generation of 3D scenes in an implicit latent space using a fixed 1K-token representation for arbitrary views and resolutions.
-
UniDAC: Universal Metric Depth Estimation for Any Camera
UniDAC achieves universal metric depth estimation across camera types by decoupling relative depth prediction from spatially varying scale estimation using a depth-guided module and distortion-aware positional embedding.
-
VGGT-360: Geometry-Consistent Zero-Shot Panoramic Depth Estimation
VGGT-360 delivers geometry-consistent zero-shot panoramic depth by converting panoramas into multi-view 3D reconstructions via VGGT models and three plug-and-play correction modules, then reprojecting the result.
-
When Robots Do the Chores: A Benchmark and Agent for Long-Horizon Household Task Execution
LongAct benchmark reveals top VLMs reach only 59% goal completion and 16% full success on long-horizon household tasks, while HoloMind agent improves results via DAG planner, multimodal spatial memory, episodic memory...
-
Plan in Sandbox, Navigate in Open Worlds: Learning Physics-Grounded Abstracted Experience for Embodied Navigation
SAGE trains agents in physics-grounded semantic abstractions via RL with asymmetric clipping, achieving 53.21% LLM-Match Success on A-EQA (+9.7% over baseline) and encouraging physical robot transfer.
-
Plug-and-Play Label Map Diffusion for Universal Goal-Oriented Navigation
PLMD applies a denoising diffusion model to predict labels for unknown map regions, allowing goal localization in unexplored environments by substituting completed labels into existing navigation pipelines.
-
SpaAct: Spatially-Activated Transition Learning with Curriculum Adaptation for Vision-Language Navigation
SpaAct activates spatial awareness in VLMs using action retrospection, future frame prediction, and progressive curriculum learning to reach SOTA on VLN-CE benchmarks.
-
OVAL: Open-Vocabulary Augmented Memory Model for Lifelong Object Goal Navigation
OVAL introduces an open-vocabulary memory model with structured descriptors and multi-value frontier scoring to enable efficient lifelong object goal navigation in unseen settings.
-
Habitat-GS: A High-Fidelity Navigation Simulator with Dynamic Gaussian Splatting
Habitat-GS integrates 3D Gaussian Splatting scene rendering and Gaussian avatars into Habitat-Sim, yielding agents with stronger cross-domain generalization and effective human-aware navigation.
-
FSUNav: A Cerebrum-Cerebellum Architecture for Fast, Safe, and Universal Zero-Shot Goal-Oriented Navigation
FSUNav's dual brain-inspired modules achieve state-of-the-art zero-shot goal navigation across heterogeneous robots with improved speed, safety, and generalization.
-
Contrastive Language-Colored Pointmap Pretraining for Unified 3D Scene Understanding
UniScene3D learns unified 3D scene representations from colored pointmaps using contrastive CLIP pretraining plus cross-view geometric and grounded view alignments, achieving state-of-the-art results on viewpoint grou...
-
ReMemNav: A Rethinking and Memory-Augmented Framework for Zero-Shot Object Navigation
ReMemNav improves zero-shot object navigation success and efficiency by integrating episodic memory and rethinking with VLMs, achieving SR/SPL gains of 1.7%/7.0% on HM3D v0.1, 18.2%/11.1% on HM3D v0.2, and 8.7%/7.9% on MP3D.
-
Memory Over Maps: 3D Object Localization Without Reconstruction
A map-free localization method stores posed RGB-D keyframes, retrieves and re-ranks them with a VLM, then fuses sparse depth for on-demand 3D target estimates, matching reconstruction-based performance on navigation b...
-
Learning Material-Aware Hamiltonian Risk Fields for Safe Navigation
A learned context-energy term in port-Hamiltonian policies creates selective risk navigation that activates evasive forces only when safer paths are available.
-
TrajRAG: Retrieving Geometric-Semantic Experience for Zero-Shot Object Navigation
TrajRAG uses a topological-polar trajectory representation and hierarchical retrieval to accumulate and reuse geometric-semantic navigation experiences, improving zero-shot ObjectNav on MP3D and HM3D benchmarks.
-
UpstreamQA: A Modular Framework for Explicit Reasoning on Video Question Answering Tasks
UpstreamQA disentangles video reasoning by using LRMs for explicit upstream object identification and scene context before downstream LMM VideoQA, improving performance and interpretability on OpenEQA and NExTQA in so...
-
Explore Like Humans: Autonomous Exploration with Online SG-Memo Construction for Embodied Agents
ABot-Explorer unifies online exploration and hierarchical semantic memory construction via VLM-distilled navigational affordances for improved embodied navigation efficiency.
-
Dual-Anchoring: Addressing State Drift in Vision-Language Navigation
Dual-Anchoring adds explicit progress tokens and retrospective landmark verification to VLN agents, cutting state drift and lifting success rate 15.2% overall with 24.7% gains on long trajectories.
-
Think before Go: Hierarchical Reasoning for Image-goal Navigation
HRNav decomposes image-goal navigation into VLM-based short-horizon planning and RL-based execution with a wandering suppression penalty to improve performance in complex unseen settings.
-
HOG-Layout: Hierarchical 3D Scene Generation, Optimization and Editing via Vision-Language Models
HOG-Layout enables text-driven hierarchical 3D scene generation, optimization, and real-time editing using LLMs, VLMs, RAG for semantic consistency, and an optimization module for physical plausibility.
-
IGV-RRT: Prior-Real-Time Observation Fusion for Active Object Search in Changing Environments
IGV-RRT improves object goal navigation in dynamic indoor environments by combining uncertainty-aware priors from 3D scene graphs with online VLM observations in a real-time tree planner.
-
A Deployable Embodied Vision-Language Navigation System with Hierarchical Cognition and Context-Aware Exploration
A modular VLN architecture builds a cognitive memory graph, decomposes it for VLM reasoning, and solves a weighted traveling repairman problem for context-aware exploration to achieve real-time performance and higher ...
-
VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs
VideoLLaMA 2 improves video LLMs via a new STC connector for spatial-temporal dynamics and joint audio training, reaching competitive results on video QA and captioning benchmarks.
Reference graph
Works this paper leans on
-
[1]
SceneNN: A scene meshes dataset with annotations
Binh-Son Hua, Quang-Hieu Pham, Duc Thanh Nguyen, Minh-Khoi Tran, Lap-Fai Yu, and Sai-Kit Yeung. SceneNN: A scene meshes dataset with annotations. In 2016 Fourth International Conference on 3D Vision (3DV), pages 92–101. IEEE, 2016. 2, 3
work page 2016
-
[2]
ScanNet: Richly-annotated 3D reconstructions of indoor scenes
Angela Dai, Angel X Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Nießner. ScanNet: Richly-annotated 3D reconstructions of indoor scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5828–5839, 2017. 2, 3, 5
work page 2017
-
[3]
Joint 2D-3D-Semantic Data for Indoor Scene Understanding
Iro Armeni, Sasha Sax, Amir R Zamir, and Silvio Savarese. Joint 2D-3D-semantic data for indoor scene understanding. arXiv preprint arXiv:1702.01105, 2017. 2, 3
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[4]
Matterport3D: Learning from RGB-D data in indoor environments
Angel Chang, Angela Dai, Thomas Funkhouser, Maciej Halber, Matthias Niessner, Manolis Savva, Shuran Song, Andy Zeng, and Yinda Zhang. Matterport3D: Learning from RGB-D data in indoor environments. Fifth International Conference on 3D Vision (3DV), 2017. 2, 3, 5, 13
work page 2017
-
[5]
Zamir, Zhi-Yang He, Alexander Sax, Jitendra Malik, and Silvio Savarese
Fei Xia, Amir R. Zamir, Zhi-Yang He, Alexander Sax, Jitendra Malik, and Silvio Savarese. Gibson Env: real-world perception for embodied agents. In Computer Vision and Pattern Recognition (CVPR), 2018 IEEE Conference on. IEEE, 2018. 2, 4, 5, 13
work page 2018
-
[6]
Habitat: A Platform for Embodied AI Research
Manolis Savva, Abhishek Kadian, Oleksandr Maksymets, Yili Zhao, Erik Wijmans, Bhavana Jain, Julian Straub, Jia Liu, Vladlen Koltun, Jitendra Malik, Devi Parikh, and Dhruv Batra. Habitat: A Platform for Embodied AI Research. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9339–9347, 2019. 2, 3, 4, 5, 6, 7
work page 2019
-
[7]
On Evaluation of Embodied Navigation Agents
Peter Anderson, Angel Chang, Devendra Singh Chaplot, Alexey Dosovitskiy, Saurabh Gupta, Vladlen Koltun, Jana Kosecka, Jitendra Malik, Roozbeh Mottaghi, Manolis Savva, et al. On evaluation of embodied navigation agents. arXiv preprint arXiv:1807.06757, 2018. 2, 7
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[8]
AI2-THOR: An Interactive 3D Environment for Visual AI
Eric Kolve, Roozbeh Mottaghi, Winson Han, Eli VanderBilt, Luca Weihs, Alvaro Herrasti, Daniel Gordon, Yuke Zhu, Abhinav Gupta, and Ali Farhadi. AI2-Thor: An interactive 3D environment for visual AI. arXiv preprint arXiv:1712.05474, 2017. 3
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[9]
Chalet: Cornell house agent learning environment
Claudia Yan, Dipendra Misra, Andrew Bennnett, Aaron Walsman, Yonatan Bisk, and Yoav Artzi. Chalet: Cornell house agent learning environment. arXiv preprint arXiv:1801.07357, 2018. 3
-
[10]
VirtualHome: Simulating household activities via programs
Xavier Puig, Kevin Ra, Marko Boben, Jiaman Li, Tingwu Wang, Sanja Fidler, and Antonio Torralba. VirtualHome: Simulating household activities via programs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8494–8502, 2018
work page 2018
-
[11]
Habitat 2.0: Training home assistants to rearrange their habitat
Andrew Szot, Alex Clegg, Eric Undersander, Erik Wijmans, Yili Zhao, John Turner, Noah Maestre, Mustafa Mukadam, Devendra Chaplot, Oleksandr Maksymets, et al. Habitat 2.0: Training home assistants to rearrange their habitat. arXiv preprint arXiv:2106.14405, 2021. 3
-
[12]
Semantic scene completion from a single depth image
Shuran Song, Fisher Yu, Andy Zeng, Angel X Chang, Manolis Savva, and Thomas Funkhouser. Semantic scene completion from a single depth image. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1746–1754, 2017
work page 2017
-
[13]
2021.doi: 10.48550/arXiv.2011.09127
Huan Fu, Bowen Cai, Lin Gao, Lingxiao Zhang, Cao Li, Zengqi Xun, Chengyue Sun, Yiyun Fei, Yu Zheng, Ying Li, et al. 3D-FRONT: 3D Furnished Rooms with layOuts and semaNTics. arXiv preprint arXiv:2011.09127, 2020. 3
-
[14]
RoboTHOR: An open simulation-to-real embodied AI platform
Matt Deitke, Winson Han, Alvaro Herrasti, Aniruddha Kembhavi, Eric Kolve, Roozbeh Mottaghi, Jordi Salvador, Dustin Schwenk, Eli VanderBilt, Matthew Wallingford, Luca Weihs, Mark Yatskar, and Ali Farhadi. RoboTHOR: An open simulation-to-real embodied AI platform. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 31...
work page 2020
-
[15]
The Replica Dataset: A Digital Replica of Indoor Spaces
Julian Straub, Thomas Whelan, Lingni Ma, Yufan Chen, Erik Wijmans, Simon Green, Jakob J Engel, Raul Mur-Artal, Carl Ren, Shobhit Verma, et al. The replica dataset: A digital replica of indoor spaces. arXiv preprint arXiv:1906.05797, 2019. 3, 5
work page internal anchor Pith review Pith/arXiv arXiv 1906
-
[16]
Rescan: Inductive instance segmentation for indoor RGBD scans
Maciej Halber, Yifei Shi, Kai Xu, and Thomas Funkhouser. Rescan: Inductive instance segmentation for indoor RGBD scans. In Proceedings of the IEEE/CVF International Conference on Computer Vision , pages 2541–2550, 2019
work page 2019
-
[17]
RIO: 3D object instance re-localization in changing indoor environments
Johanna Wald, Armen Avetisyan, Nassir Navab, Federico Tombari, and Matthias Nießner. RIO: 3D object instance re-localization in changing indoor environments. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7658–7667, 2019. 3
work page 2019
-
[18]
3D semantic parsing of large-scale indoor spaces
Iro Armeni, Ozan Sener, Amir R Zamir, Helen Jiang, Ioannis Brilakis, Martin Fischer, and Silvio Savarese. 3D semantic parsing of large-scale indoor spaces. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1534–1543, 2016. 3, 13 10
work page 2016
-
[19]
iGibson, a simulation environment for interactive tasks in large realistic scenes
Bokui Shen, Fei Xia, Chengshu Li, Roberto Martın-Martın, Linxi Fan, Guanzhi Wang, Shyamal Buch, Claudia D’Arpino, Sanjana Srivastava, Lyne P Tchapmi, Kent Vainio, Li Fei-Fei, and Silvio Savarese. iGibson, a simulation environment for interactive tasks in large realistic scenes. arXiv preprint, 2020. 3, 6
work page 2020
-
[20]
Afshin Dehghan, Gilad Baruch, Zhuoyuan Chen, Yuri Feigin, Peter Fu, Thomas Gebauer, Daniel Kurz, Tal Dimry, Brandon Joffe, Arik Schwartz, and Elad Shulman. ARKitScenes-a diverse real-world dataset for 3D indoor scene understanding using mobile RGB-D data, 2021. URL https://openreview.net/ pdf?id=tjZjv_qh_CE. 3
work page 2021
-
[21]
https://www.nii.ac.jp/dsc/idr/lifull/
LIFULL HOME. https://www.nii.ac.jp/dsc/idr/lifull/. 3
-
[22]
Cubicasa5k: A dataset and an improved multi-task model for floorplan image analysis
Ahti Kalervo, Juha Ylioinas, Markus Häikiö, Antti Karhu, and Juho Kannala. Cubicasa5k: A dataset and an improved multi-task model for floorplan image analysis. In Scandinavian Conference on Image Analysis, pages 28–40. Springer, 2019
work page 2019
-
[23]
Data-driven interior plan generation for residential buildings
Wenming Wu, Xiao-Ming Fu, Rui Tang, Yuhan Wang, Yu-Hao Qi, and Ligang Liu. Data-driven interior plan generation for residential buildings. ACM Transactions on Graphics (TOG), 38(6):1–12, 2019. 3
work page 2019
-
[24]
Zillow indoor dataset: Annotated floor plans with 360deg panoramas and 3d room layouts
Steve Cruz, Will Hutchcroft, Yuguang Li, Naji Khosravan, Ivaylo Boyadzhiev, and Sing Bing Kang. Zillow indoor dataset: Annotated floor plans with 360deg panoramas and 3d room layouts. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2133–2143, 2021. 3
work page 2021
-
[25]
Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium.Advances in neural information processing systems, 30, 2017. 6
work page 2017
-
[26]
Mikołaj Bi´nkowski, Danica J Sutherland, Michael Arbel, and Arthur Gretton. Demystifying MMD GANs. arXiv preprint arXiv:1801.01401, 2018. 6
work page internal anchor Pith review arXiv 2018
-
[27]
Cognitive mapping and planning for visual navigation
Saurabh Gupta, James Davidson, Sergey Levine, Rahul Sukthankar, and Jitendra Malik. Cognitive mapping and planning for visual navigation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2616–2625, 2017. 6
work page 2017
-
[28]
Semi-parametric topological memory for navigation
Nikolay Savinov, Alexey Dosovitskiy, and Vladlen Koltun. Semi-parametric topological memory for navigation. In International Conference on Learning Representations, 2018
work page 2018
-
[29]
DD-PPO: Learning near-perfect pointgoal navigators from 2.5 billion frames
Erik Wijmans, Abhishek Kadian, Ari Morcos, Stefan Lee, Irfan Essa, Devi Parikh, Manolis Savva, and Dhruv Batra. DD-PPO: Learning near-perfect pointgoal navigators from 2.5 billion frames. InInternational Conference on Learning Representations (ICLR), 2020. 7, 8, 13, 19
work page 2020
-
[30]
Neural topological slam for visual navigation
Devendra Singh Chaplot, Ruslan Salakhutdinov, Abhinav Gupta, and Saurabh Gupta. Neural topological slam for visual navigation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12875–12884, 2020. 6
work page 2020
-
[31]
Robot navigation in constrained pedestrian environments using reinforcement learning
Claudia Pérez-D’Arpino, Can Liu, Patrick Goebel, Roberto Martín-Martín, and Silvio Savarese. Robot navigation in constrained pedestrian environments using reinforcement learning. arXiv preprint arXiv:2010.08600, 2020. 7
-
[32]
Occupancy anticipation for efficient exploration and navigation
Santhosh K Ramakrishnan, Ziad Al-Halah, and Kristen Grauman. Occupancy anticipation for efficient exploration and navigation. In European Conference on Computer Vision, pages 400–418. Springer, 2020
work page 2020
-
[33]
Differentiable slam-net: Learning particle slam for visual navigation
Peter Karkus, Shaojun Cai, and David Hsu. Differentiable slam-net: Learning particle slam for visual navigation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2815–2825, 2021. 7
work page 2021
-
[34]
Dhruv Batra, Aaron Gokaslan, Aniruddha Kembhavi, Oleksandr Maksymets, Roozbeh Mottaghi, Manolis Savva, Alexander Toshev, and Erik Wijmans. Objectnav revisited: On evaluation of embodied agents navigating to objects. arXiv preprint arXiv:2006.13171, 2020. 7
-
[35]
Room-across-room: Multilingual vision-and-language navigation with dense spatiotemporal grounding
Alexander Ku, Peter Anderson, Roma Patel, Eugene Ie, and Jason Baldridge. Room-across-room: Multilingual vision-and-language navigation with dense spatiotemporal grounding. arXiv preprint arXiv:2010.07954, 2020
-
[36]
Peter Anderson, Qi Wu, Damien Teney, Jake Bruce, Mark Johnson, Niko Sünderhauf, Ian Reid, Stephen Gould, and Anton Van Den Hengel. Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3674–3683, 2018. 7
work page 2018
-
[37]
Deep residual learning for image recognition
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016. 7, 13
work page 2016
-
[38]
LSTM can solve hard long time lag problems
Sepp Hochreiter and Jürgen Schmidhuber. LSTM can solve hard long time lag problems. Advances in neural information processing systems, pages 473–479, 1997. 7
work page 1997
-
[39]
Imagenet: A large-scale hierarchical image database
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009. 9
work page 2009
-
[40]
Microsoft coco: Common objects in context
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer, 2014
work page 2014
-
[41]
Revisiting unreasonable effectiveness of data in deep learning era
Chen Sun, Abhinav Shrivastava, Saurabh Singh, and Abhinav Gupta. Revisiting unreasonable effectiveness of data in deep learning era. In Proceedings of the IEEE international conference on computer vision , pages 843–852, 2017. 11
work page 2017
-
[42]
Billion-scale semi-supervised learning for image classification
I Zeki Yalniz, Hervé Jégou, Kan Chen, Manohar Paluri, and Dhruv Mahajan. Billion-scale semi-supervised learning for image classification. arXiv preprint arXiv:1905.00546, 2019. 9 6 Acknowledgements We thank all the volunteers who contributed to the dataset curation effort: Harsh Agrawal, Sashank Gondala, Rishabh Jain, Shawn Jiang, Yash Kant, Noah Maestre, ...
work page internal anchor Pith review Pith/arXiv arXiv 1905
-
[43]
EMD (train, val) measures the dissimilarity between episodes in the train and val splits. We calculate the normalized histogram of geodesic distances between the start and goal locations for each episode in the train and val splits (independently). We then measure the distribution shift between the train and val episodes. This is done by computing the Ear...
-
[44]
KID (mean) is a measure of visual fidelity of images rendered from each dataset. This is calculated as the mean of KID (Gibson real) and KID (MP3D real) from Table 5(b) in the main paper
- [45]
-
[46]
It is computed as the overall navigable area in the training scans for each dataset
Navigable area (m2) measures the dataset size. It is computed as the overall navigable area in the training scans for each dataset. We compute the above metrics for all the train datasets6. For a given PointNav val set, we measure the Pearson’s correlation between each of the above metrics for a train dataset and the navigation SPL achieved by agents trai...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.