Matterport3D: Learning from RGB-D Data in Indoor Environments

Andy Zeng; Angela Dai; Angel Chang; Maciej Halber; Manolis Savva; Matthias Nie{\ss}ner; Shuran Song; Thomas Funkhouser; Yinda Zhang

arxiv: 1709.06158 · v1 · pith:4PGWCMQPnew · submitted 2017-09-18 · 💻 cs.CV

Matterport3D: Learning from RGB-D Data in Indoor Environments

Angel Chang , Angela Dai , Thomas Funkhouser , Maciej Halber , Matthias Nie{\ss}ner , Manolis Savva , Shuran Song , Andy Zeng

show 1 more author

Yinda Zhang

This is my paper

classification 💻 cs.CV

keywords rgb-dviewsdatasetsdiversematterport3dpanoramicpredictionsemantic

0 comments

read the original abstract

Access to large, diverse RGB-D datasets is critical for training RGB-D scene understanding algorithms. However, existing datasets still cover only a limited number of views or a restricted scale of spaces. In this paper, we introduce Matterport3D, a large-scale RGB-D dataset containing 10,800 panoramic views from 194,400 RGB-D images of 90 building-scale scenes. Annotations are provided with surface reconstructions, camera poses, and 2D and 3D semantic segmentations. The precise global alignment and comprehensive, diverse panoramic set of views over entire buildings enable a variety of supervised and self-supervised computer vision tasks, including keypoint matching, view overlap prediction, normal prediction from color, semantic segmentation, and region classification.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 50 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Mind the Gap: Geometrically Accurate Generative Reconstruction from Disjoint Views
cs.CV 2026-05 unverdicted novelty 8.0

GLADOS reconstructs 3D geometry from disjoint views by generating intermediate perspectives, performing robust coarse alignment that tolerates generative inconsistencies, and iteratively expanding context for consistency.
ARKitScenes: A Diverse Real-World Dataset For 3D Indoor Scene Understanding Using Mobile RGB-D Data
cs.CV 2021-11 accept novelty 8.0

ARKitScenes is the largest real-world indoor RGB-D dataset captured with mobile LiDAR, including high-resolution depth maps and 3D furniture bounding box annotations for advancing object detection and depth upsampling.
DepthAgent: Towards Better Universal Depth Estimation via Sample-wise Expert Selection
cs.CV 2026-05 unverdicted novelty 7.0

A reinforcement-learned vision-language agent adaptively selects and fuses monocular depth experts per sample for better performance across camera geometries.
Depth2Pose: A Pose-Based Benchmark for Monocular Depth Estimation without Ground-Truth Depth
cs.CV 2026-05 unverdicted novelty 7.0

Depth2Pose is a new evaluation framework for monocular depth estimators that uses relative camera pose accuracy as a task-driven proxy and introduces the D2P dataset of challenging out-of-distribution scenes.
DEVIS-GRPO: Unleashing GRPO on Dynamic Extreme View Synthesis
cs.CV 2026-05 unverdicted novelty 7.0

DEVIS-GRPO applies online policy gradients with an accumulative small-to-large view sampling strategy and multi-level rewards to improve trajectory-controlled extreme view video generation, reporting gains on Kubric-4...
PanoPlane: Plane-Aware Panoramic Completion for Sparse-View Indoor 3D Gaussian Splatting
cs.CV 2026-05 unverdicted novelty 7.0

PanoPlane achieves up to 17.8% PSNR gains in sparse-view indoor novel view synthesis by using training-free plane-aware panoramic completion to supervise 3D Gaussian Splatting.
Beyond Isolation: A Unified Benchmark for General-Purpose Navigation
cs.RO 2026-05 unverdicted novelty 7.0

OmniNavBench is a unified benchmark for general-purpose navigation featuring composite multi-skill instructions, support for humanoid, quadrupedal and wheeled robots, and 1779 human teleoperated trajectories across 17...
Holo360D: A Large-Scale Real-World Dataset with Continuous Trajectories for Advancing Panoramic 3D Reconstruction and Beyond
cs.CV 2026-04 unverdicted novelty 7.0

Holo360D is the first large-scale dataset providing continuous panoramic sequences with accurately aligned high-completeness depth maps and meshes for training panoramic 3D reconstruction models.
Dual-Anchoring: Addressing State Drift in Vision-Language Navigation
cs.CV 2026-04 unverdicted novelty 7.0

Dual-Anchoring Framework mitigates progress drift via structured instruction tokens and memory drift via landmark-centric retrospective prediction, yielding 15.2% success rate gain and 24.7% on long trajectories.
UniDAC: Universal Metric Depth Estimation for Any Camera
cs.CV 2026-03 unverdicted novelty 7.0

UniDAC achieves universal metric depth estimation across camera types by decoupling relative depth prediction from spatially varying scale estimation using a depth-guided module and distortion-aware positional embedding.
Learning Interactive Real-World Simulators
cs.AI 2023-10 conditional novelty 7.0

UniSim learns a universal real-world simulator from orchestrated diverse datasets, enabling zero-shot deployment of policies trained purely in simulation.
Autonomous Frontier-Based Exploration with VLM Guidance
cs.RO 2026-05 unverdicted novelty 6.0

A VLM-based method for selecting exploration frontiers in robotics achieves up to 24% better map coverage than standard geometric heuristics in simulated indoor environments.
GA-VLN: Geometry-Aware BEV Representation for Efficient Vision-Language Navigation
cs.CV 2026-05 unverdicted novelty 6.0

GA-VLN builds a geometry-aware BEV representation from RGB-D inputs plus 3D foundation model features to deliver state-of-the-art vision-language navigation using only navigation data.
UniT: Unified Geometry Learning with Group Autoregressive Transformer
cs.CV 2026-05 unverdicted novelty 6.0

UniT unifies online and offline 3D geometry perception via a Group Autoregressive Transformer that processes observation groups with anchor-free point map prediction and a scale-adaptive loss.
VoxScene: Anchor-Conditioned Voxel Diffusion for Indoor Scene Arrangement
cs.GR 2026-05 unverdicted novelty 6.0

VoxScene is a new anchor-conditioned voxel diffusion model that synthesizes collision-free 3D indoor scene arrangements via discrete volumetric occupancies and uses the grids for asset retrieval.
Unlocking Dense Metric Depth Estimation in VLMs
cs.CV 2026-05 unverdicted novelty 6.0

DepthVLM attaches a depth head to VLMs for native dense metric depth prediction alongside language outputs using a two-stage unified training schedule and a new indoor-outdoor benchmark.
Unlocking Dense Metric Depth Estimation in VLMs
cs.CV 2026-05 unverdicted novelty 6.0

DepthVLM converts a standard VLM into a dense metric depth predictor by attaching a lightweight head and training under unified vision-text supervision, outperforming prior VLMs and some pure vision models on a new in...
HetScene: Heterogeneity-Aware Diffusion for Dense Indoor Scene Generation
cs.CV 2026-05 unverdicted novelty 6.0

HetScene proposes a two-stage heterogeneous diffusion framework that decomposes scenes into primary structural objects and secondary contextual objects to generate denser, more plausible indoor layouts.
NavOL: Navigation Policy with Online Imitation Learning
cs.RO 2026-05 unverdicted novelty 6.0

NavOL collects expert trajectory labels online from a global planner during policy rollouts in simulation to train a diffusion navigation policy, mitigating distribution shift and improving performance on visual navig...
Beyond Thinking: Imagining in 360$^\circ$ for Humanoid Visual Search
cs.CV 2026-05 unverdicted novelty 6.0

Imagining in 360° decouples visual search into a single-step probabilistic semantic layout predictor and an actor, removing the need for multi-turn CoT reasoning and trajectory annotations while improving efficiency i...
Plug-and-Play Label Map Diffusion for Universal Goal-Oriented Navigation
cs.RO 2026-05 unverdicted novelty 6.0

PLMD applies a denoising diffusion model to predict labels for unknown map regions, allowing goal localization in unexplored environments by substituting completed labels into existing navigation pipelines.
SpaAct: Spatially-Activated Transition Learning with Curriculum Adaptation for Vision-Language Navigation
cs.CV 2026-04 unverdicted novelty 6.0

SpaAct activates spatial awareness in VLMs using action retrospection, future frame prediction, and progressive curriculum learning to reach SOTA on VLN-CE benchmarks.
Feed-Forward 3D Scene Modeling: A Problem-Driven Perspective
cs.CV 2026-04 unverdicted novelty 6.0

The paper proposes a problem-driven taxonomy for feed-forward 3D scene modeling that groups methods by five core challenges: feature enhancement, geometry awareness, model efficiency, augmentation strategies, and temp...
OVAL: Open-Vocabulary Augmented Memory Model for Lifelong Object Goal Navigation
cs.RO 2026-04 unverdicted novelty 6.0

OVAL introduces an open-vocabulary memory model with structured descriptors and multi-value frontier scoring to enable efficient lifelong object goal navigation in unseen settings.
ReplicateAnyScene: Zero-Shot Video-to-3D Composition via Textual-Visual-Spatial Alignment
cs.CV 2026-04 unverdicted novelty 6.0

ReplicateAnyScene performs fully automated zero-shot video-to-compositional-3D reconstruction by cascading alignments of generic priors from vision foundation models across textual, visual, and spatial dimensions.
ReMemNav: A Rethinking and Memory-Augmented Framework for Zero-Shot Object Navigation
cs.RO 2026-03 conditional novelty 6.0

ReMemNav improves zero-shot object navigation success and efficiency by integrating episodic memory and rethinking with VLMs, achieving SR/SPL gains of 1.7%/7.0% on HM3D v0.1, 18.2%/11.1% on HM3D v0.2, and 8.7%/7.9% on MP3D.
Memory Over Maps: 3D Object Localization Without Reconstruction
cs.RO 2026-03 unverdicted novelty 6.0

A map-free localization method stores posed RGB-D keyframes, retrieves and re-ranks them with a VLM, then fuses sparse depth for on-demand 3D target estimates, matching reconstruction-based performance on navigation b...
MerNav: A Highly Generalizable Memory-Execute-Review Framework for Zero-Shot Object Goal Navigation
cs.CV 2026-02 unverdicted novelty 6.0

MerNav's Memory-Execute-Review framework improves success rates in zero-shot object goal navigation by 5-8% over baselines on four datasets while outperforming both training-free and supervised methods on key benchmarks.
PanoSAMic: Panoramic Image Segmentation from SAM Feature Encoding and Dual View Fusion
cs.CV 2026-01 unverdicted novelty 6.0

PanoSAMic modifies SAM with multi-stage feature encoding, spatio-modal fusion, spherical attention, and dual-view fusion to achieve SOTA panoramic semantic segmentation on public RGB and RGB-D datasets.
AstraNav-World: World Model for Foresight Control and Consistency
cs.CV 2025-12 unverdicted novelty 6.0

AstraNav-World unifies diffusion video generation and vision-language action planning in a single bidirectional model that improves trajectory accuracy, success rates, and zero-shot real-world adaptation in embodied n...
Chorus: Multi-Teacher Pretraining for Holistic 3D Gaussian Scene Encoding
cs.CV 2025-12 unverdicted novelty 6.0

Chorus pretrains a shared 3D Gaussian scene encoder via multi-teacher distillation to capture holistic features from high-level semantics to fine-grained structure, with strong transfer on segmentation and point-cloud...
DoReMi: Bridging 3D Domains via Topology-Aware Domain-Representation Mixture of Experts
cs.CV 2025-11 unverdicted novelty 6.0

DoReMi uses self-supervised pre-training on topological and texture variations plus domain-aware experts with spatial-guided routing and entropy-controlled allocation to reach 80.1% mIoU on ScanNet and 77.2% mIoU on S3DIS.
C-NAV: Towards Self-Evolving Continual Object Navigation in Open World
cs.RO 2025-10 unverdicted novelty 6.0

C-Nav is a continual visual navigation framework with dual-path anti-forgetting via feature distillation and replay plus adaptive sampling that outperforms baselines on a new continual object navigation benchmark whil...
The Less You Depend, The More You Learn: Synthesizing Novel Views from Sparse, Unposed Images with Minimal 3D Knowledge
cs.CV 2025-06 unverdicted novelty 6.0

Data-centric novel view synthesis models with minimal 3D knowledge and no pose annotations scale better with data volume and outperform traditional bias-driven methods.
DEGround: An Effective Baseline for Ego-centric 3D Visual Grounding with a Homogeneous Framework
cs.CV 2025-06 unverdicted novelty 6.0

DEGround presents a unified homogeneous framework for 3D visual grounding with shared queries and two plug-in modules for better instruction alignment, reporting a 7.52% improvement on the EmbodiedScan benchmark.
Personalized Embodied Navigation for Portable Object Finding
cs.RO 2024-03 unverdicted novelty 6.0

Transit-Aware Planning (TAP) enriches navigation policies with object transit data on Dynamic Object Maps, raising success rates by 21.1% in MP3D simulation and 18.3% in real-world tests for finding non-stationary targets.
Robo-Cortex: A Self-Evolving Embodied Agent via Dual-Grain Cognitive Memory and Autonomous Knowledge Induction
cs.RO 2026-05 unverdicted novelty 5.0

Robo-Cortex proposes a self-evolving embodied navigation agent using dual-grain cognitive memory and autonomous knowledge induction from trajectories, reporting SPL gains on IGNav, AR, AEQA and preliminary real-robot tests.
From Visual Synthesis to Interactive Worlds: Toward Production-Ready 3D Asset Generation
cs.GR 2026-04 unverdicted novelty 5.0

The paper surveys 3D asset generation methods and organizes them around the full production pipeline to assess which outputs meet engine-level requirements for interactive applications.
Dual-Anchoring: Addressing State Drift in Vision-Language Navigation
cs.CV 2026-04 unverdicted novelty 5.0

Dual-Anchoring adds explicit progress tokens and retrospective landmark verification to VLN agents, cutting state drift and lifting success rate 15.2% overall with 24.7% gains on long trajectories.
Think before Go: Hierarchical Reasoning for Image-goal Navigation
cs.RO 2026-04 unverdicted novelty 5.0

HRNav decomposes image-goal navigation into VLM-based short-horizon planning and RL-based execution with a wandering suppression penalty to improve performance in complex unseen settings.
MV3DIS: Multi-View Mask Matching via 3D Guides for Zero-Shot 3D Instance Segmentation
cs.CV 2026-04 unverdicted novelty 5.0

MV3DIS uses 3D-guided mask matching and depth consistency to produce more consistent multi-view 2D masks that refine into accurate zero-shot 3D instances.
OpenSpatial: A Principled Data Engine for Empowering Spatial Intelligence
cs.CL 2026-04 unverdicted novelty 5.0

OpenSpatial supplies a principled open-source data engine and 3-million-sample dataset that raises spatial-reasoning model performance by an average of 19 percent on benchmarks.
Audio Spatially-Guided Fusion for Audio-Visual Navigation
cs.SD 2026-04 unverdicted novelty 5.0

Audio Spatially-Guided Fusion improves generalization in audio-visual navigation on unheard sound sources by extracting spatial audio features and adaptively fusing them with visual data.
JoyAI-Image: Awaking Spatial Intelligence in Unified Multimodal Understanding and Generation
cs.GR 2026-05 unverdicted novelty 4.0

JoyAI-Image unifies visual understanding and generation via an MLLM-MMDiT architecture with spatial training signals to reach competitive benchmark performance and stronger spatial intelligence.
JoyAI-Image: Awaking Spatial Intelligence in Unified Multimodal Understanding and Generation
cs.GR 2026-05 unverdicted novelty 4.0

JoyAI-Image unifies visual understanding, generation, and editing in one model and claims stronger spatial intelligence through bidirectional perception-generation loops.
From Visual Synthesis to Interactive Worlds: Toward Production-Ready 3D Asset Generation
cs.GR 2026-04 unverdicted novelty 4.0

The paper surveys 3D content generation literature using a taxonomy of asset types and production stages to evaluate progress toward engine-ready assets.
MacroNav: Multi-Task Context Representation Learning Enables Efficient Navigation in Unknown Environments
cs.RO 2025-11 unverdicted novelty 4.0

MacroNav learns multi-scale navigation-centric representations through multi-task self-supervised learning and combines them with graph-based reinforcement learning for efficient action selection, reporting gains in s...
To Learn or Not to Learn: Analyzing the Role of Learning for Navigation in Virtual Environments
cs.CV 2019-07 unverdicted novelty 4.0

Classical agents outperform learning-based ones on MINOS and Stanford 3D Indoor Spaces, with learned agents weaker at collision avoidance and memory but stronger at handling ambiguity and noise.
Why Build an Assistant in Minecraft?
cs.AI 2019-07 unverdicted novelty 4.0

A rationale is presented for developing an assistant in Minecraft to advance natural language understanding and dialogue learning.
NeRF: Neural Radiance Field in 3D Vision: A Comprehensive Review (Updated Post-Gaussian Splatting)
cs.CV 2022-10 unverdicted novelty 2.0

A literature survey of NeRF and neural field methods from 2020-2025, organized by architecture and application taxonomies with benchmarks and dataset overviews, covering both pre- and post-Gaussian Splatting periods.