FUSER is the first feed-forward multiview 3D registration transformer that jointly processes all scans to predict global poses, followed by SE(3)^N diffusion refinement for higher accuracy.
hub Baseline reference
ARKitScenes: A Diverse Real-World Dataset For 3D Indoor Scene Understanding Using Mobile RGB-D Data
Baseline reference. 53% of citing Pith papers use this work as a benchmark or comparison.
abstract
Scene understanding is an active research area. Commercial depth sensors, such as Kinect, have enabled the release of several RGB-D datasets over the past few years which spawned novel methods in 3D scene understanding. More recently with the launch of the LiDAR sensor in Apple's iPads and iPhones, high quality RGB-D data is accessible to millions of people on a device they commonly use. This opens a whole new era in scene understanding for the Computer Vision community as well as app developers. The fundamental research in scene understanding together with the advances in machine learning can now impact people's everyday experiences. However, transforming these scene understanding methods to real-world experiences requires additional innovation and development. In this paper we introduce ARKitScenes. It is not only the first RGB-D dataset that is captured with a now widely available depth sensor, but to our best knowledge, it also is the largest indoor scene understanding data released. In addition to the raw and processed data from the mobile device, ARKitScenes includes high resolution depth maps captured using a stationary laser scanner, as well as manually labeled 3D oriented bounding boxes for a large taxonomy of furniture. We further analyze the usefulness of the data for two downstream tasks: 3D object detection and color-guided depth upsampling. We demonstrate that our dataset can help push the boundaries of existing state-of-the-art methods and it introduces new challenges that better represent real-world scenarios.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
VGenST-Bench is a new video benchmark for MLLM spatio-temporal reasoning built via generative synthesis, a multi-agent pipeline with human oversight, a 3x2x2 taxonomy, and hierarchical tasks separating perception from reasoning.
Proposes Spatial Narrative Score (SNS) evaluation for VLMs' camera motion understanding and introduces CaMo model achieving consistent performance on SNS and direct QA.
ViSRA boosts MLLM 3D spatial reasoning performance by up to 28.9% on unseen tasks via a plug-and-play video-based agent that extracts explicit spatial cues from expert models without any post-training.
SplatWeaver uses cardinality Gaussian experts and pixel-level routing to dynamically allocate varying numbers of Gaussian primitives for generalizable novel view synthesis.
DENALI is the first large-scale real-world dataset of space-time histograms from low-cost LiDARs for training models to perceive hidden objects via multi-bounce light cues.
A 3D-grounded autoencoder and diffusion transformer allow direct generation of 3D scenes in an implicit latent space using a fixed 1K-token representation for arbitrary views and resolutions.
WildDet3D is a promptable 3D detector paired with a new 1M-image dataset across 13.5K categories that sets SOTA on open-world and zero-shot 3D detection benchmarks.
VAEX-BENCH shows state-of-the-art MLLMs perform substantially worse on abstractive spatiotemporal reasoning tasks than on matched extractive tasks in video understanding.
ZipMap achieves linear-time bidirectional 3D reconstruction by zipping image collections into a compact stateful representation via test-time training layers.
POMA-3D learns self-supervised 3D scene representations from point maps and improves performance on geometric 3D tasks including navigation and scene retrieval.
FastForward represents scenes as collections of 3D-anchored image features and performs camera pose estimation via feed-forward correspondence prediction, achieving competitive accuracy with minimal mapping time.
π³ is a feed-forward network with full permutation equivariance that outputs affine-invariant poses and scale-invariant local point maps without reference frames, reaching state-of-the-art on camera pose, depth, and dense reconstruction benchmarks.
ForeSplat introduces MetaGrad, a meta-gradient training rule that makes feed-forward 3DGS predictions optimization-aware so they converge faster and to higher quality upon refinement.
UniT unifies online and offline 3D geometry perception via a Group Autoregressive Transformer that processes observation groups with anchor-free point map prediction and a scale-adaptive loss.
HGC-Det applies hyperbolic geometry to constrain cross-modal distillation between images and point clouds, with added semantic-guided voxel optimization and feature aggregation, yielding improved accuracy-efficiency trade-offs on SUN RGB-D, ARKitScenes, KITTI, and nuScenes.
Hyperbolic Scene Graph (HSG) learns embeddings in hyperbolic space for better hierarchical structure in scene graphs, achieving graph IoU of 33.51 versus 25.37 for the best Euclidean baseline.
The paper proposes a problem-driven taxonomy for feed-forward 3D scene modeling that groups methods by five core challenges: feature enhancement, geometry awareness, model efficiency, augmentation strategies, and temporal-aware modeling.
ReplicateAnyScene performs fully automated zero-shot video-to-compositional-3D reconstruction by cascading alignments of generic priors from vision foundation models across textual, visual, and spatial dimensions.
BoxerNet lifts 2D bounding boxes to metric 3D boxes via transformer regression with aleatoric uncertainty and median depth encoding, then fuses multi-view results to outperform CuTR by large margins on open-world benchmarks.
UniScene3D learns unified 3D scene representations from colored pointmaps using contrastive CLIP pretraining plus cross-view geometric and grounded view alignments, achieving state-of-the-art results on viewpoint grounding, scene retrieval, classification, and 3D VQA.
SpatialStack improves 3D spatial reasoning in vision-language models by stacking and synchronizing multi-level geometric features with the language backbone.
Motion-MLLM integrates IMU egomotion data into MLLMs using cascaded filtering and asymmetric fusion to ground visual content in physical trajectories for scale-aware 3D understanding, achieving competitive accuracy at higher speed.
Dual Tuning is a data curation method that jointly scores training examples for benefit and for reasoning-gain to choose between reasoning and direct-answer post-training modes for multimodal LLMs.
citing papers explorer
-
FUSER: Feed-Forward MUltiview 3D Registration Transformer and SE(3)$^N$ Diffusion Refinement
FUSER is the first feed-forward multiview 3D registration transformer that jointly processes all scans to predict global poses, followed by SE(3)^N diffusion refinement for higher accuracy.
-
VGenST-Bench: A Benchmark for Spatio-Temporal Reasoning via Active Video Synthesis
VGenST-Bench is a new video benchmark for MLLM spatio-temporal reasoning built via generative synthesis, a multi-agent pipeline with human oversight, a 3x2x2 taxonomy, and hierarchical tasks separating perception from reasoning.
-
CaMo: Camera Motion Grounded Evaluation and Training for Vision-Language Models
Proposes Spatial Narrative Score (SNS) evaluation for VLMs' camera motion understanding and introduces CaMo model achieving consistent performance on SNS and direct QA.
-
ViSRA: A Video-based Spatial Reasoning Agent for Multi-modal Large Language Models
ViSRA boosts MLLM 3D spatial reasoning performance by up to 28.9% on unseen tasks via a plug-and-play video-based agent that extracts explicit spatial cues from expert models without any post-training.
-
SplatWeaver: Learning to Allocate Gaussian Primitives for Generalizable Novel View Synthesis
SplatWeaver uses cardinality Gaussian experts and pixel-level routing to dynamically allocate varying numbers of Gaussian primitives for generalizable novel view synthesis.
-
DENALI: A Dataset Enabling Non-Line-of-Sight Spatial Reasoning with Low-Cost LiDARs
DENALI is the first large-scale real-world dataset of space-time histograms from low-cost LiDARs for training models to perceive hidden objects via multi-bounce light cues.
-
Any 3D Scene is Worth 1K Tokens: 3D-Grounded Representation for Scene Generation at Scale
A 3D-grounded autoencoder and diffusion transformer allow direct generation of 3D scenes in an implicit latent space using a fixed 1K-token representation for arbitrary views and resolutions.
-
WildDet3D: Scaling Promptable 3D Detection in the Wild
WildDet3D is a promptable 3D detector paired with a new 1M-image dataset across 13.5K categories that sets SOTA on open-world and zero-shot 3D detection benchmarks.
-
Reasoning over Video: Evaluating How MLLMs Extract, Integrate, and Reconstruct Spatiotemporal Evidence
VAEX-BENCH shows state-of-the-art MLLMs perform substantially worse on abstractive spatiotemporal reasoning tasks than on matched extractive tasks in video understanding.
-
ZipMap: Linear-Time Stateful 3D Reconstruction via Test-Time Training
ZipMap achieves linear-time bidirectional 3D reconstruction by zipping image collections into a compact stateful representation via test-time training layers.
-
POMA-3D: The Point Map Way to 3D Scene Understanding
POMA-3D learns self-supervised 3D scene representations from point maps and improves performance on geometric 3D tasks including navigation and scene retrieval.
-
A Scene is Worth a Thousand Features: Feed-Forward Camera Localization from a Collection of Image Features
FastForward represents scenes as collections of 3D-anchored image features and performs camera pose estimation via feed-forward correspondence prediction, achieving competitive accuracy with minimal mapping time.
-
$\pi^3$: Permutation-Equivariant Visual Geometry Learning
π³ is a feed-forward network with full permutation equivariance that outputs affine-invariant poses and scale-invariant local point maps without reference frames, reaching state-of-the-art on camera pose, depth, and dense reconstruction benchmarks.
-
ForeSplat: Optimization-Aware Foresight for Feed-Forward 3D Gaussian Splatting
ForeSplat introduces MetaGrad, a meta-gradient training rule that makes feed-forward 3DGS predictions optimization-aware so they converge faster and to higher quality upon refinement.
-
UniT: Unified Geometry Learning with Group Autoregressive Transformer
UniT unifies online and offline 3D geometry perception via a Group Autoregressive Transformer that processes observation groups with anchor-free point map prediction and a scale-adaptive loss.
-
Hyperbolic Distillation: Geometry-Guided Cross-Modal Transfer for Robust 3D Object Detection
HGC-Det applies hyperbolic geometry to constrain cross-modal distillation between images and point clouds, with added semantic-guided voxel optimization and feature aggregation, yielding improved accuracy-efficiency trade-offs on SUN RGB-D, ARKitScenes, KITTI, and nuScenes.
-
HSG: Hyperbolic Scene Graph
Hyperbolic Scene Graph (HSG) learns embeddings in hyperbolic space for better hierarchical structure in scene graphs, achieving graph IoU of 33.51 versus 25.37 for the best Euclidean baseline.
-
Feed-Forward 3D Scene Modeling: A Problem-Driven Perspective
The paper proposes a problem-driven taxonomy for feed-forward 3D scene modeling that groups methods by five core challenges: feature enhancement, geometry awareness, model efficiency, augmentation strategies, and temporal-aware modeling.
-
ReplicateAnyScene: Zero-Shot Video-to-3D Composition via Textual-Visual-Spatial Alignment
ReplicateAnyScene performs fully automated zero-shot video-to-compositional-3D reconstruction by cascading alignments of generic priors from vision foundation models across textual, visual, and spatial dimensions.
-
Boxer: Robust Lifting of Open-World 2D Bounding Boxes to 3D
BoxerNet lifts 2D bounding boxes to metric 3D boxes via transformer regression with aleatoric uncertainty and median depth encoding, then fuses multi-view results to outperform CuTR by large margins on open-world benchmarks.
-
Contrastive Language-Colored Pointmap Pretraining for Unified 3D Scene Understanding
UniScene3D learns unified 3D scene representations from colored pointmaps using contrastive CLIP pretraining plus cross-view geometric and grounded view alignments, achieving state-of-the-art results on viewpoint grounding, scene retrieval, classification, and 3D VQA.
-
SpatialStack: Layered Geometry-Language Fusion for 3D VLM Spatial Reasoning
SpatialStack improves 3D spatial reasoning in vision-language models by stacking and synchronizing multi-level geometric features with the language backbone.
-
Feeling the Space: Egomotion-Aware Video Representation for Efficient and Accurate 3D Scene Understanding
Motion-MLLM integrates IMU egomotion data into MLLMs using cascaded filtering and asymmetric fusion to ground visual content in physical trajectories for scale-aware 3D understanding, achieving competitive accuracy at higher speed.
-
Dual Tuning for Reasoning Efficacy-Driven Data Curation in Multimodal LLM Training
Dual Tuning is a data curation method that jointly scores training examples for benefit and for reasoning-gain to choose between reasoning and direct-answer post-training modes for multimodal LLMs.
-
Chorus: Multi-Teacher Pretraining for Holistic 3D Gaussian Scene Encoding
Chorus pretrains a shared 3D Gaussian scene encoder via multi-teacher distillation to capture holistic features from high-level semantics to fine-grained structure, with strong transfer on segmentation and point-cloud tasks using far fewer scenes.
-
DoReMi: Bridging 3D Domains via Topology-Aware Domain-Representation Mixture of Experts
DoReMi uses self-supervised pre-training on topological and texture variations plus domain-aware experts with spatial-guided routing and entropy-controlled allocation to reach 80.1% mIoU on ScanNet and 77.2% mIoU on S3DIS.
-
Streaming 4D Visual Geometry Transformer
A causal transformer with key-value caching and distillation from a bidirectional VGGT model enables efficient online 4D geometry reconstruction from videos.
-
VLM-3R: Vision-Language Models Augmented with Instruction-Aligned 3D Reconstruction
VLM-3R augments VLMs with implicit 3D tokens from monocular video via geometry encoding and 200K+ 3D reconstructive QA pairs, plus a new 138K-pair temporal benchmark, to support spatial and embodied reasoning.
-
HorizonStream: Long-Horizon Attention for Streaming 3D Reconstruction
HorizonStream is a long-horizon Transformer that factorizes geometric evidence influence into channel-wise linear attention for long-range temporal propagation and local spatiotemporal attention for short-range matching, claiming stable generalization from 48-frame training to over 10,000-frame test
-
IVGT: Implicit Visual Geometry Transformer for Neural Scene Representation
IVGT implicitly models continuous neural scene representations from pose-free multi-view images to enable coherent surface extraction, novel view synthesis, and related 3D tasks via SDF and color prediction.
-
ST-Gen4D: Embedding 4D Spatiotemporal Cognition into World Model for 4D Generation
ST-Gen4D uses a world model that fuses global appearance and local dynamic graphs into a 4D cognition representation to guide consistent 4D Gaussian generation.
-
R3D: Revisiting 3D Policy Learning
A transformer 3D encoder plus diffusion decoder architecture, with 3D-specific augmentations, outperforms prior 3D policy methods on manipulation benchmarks by improving training stability.
-
OpenSpatial: A Principled Data Engine for Empowering Spatial Intelligence
OpenSpatial supplies a principled open-source data engine and 3-million-sample dataset that raises spatial-reasoning model performance by an average of 19 percent on benchmarks.
-
ViPE: Video Pose Engine for 3D Geometric Perception
ViPE estimates camera intrinsics, motion, and dense near-metric depth from uncalibrated videos, outperforming baselines on TUM and KITTI while releasing annotations for 96M frames across real and generated videos.
-
JoyAI-Image: Awaking Spatial Intelligence in Unified Multimodal Understanding and Generation
JoyAI-Image unifies visual understanding and generation via an MLLM-MMDiT architecture with spatial training signals to reach competitive benchmark performance and stronger spatial intelligence.
-
Seed1.5-VL Technical Report
Seed1.5-VL is a compact multimodal model that sets new records on dozens of vision-language benchmarks and outperforms prior systems on agent-style tasks.