Recognition: 3 theorem links
Depth Anything 3: Recovering the Visual Space from Any Views
Pith reviewed 2026-05-11 02:03 UTC · model grok-4.3
The pith
A plain transformer with a single depth-ray target recovers consistent geometry from arbitrary views.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Depth Anything 3 uses a single plain transformer encoder and a singular depth-ray prediction target to recover spatially consistent geometry from an arbitrary number of views with or without known poses. Trained via teacher-student distillation exclusively on public academic datasets, the model matches the detail and generalization of Depth Anything 2 while establishing new state-of-the-art results across camera pose estimation, any-view geometry, and visual rendering on a dedicated benchmark.
What carries the argument
A teacher-student training paradigm applied to a vanilla transformer encoder that uses one depth-ray prediction target in place of multi-task heads.
If this is right
- The model produces consistent geometry outputs from any number of input views.
- It operates without requiring known camera poses as input.
- It sets new performance records on pose estimation, geometric reconstruction, and rendering tasks.
- All training uses only publicly available academic datasets.
- It improves monocular depth estimation accuracy relative to Depth Anything 2.
Where Pith is reading between the lines
- The benchmark introduced here could serve as a standard testbed for future any-view geometry methods.
- Success with a single prediction target may indicate that separate heads for pose and depth are often redundant.
- The approach could extend to video sequences by treating frames as arbitrary views.
Load-bearing premise
A single unmodified transformer encoder plus a singular depth-ray prediction target, trained via teacher-student distillation on public datasets, suffices to produce spatially consistent geometry from arbitrary views without known poses.
What would settle it
A collection of test scenes with many arbitrary views where the model's predicted geometry shows measurable inconsistencies or lower accuracy than models that rely on explicit pose inputs or specialized multi-view fusion modules.
read the original abstract
We present Depth Anything 3 (DA3), a model that predicts spatially consistent geometry from an arbitrary number of visual inputs, with or without known camera poses. In pursuit of minimal modeling, DA3 yields two key insights: a single plain transformer (e.g., vanilla DINO encoder) is sufficient as a backbone without architectural specialization, and a singular depth-ray prediction target obviates the need for complex multi-task learning. Through our teacher-student training paradigm, the model achieves a level of detail and generalization on par with Depth Anything 2 (DA2). We establish a new visual geometry benchmark covering camera pose estimation, any-view geometry and visual rendering. On this benchmark, DA3 sets a new state-of-the-art across all tasks, surpassing prior SOTA VGGT by an average of 44.3% in camera pose accuracy and 25.1% in geometric accuracy. Moreover, it outperforms DA2 in monocular depth estimation. All models are trained exclusively on public academic datasets.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents Depth Anything 3 (DA3), a model that recovers spatially consistent 3D geometry from an arbitrary number of input views (with or without known poses) using only a plain transformer encoder (e.g., vanilla DINO) and a single depth-ray prediction target. It is trained via teacher-student distillation from Depth Anything 2 on public datasets, claims performance on par with DA2 for monocular depth, and introduces a new visual geometry benchmark on which it reports SOTA results, outperforming VGGT by 44.3% in camera pose accuracy and 25.1% in geometric accuracy across pose estimation, any-view geometry, and visual rendering tasks.
Significance. If the central claims hold under rigorous verification, the result would be significant: it would demonstrate that minimal architectural choices (plain encoder + singular depth-ray target) plus distillation suffice for cross-view geometric consistency, reducing the need for specialized multi-view modules or explicit pose/alignment losses. The new benchmark spanning multiple tasks could become a useful standard for evaluating visual geometry models. Credit is due for the emphasis on public-data-only training and the attempt at architectural simplification.
major comments (3)
- [Abstract and §4] Abstract and §4 (results): The reported 44.3% average improvement in camera pose accuracy and 25.1% in geometric accuracy over VGGT are presented without error bars, per-task standard deviations, statistical significance tests, or ablation tables isolating the contribution of the depth-ray target versus data correlations; this is load-bearing for the SOTA claim given the empirical benchmark.
- [§3] §3 (method): The teacher-student paradigm is described as using only per-view depth rays from a monocular teacher with no explicit cross-view alignment, pose, or consistency losses; it is unclear how the student discovers spatially consistent geometry from arbitrary views, raising the risk that benchmark gains arise from scene correlations in public multi-view datasets rather than the proposed minimal representation.
- [Benchmark definition] Benchmark definition (likely §4 or appendix): Details on how the new visual geometry benchmark constructs ground truth for arbitrary views, defines the three tasks, ensures no train-test leakage, and handles pose-free inference are insufficient to reproduce or independently validate the cross-task SOTA margins.
minor comments (2)
- [§3] Notation for the depth-ray target and its relation to standard depth or ray representations should be clarified with an equation or diagram in §3 to aid reproducibility.
- The manuscript would benefit from an explicit limitations paragraph discussing failure cases (e.g., highly dynamic scenes or extreme viewpoint changes) where the plain encoder plus depth-ray may break down.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment below, providing clarifications on our approach and outlining planned revisions to strengthen the manuscript.
read point-by-point responses
-
Referee: [Abstract and §4] Abstract and §4 (results): The reported 44.3% average improvement in camera pose accuracy and 25.1% in geometric accuracy over VGGT are presented without error bars, per-task standard deviations, statistical significance tests, or ablation tables isolating the contribution of the depth-ray target versus data correlations; this is load-bearing for the SOTA claim given the empirical benchmark.
Authors: We agree that error bars, per-task standard deviations, and ablations would provide stronger support for the SOTA claims. In the revised manuscript we will report per-task standard deviations computed across multiple evaluation seeds, include an ablation study that isolates the depth-ray target from other factors, and add statistical significance testing (e.g., paired t-tests) on the key metrics. We note that the large reported margins make the overall ranking robust, but we will incorporate these elements to address the concern directly. revision: partial
-
Referee: [§3] §3 (method): The teacher-student paradigm is described as using only per-view depth rays from a monocular teacher with no explicit cross-view alignment, pose, or consistency losses; it is unclear how the student discovers spatially consistent geometry from arbitrary views, raising the risk that benchmark gains arise from scene correlations in public multi-view datasets rather than the proposed minimal representation.
Authors: The student processes an arbitrary number of views jointly through the shared vanilla DINO transformer while predicting depth rays; this joint encoding enables the model to discover cross-view geometric consistency implicitly from the multi-view training distribution, even though supervision remains per-view. To address the possibility of scene-specific correlations, the revision will include additional experiments on held-out scenes and cross-dataset generalization that demonstrate the consistency generalizes beyond training-scene statistics. revision: yes
-
Referee: [Benchmark definition] Benchmark definition (likely §4 or appendix): Details on how the new visual geometry benchmark constructs ground truth for arbitrary views, defines the three tasks, ensures no train-test leakage, and handles pose-free inference are insufficient to reproduce or independently validate the cross-task SOTA margins.
Authors: We acknowledge that additional detail is required for full reproducibility. The revised manuscript will expand Section 4 and the appendix with explicit descriptions of ground-truth construction for arbitrary views, precise definitions and metrics for the three tasks, the train-test split procedure used to prevent leakage, and the exact protocol for pose-free inference, including pseudocode for the evaluation pipeline. revision: yes
Circularity Check
No circularity: purely empirical claims with no derivations or self-referential predictions
full rationale
The paper describes an empirical teacher-student training setup using a plain DINO encoder and depth-ray target on public datasets, evaluated via benchmark comparisons (e.g., 44.3% pose accuracy gain over VGGT). No equations, first-principles derivations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided text. Performance claims rest on external benchmark results rather than any quantity defined in terms of itself. This is the expected non-finding for an architecture paper without mathematical modeling.
Axiom & Free-Parameter Ledger
free parameters (1)
- training hyperparameters
axioms (1)
- domain assumption Teacher-student distillation on public datasets yields spatially consistent multi-view geometry.
Forward citations
Cited by 60 Pith papers
-
TrackCraft3R: Repurposing Video Diffusion Transformers for Dense 3D Tracking
TrackCraft3R is the first method to repurpose a video diffusion transformer as a feed-forward dense 3D tracker via dual-latent representations and temporal RoPE alignment, achieving SOTA performance with lower compute.
-
Revisiting Photometric Ambiguity for Accurate Gaussian-Splatting Surface Reconstruction
AmbiSuR adds intrinsic photometric disambiguation and a self-indication module to Gaussian Splatting to resolve ambiguities and improve surface reconstruction accuracy.
-
PointForward: Feedforward Driving Reconstruction through Point-Aligned Representations
PointForward uses sparse world-space 3D queries and scene graphs to deliver consistent single-pass reconstruction of dynamic driving scenes via point-aligned representations.
-
Seeing Across Skies and Streets: Feedforward 3D Reconstruction from Satellite, Drone, and Ground Images
Cross3R performs feed-forward 3D reconstruction and 6-DoF pose estimation from any combination of satellite, UAV, and ground images, outperforming baselines on a new 278K-image tri-view dataset.
-
Mix3R: Mixing Feed-forward Reconstruction and Generative 3D Priors for Joint Multi-view Aligned 3D Reconstruction and Pose Estimation
Mix3R mixes feed-forward reconstruction and generative 3D priors via Mixture-of-Transformers and overlap-based attention bias to achieve better-aligned 3D shapes and more accurate poses than either approach alone.
-
AirZoo: A Unified Large-Scale Dataset for Grounding Aerial Geometric 3D Vision
AirZoo is a new large-scale synthetic dataset for aerial 3D vision that improves state-of-the-art models on image retrieval, cross-view matching, and 3D reconstruction when used for fine-tuning.
-
Face Anything: 4D Face Reconstruction from Any Image Sequence
A single transformer model jointly predicts depth and normalized canonical coordinates to deliver state-of-the-art 4D facial geometry and tracking with 3x lower correspondence error and 16% better depth accuracy.
-
URoPE: Universal Relative Position Embedding across Geometric Spaces
URoPE is a parameter-free relative position embedding for transformers that works across arbitrary geometric spaces by ray sampling and projection, yielding consistent gains on novel view synthesis, 3D detection, trac...
-
View-Consistent 3D Scene Editing via Dual-Path Structural Correspondense and Semantic Continuity
A dual-path consistency framework for text-driven 3D scene editing that models cross-view dependencies via structural correspondence and semantic continuity, trained on a newly constructed paired multi-view dataset.
-
GlobalSplat: Efficient Feed-Forward 3D Gaussian Splatting via Global Scene Tokens
GlobalSplat achieves competitive novel-view synthesis on RealEstate10K and ACID using only 16K Gaussians via global scene tokens and coarse-to-fine training, with a 4MB footprint and under 78ms inference.
-
Any 3D Scene is Worth 1K Tokens: 3D-Grounded Representation for Scene Generation at Scale
A 3D-grounded autoencoder and diffusion transformer allow direct generation of 3D scenes in an implicit latent space using a fixed 1K-token representation for arbitrary views and resolutions.
-
EgoFun3D: Modeling Interactive Objects from Egocentric Videos using Function Templates
EgoFun3D creates a new task, 271-video dataset, and pipeline using function templates to model interactive 3D objects from egocentric videos for simulation.
-
LuMon: A Comprehensive Benchmark and Development Suite with Novel Datasets for Lunar Monocular Depth Estimation
A new benchmark with real lunar stereo ground truth and analog data shows that sim-to-real fine-tuned monocular depth models achieve large in-domain gains but minimal generalization to actual lunar images.
-
TAIHRI: Task-Aware 3D Human Keypoints Localization for Close-Range Human-Robot Interaction
TAIHRI is the first task-aware VLM for close-range HRI that localizes metric-scale 3D coordinates of critical keypoints by quantizing space and performing 2D keypoint reasoning via next-token prediction.
-
MoRight: Motion Control Done Right
MoRight disentangles object and camera motion via canonical-view specification and temporal cross-view attention, while decomposing motion into active user-driven and passive consequence components to learn and apply ...
-
SEM-ROVER: Semantic Voxel-Guided Diffusion for Large-Scale Driving Scene Generation
SEM-ROVER generates large multiview-consistent 3D urban driving scenes via semantic-conditioned diffusion on Σ-Voxfield voxel grids with progressive outpainting and deferred rendering.
-
AnyImageNav: Any-View Geometry for Precise Last-Meter Image-Goal Navigation
AnyImageNav uses a semantic-to-geometric cascade with 3D multi-view foundation models to recover precise 6-DoF poses from goal images, achieving 0.27m position error and state-of-the-art success rates on Gibson and HM...
-
MapAnything: Universal Feed-Forward Metric 3D Reconstruction
MapAnything is a unified feed-forward transformer that regresses metric 3D scene geometry and cameras from images using a factored representation of depth maps, ray maps, poses, and scale.
-
GeoQuery: Geometry-Query Diffusion for Sparse-View Reconstruction
GeoQuery replaces corrupted rendering features with geometry-aligned proxy queries and restricts cross-view attention to local windows, enabling robust diffusion-based refinement under extreme view sparsity.
-
GuidedVLA: Specifying Task-Relevant Factors via Plug-and-Play Action Attention Specialization
GuidedVLA improves VLA success rates by manually supervising separate attention heads in the action decoder with auxiliary signals for task-relevant factors.
-
UniFixer: A Universal Reference-Guided Fixer for Diffusion-Based View Synthesis
UniFixer is a universal reference-guided framework that fixes spatial, temporal, and backbone-related degradations in diffusion-based view synthesis via coarse-to-fine modules and achieves zero-shot SOTA results on no...
-
Focusable Monocular Depth Estimation
FocusDepth is a prompt-conditioned framework that fuses SAM3 features into Depth Anything models via Multi-Scale Spatial-Aligned Fusion to improve target-region depth accuracy on the new FDE-Bench.
-
Pixal3D: Pixel-Aligned 3D Generation from Images
Pixal3D performs pixel-aligned 3D generation from images via back-projected multi-scale feature volumes, achieving fidelity close to reconstruction while supporting multi-view and scene synthesis.
-
GemDepth: Geometry-Embedded Features for 3D-Consistent Video Depth
GemDepth predicts inter-frame camera poses to inject geometric embeddings into a spatio-temporal transformer, yielding state-of-the-art 3D-consistent video depth.
-
GemDepth: Geometry-Embedded Features for 3D-Consistent Video Depth
GemDepth achieves improved 3D-consistent video depth by embedding predicted inter-frame camera poses into a network with an Alternating Spatio-Temporal Transformer for better spatial precision and temporal coherence.
-
GemDepth: Geometry-Embedded Features for 3D-Consistent Video Depth
GemDepth embeds predicted camera poses into a spatio-temporal transformer to achieve state-of-the-art 3D-consistent video depth estimation.
-
CoWorld-VLA: Thinking in a Multi-Expert World Model for Autonomous Driving
CoWorld-VLA extracts semantic, geometric, dynamic, and trajectory expert tokens from multi-source supervision and feeds them into a diffusion-based hierarchical planner, achieving competitive collision avoidance and t...
-
CoWorld-VLA: Thinking in a Multi-Expert World Model for Autonomous Driving
CoWorld-VLA encodes world information into four expert tokens that condition a diffusion-based planner, yielding competitive collision avoidance and trajectory accuracy on the NAVSIM benchmark.
-
Geometric 4D Stitching for Grounded 4D Generation
Geometric 4D Stitching explicitly complements missing geometric regions in 4D generated scenes with grounded stitches to achieve consistent 4D representations in under 10 minutes on a single GPU.
-
Sat3R: Satellite DSM Reconstruction via RPC-Aware Depth Fine-tuning
Sat3R adapts Depth Anything V2 via RPC-aware metric depth fine-tuning to deliver satellite DSM reconstruction with 38% lower MAE than zero-shot baselines and over 300x speedup versus optimization methods.
-
Spark3R: Asymmetric Token Reduction Makes Fast Feed-Forward 3D Reconstruction
Asymmetric token reduction, with distinct merging for queries and pruning for key-values plus layer-wise adaptation, delivers up to 28x speedup on 1000-frame 3D reconstruction inputs while preserving competitive quality.
-
AnchorD: Metric Grounding of Monocular Depth Using Factor Graphs
AnchorD anchors monocular depth priors in metric sensor data via patch-wise affine alignment using factor graph optimization, improving accuracy on non-Lambertian objects and introducing a new benchmark dataset with d...
-
3D-ReGen: A Unified 3D Geometry Regeneration Framework
3D-ReGen is a conditioned 3D regenerator using VecSet that learns a regeneration prior from unlabeled 3D datasets via self-supervised tasks and achieves state-of-the-art results on controllable 3D geometry tasks.
-
Unified 4D World Action Modeling from Video Priors with Asynchronous Denoising
X-WAM unifies real-time robotic action execution with high-fidelity 4D world synthesis by adapting video diffusion priors through lightweight depth branches and asynchronous noise sampling, achieving 79-91% success on...
-
Unified 4D World Action Modeling from Video Priors with Asynchronous Denoising
X-WAM unifies robotic action execution and 4D world synthesis by adapting video diffusion priors with a lightweight depth branch and asynchronous noise sampling, achieving 79-91% success on robot benchmarks.
-
Learning Human-Intention Priors from Large-Scale Human Demonstrations for Robotic Manipulation
MoT-HRA learns embodiment-agnostic human-intention priors from the HA-2.2M dataset of 2.2M human video episodes through a three-expert hierarchy to improve robotic motion plausibility and robustness under distribution shift.
-
SS3D: End2End Self-Supervised 3D from Web Videos
SS3D pretrains an end-to-end 3D estimator on filtered YouTube-8M videos via SfM self-supervision, achieving improved zero-shot transfer and fine-tuning over prior baselines.
-
SS3D: End2End Self-Supervised 3D from Web Videos
SS3D pretrains an end-to-end feed-forward 3D estimator on filtered YouTube-8M videos via SfM self-supervision, MVS filtering, and expert distillation, delivering stronger zero-shot transfer and fine-tuning than prior ...
-
MLG-Stereo: ViT Based Stereo Matching with Multi-Stage Local-Global Enhancement
MLG-Stereo adds multi-granularity feature extraction, local-global cost volumes, and guided recurrent refinement to ViT stereo matching, yielding competitive results on Middlebury, KITTI-2015, and strong results on KI...
-
Image Generators are Generalist Vision Learners
Image generation pretraining builds generalist vision models that reach SOTA on 2D and 3D perception tasks by reframing them as RGB image outputs.
-
Image Generators are Generalist Vision Learners
Image generation pretraining produces generalist vision models that reframe perception tasks as image synthesis and reach SOTA results on segmentation, depth estimation, and other 2D/3D tasks.
-
FurnSet: Exploiting Repeats for 3D Scene Reconstruction
FurnSet improves single-view 3D scene reconstruction by using per-object CLS tokens and set-aware self-attention to group and jointly reconstruct repeated object instances, with added scene-object conditioning and lay...
-
Enhancing Glass Surface Reconstruction via Depth Prior for Robot Navigation
A training-free RANSAC-based fusion of depth foundation model priors with sensor data recovers accurate metric depth on glass, supported by a new GlassRecon RGB-D dataset with derived ground truth.
-
VEFX-Bench: A Holistic Benchmark for Generic Video Editing and Visual Effects
VEFX-Bench releases a large human-labeled video editing dataset, a multi-dimensional reward model, and a standardized benchmark that better matches human judgments than generic evaluators.
-
Geometric Context Transformer for Streaming 3D Reconstruction
LingBot-Map is a streaming 3D reconstruction model built on a geometric context transformer that combines anchor context, pose-reference window, and trajectory memory to deliver accurate, drift-resistant results at 20...
-
Feed-Forward 3D Scene Modeling: A Problem-Driven Perspective
The paper proposes a problem-driven taxonomy for feed-forward 3D scene modeling that groups methods by five core challenges: feature enhancement, geometry awareness, model efficiency, augmentation strategies, and temp...
-
Lyra 2.0: Explorable Generative 3D Worlds
Lyra 2.0 produces persistent 3D-consistent video sequences for large explorable worlds by using per-frame geometry for information routing and self-augmented training to correct temporal drift.
-
Grasp in Gaussians: Fast Monocular Reconstruction of Dynamic Hand-Object Interactions
GraG reconstructs dynamic 3D hand-object interactions from monocular video 6.4x faster than prior work by using compact Sum-of-Gaussians tracking initialized from large models and refined with 2D losses.
-
Rays as Pixels: Learning A Joint Distribution of Videos and Camera Trajectories
A video diffusion model learns a joint distribution over videos and camera trajectories by representing cameras as pixel-aligned ray encodings (raxels) denoised jointly with video frames via decoupled attention.
-
Self-Improving 4D Perception via Self-Distillation
SelfEvo enables pretrained 4D perception models to self-improve on unlabeled videos via self-distillation, delivering up to 36.5% relative gains in video depth estimation and 20.1% in camera estimation across eight be...
-
PhyEdit: Towards Real-World Object Manipulation via Physically-Grounded Image Editing
PhyEdit improves physical accuracy in image object manipulation by using explicit geometric simulation as 3D-aware guidance combined with joint 2D-3D supervision.
-
LSRM: High-Fidelity Object-Centric Reconstruction via Scaled Context Windows
LSRM scales transformer context windows with native sparse attention and geometric routing to deliver high-fidelity feed-forward 3D reconstruction and inverse rendering that approaches dense optimization quality.
-
LoMa: Local Feature Matching Revisited
Scaling data, model size, and compute for local feature matching produces large performance gains on challenging benchmarks and a new manually annotated HardMatch dataset.
-
Memory Over Maps: 3D Object Localization Without Reconstruction
A map-free localization method stores posed RGB-D keyframes, retrieves and re-ranks them with a VLM, then fuses sparse depth for on-demand 3D target estimates, matching reconstruction-based performance on navigation b...
-
EponaV2: Driving World Model with Comprehensive Future Reasoning
EponaV2 advances perception-free driving world models by forecasting comprehensive future 3D geometry and semantic representations, achieving SOTA planning performance on NAVSIM benchmarks.
-
Thinking with Novel Views: A Systematic Analysis of Generative-Augmented Spatial Intelligence
Integrating generative novel-view synthesis into LMM reasoning loops improves accuracy on spatial subtasks by 1.3 to 3.9 percentage points across multiple models and tasks.
-
Syn4D: A Multiview Synthetic 4D Dataset
Syn4D is a new multiview synthetic 4D dataset supplying dense ground-truth annotations for dynamic scene reconstruction, tracking, and human pose estimation.
-
SS3D: End2End Self-Supervised 3D from Web Videos
SS3D scales SfM-based self-supervision to ~100M frames from YouTube-8M using a multi-view signal proxy for filtering and a two-stage training schedule, achieving strong zero-shot transfer and better fine-tuning than p...
-
Context Unrolling in Omni Models
Omni is a multimodal model whose native training on diverse data types enables context unrolling, allowing explicit reasoning across modalities to better approximate shared knowledge and improve downstream performance.
-
MonoEM-GS: Monocular Expectation-Maximization Gaussian Splatting SLAM
MonoEM-GS stabilizes view-dependent geometry from foundation models inside a global Gaussian Splatting representation via EM and adds multi-modal features for in-place open-set segmentation.
Reference graph
Works this paper leans on
-
[1]
Large-scale data for multiple-view stereopsis.Int
Henrik Aanæs, Rasmus Ramsbøl Jensen, George Vogiatzis, Engin Tola, and Anders Bjorholm Dahl. Large-scale data for multiple-view stereopsis.Int. J. Comput. Vis., 120(2):153–168, 2016
work page 2016
-
[2]
Yousset I Abdel-Aziz, Hauck Michael Karara, and Michael Hauck. Direct linear transformation from comparator coordinates into object space coordinates in close-range photogrammetry.Photogrammetric engineering & remote sensing, 81(2):103–107, 2015
work page 2015
-
[3]
Map-free visual relocalization: Metric pose relative to a single image
Eduardo Arnold, Jamie Wynn, Sara Vicente, Guillermo Garcia-Hernando, Aron Monszpart, Victor Prisacariu, Daniyar Turmukhambetov, and Eric Brachmann. Map-free visual relocalization: Metric pose relative to a single image. In European Conference on Computer Vision, pages 690–708. Springer, 2022
work page 2022
-
[4]
Martha E Arterberry and Albert Yonas. Perception of three-dimensional shape specified by optic flow by 8-week-old infants.Perception & Psychophysics, 62(3):550–556, 2000
work page 2000
-
[5]
Gilad Baruch, Zhuoyuan Chen, Afshin Dehghan, Tal Dimry, Yuri Feigin, Peter Fu, Thomas Gebauer, Brandon Joffe, Daniel Kurz, Arik Schwartz, et al. Arkitscenes: A diverse real-world dataset for 3d indoor scene understanding using mobile rgb-d data.Adv. Neural Inform. Process. Syst., 2021
work page 2021
-
[6]
Depth Pro: Sharp Monocular Metric Depth in Less Than a Second
Aleksei Bochkovskii, AmaÃG, l Delaunoy, Hugo Germain, Marcel Santos, Yichao Zhou, Stephan R Richter, and Vladlen Koltun. Depth pro: Sharp monocular metric depth in less than a second.arXivpreprintarXiv:2410.02073, 2024
work page internal anchor Pith review arXiv 2024
-
[7]
Depth pro: Sharp monocular metric depth in less than a second
Alexey Bochkovskiy, Amaël Delaunoy, Hugo Germain, Marcel Santos, Yichao Zhou, Stephan Richter, and Vladlen Koltun. Depth pro: Sharp monocular metric depth in less than a second. InInt. Conf. Learn. Represent., 2025
work page 2025
-
[8]
Unstructured lumigraph rendering
Chris Buehler, Michael Bosse, Leonard McMillan, Steven Gortler, and Michael Cohen. Unstructured lumigraph rendering. InProceedings of the 28th annual conference on Computer graphics and interactivetechniques, pages 425–432, 2001
work page 2001
-
[9]
Yohann Cabon, Naila Murray, and Martin Humenberger. Virtual kitti 2, 2020
work page 2020
-
[10]
Must3r: Multi-view network for stereo 3d reconstruction
Yohann Cabon, Lucas Stoffl, Leonid Antsfeld, Gabriela Csurka, Boris Chidlovskii, Jerome Revaud, and Vincent Leroy. Must3r: Multi-view network for stereo 3d reconstruction. InIEEE Conf. Comput. Vis. Pattern Recog., pages 1050–1060, 2025
work page 2025
-
[11]
pixelsplat: 3d gaussian splats from image pairs for scalable generalizable 3d reconstruction
David Charatan, Sizhe Lester Li, Andrea Tagliasacchi, and Vincent Sitzmann. pixelsplat: 3d gaussian splats from image pairs for scalable generalizable 3d reconstruction. InIEEE Conf. Comput. Vis. Pattern Recog., pages 19457–19467, 2024
work page 2024
-
[12]
Mvsnerf: Fast generalizable radiance field reconstruction from multi-view stereo
Anpei Chen, Zexiang Xu, Fuqiang Zhao, Xiaoshuai Zhang, Fanbo Xiang, Jingyi Yu, and Hao Su. Mvsnerf: Fast generalizable radiance field reconstruction from multi-view stereo. InIEEE Conf. Comput. Vis. Pattern Recog., pages 14124–14133, 2021
work page 2021
-
[13]
Mvsplat: Efficient 3d gaussian splatting from sparse multi-view images
Yuedong Chen, Haofei Xu, Chuanxia Zheng, Bohan Zhuang, Marc Pollefeys, Andreas Geiger, Tat-Jen Cham, and Jianfei Cai. Mvsplat: Efficient 3d gaussian splatting from sparse multi-view images. InEur. Conf. Comput. Vis., pages 370–386. Springer, 2024
work page 2024
-
[14]
Explicit correspondence matching for generalizable neural radiance fields.IEEE Trans.Pattern Anal
Yuedong Chen, Haofei Xu, Qianyi Wu, Chuanxia Zheng, Tat-Jen Cham, and Jianfei Cai. Explicit correspondence matching for generalizable neural radiance fields.IEEE Trans.Pattern Anal. Mach. Intell., 2025
work page 2025
-
[15]
Diml/cvl rgb-d dataset: 2m rgb-d images of natural indoor and outdoor scenes
Jaehoon Cho, Dongbo Min, Youngjung Kim, and Kwanghoon Sohn. Diml/cvl rgb-d dataset: 2m rgb-d images of natural indoor and outdoor scenes.arXiv preprint arXiv:2110.11590, 2021
-
[16]
The cityscapes dataset for semantic urban scene understanding
Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The cityscapes dataset for semantic urban scene understanding. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3213–3223, 2016
work page 2016
-
[17]
Objaverse: A universe of annotated 3d objects
Matt Deitke, Dustin Schwenk, Jordi Salvador, Luca Weihs, Oscar Michel, Eli VanderBilt, Ludwig Schmidt, Kiana Ehsani, Aniruddha Kembhavi, and Ali Farhadi. Objaverse: A universe of annotated 3d objects. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 13142–13153, 2023
work page 2023
-
[18]
Junyuan Deng, Heng Li, Tao Xie, Weiqiang Ren, Qian Zhang, Ping Tan, and Xiaoyang Guo. Sail-recon: Large sfm by augmenting scene regression with localization.arXiv preprint arXiv:2508.17972, 2025. 24
-
[19]
arXiv preprint arXiv:2507.11539 (2025)
Kai Deng, Zexin Ti, Jiawei Xu, Jian Yang, and Jin Xie. Vggt-long: Chunk it, loop it, align it – pushing vggt’s limits on kilometer-scale long rgb sequences.arXiv preprint arXiv:2507.11539, 2025
-
[20]
Superpoint: Self-supervised interest point detection and description
Daniel DeTone, Tomasz Malisiewicz, and Andrew Rabinovich. Superpoint: Self-supervised interest point detection and description. InCVPR workshops, pages 224–236, 2018
work page 2018
-
[21]
Drone australia gliding ep025: Sydney views | opera house, harbour bridge & hyde park | dji mavic 4k
MTS Drones. Drone australia gliding ep025: Sydney views | opera house, harbour bridge & hyde park | dji mavic 4k. https://www.youtube.com/watch?v=qbgKDaGraTA, 2024. Accessed: Sep. 25, 2025. Used under YouTube Standard License
work page 2024
-
[22]
D2-net: A trainable cnn for joint description and detection of local features
Mihai Dusmanu, Ignacio Rocco, Tomas Pajdla, Marc Pollefeys, Josef Sivic, Akihiko Torii, and Torsten Sattler. D2-net: A trainable cnn for joint description and detection of local features. InIEEE Conf. Comput. Vis. Pattern Recog., pages 8092–8101, 2019
work page 2019
-
[23]
Depth map prediction from a single image using a multi-scale deep network
David Eigen, Christian Puhrsch, and Rob Fergus. Depth map prediction from a single image using a multi-scale deep network. InAdv. Neural Inform. Process. Syst., 2014
work page 2014
-
[24]
Mathias Gehrig, Willem Aarents, Daniel Gehrig, and Davide Scaramuzza. Dsec: A stereo event camera dataset for driving scenarios.IEEE Robotics and Automation Letters, 6(3):4947–4954, 2021
work page 2021
-
[26]
Vision meets robotics: The kitti dataset
Andreas Geiger, Philip Lenz, Christoph Stiller, and Raquel Urtasun. Vision meets robotics: The kitti dataset. The international journal of robotics research, 32(11):1231–1237, 2013
work page 2013
-
[27]
Yotam Gil, Shay Elmalem, Harel Haim, Emanuel Marom, and Raja Giryes. Online training of stereo self- calibration using monocular depth estimation.IEEE Transactionson Computational Imaging, 7:812–823, 2021
work page 2021
-
[28]
Radiant foam: Real-time differentiable ray tracing
Shrisudhan Govindarajan, Daniel Rebain, Kwang Moo Yi, and Andrea Tagliasacchi. Radiant foam: Real-time differentiable ray tracing. InInt. Conf. Comput. Vis., 2025
work page 2025
-
[29]
Antoine Guédon and Vincent Lepetit. Sugar: Surface-aligned gaussian splatting for efficient 3d mesh recon- struction and high-quality mesh rendering. InIEEE Conf. Comput. Vis. Pattern Recog., pages 5354–5363, 2024
work page 2024
-
[30]
3d packing for self-supervised monocular depth estimation
Vitor Guizilini, Rares Ambrus, Sudeep Pillai, Allan Raventos, and Adrien Gaidon. 3d packing for self-supervised monocular depth estimation. InIEEE Conf. Comput. Vis. Pattern Recog., 2020
work page 2020
-
[31]
Multi-view reconstruction via sfm-guided monocular depth estimation
Haoyu Guo, He Zhu, Sida Peng, Haotong Lin, Yunzhi Yan, Tao Xie, Wenguan Wang, Xiaowei Zhou, and Hujun Bao. Multi-view reconstruction via sfm-guided monocular depth estimation. InIEEE Conf. Comput. Vis. Pattern Recog., pages 5272–5282, 2025
work page 2025
-
[32]
Gómez, Manuel Silva, Antonio Seoane, Agnés Borràs, Mario Noriega, German Ros, Jose A
Jose L. Gómez, Manuel Silva, Antonio Seoane, Agnés Borràs, Mario Noriega, German Ros, Jose A. Iglesias-Guitian, and Antonio M. López. All for one, and one for all: Urbansyn dataset, the third musketeer of synthetic driving scenes. Neurocomputing, 637:130038, 2025. ISSN 0925-2312. doi: https://doi.org/10.1016/j.neucom.2025.130038. URLhttps://www.sciencedir...
-
[33]
Detector-free structure from motion
Xingyi He, Jiaming Sun, Yifan Wang, Sida Peng, Qixing Huang, Hujun Bao, and Xiaowei Zhou. Detector-free structure from motion. InIEEE Conf. Comput. Vis. Pattern Recog., pages 21594–21603, 2024
work page 2024
-
[34]
Plenoptic modeling and rendering from image sequences taken by a hand-held camera
Benno Heigl, Reinhard Koch, Marc Pollefeys, Joachim Denzler, and Luc Van Gool. Plenoptic modeling and rendering from image sequences taken by a hand-held camera. InMustererkennung1999: 21. DAGM-Symposium Bonn, 15.–17.September 1999, pages 94–101. Springer, 1999
work page 1999
-
[35]
Lrm: Large reconstruction model for single image to 3d
Yicong Hong, Kai Zhang, Jiuxiang Gu, Sai Bi, Yang Zhou, Difan Liu, Feng Liu, Kalyan Sunkavalli, Trung Bui, and Hao Tan. Lrm: Large reconstruction model for single image to 3d. InInt. Conf. Learn. Represent., 2024
work page 2024
-
[36]
Mu Hu, Wei Yin, Chi Zhang, Zhipeng Cai, Xiaoxiao Long, Hao Chen, Kaixuan Wang, Gang Yu, Chunhua Shen, and Shaojie Shen. Metric3d v2: A versatile monocular geometric foundation model for zero-shot metric depth and surface normal estimation.IEEE Transactionson Pattern Analysis and Machine Intelligence, 2024
work page 2024
-
[37]
Mu Hu, Wei Yin, Chi Zhang, Zhipeng Cai, Xiaoxiao Long, Hao Chen, Kaixuan Wang, Gang Yu, Chunhua Shen, and Shaojie Shen. Metric3d v2: A versatile monocular geometric foundation model for zero-shot metric depth and surface normal estimation.TPAMI, 2024. 25
work page 2024
-
[38]
Deepmvs: Learning multi-view stereopsis
Po-Han Huang, Kevin Matzen, Johannes Kopf, Narendra Ahuja, and Jia-Bin Huang. Deepmvs: Learning multi-view stereopsis. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018
work page 2018
-
[39]
Pow3r: Empowering unconstrained 3d reconstruction with camera and scene priors
Wonbong Jang, Philippe Weinzaepfel, Vincent Leroy, Lourdes Agapito, and Jerome Revaud. Pow3r: Empowering unconstrained 3d reconstruction with camera and scene priors. InIEEE Conf. Comput. Vis. Pattern Recog., pages 1071–1081, 2025
work page 2025
-
[40]
Megasynth: Scaling up 3d scene reconstruction with synthesized data
Hanwen Jiang, Zexiang Xu, Desai Xie, Ziwen Chen, Haian Jin, Fujun Luan, Zhixin Shu, Kai Zhang, Sai Bi, Xin Sun, et al. Megasynth: Scaling up 3d scene reconstruction with synthesized data. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 16441–16452, 2025
work page 2025
-
[41]
Anysplat: Feed-forward 3d gaussian splatting from unconstrained views.ACM Trans
Lihan Jiang, Yucheng Mao, Linning Xu, Tao Lu, Kerui Ren, Yichen Jin, Xudong Xu, Mulin Yu, Jiangmiao Pang, Feng Zhao, et al. Anysplat: Feed-forward 3d gaussian splatting from unconstrained views.ACM Trans. Graph., 2025
work page 2025
-
[42]
Repurposing diffusion-based image generators for monocular depth estimation
Bingxin Ke, Anton Obukhov, Shengyu Huang, Nando Metzger, Rodrigo Caye Daudt, and Konrad Schindler. Repurposing diffusion-based image generators for monocular depth estimation. InCVPR, pages 9492–9502, 2024
work page 2024
-
[43]
MapAnything: Universal Feed-Forward Metric 3D Reconstruction
Nikhil Keetha, Norman Müller, Johannes Schönberger, Lorenzo Porzi, Yuchen Zhang, Tobias Fischer, Arno Knapitsch, Duncan Zauss, Ethan Weber, Nelson Antunes, Jonathon Luiten, Manuel Lopez-Antequera, Samuel Rota Bulò, Christian Richardt, Deva Ramanan, Sebastian Scherer, and Peter Kontschieder. MapAnything: Universal feed-forward metric 3D reconstruction, 202...
work page internal anchor Pith review arXiv 2025
-
[44]
3d gaussian splatting for real-time radiance field rendering.ACM Trans.Graph., 42(4):139–1, 2023
Bernhard Kerbl, Georgios Kopanas, Thomas Leimkühler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering.ACM Trans.Graph., 42(4):139–1, 2023
work page 2023
-
[46]
Tanks and temples: Benchmarking large-scale scene reconstruction
Arno Knapitsch, Jaesik Park, Qian-Yi Zhou, and Vladlen Koltun. Tanks and temples: Benchmarking large-scale scene reconstruction. ACM Trans. Graph., 36(4):1–13, 2017
work page 2017
-
[47]
Eden: Multimodal synthetic dataset of enclosed garden scenes
Hoang-An Le, Thomas Mensink, Partha Das, Sezer Karaoglu, and Theo Gevers. Eden: Multimodal synthetic dataset of enclosed garden scenes. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 1579–1589, 2021
work page 2021
-
[48]
Grounding image matching in 3d with mast3r
Vincent Leroy, Yohann Cabon, and Jérôme Revaud. Grounding image matching in 3d with mast3r. InEur. Conf. Comput. Vis., pages 71–91. Springer, 2024
work page 2024
-
[49]
Light field rendering.ACM Trans.Graph., 1996
Marc Levoy and Pat Hanrahan. Light field rendering.ACM Trans.Graph., 1996
work page 1996
-
[50]
Matrixcity: A large-scale city dataset for city-scale neural rendering and beyond
Yixuan Li, Lihan Jiang, Linning Xu, Yuanbo Xiangli, Zhenzhi Wang, Dahua Lin, and Bo Dai. Matrixcity: A large-scale city dataset for city-scale neural rendering and beyond. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 3205–3215, 2023
work page 2023
-
[51]
Megadepth: Learning single-view depth prediction from internet photos
Zhengqi Li and Noah Snavely. Megadepth: Learning single-view depth prediction from internet photos. InIEEE Conf. Comput. Vis. Pattern Recog., pages 2041–2050, 2018
work page 2041
-
[52]
Efficient neural radiance fields for interactive free-viewpoint video
Haotong Lin, Sida Peng, Zhen Xu, Yunzhi Yan, Qing Shuai, Hujun Bao, and Xiaowei Zhou. Efficient neural radiance fields for interactive free-viewpoint video. InSIGGRAPH Asia 2022 Conference Papers, pages 1–9, 2022
work page 2022
-
[53]
Dl3dv-10k: A large-scale scene dataset for deep learning-based 3d vision
Lu Ling, Yichen Sheng, Zhi Tu, Wentian Zhao, Cheng Xin, Kun Wan, Lantao Yu, Qianyu Guo, Zixun Yu, Yawen Lu, et al. Dl3dv-10k: A large-scale scene dataset for deep learning-based 3d vision. InIEEE Conf. Comput. Vis. Pattern Recog., pages 22160–22169, 2024
work page 2024
-
[54]
arXiv preprint arXiv:2505.12549 (2025)
Dominic Maggio, Hyungtae Lim, and Luca Carlone. Vggt-slam: Dense rgb slam optimized on the sl(4) manifold. arXiv preprint arXiv:2505.12549, 2025
-
[55]
John McCormac, Ankur Handa, Stefan Leutenegger, and Andrew J Davison. Scenenet rgb-d: Can 5m synthetic images beat generic imagenet pre-training on indoor segmentation? InProceedings of the IEEE International Conference on Computer Vision, pages 2678–2687, 2017
work page 2017
-
[56]
Spring: A high-resolution high-detail dataset and benchmark for scene flow, optical flow and stereo
Lukas Mehl, Jenny Schmalfuss, Azin Jahedi, Yaroslava Nalivayko, and Andrés Bruhn. Spring: A high-resolution high-detail dataset and benchmark for scene flow, optical flow and stereo. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4981–4991, 2023. 26
work page 2023
-
[57]
Nerf: Representing scenes as neural radiance fields for view synthesis
Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view synthesis. InEur. Conf. Comput. Vis., 2020
work page 2020
-
[58]
Orb-slam: A versatile and accurate monocular slam system
Raul Mur-Artal, Jose Maria Martinez Montiel, and Juan D Tardos. Orb-slam: A versatile and accurate monocular slam system. IEEE transactions on robotics, 31(5):1147–1163, 2015
work page 2015
-
[59]
Mast3r-slam: Real-time dense slam with 3d reconstruction priors
Riku Murai, Eric Dexheimer, and Andrew J Davison. Mast3r-slam: Real-time dense slam with 3d reconstruction priors. In IEEE Conf. Comput. Vis. Pattern Recog., pages 16695–16705, 2025
work page 2025
-
[60]
3d ken burns effect from a single image.ACM Transactions on Graphics (ToG), 38(6):1–15, 2019
Simon Niklaus, Long Mai, Jimei Yang, and Feng Liu. 3d ken burns effect from a single image.ACM Transactions on Graphics (ToG), 38(6):1–15, 2019
work page 2019
-
[61]
DINOv2: Learning Robust Visual Features without Supervision
Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[62]
Global structure-from-motion revisited
Linfei Pan, Dániel Baráth, Marc Pollefeys, and Johannes L Schönberger. Global structure-from-motion revisited. In Eur. Conf. Comput. Vis., pages 58–77. Springer, 2024
work page 2024
-
[63]
Aria digital twin: A new benchmark dataset for egocentric 3d machine perception
Xiaqing Pan, Nicholas Charron, Yongqian Yang, Scott Peters, Thomas Whelan, Chen Kong, Omkar Parkhi, Richard Newcombe, and Yuheng Carl Ren. Aria digital twin: A new benchmark dataset for egocentric 3d machine perception. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 20133–20143, 2023
work page 2023
-
[64]
Scalable diffusion models with transformers
William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023
work page 2023
-
[65]
Unidepth: Universal monocular metric depth estimation
Luigi Piccinelli, Yung-Hsu Yang, Christos Sakaridis, Mattia Segu, Siyuan Li, Luc Van Gool, and Fisher Yu. Unidepth: Universal monocular metric depth estimation. InIEEE Conf. Comput. Vis. Pattern Recog., pages 10106–10116, 2024
work page 2024
-
[66]
Unidepthv2: Universal monocular metric depth estimation made simpler
Luigi Piccinelli, Christos Sakaridis, Yung-Hsu Yang, Mattia Segu, Siyuan Li, Wim Abbeloos, and Luc Van Gool. Unidepthv2: Universal monocular metric depth estimation made simpler.arXiv preprint arXiv:2502.20110, 2025
-
[67]
Vision transformers for dense prediction
René Ranftl, Alexey Bochkovskiy, and Vladlen Koltun. Vision transformers for dense prediction. InInt. Conf. Comput. Vis., pages 12179–12188, 2021
work page 2021
-
[68]
Common objects in 3d: Large-scale learning and evaluation of real-life 3d category reconstruction
Jeremy Reizenstein, Roman Shapovalov, Philipp Henzler, Luca Sbordone, Patrick Labatut, and David Novotny. Common objects in 3d: Large-scale learning and evaluation of real-life 3d category reconstruction. InInt. Conf. Comput. Vis., pages 10901–10911, 2021
work page 2021
-
[69]
Hypersim: A photorealistic synthetic dataset for holistic indoor scene understanding
Mike Roberts, Jason Ramapuram, Anurag Ranjan, Atulit Kumar, Miguel Angel Bautista, Nathan Paczan, Russ Webb, and Joshua M Susskind. Hypersim: A photorealistic synthetic dataset for holistic indoor scene understanding. In Proceedings of the IEEE/CVF international conference on computer vision, pages 10912– 10922, 2021
work page 2021
-
[70]
Structure-from-motion revisited
Johannes Lutz Schönberger and Jan-Michael Frahm. Structure-from-motion revisited. InIEEE Conf. Comput. Vis. Pattern Recog., 2016
work page 2016
-
[71]
Pixelwise view selection for unstructured multi-view stereo
Johannes Lutz Schönberger, Enliang Zheng, Marc Pollefeys, and Jan-Michael Frahm. Pixelwise view selection for unstructured multi-view stereo. InEur. Conf. Comput. Vis., 2016
work page 2016
-
[72]
A multi-view stereo benchmark with high-resolution images and multi-camera videos
Thomas Schops, Johannes L Schonberger, Silvano Galliani, Torsten Sattler, Konrad Schindler, Marc Pollefeys, and Andreas Geiger. A multi-view stereo benchmark with high-resolution images and multi-camera videos. In IEEE Conf. Comput. Vis. Pattern Recog., pages 3260–3269, 2017
work page 2017
-
[73]
A comparison and evaluation of multi-view stereo reconstruction algorithms
Steven M Seitz, Brian Curless, James Diebel, Daniel Scharstein, and Richard Szeliski. A comparison and evaluation of multi-view stereo reconstruction algorithms. InIEEE Conf. Comput. Vis. Pattern Recog., volume 1, pages 519–528. IEEE, 2006
work page 2006
-
[74]
Scene coordinate regression forests for camera relocalization in rgb-d images
Jamie Shotton, Ben Glocker, Christopher Zach, Shahram Izadi, Antonio Criminisi, and Andrew Fitzgibbon. Scene coordinate regression forests for camera relocalization in rgb-d images. InIEEE Conf. Comput. Vis. Pattern Recog., pages 2930–2937, 2013
work page 2013
-
[75]
Indoor segmentation and support inference from rgbd images
Nathan Silberman, Derek Hoiem, Pushmeet Kohli, and Rob Fergus. Indoor segmentation and support inference from rgbd images. InEur. Conf. Comput. Vis., pages 746–760. Springer, 2012. 27
work page 2012
-
[76]
Indoor segmentation and support inference from rgbd images
Nathan Silberman, Derek Hoiem, Pushmeet Kohli, and Rob Fergus. Indoor segmentation and support inference from rgbd images. InEur. Conf. Comput. Vis., pages 746–760. Springer, 2012
work page 2012
-
[77]
Scene representation networks: Continuous 3d-structure-aware neural scene representations.Adv
Vincent Sitzmann, Michael Zollhöfer, and Gordon Wetzstein. Scene representation networks: Continuous 3d-structure-aware neural scene representations.Adv. Neural Inform. Process. Syst., 32, 2019
work page 2019
-
[78]
Light field networks: Neural scene representations with single-evaluation rendering.Adv
Vincent Sitzmann, Semon Rezchikov, Bill Freeman, Josh Tenenbaum, and Fredo Durand. Light field networks: Neural scene representations with single-evaluation rendering.Adv. Neural Inform. Process. Syst., 34:19313– 19325, 2021
work page 2021
-
[79]
Brandon Smart, Chuanxia Zheng, Iro Laina, and Victor Adrian Prisacariu. Splatt3r: Zero-shot gaussian splatting from uncalibrated image pairs.arXiv preprint arXiv:2408.13912, 2024
-
[80]
Photo tourism: exploring photo collections in 3d.ACM Trans.Graph., pages 835–846, 2006
Noah Snavely, Steven M Seitz, and Richard Szeliski. Photo tourism: exploring photo collections in 3d.ACM Trans.Graph., pages 835–846, 2006
work page 2006
-
[81]
Sun rgb-d: A rgb-d scene understanding benchmark suite
Shuran Song, Samuel P Lichtenberg, and Jianxiong Xiao. Sun rgb-d: A rgb-d scene understanding benchmark suite. In IEEE Conf. Comput. Vis. Pattern Recog., pages 567–576, 2015
work page 2015
-
[83]
The Replica Dataset: A Digital Replica of Indoor Spaces
Julian Straub, Thomas Whelan, Lingni Ma, Yufan Chen, Erik Wijmans, Simon Green, Jakob J Engel, Raul Mur-Artal, Carl Ren, Shobhit Verma, et al. The replica dataset: A digital replica of indoor spaces.arXiv preprint arXiv:1906.05797, 2019
work page internal anchor Pith review arXiv 1906
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.