UniDepthV2: Universal Monocular Metric Depth Estimation Made Simpler

Christos Sakaridis; Luc Van Gool; Luigi Piccinelli; Mattia Segu; Siyuan Li; Wim Abbeloos; Yung-Hsu Yang

arxiv: 2502.20110 · v2 · pith:T7ZZFOUO · submitted 2025-02-27 · cs.CV

UniDepthV2: Universal Monocular Metric Depth Estimation Made Simpler

Luigi Piccinelli , Christos Sakaridis , Yung-Hsu Yang , Mattia Segu , Siyuan Li , Wim Abbeloos , Luc Van Gool This is my paper

Reviewed by Pith T0 review T1 audit T2 compute T3 formal T4 kernel 2026-05-17 09:05 UTCgrok-4.3pith:T7ZZFOUO record.json open to challenge →

classification cs.CV

keywords monocular metric depth estimationzero-shot generalizationmetric 3D reconstructioncamera-depth disentanglementedge-guided lossuncertainty estimationuniversal depth model

0 comments

The pith

UniDepthV2 predicts metric 3D points directly from single images across domains without extra inputs or retraining.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents UniDepthV2 as a model for monocular metric depth estimation that works universally rather than being limited to the domains seen during training. It achieves this by directly outputting metric 3D points from one image at inference time, using a self-promptable camera module to supply dense camera information that conditions the depth features. A pseudo-spherical representation helps separate camera geometry from depth values, while a geometric invariance loss keeps the depth features stable under camera changes. An edge-guided loss sharpens boundaries in the output, the architecture is simplified for efficiency, and an uncertainty map is added for downstream use. Zero-shot tests across ten datasets show the model maintains accuracy where prior methods degrade.

Core claim

UniDepthV2 reconstructs metric 3D scenes from single images across domains by implementing a self-promptable camera module that predicts a dense camera representation to condition depth features, exploiting a pseudo-spherical output representation that disentangles the camera and depth representations, and proposing a geometric invariance loss that promotes the invariance of camera-prompted depth features. The model further improves its predecessor through an edge-guided loss for sharper localization, a revisited and simplified architectural design, and an additional uncertainty-level output.

What carries the argument

self-promptable camera module that predicts a dense camera representation to condition depth features, combined with a pseudo-spherical output representation and geometric invariance loss

If this is right

Metric 3D reconstruction becomes feasible from ordinary single images in previously unseen environments.
Downstream tasks gain access to per-pixel uncertainty estimates for confidence-aware processing.
Edge localization in depth maps improves without requiring separate post-processing steps.
A single trained model replaces the need for domain-specific depth estimators in many settings.
Inference runs with a simpler and more efficient network than the prior version.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same disentanglement of geometry and appearance could be tested on related tasks such as surface normal estimation or novel-view synthesis.
If the uncertainty output correlates well with actual error, it might support active sensing strategies that query only low-confidence regions.
The approach opens the possibility of deploying depth-aware systems in robotics or augmented reality without per-scene camera recalibration.

Load-bearing premise

The self-promptable camera module and geometric invariance loss can reliably disentangle and generalize camera and depth features without domain-specific information or post-hoc adjustments.

What would settle it

A new test domain where the model produces depth values that deviate systematically from ground-truth metric distances without any fine-tuning or camera calibration would falsify the claim of universal generalization.

read the original abstract

Accurate monocular metric depth estimation (MMDE) is crucial to solving downstream tasks in 3D perception and modeling. However, the remarkable accuracy of recent MMDE methods is confined to their training domains. These methods fail to generalize to unseen domains even in the presence of moderate domain gaps, which hinders their practical applicability. We propose a new model, UniDepthV2, capable of reconstructing metric 3D scenes from solely single images across domains. Departing from the existing MMDE paradigm, UniDepthV2 directly predicts metric 3D points from the input image at inference time without any additional information, striving for a universal and flexible MMDE solution. In particular, UniDepthV2 implements a self-promptable camera module predicting a dense camera representation to condition depth features. Our model exploits a pseudo-spherical output representation, which disentangles the camera and depth representations. In addition, we propose a geometric invariance loss that promotes the invariance of camera-prompted depth features. UniDepthV2 improves its predecessor UniDepth model via a new edge-guided loss which enhances the localization and sharpness of edges in the metric depth outputs, a revisited, simplified and more efficient architectural design, and an additional uncertainty-level output which enables downstream tasks requiring confidence. Thorough evaluations on ten depth datasets in a zero-shot regime consistently demonstrate the superior performance and generalization of UniDepthV2. Code and models are available at https://github.com/lpiccinelli-eth/UniDepth

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

UniDepthV2 refines the prior model with an edge-guided loss, uncertainty head, and simplifications, delivering reported zero-shot gains, but the camera module's ability to truly disentangle scale from scene content remains the key unproven piece.

read the letter

UniDepthV2 builds directly on the first version by adding an edge-guided loss for sharper boundaries, an uncertainty output, a geometric invariance loss, and a cleaner architecture while keeping the self-promptable camera module and pseudo-spherical representation. The result is a model that claims stronger zero-shot metric depth across ten datasets without test-time adjustments or extra inputs. Code release helps anyone who wants to check the numbers themselves. The practical upside is clear for robotics or reconstruction work that needs metric scale out of the box on new domains. The main soft spot is whether the camera module actually recovers intrinsics independently or simply correlates image statistics with scale. The geometric loss is intended to enforce separation, but if that separation leaks domain cues the zero-shot story weakens on cameras or scenes far from training. The evaluations are broad, yet the paper would be stronger with explicit checks on focal-length shifts or sensor changes that the stress test flags. This work suits readers already following monocular depth papers who need an incremental but usable step forward. It is coherent enough on its own terms to go to referees rather than desk reject, with the main questions centered on how well the disentanglement holds up under distribution shift.

Referee Report

3 major / 2 minor

Summary. The paper introduces UniDepthV2, a model for monocular metric depth estimation (MMDE) that directly predicts metric 3D points from single images across domains without additional inputs. It builds on UniDepth with a self-promptable camera module that outputs a dense camera representation, a pseudo-spherical output representation to disentangle camera and depth features, a geometric invariance loss, an edge-guided loss for improved edge sharpness, a simplified architecture, and an uncertainty output. The central claim is superior zero-shot generalization on ten depth datasets.

Significance. If the disentanglement of camera parameters from scene content holds and zero-shot results are free of domain leakage, this would represent a meaningful step toward practical universal MMDE for downstream 3D tasks. The public release of code and models strengthens the contribution by enabling direct reproducibility and extension.

major comments (3)

[§3.2] §3.2 (Self-promptable camera module): The module is described as predicting a dense camera representation to condition depth features, but the manuscript lacks controlled experiments (e.g., varying focal length or principal point while holding scene content fixed) that would demonstrate whether predictions rely on true camera intrinsics rather than image appearance cues such as object scale or texture. This directly bears on the zero-shot generalization claim.
[§4] §4 (Experimental setup) and Table 1/2: The zero-shot evaluations across ten datasets report superior performance, yet the text does not explicitly confirm that training data splits exclude any overlap with evaluation domains or that baselines were re-trained under identical conditions without implicit domain cues. This information is load-bearing for the generalization superiority assertion.
[§3.3] §3.3 (Geometric invariance loss): The loss is introduced to enforce invariance of camera-prompted depth features, but no quantitative ablation or feature-space analysis (e.g., cosine similarity of depth features under synthetic camera perturbations) is provided to verify that it successfully separates camera and depth representations rather than allowing domain-specific correlations to persist.

minor comments (2)

[Figure 4] Figure 4: The uncertainty visualization would be clearer with an explicit color scale and comparison to ground-truth error maps.
[§3.1] Notation: The pseudo-spherical representation is introduced without a compact mathematical definition (e.g., an equation relating spherical coordinates to metric depth and camera parameters); adding this would improve readability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point by point below, providing the strongest honest defense of the manuscript while committing to revisions that directly strengthen the claims on camera disentanglement, data integrity, and loss validation.

read point-by-point responses

Referee: [§3.2] §3.2 (Self-promptable camera module): The module is described as predicting a dense camera representation to condition depth features, but the manuscript lacks controlled experiments (e.g., varying focal length or principal point while holding scene content fixed) that would demonstrate whether predictions rely on true camera intrinsics rather than image appearance cues such as object scale or texture. This directly bears on the zero-shot generalization claim.

Authors: We agree that explicit controlled experiments would provide clearer evidence that the self-promptable camera module captures true intrinsics rather than appearance cues. In the revised manuscript we will add a dedicated experiment subsection that fixes scene content and synthetically varies focal length and principal point; we will report the resulting shifts in the predicted dense camera representation and the downstream metric depth to demonstrate responsiveness to camera parameters. revision: yes
Referee: [§4] §4 (Experimental setup) and Table 1/2: The zero-shot evaluations across ten datasets report superior performance, yet the text does not explicitly confirm that training data splits exclude any overlap with evaluation domains or that baselines were re-trained under identical conditions without implicit domain cues. This information is load-bearing for the generalization superiority assertion.

Authors: We confirm that the training splits were constructed with no overlap to any evaluation domains and that all baselines were re-trained from scratch under identical data, hyperparameters, and protocol. In the revised Section 4 we will add an explicit paragraph detailing the split construction procedure and training conditions to make this information unambiguous and to reinforce the zero-shot generalization results. revision: yes
Referee: [§3.3] §3.3 (Geometric invariance loss): The loss is introduced to enforce invariance of camera-prompted depth features, but no quantitative ablation or feature-space analysis (e.g., cosine similarity of depth features under synthetic camera perturbations) is provided to verify that it successfully separates camera and depth representations rather than allowing domain-specific correlations to persist.

Authors: We acknowledge that a quantitative feature-space analysis would strengthen the justification for the geometric invariance loss. In the revised manuscript we will include a new ablation that reports average cosine similarity of depth features under controlled synthetic camera perturbations, together with the corresponding depth accuracy metrics, to demonstrate that the loss successfully promotes separation of camera and depth representations. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper presents an empirical ML architecture for monocular metric depth estimation, with innovations including a self-promptable camera module, pseudo-spherical output representation, geometric invariance loss, and edge-guided loss. These are introduced as design choices to promote disentanglement of camera and depth features, supported by zero-shot evaluations on ten external benchmark datasets. No equations, predictions, or first-principles results reduce by construction to quantities fitted on evaluation data or to unverified self-citations. Self-reference to the predecessor UniDepth model describes incremental improvements rather than serving as load-bearing justification for the core claims. The derivation remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 2 invented entities

The central claim rests on the assumption that metric depth is recoverable from image appearance alone across domains and that the introduced modules and losses achieve disentanglement without additional supervision.

free parameters (1)

training hyperparameters and loss weights
Standard neural network training choices tuned to achieve reported performance.

axioms (1)

domain assumption Single images contain sufficient cues for metric depth and camera parameters across arbitrary domains
Core premise of universal monocular metric depth estimation invoked throughout the abstract.

invented entities (2)

self-promptable camera module no independent evidence
purpose: Predicts dense camera representation to condition depth features
New architectural component introduced to enable domain-agnostic inference.
pseudo-spherical output representation no independent evidence
purpose: Disentangles camera and depth representations
New output format proposed to separate the two factors.

pith-pipeline@v0.9.0 · 5586 in / 1207 out tokens · 75025 ms · 2026-05-17T09:05:20.568310+00:00 · methodology

discussion (0)

Forward citations

Cited by 33 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

One Scene, Two Depths: Probing Geometric Ambiguity in Monocular Foundation Models
cs.CV 2026-06 unverdicted novelty 7.0

Introduces MultiDepth-3k benchmark revealing diverse layer preferences across depth models on ambiguous scenes, with Laplacian Visual Prompting altering outputs for some frozen models and best pair reaching 75.5% ML-SRA.
DepthMaster: Unified Monocular Depth Estimation for Perspective and Panoramic Images
cs.CV 2026-06 unverdicted novelty 7.0

DepthMaster unifies metric monocular depth estimation for perspective and panoramic images by patching panoramas into perspective views, adding a consistency loss and virtual cameras, and training mostly on perspectiv...
Honey, I Shrunk the Arc de Triomphe!
cs.CV 2026-06 unverdicted novelty 7.0

MetricScenes dataset from web photos and stereo imagery, plus a two-stage Poisson depth completion method, allows fine-tuning MoGe-2 to mitigate scale-collapse in metric monocular geometry while preserving benchmark p...
Honey, I Shrunk the Arc de Triomphe!
cs.CV 2026-06 unverdicted novelty 7.0

Introduces MetricScenes dataset with metric grounding from geo-tags and stereo, plus Poisson depth completion, showing fine-tuned MoGe-2 reduces scale-collapse in open scenes.
SeeGroup: Multi-Layer Depth Estimation of Transparent Surfaces via Self-Determined Grouping
cs.CV 2026-05 unverdicted novelty 7.0

SeeGroup formulates per-pixel multi-layer depth as a point process with permutation-invariant likelihood to support arbitrary groupings, raising quadruplet relative depth accuracy from 61.34% to 70.09% on the LayeredD...
CARD: A Multi-Modal Automotive Dataset for Dense 3D Reconstruction in Challenging Road Topography
cs.CV 2026-05 conditional novelty 7.0

CARD is a new multi-modal driving dataset delivering ~500K dense depth pixels per frame from challenging road topographies using stereo cameras and fused LiDARs over 110 km.
Anny-Fit: All-Age Human Mesh Recovery
cs.CV 2026-05 unverdicted novelty 7.0

Anny-Fit jointly optimizes all-age multi-person 3D human meshes in camera coordinates using complementary signals from off-the-shelf depth, segmentation, keypoint, and VLM networks, yielding better reprojection, depth...
DP-SfM: Dual-Pixel Structure-from-Motion without Scale Ambiguity
cs.CV 2026-05 unverdicted novelty 7.0

Dual-pixel defocus blur enables absolute scale estimation in SfM without reference objects or calibration.
LuMon: A Comprehensive Benchmark and Development Suite with Novel Datasets for Lunar Monocular Depth Estimation
cs.CV 2026-04 unverdicted novelty 7.0

A new benchmark with real lunar stereo ground truth and analog data shows that sim-to-real fine-tuned monocular depth models achieve large in-domain gains but minimal generalization to actual lunar images.
Appearance Decomposition Gaussian Splatting for Multi-Traversal Reconstruction
cs.CV 2026-04 unverdicted novelty 7.0

ADM-GS decomposes static background appearance into traversal-invariant material and traversal-dependent illumination via a frequency-separated neural light field, yielding +0.98 dB PSNR gains and better cross-travers...
Efficient Test-Time Optimization for Depth Completion via Low-Rank Decoder Adaptation
cs.CV 2026-03 unverdicted novelty 7.0

Low-rank decoder adaptation enables efficient test-time optimization for zero-shot depth completion by updating only the subspace containing depth-relevant information.
RAD: Retrieval-Augmented Monocular Metric Depth Estimation for Underrepresented Classes
cs.CV 2026-02 unverdicted novelty 7.0

RAD retrieves semantically similar RGB-D context samples for low-confidence regions and fuses them via matched cross-attention to cut relative absolute depth error by 29.2% on NYU Depth v2 underrepresented classes whi...
AerialMetric: Benchmarking and Adapting UAV Monocular Metric Depth Estimation in the Real World
cs.CV 2026-06 unverdicted novelty 6.0

AerialMetric is a new benchmark dataset and evaluation suite for adapting monocular metric depth estimation models to real-world UAV aerial views.
VLM3: Vision Language Models Are Native 3D Learners
cs.CV 2026-05 unverdicted novelty 6.0

Standard VLMs achieve expert-level 3D performance on depth estimation, pose estimation, and object understanding via three simple techniques without architecture changes or regression losses.
Unified Panoramic Geometry Estimation via Multi-View Foundation Models
cs.CV 2026-05 unverdicted novelty 6.0

PaGeR is a framework that lifts perspective 3D foundation models to omnidirectional images through mixed training, enabling unified prediction of scale-invariant depth, metric depth, surface normals, and sky masks fro...
HumanSplatHMR: Closing the Loop Between Human Mesh Recovery and Gaussian Splatting Avatar
cs.CV 2026-05 unverdicted novelty 6.0

HumanSplatHMR jointly refines 3D human poses and learns Gaussian Splatting avatars by backpropagating photometric, segmentation, and depth losses through a differentiable renderer to improve novel-view and novel-pose ...
HumanSplatHMR: Closing the Loop Between Human Mesh Recovery and Gaussian Splatting Avatar
cs.CV 2026-05 unverdicted novelty 6.0

HumanSplatHMR closes the loop between human mesh recovery and Gaussian Splatting by using photometric, segmentation, and depth losses to refine poses during avatar optimization.
Vista4D: Video Reshooting with 4D Point Clouds
cs.CV 2026-04 unverdicted novelty 6.0

Vista4D re-synthesizes dynamic videos from new viewpoints by grounding them in a 4D point cloud built with static segmentation and multiview training.
Enhancing Glass Surface Reconstruction via Depth Prior for Robot Navigation
cs.RO 2026-04 unverdicted novelty 6.0

A training-free RANSAC-based fusion of depth foundation model priors with sensor data recovers accurate metric depth on glass, supported by a new GlassRecon RGB-D dataset with derived ground truth.
In Depth We Trust: Reliable Monocular Depth Supervision for Gaussian Splatting
cs.CV 2026-04 unverdicted novelty 6.0

A selective regularization framework lets scale-ambiguous monocular depth priors improve Gaussian Splatting geometry and rendering by isolating and supervising only ill-posed regions.
OpenVO: Open-World Visual Odometry with Temporal Dynamics Awareness
cs.CV 2026-02 unverdicted novelty 6.0

OpenVO estimates ego-motion from monocular dashcam footage with varying observation rates and uncalibrated cameras by encoding temporal dynamics in a two-frame regression framework and using 3D priors from foundation ...
SpaceDrive: Infusing Spatial Awareness into VLM-based Autonomous Driving
cs.CV 2025-12 conditional novelty 6.0

SpaceDrive integrates 3D positional encodings derived from depth and ego-states into VLMs, replacing digit tokens to improve spatial reasoning and trajectory regression in autonomous driving.
Depth Anything 3: Recovering the Visual Space from Any Views
cs.CV 2025-11 unverdicted novelty 6.0

DA3 recovers consistent visual geometry from arbitrary views via a vanilla DINO transformer and depth-ray target, setting new SOTA on a visual geometry benchmark while outperforming DA2 on monocular depth.
SCOPE: Scale-Consistent One-Pass Estimation of 3D Geometry
cs.CV 2026-06 unverdicted novelty 5.0

SCOPE uses affine-invariant 3D point maps with shared parameters and three consistency innovations to estimate 3D geometry from extended monocular videos, reporting 24.2% and 34.9% error reductions on ScanNet.
{\alpha}Depth: Learning Single-Pass Soft Boundary Decomposition for Stereo Conversion
cs.CV 2026-05 unverdicted novelty 5.0

αDepth proposes a single-pass layered model with CAR for soft boundary decomposition to improve stereo conversion by estimating layered color and depth.
Towards Consistent Video Geometry Estimation
cs.CV 2026-05 unverdicted novelty 5.0

ViGeo is a feed-forward transformer for video geometry that introduces dynamic chunking attention and a completion-based data refinement framework to achieve SOTA on depth, normals, and point map estimation.
The Midas Touch for Metric Depth
cs.CV 2026-05 unverdicted novelty 5.0

MTD turns relative depth into metric depth via segment-wise sparse graph optimization and discontinuity-aware geodesic pixel refinement, claiming better accuracy and generalization than prior depth methods.
ROVR-Open-Dataset: A Large-Scale Depth Dataset for Autonomous Driving
cs.CV 2025-08 unverdicted novelty 5.0

ROVR is a new diverse depth dataset for autonomous driving with 200K frames, released pipelines, and ablations showing sparse ground truth supports model training.
ViPE: Video Pose Engine for 3D Geometric Perception
cs.CV 2025-08 unverdicted novelty 5.0

ViPE estimates camera intrinsics, motion, and dense near-metric depth from uncalibrated videos, outperforming baselines on TUM and KITTI while releasing annotations for 96M frames across real and generated videos.
MoGe-2: Accurate Monocular Geometry with Metric Scale and Sharp Details
cs.CV 2025-07 unverdicted novelty 5.0

MoGe-2 recovers metric-scale 3D point maps with fine details from single images via data refinement and extension of affine-invariant predictions.
Can Single-View Mesh Reconstruction Generalize to Robot Camera Rotation?
cs.CV 2026-06 unverdicted novelty 4.0

Single-view mesh reconstruction generalizes poorly to robot camera rotations, inducing MDE distortion and layout drift, while a gravity-aware refinement cuts one-stage layout-orientation error by 47.1%.
Large Depth Completion Model from Sparse Observations
cs.CV 2026-05 unverdicted novelty 4.0

LDCM achieves state-of-the-art metric depth completion from sparse observations by combining foundation-model initialization with a point-map regression head that removes the need for camera intrinsics.
Depth-Aware Rover: A Study of Edge AI and Monocular Vision for Real-World Implementation
cs.CV 2026-04 unverdicted novelty 3.0

Monocular depth estimation with UniDepthV2 on Raspberry Pi enables cost-effective rover navigation, proving more robust than stereo vision in real-world tests at 0.1 FPS depth and 10 FPS detection.

Reference graph

Works this paper leans on

90 extracted references · 90 canonical work pages · cited by 31 Pith papers · 4 internal anchors

[1]

Depth-supervised nerf: Fewer views and faster training for free,

K. Deng, A. Liu, J.-Y . Zhu, and D. Ramanan, “Depth-supervised nerf: Fewer views and faster training for free,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 12 882–12 891. 1

work page 2022
[2]

Does computer vision matter for action?

B. Zhou, P. Kr ¨ahenb¨uhl, and V . Koltun, “Does computer vision matter for action?”Science Robotics, vol. 4, 5 2019. 1 12

work page 2019
[3]

Towards real-time monocular depth estimation for robotics: A survey,

X. Dong, M. A. Garratt, S. G. Anavatti, and H. A. Abbass, “Towards real-time monocular depth estimation for robotics: A survey,”IEEE Transactions on Intelligent Transportation Systems, vol. 23, no. 10, pp. 16 940–16 961, 2022. 1

work page 2022
[4]

Pseudo-lidar from visual depth estimation: Bridging the gap in 3d object detection for autonomous driving,

Y . Wang, W.-L. Chao, D. Garg, B. Hariharan, M. Campbell, and K. Q. Weinberger, “Pseudo-lidar from visual depth estimation: Bridging the gap in 3d object detection for autonomous driving,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 8445–8453. 1

work page 2019
[5]

Is pseudo-lidar needed for monocular 3d object detection?

D. Park, R. Ambrus, V . Guizilini, J. Li, and A. Gaidon, “Is pseudo-lidar needed for monocular 3d object detection?” inIEEE/CVF International Conference on Computer Vision (ICCV), 2021. 1

work page 2021
[6]

Depth map prediction from a single image using a multi-scale deep network,

D. Eigen, C. Puhrsch, and R. Fergus, “Depth map prediction from a single image using a multi-scale deep network,” inAdvances in Neural Information Processing Systems (NeurIPS), vol. 3. Neural information processing systems foundation, 6 2014, pp. 2366–2374. 1, 3

work page 2014
[7]

Deep ordinal regression network for monocular depth estimation,

H. Fu, M. Gong, C. Wang, K. Batmanghelich, and D. Tao, “Deep ordinal regression network for monocular depth estimation,”Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2002–2011, 6 2018. 1, 3

work page 2002
[8]

Adabins: Depth estimation using adaptive bins,

S. F. Bhat, I. Alhashim, and P. Wonka, “Adabins: Depth estimation using adaptive bins,”Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4008–4017, 11 2020. 1, 3, 9

work page 2020
[9]

Vision transformers for dense prediction,

R. Ranftl, A. Bochkovskiy, and V . Koltun, “Vision transformers for dense prediction,”Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 12 159–12 168, 3 2021. 1

work page 2021
[10]

P3Depth: Monocular depth estimation with a piecewise planarity prior,

V . Patil, C. Sakaridis, A. Liniger, and L. V . Gool, “P3Depth: Monocular depth estimation with a piecewise planarity prior,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2022, pp. 1600–1611. 1, 3

work page 2022
[11]

Neural window fully- connected crfs for monocular depth estimation,

W. Yuan, X. Gu, Z. Dai, S. Zhu, and P. Tan, “Neural window fully- connected crfs for monocular depth estimation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2022, pp. 3906–3915. 1, 3, 9

work page 2022
[12]

iDisc: Internal discretization for monocular depth estimation,

L. Piccinelli, C. Sakaridis, and F. Yu, “iDisc: Internal discretization for monocular depth estimation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023. 1, 3, 9

work page 2023
[13]

Train in germany, test in the usa: Making 3d object detectors generalize,

Y . Wang, X. Chen, Y . You, L. E. Li, B. Hariharan, M. Campbell, K. Q. Weinberger, and W. L. Chao, “Train in germany, test in the usa: Making 3d object detectors generalize,”Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11 710–11 720, 5 2020. 1

work page 2020
[14]

Metric3d: Towards zero-shot metric 3d prediction from a single image,

W. Yin, C. Zhang, H. Chen, Z. Cai, G. Yu, K. Wang, X. Chen, and C. Shen, “Metric3d: Towards zero-shot metric 3d prediction from a single image,” inProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023, pp. 9043–9053. 1, 3, 5, 7, 8, 9, 10

work page 2023
[15]

Towards zero-shot scale-aware monocular depth estimation,

V . Guizilini, I. Vasiljevic, D. Chen, R. Ambrus,, and A. Gaidon, “Towards zero-shot scale-aware monocular depth estimation,” inProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023, pp. 9233–9243. 1, 3

work page 2023
[16]

arXiv preprint arXiv:2404.15506 , year=

M. Hu, W. Yin, C. Zhang, Z. Cai, X. Long, H. Chen, K. Wang, G. Yu, C. Shen, and S. Shen, “Metric3d v2: A versatile monocular geometric foundation model for zero-shot metric depth and surface normal estima- tion,”arXiv preprint arXiv:2404.15506, 2024. 1, 3, 6, 7, 8, 9, 10

work page arXiv 2024
[17]

Unidepth: Universal monocular metric depth estimation,

L. Piccinelli, Y .-H. Yang, C. Sakaridis, M. Segu, S. Li, L. Van Gool, and F. Yu, “Unidepth: Universal monocular metric depth estimation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 10 106–10 116. 1, 2, 3, 6, 7, 8, 9, 10, 11

work page 2024
[18]

Perception-aware chance-constrained model predictive control for uncertain environments,

A. D. Bonzanini, A. Mesbah, and S. Di Cairano, “Perception-aware chance-constrained model predictive control for uncertain environments,” in2021 American Control Conference (ACC). IEEE, 2021, pp. 2082–

work page 2021
[19]

Stochastic model predictive control: An overview and perspectives for future research,

A. Mesbah, “Stochastic model predictive control: An overview and perspectives for future research,”IEEE Control Systems Magazine, vol. 36, no. 6, pp. 30–44, 2016. 2

work page 2016
[20]

Safe perception-based control under stochastic sensor uncertainty using con- formal prediction,

S. Yang, G. J. Pappas, R. Mangharam, and L. Lindemann, “Safe perception-based control under stochastic sensor uncertainty using con- formal prediction,”arXiv preprint arXiv:2304.00194, 2023. 2

work page arXiv 2023
[21]

Robust model predictive control: A survey,

A. Bemporad and M. Morari, “Robust model predictive control: A survey,” inRobustness in identification and control. Springer, 2007, pp. 207–

work page 2007
[22]

Indoor seg- mentation and support inference from rgbd images,

P. K. Nathan Silberman, Derek Hoiem and R. Fergus, “Indoor seg- mentation and support inference from rgbd images,” inThe European Conference on Computer Vision (ECCV), 2012. 2

work page 2012
[23]

Are we ready for autonomous driving? The KITTI vision benchmark suite,

A. Geiger, P. Lenz, and R. Urtasun, “Are we ready for autonomous driving? The KITTI vision benchmark suite,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2012. 2

work page 2012
[24]

Towards robust monocular depth estimation: Mixing datasets for zero-shot cross- dataset transfer,

R. Ranftl, K. Lasinger, D. Hafner, K. Schindler, and V . Koltun, “Towards robust monocular depth estimation: Mixing datasets for zero-shot cross- dataset transfer,”IEEE Transactions on Pattern Analysis and Machine Intelligence (T-PAMI), vol. 44, no. 3, pp. 1623–1637, 2020. 2

work page 2020
[25]

Omnidata: A scalable pipeline for making multi-task mid-level vision datasets from 3d scans,

A. Eftekhar, A. Sax, J. Malik, and A. Zamir, “Omnidata: A scalable pipeline for making multi-task mid-level vision datasets from 3d scans,” inProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021, pp. 10 786–10 796. 2

work page 2021
[26]

Learning to recover 3d scene shape from a single image,

W. Yin, J. Zhang, O. Wang, S. Niklaus, L. Mai, S. Chen, and C. Shen, “Learning to recover 3d scene shape from a single image,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021, pp. 204–213. 2

work page 2021
[27]

Repurposing diffusion-based image generators for monocular depth estimation,

B. Ke, A. Obukhov, S. Huang, N. Metzger, R. C. Daudt, and K. Schindler, “Repurposing diffusion-based image generators for monocular depth estimation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 9492–9502. 2

work page 2024
[28]

Depth anything: Unleashing the power of large-scale unlabeled data,

L. Yang, B. Kang, Z. Huang, X. Xu, J. Feng, and H. Zhao, “Depth anything: Unleashing the power of large-scale unlabeled data,” inPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 10 371–10 381. 2

work page 2024
[29]

Depth Anything V2

L. Yang, B. Kang, Z. Huang, Z. Zhao, X. Xu, J. Feng, and H. Zhao, “Depth anything v2,”arXiv preprint arXiv:2406.09414, 2024. 2, 8, 9

work page internal anchor Pith review Pith/arXiv arXiv 2024
[30]

Segment anything,

A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y . Lo, P. Doll ´ar, and R. Girshick, “Segment anything,” 2023. 3

work page 2023
[31]

Geowizard: Unleashing the diffusion priors for 3d geometry estimation from a single image,

X. Fu, W. Yin, M. Hu, K. Wang, Y . Ma, P. Tan, S. Shen, D. Lin, and X. Long, “Geowizard: Unleashing the diffusion priors for 3d geometry estimation from a single image,” inEuropean Conference on Computer Vision (ECCV), 2024. 3

work page 2024
[32]

Lotus: Diffusion-based visual foundation model for high-quality dense prediction,

J. He, H. Li, W. Yin, Y . Liang, L. Li, K. Zhou, H. Zhang, B. Liu, and Y .-C. Chen, “Lotus: Diffusion-based visual foundation model for high-quality dense prediction,” inInternational Conference on Learning Representations (ICLR), 2025. 3

work page 2025
[33]

Video depth anything: Consistent depth estimation for super-long videos,

S. Chen, H. Guo, S. Zhu, F. Zhang, Z. Huang, J. Feng, and B. Kang, “Video depth anything: Consistent depth estimation for super-long videos,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2025, pp. 22 831–22 840. 3

work page 2025
[34]

Depthcrafter: Generating consistent long depth sequences for open-world videos,

W. Hu, X. Gao, X. Li, S. Zhao, X. Cun, Y . Zhang, L. Quan, and Y . Shan, “Depthcrafter: Generating consistent long depth sequences for open-world videos,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2025, pp. 2005–2015. 3

work page 2025
[35]

Deeper depth prediction with fully convolutional residual networks,

I. Laina, C. Rupprecht, V . Belagiannis, F. Tombari, and N. Navab, “Deeper depth prediction with fully convolutional residual networks,”Proceedings of the International Conference on 3D Vision (3DV), pp. 239–248, 6

work page
[36]

Learning depth from single monoc- ular images using deep convolutional neural fields,

F. Liu, C. Shen, G. Lin, and I. Reid, “Learning depth from single monoc- ular images using deep convolutional neural fields,”IEEE Transactions on Pattern Analysis and Machine Intelligence (T-PAMI), vol. 38, pp. 2024–2039, 2 2015. 3

work page 2024
[37]

Transformer-based attention networks for continuous pixel-wise prediction,

G. Yang, H. Tang, M. Ding, N. Sebe, and E. Ricci, “Transformer-based attention networks for continuous pixel-wise prediction,”Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 16 249–16 259, 3 2021. 3

work page 2021
[38]

Video depth propagation,

L. Piccinelli, T. Wandel, C. Sakaridis, W. Abbeloos, and L. V . Gool, “Video depth propagation,” 2025. [Online]. Available: https: //arxiv.org/abs/2512.10725 3

work page arXiv 2025
[39]

ZoeDepth: Zero-shot Transfer by Combining Relative and Metric Depth

S. F. Bhat, R. Birkl, D. Wofk, P. Wonka, and M. M¨uller, “Zoedepth: Zero- shot transfer by combining relative and metric depth,”arXiv preprint arXiv:2302.12288, 2023. 3, 7, 8, 9, 10

work page internal anchor Pith review Pith/arXiv arXiv 2023
[40]

Cam-convs: Camera-aware multi-scale convolutions for single- view depth,

J. M. Facil, B. Ummenhofer, H. Zhou, L. Montesano, T. Brox, and J. Civera, “Cam-convs: Camera-aware multi-scale convolutions for single- view depth,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 11 826–11 835. 3

work page 2019
[41]

arXiv preprint arXiv:1907.10326 , year=

J. H. Lee, M. Han, D. W. Ko, and I. H. Suh, “From big to small: Multi- scale local planar guidance for monocular depth estimation,”CoRR, vol. abs/1907.10326, 7 2019. 3, 9

work page arXiv 1907
[42]

Mapillary planet-scale depth dataset,

M. L. Antequera, P. Gargallo, M. Hofinger, S. R. Bul `o, Y . Kuang, and P. Kontschieder, “Mapillary planet-scale depth dataset,” inThe European Conference on Computer Vision (ECCV). Springer International Pub- lishing, 2020, pp. 589–604. 3, 6

work page 2020
[43]

Depth Pro: Sharp Monocular Metric Depth in Less Than a Second

A. Bochkovskii, A. Delaunoy, H. Germain, M. Santos, Y . Zhou, S. R. Richter, and V . Koltun, “Depth pro: Sharp monocular metric depth in less than a second,”arXiv preprint arXiv:2410.02073, 2024. 3, 6, 7, 8, 10

work page internal anchor Pith review Pith/arXiv arXiv 2024
[44]

UniK3D: Universal camera monocular 3d estimation,

L. Piccinelli, C. Sakaridis, M. Segu, Y .-H. Yang, S. Li, W. Abbeloos, and L. Van Gool, “UniK3D: Universal camera monocular 3d estimation,” 13 inIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025. 3

work page 2025
[45]

Diffusion models for monocular depth estimation: Overcoming challenging conditions,

F. Tosi, P. Zama Ramirez, and M. Poggi, “Diffusion models for monocular depth estimation: Overcoming challenging conditions,” inEuropean Conference on Computer Vision (ECCV), 2024. 3

work page 2024
[46]

Robust monocular depth estimation under chal- lenging conditions,

S. Gasperini, V . Olsson, M. Poggi, F. Tosi, S. Salti, L. Di Stefano, K. AAstr ”om, J. Gonfaus, L. Van Gool, R. Timofte, A. N ”aslund, and L. Bitti, “Robust monocular depth estimation under chal- lenging conditions,” inProceedings of the IEEE/CVF International Con- ference on Computer Vision (ICCV), October 2023, pp. 7897–7908. 3

work page 2023
[47]

Learning depth estimation for transparent and mirror surfaces,

A. Costanzino, F. Tosi, M. Poggi, S. Salti, S. Mattoccia, and L. Di Ste- fano, “Learning depth estimation for transparent and mirror surfaces,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2023, pp. 17 770–17 780. 3

work page 2023
[48]

Booster: A benchmark for depth from images of specular and transparent surfaces,

P. Z. Ramirez, A. Costanzino, F. Tosi, M. Poggi, S. Salti, S. Mattoccia, and L. Di Stefano, “Booster: A benchmark for depth from images of specular and transparent surfaces,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 46, no. 1, pp. 85–102, 2024. 3, 6

work page 2024
[49]

Slic superpixels compared to state-of-the-art superpixel methods,

R. Achanta, A. Shaji, K. Smith, A. Lucchi, P. Fua, and S. S ¨usstrunk, “Slic superpixels compared to state-of-the-art superpixel methods,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 34, no. 11, pp. 2274–2282, 2012. 5

work page 2012
[50]

An image is worth 16x16 words: Transformers for image recognition at scale,

A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Un- terthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, “An image is worth 16x16 words: Transformers for image recognition at scale,” inInternational Conference on Learning Representations (ICLR). OpenReview.net, 2021. 5, 7

work page 2021
[51]

Grounding image matching in 3d with mast3r, 2024

V . Leroy, Y . Cabon, and J. Revaud, “Grounding image matching in 3d with mast3r,”arXiv preprint arXiv:2406.09756, 2024. 7, 8, 10

work page arXiv 2024
[52]

A2d2: Audi autonomous driving dataset , year =

J. Geyer, Y . Kassahun, M. Mahmudi, X. Ricou, R. Durgesh, A. S. Chung, L. Hauswald, V . H. Pham, M. M ¨uhlegg, S. Dorn, T. Fernandez, M. J ¨anicke, S. Mirashi, C. Savani, M. Sturm, O. V orobiov, M. Oelker, S. Garreis, and P. Schuberth, “A2D2: Audi Autonomous Driving Dataset,”arXiv preprint arXiv:2004.06320, 2020. [Online]. Available: https://www.a2d2.audi 6

work page arXiv 2004
[53]

Argoverse 2: Next generation datasets for self-driving perception and forecasting,

B. Wilson, W. Qi, T. Agarwal, J. Lambert, J. Singh, S. Khandelwal, B. Pan, R. Kumar, A. Hartnett, J. K. Pontes, D. Ramanan, P. Carr, and J. Hays, “Argoverse 2: Next generation datasets for self-driving perception and forecasting,” inAdvances in Neural Information Processing Systems,

work page
[54]

ARKitscenes - a diverse real-world dataset for 3d indoor scene understanding using mobile RGB- d data,

G. Baruch, Z. Chen, A. Dehghan, T. Dimry, Y . Feigin, P. Fu, T. Gebauer, B. Joffe, D. Kurz, A. Schwartz, and E. Shulman, “ARKitscenes - a diverse real-world dataset for 3d indoor scene understanding using mobile RGB- d data,” inAdvances in Neural Information Processing Systems (NIPS),

work page
[55]

BEDLAM: A synthetic dataset of bodies exhibiting detailed lifelike animated motion,

M. J. Black, P. Patel, J. Tesch, and J. Yang, “BEDLAM: A synthetic dataset of bodies exhibiting detailed lifelike animated motion,” inPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023, pp. 8726–8737. 6

work page 2023
[56]

Blendedmvs: A large-scale dataset for generalized multi-view stereo networks,

Y . Yao, Z. Luo, S. Li, J. Zhang, Y . Ren, L. Zhou, T. Fang, and L. Quan, “Blendedmvs: A large-scale dataset for generalized multi-view stereo networks,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 1790–1799. 6

work page 2020
[57]

DL3DV-10k: A large-scale scene dataset for deep learning-based 3d vision,

L. Ling, Y . Sheng, Z. Tu, W. Zhao, C. Xin, K. Wan, L. Yu, Q. Guo, Z. Yu, Y . Luet al., “DL3DV-10k: A large-scale scene dataset for deep learning-based 3d vision,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 22 160– 22 169. 6

work page 2024
[58]

Drivingstereo: A large-scale dataset for stereo matching in autonomous driving scenarios,

G. Yang, X. Song, C. Huang, Z. Deng, J. Shi, and B. Zhou, “Drivingstereo: A large-scale dataset for stereo matching in autonomous driving scenarios,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019. 6

work page 2019
[59]

Dynamicstereo: Consistent dynamic depth from stereo videos,

N. Karaev, I. Rocco, B. Graham, N. Neverova, A. Vedaldi, and C. Rup- precht, “Dynamicstereo: Consistent dynamic depth from stereo videos,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023. 6

work page 2023
[60]

Eden: Multimodal synthetic dataset of enclosed garden scenes,

H.-A. Le, T. Mensink, P. Das, S. Karaoglu, and T. Gevers, “Eden: Multimodal synthetic dataset of enclosed garden scenes,” inProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2021, pp. 1579–1589. 6

work page 2021
[61]

Hoi4d: A 4d egocentric dataset for category-level human-object interaction,

Y . Liu, Y . Liu, C. Jiang, K. Lyu, W. Wan, H. Shen, B. Liang, Z. Fu, H. Wang, and L. Yi, “Hoi4d: A 4d egocentric dataset for category-level human-object interaction,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 21 013– 21 022. 6

work page 2022
[62]

Habitat-matterport 3d dataset (HM3d): 1000 large-scale 3d environments for embodied AI,

S. K. Ramakrishnan, A. Gokaslan, E. Wijmans, O. Maksymets, A. Clegg, J. M. Turner, E. Undersander, W. Galuba, A. Westbury, A. X. Chang, M. Savva, Y . Zhao, and D. Batra, “Habitat-matterport 3d dataset (HM3d): 1000 large-scale 3d environments for embodied AI,” inAdvances in Neural Information Processing Systems (NIPS), 2021. 6

work page 2021
[63]

Matterport3d: Learning from rgb-d data in indoor environments,

A. Chang, A. Dai, T. Funkhouser, M. Halber, M. Niessner, M. Savva, S. Song, A. Zeng, and Y . Zhang, “Matterport3d: Learning from rgb-d data in indoor environments,” inProceedings of the International Conference on 3D Vision (3DV), 2017. 6

work page 2017
[64]

Ma- trixcity: A large-scale city dataset for city-scale neural rendering and beyond,

Y . Li, L. Jiang, L. Xu, Y . Xiangli, Z. Wang, D. Lin, and B. Dai, “Ma- trixcity: A large-scale city dataset for city-scale neural rendering and beyond,” inProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023, pp. 3205–3215. 6

work page 2023
[65]

Megadepth: Learning single-view depth prediction from internet photos,

Z. Li and N. Snavely, “Megadepth: Learning single-view depth prediction from internet photos,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 2041–2050. 6

work page 2018
[66]

Map-free visual relocalization: Metric pose relative to a single image,

E. Arnold, J. Wynn, S. Vicente, G. Garcia-Hernando, ´A. Monszpart, V . A. Prisacariu, D. Turmukhambetov, and E. Brachmann, “Map-free visual relocalization: Metric pose relative to a single image,” inEuropean Conference on Computer Vision (ECCV), 2022. 6

work page 2022
[67]

Pointodyssey: A large-scale synthetic dataset for long-term point track- ing,

Y . Zheng, A. W. Harley, B. Shen, G. Wetzstein, and L. J. Guibas, “Pointodyssey: A large-scale synthetic dataset for long-term point track- ing,” inProceedings of the IEEE/CVF International Conference on Com- puter Vision (ICCV), 2023, pp. 19 855–19 865. 6

work page 2023
[68]

Scannet: Richly-annotated 3d reconstructions of indoor scenes,

A. Dai, A. X. Chang, M. Savva, M. Halber, T. Funkhouser, and M. Nießner, “Scannet: Richly-annotated 3d reconstructions of indoor scenes,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2017. 6

work page 2017
[69]

Scannet++: A high- fidelity dataset of 3d indoor scenes,

C. Yeshwanth, Y .-C. Liu, M. Nießner, and A. Dai, “Scannet++: A high- fidelity dataset of 3d indoor scenes,” inProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023. 6

work page 2023
[70]

Tartanair: A dataset to push the limits of visual slam,

W. Wang, D. Zhu, X. Wang, Y . Hu, Y . Qiu, C. Wang, Y . Hu, A. Kapoor, and S. Scherer, “Tartanair: A dataset to push the limits of visual slam,” in2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2020, pp. 4909–4916. 6

work page 2020
[71]

Taskonomy: Disentangling task transfer learning,

A. R. Zamir, A. Sax, W. B. Shen, L. Guibas, J. Malik, and S. Savarese, “Taskonomy: Disentangling task transfer learning,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2018. 6

work page 2018
[72]

Scalability in perception for autonomous driving: Waymo open dataset,

P. Sun, H. Kretzschmar, X. Dotiwalla, A. Chouard, V . Patnaik, P. Tsui, J. Guo, Y . Zhou, Y . Chai, B. Caineet al., “Scalability in perception for autonomous driving: Waymo open dataset,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 2446–2454. 6

work page 2020
[73]

Rgbd objects in the wild: Scaling real-world 3d object learning from rgb-d videos,

H. Xia, Y . Fu, S. Liu, and X. Wang, “Rgbd objects in the wild: Scaling real-world 3d object learning from rgb-d videos,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 22 378–22 389. 6

work page 2024
[74]

Sun rgb-d: A rgb-d scene un- derstanding benchmark suite,

S. Song, S. P. Lichtenberg, and J. Xiao, “Sun rgb-d: A rgb-d scene un- derstanding benchmark suite,”Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), vol. 07-12-June- 2015, pp. 567–576, 10 2015. 6

work page 2015
[75]

Comparison of monocular depth estimation methods using geometrically relevant metrics on the IBims-1 dataset,

T. Koch, L. Liebel, M. K ¨orner, and F. Fraundorfer, “Comparison of monocular depth estimation methods using geometrically relevant metrics on the IBims-1 dataset,”Computer Vision and Image Understanding (CVIU), vol. 191, p. 102877, 2020. 6

work page 2020
[76]

A benchmark for the evaluation of rgb-d slam systems,

J. Sturm, N. Engelhard, F. Endres, W. Burgard, and D. Cremers, “A benchmark for the evaluation of rgb-d slam systems,” inProc. of the International Conference on Intelligent Robot Systems (IROS), 2012. 6

work page 2012
[77]

A multi-view stereo benchmark with high-resolution images and multi-camera videos,

T. Sch¨ops, J. L. Sch¨onberger, S. Galliani, T. Sattler, K. Schindler, M. Polle- feys, and A. Geiger, “A multi-view stereo benchmark with high-resolution images and multi-camera videos,” inProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition (CVPR), 2017. 6

work page 2017
[78]

A naturalistic open source movie for optical flow evaluation,

D. J. Butler, J. Wulff, G. B. Stanley, and M. J. Black, “A naturalistic open source movie for optical flow evaluation,” inThe European Conference on Computer Vision (ECCV), ser. Part IV , LNCS 7577. Springer, 2012, pp. 611–625. 6

work page 2012
[79]

3d packing for self-supervised monocular depth estimation,

V . Guizilini, R. Ambrus, S. Pillai, A. Raventos, and A. Gaidon, “3d packing for self-supervised monocular depth estimation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020. 6

work page 2020
[80]

nuscenes: A multimodal dataset for autonomous driving,

H. Caesar, V . Bankiti, A. H. Lang, S. V ora, V . E. Liong, Q. Xu, A. Krish- nan, Y . Pan, G. Baldan, and O. Beijbom, “nuscenes: A multimodal dataset for autonomous driving,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020. 6 14

work page 2020

Showing first 80 references.

[1] [1]

Depth-supervised nerf: Fewer views and faster training for free,

K. Deng, A. Liu, J.-Y . Zhu, and D. Ramanan, “Depth-supervised nerf: Fewer views and faster training for free,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 12 882–12 891. 1

work page 2022

[2] [2]

Does computer vision matter for action?

B. Zhou, P. Kr ¨ahenb¨uhl, and V . Koltun, “Does computer vision matter for action?”Science Robotics, vol. 4, 5 2019. 1 12

work page 2019

[3] [3]

Towards real-time monocular depth estimation for robotics: A survey,

X. Dong, M. A. Garratt, S. G. Anavatti, and H. A. Abbass, “Towards real-time monocular depth estimation for robotics: A survey,”IEEE Transactions on Intelligent Transportation Systems, vol. 23, no. 10, pp. 16 940–16 961, 2022. 1

work page 2022

[4] [4]

Pseudo-lidar from visual depth estimation: Bridging the gap in 3d object detection for autonomous driving,

Y . Wang, W.-L. Chao, D. Garg, B. Hariharan, M. Campbell, and K. Q. Weinberger, “Pseudo-lidar from visual depth estimation: Bridging the gap in 3d object detection for autonomous driving,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 8445–8453. 1

work page 2019

[5] [5]

Is pseudo-lidar needed for monocular 3d object detection?

D. Park, R. Ambrus, V . Guizilini, J. Li, and A. Gaidon, “Is pseudo-lidar needed for monocular 3d object detection?” inIEEE/CVF International Conference on Computer Vision (ICCV), 2021. 1

work page 2021

[6] [6]

Depth map prediction from a single image using a multi-scale deep network,

D. Eigen, C. Puhrsch, and R. Fergus, “Depth map prediction from a single image using a multi-scale deep network,” inAdvances in Neural Information Processing Systems (NeurIPS), vol. 3. Neural information processing systems foundation, 6 2014, pp. 2366–2374. 1, 3

work page 2014

[7] [7]

Deep ordinal regression network for monocular depth estimation,

H. Fu, M. Gong, C. Wang, K. Batmanghelich, and D. Tao, “Deep ordinal regression network for monocular depth estimation,”Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2002–2011, 6 2018. 1, 3

work page 2002

[8] [8]

Adabins: Depth estimation using adaptive bins,

S. F. Bhat, I. Alhashim, and P. Wonka, “Adabins: Depth estimation using adaptive bins,”Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4008–4017, 11 2020. 1, 3, 9

work page 2020

[9] [9]

Vision transformers for dense prediction,

R. Ranftl, A. Bochkovskiy, and V . Koltun, “Vision transformers for dense prediction,”Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 12 159–12 168, 3 2021. 1

work page 2021

[10] [10]

P3Depth: Monocular depth estimation with a piecewise planarity prior,

V . Patil, C. Sakaridis, A. Liniger, and L. V . Gool, “P3Depth: Monocular depth estimation with a piecewise planarity prior,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2022, pp. 1600–1611. 1, 3

work page 2022

[11] [11]

Neural window fully- connected crfs for monocular depth estimation,

W. Yuan, X. Gu, Z. Dai, S. Zhu, and P. Tan, “Neural window fully- connected crfs for monocular depth estimation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2022, pp. 3906–3915. 1, 3, 9

work page 2022

[12] [12]

iDisc: Internal discretization for monocular depth estimation,

L. Piccinelli, C. Sakaridis, and F. Yu, “iDisc: Internal discretization for monocular depth estimation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023. 1, 3, 9

work page 2023

[13] [13]

Train in germany, test in the usa: Making 3d object detectors generalize,

Y . Wang, X. Chen, Y . You, L. E. Li, B. Hariharan, M. Campbell, K. Q. Weinberger, and W. L. Chao, “Train in germany, test in the usa: Making 3d object detectors generalize,”Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11 710–11 720, 5 2020. 1

work page 2020

[14] [14]

Metric3d: Towards zero-shot metric 3d prediction from a single image,

W. Yin, C. Zhang, H. Chen, Z. Cai, G. Yu, K. Wang, X. Chen, and C. Shen, “Metric3d: Towards zero-shot metric 3d prediction from a single image,” inProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023, pp. 9043–9053. 1, 3, 5, 7, 8, 9, 10

work page 2023

[15] [15]

Towards zero-shot scale-aware monocular depth estimation,

V . Guizilini, I. Vasiljevic, D. Chen, R. Ambrus,, and A. Gaidon, “Towards zero-shot scale-aware monocular depth estimation,” inProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023, pp. 9233–9243. 1, 3

work page 2023

[16] [16]

arXiv preprint arXiv:2404.15506 , year=

M. Hu, W. Yin, C. Zhang, Z. Cai, X. Long, H. Chen, K. Wang, G. Yu, C. Shen, and S. Shen, “Metric3d v2: A versatile monocular geometric foundation model for zero-shot metric depth and surface normal estima- tion,”arXiv preprint arXiv:2404.15506, 2024. 1, 3, 6, 7, 8, 9, 10

work page arXiv 2024

[17] [17]

Unidepth: Universal monocular metric depth estimation,

L. Piccinelli, Y .-H. Yang, C. Sakaridis, M. Segu, S. Li, L. Van Gool, and F. Yu, “Unidepth: Universal monocular metric depth estimation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 10 106–10 116. 1, 2, 3, 6, 7, 8, 9, 10, 11

work page 2024

[18] [18]

Perception-aware chance-constrained model predictive control for uncertain environments,

A. D. Bonzanini, A. Mesbah, and S. Di Cairano, “Perception-aware chance-constrained model predictive control for uncertain environments,” in2021 American Control Conference (ACC). IEEE, 2021, pp. 2082–

work page 2021

[19] [19]

Stochastic model predictive control: An overview and perspectives for future research,

A. Mesbah, “Stochastic model predictive control: An overview and perspectives for future research,”IEEE Control Systems Magazine, vol. 36, no. 6, pp. 30–44, 2016. 2

work page 2016

[20] [20]

Safe perception-based control under stochastic sensor uncertainty using con- formal prediction,

S. Yang, G. J. Pappas, R. Mangharam, and L. Lindemann, “Safe perception-based control under stochastic sensor uncertainty using con- formal prediction,”arXiv preprint arXiv:2304.00194, 2023. 2

work page arXiv 2023

[21] [21]

Robust model predictive control: A survey,

A. Bemporad and M. Morari, “Robust model predictive control: A survey,” inRobustness in identification and control. Springer, 2007, pp. 207–

work page 2007

[22] [22]

Indoor seg- mentation and support inference from rgbd images,

P. K. Nathan Silberman, Derek Hoiem and R. Fergus, “Indoor seg- mentation and support inference from rgbd images,” inThe European Conference on Computer Vision (ECCV), 2012. 2

work page 2012

[23] [23]

Are we ready for autonomous driving? The KITTI vision benchmark suite,

A. Geiger, P. Lenz, and R. Urtasun, “Are we ready for autonomous driving? The KITTI vision benchmark suite,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2012. 2

work page 2012

[24] [24]

Towards robust monocular depth estimation: Mixing datasets for zero-shot cross- dataset transfer,

R. Ranftl, K. Lasinger, D. Hafner, K. Schindler, and V . Koltun, “Towards robust monocular depth estimation: Mixing datasets for zero-shot cross- dataset transfer,”IEEE Transactions on Pattern Analysis and Machine Intelligence (T-PAMI), vol. 44, no. 3, pp. 1623–1637, 2020. 2

work page 2020

[25] [25]

Omnidata: A scalable pipeline for making multi-task mid-level vision datasets from 3d scans,

A. Eftekhar, A. Sax, J. Malik, and A. Zamir, “Omnidata: A scalable pipeline for making multi-task mid-level vision datasets from 3d scans,” inProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021, pp. 10 786–10 796. 2

work page 2021

[26] [26]

Learning to recover 3d scene shape from a single image,

W. Yin, J. Zhang, O. Wang, S. Niklaus, L. Mai, S. Chen, and C. Shen, “Learning to recover 3d scene shape from a single image,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021, pp. 204–213. 2

work page 2021

[27] [27]

Repurposing diffusion-based image generators for monocular depth estimation,

B. Ke, A. Obukhov, S. Huang, N. Metzger, R. C. Daudt, and K. Schindler, “Repurposing diffusion-based image generators for monocular depth estimation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 9492–9502. 2

work page 2024

[28] [28]

Depth anything: Unleashing the power of large-scale unlabeled data,

L. Yang, B. Kang, Z. Huang, X. Xu, J. Feng, and H. Zhao, “Depth anything: Unleashing the power of large-scale unlabeled data,” inPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 10 371–10 381. 2

work page 2024

[29] [29]

Depth Anything V2

L. Yang, B. Kang, Z. Huang, Z. Zhao, X. Xu, J. Feng, and H. Zhao, “Depth anything v2,”arXiv preprint arXiv:2406.09414, 2024. 2, 8, 9

work page internal anchor Pith review Pith/arXiv arXiv 2024

[30] [30]

Segment anything,

A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y . Lo, P. Doll ´ar, and R. Girshick, “Segment anything,” 2023. 3

work page 2023

[31] [31]

Geowizard: Unleashing the diffusion priors for 3d geometry estimation from a single image,

X. Fu, W. Yin, M. Hu, K. Wang, Y . Ma, P. Tan, S. Shen, D. Lin, and X. Long, “Geowizard: Unleashing the diffusion priors for 3d geometry estimation from a single image,” inEuropean Conference on Computer Vision (ECCV), 2024. 3

work page 2024

[32] [32]

Lotus: Diffusion-based visual foundation model for high-quality dense prediction,

J. He, H. Li, W. Yin, Y . Liang, L. Li, K. Zhou, H. Zhang, B. Liu, and Y .-C. Chen, “Lotus: Diffusion-based visual foundation model for high-quality dense prediction,” inInternational Conference on Learning Representations (ICLR), 2025. 3

work page 2025

[33] [33]

Video depth anything: Consistent depth estimation for super-long videos,

S. Chen, H. Guo, S. Zhu, F. Zhang, Z. Huang, J. Feng, and B. Kang, “Video depth anything: Consistent depth estimation for super-long videos,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2025, pp. 22 831–22 840. 3

work page 2025

[34] [34]

Depthcrafter: Generating consistent long depth sequences for open-world videos,

W. Hu, X. Gao, X. Li, S. Zhao, X. Cun, Y . Zhang, L. Quan, and Y . Shan, “Depthcrafter: Generating consistent long depth sequences for open-world videos,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2025, pp. 2005–2015. 3

work page 2025

[35] [35]

Deeper depth prediction with fully convolutional residual networks,

I. Laina, C. Rupprecht, V . Belagiannis, F. Tombari, and N. Navab, “Deeper depth prediction with fully convolutional residual networks,”Proceedings of the International Conference on 3D Vision (3DV), pp. 239–248, 6

work page

[36] [36]

Learning depth from single monoc- ular images using deep convolutional neural fields,

F. Liu, C. Shen, G. Lin, and I. Reid, “Learning depth from single monoc- ular images using deep convolutional neural fields,”IEEE Transactions on Pattern Analysis and Machine Intelligence (T-PAMI), vol. 38, pp. 2024–2039, 2 2015. 3

work page 2024

[37] [37]

Transformer-based attention networks for continuous pixel-wise prediction,

G. Yang, H. Tang, M. Ding, N. Sebe, and E. Ricci, “Transformer-based attention networks for continuous pixel-wise prediction,”Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 16 249–16 259, 3 2021. 3

work page 2021

[38] [38]

Video depth propagation,

L. Piccinelli, T. Wandel, C. Sakaridis, W. Abbeloos, and L. V . Gool, “Video depth propagation,” 2025. [Online]. Available: https: //arxiv.org/abs/2512.10725 3

work page arXiv 2025

[39] [39]

ZoeDepth: Zero-shot Transfer by Combining Relative and Metric Depth

S. F. Bhat, R. Birkl, D. Wofk, P. Wonka, and M. M¨uller, “Zoedepth: Zero- shot transfer by combining relative and metric depth,”arXiv preprint arXiv:2302.12288, 2023. 3, 7, 8, 9, 10

work page internal anchor Pith review Pith/arXiv arXiv 2023

[40] [40]

Cam-convs: Camera-aware multi-scale convolutions for single- view depth,

J. M. Facil, B. Ummenhofer, H. Zhou, L. Montesano, T. Brox, and J. Civera, “Cam-convs: Camera-aware multi-scale convolutions for single- view depth,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 11 826–11 835. 3

work page 2019

[41] [41]

arXiv preprint arXiv:1907.10326 , year=

J. H. Lee, M. Han, D. W. Ko, and I. H. Suh, “From big to small: Multi- scale local planar guidance for monocular depth estimation,”CoRR, vol. abs/1907.10326, 7 2019. 3, 9

work page arXiv 1907

[42] [42]

Mapillary planet-scale depth dataset,

M. L. Antequera, P. Gargallo, M. Hofinger, S. R. Bul `o, Y . Kuang, and P. Kontschieder, “Mapillary planet-scale depth dataset,” inThe European Conference on Computer Vision (ECCV). Springer International Pub- lishing, 2020, pp. 589–604. 3, 6

work page 2020

[43] [43]

Depth Pro: Sharp Monocular Metric Depth in Less Than a Second

A. Bochkovskii, A. Delaunoy, H. Germain, M. Santos, Y . Zhou, S. R. Richter, and V . Koltun, “Depth pro: Sharp monocular metric depth in less than a second,”arXiv preprint arXiv:2410.02073, 2024. 3, 6, 7, 8, 10

work page internal anchor Pith review Pith/arXiv arXiv 2024

[44] [44]

UniK3D: Universal camera monocular 3d estimation,

L. Piccinelli, C. Sakaridis, M. Segu, Y .-H. Yang, S. Li, W. Abbeloos, and L. Van Gool, “UniK3D: Universal camera monocular 3d estimation,” 13 inIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025. 3

work page 2025

[45] [45]

Diffusion models for monocular depth estimation: Overcoming challenging conditions,

F. Tosi, P. Zama Ramirez, and M. Poggi, “Diffusion models for monocular depth estimation: Overcoming challenging conditions,” inEuropean Conference on Computer Vision (ECCV), 2024. 3

work page 2024

[46] [46]

Robust monocular depth estimation under chal- lenging conditions,

S. Gasperini, V . Olsson, M. Poggi, F. Tosi, S. Salti, L. Di Stefano, K. AAstr ”om, J. Gonfaus, L. Van Gool, R. Timofte, A. N ”aslund, and L. Bitti, “Robust monocular depth estimation under chal- lenging conditions,” inProceedings of the IEEE/CVF International Con- ference on Computer Vision (ICCV), October 2023, pp. 7897–7908. 3

work page 2023

[47] [47]

Learning depth estimation for transparent and mirror surfaces,

A. Costanzino, F. Tosi, M. Poggi, S. Salti, S. Mattoccia, and L. Di Ste- fano, “Learning depth estimation for transparent and mirror surfaces,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2023, pp. 17 770–17 780. 3

work page 2023

[48] [48]

Booster: A benchmark for depth from images of specular and transparent surfaces,

P. Z. Ramirez, A. Costanzino, F. Tosi, M. Poggi, S. Salti, S. Mattoccia, and L. Di Stefano, “Booster: A benchmark for depth from images of specular and transparent surfaces,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 46, no. 1, pp. 85–102, 2024. 3, 6

work page 2024

[49] [49]

Slic superpixels compared to state-of-the-art superpixel methods,

R. Achanta, A. Shaji, K. Smith, A. Lucchi, P. Fua, and S. S ¨usstrunk, “Slic superpixels compared to state-of-the-art superpixel methods,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 34, no. 11, pp. 2274–2282, 2012. 5

work page 2012

[50] [50]

An image is worth 16x16 words: Transformers for image recognition at scale,

A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Un- terthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, “An image is worth 16x16 words: Transformers for image recognition at scale,” inInternational Conference on Learning Representations (ICLR). OpenReview.net, 2021. 5, 7

work page 2021

[51] [51]

Grounding image matching in 3d with mast3r, 2024

V . Leroy, Y . Cabon, and J. Revaud, “Grounding image matching in 3d with mast3r,”arXiv preprint arXiv:2406.09756, 2024. 7, 8, 10

work page arXiv 2024

[52] [52]

A2d2: Audi autonomous driving dataset , year =

J. Geyer, Y . Kassahun, M. Mahmudi, X. Ricou, R. Durgesh, A. S. Chung, L. Hauswald, V . H. Pham, M. M ¨uhlegg, S. Dorn, T. Fernandez, M. J ¨anicke, S. Mirashi, C. Savani, M. Sturm, O. V orobiov, M. Oelker, S. Garreis, and P. Schuberth, “A2D2: Audi Autonomous Driving Dataset,”arXiv preprint arXiv:2004.06320, 2020. [Online]. Available: https://www.a2d2.audi 6

work page arXiv 2004

[53] [53]

Argoverse 2: Next generation datasets for self-driving perception and forecasting,

B. Wilson, W. Qi, T. Agarwal, J. Lambert, J. Singh, S. Khandelwal, B. Pan, R. Kumar, A. Hartnett, J. K. Pontes, D. Ramanan, P. Carr, and J. Hays, “Argoverse 2: Next generation datasets for self-driving perception and forecasting,” inAdvances in Neural Information Processing Systems,

work page

[54] [54]

ARKitscenes - a diverse real-world dataset for 3d indoor scene understanding using mobile RGB- d data,

G. Baruch, Z. Chen, A. Dehghan, T. Dimry, Y . Feigin, P. Fu, T. Gebauer, B. Joffe, D. Kurz, A. Schwartz, and E. Shulman, “ARKitscenes - a diverse real-world dataset for 3d indoor scene understanding using mobile RGB- d data,” inAdvances in Neural Information Processing Systems (NIPS),

work page

[55] [55]

BEDLAM: A synthetic dataset of bodies exhibiting detailed lifelike animated motion,

M. J. Black, P. Patel, J. Tesch, and J. Yang, “BEDLAM: A synthetic dataset of bodies exhibiting detailed lifelike animated motion,” inPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023, pp. 8726–8737. 6

work page 2023

[56] [56]

Blendedmvs: A large-scale dataset for generalized multi-view stereo networks,

Y . Yao, Z. Luo, S. Li, J. Zhang, Y . Ren, L. Zhou, T. Fang, and L. Quan, “Blendedmvs: A large-scale dataset for generalized multi-view stereo networks,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 1790–1799. 6

work page 2020

[57] [57]

DL3DV-10k: A large-scale scene dataset for deep learning-based 3d vision,

L. Ling, Y . Sheng, Z. Tu, W. Zhao, C. Xin, K. Wan, L. Yu, Q. Guo, Z. Yu, Y . Luet al., “DL3DV-10k: A large-scale scene dataset for deep learning-based 3d vision,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 22 160– 22 169. 6

work page 2024

[58] [58]

Drivingstereo: A large-scale dataset for stereo matching in autonomous driving scenarios,

G. Yang, X. Song, C. Huang, Z. Deng, J. Shi, and B. Zhou, “Drivingstereo: A large-scale dataset for stereo matching in autonomous driving scenarios,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019. 6

work page 2019

[59] [59]

Dynamicstereo: Consistent dynamic depth from stereo videos,

N. Karaev, I. Rocco, B. Graham, N. Neverova, A. Vedaldi, and C. Rup- precht, “Dynamicstereo: Consistent dynamic depth from stereo videos,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023. 6

work page 2023

[60] [60]

Eden: Multimodal synthetic dataset of enclosed garden scenes,

H.-A. Le, T. Mensink, P. Das, S. Karaoglu, and T. Gevers, “Eden: Multimodal synthetic dataset of enclosed garden scenes,” inProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2021, pp. 1579–1589. 6

work page 2021

[61] [61]

Hoi4d: A 4d egocentric dataset for category-level human-object interaction,

Y . Liu, Y . Liu, C. Jiang, K. Lyu, W. Wan, H. Shen, B. Liang, Z. Fu, H. Wang, and L. Yi, “Hoi4d: A 4d egocentric dataset for category-level human-object interaction,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 21 013– 21 022. 6

work page 2022

[62] [62]

Habitat-matterport 3d dataset (HM3d): 1000 large-scale 3d environments for embodied AI,

S. K. Ramakrishnan, A. Gokaslan, E. Wijmans, O. Maksymets, A. Clegg, J. M. Turner, E. Undersander, W. Galuba, A. Westbury, A. X. Chang, M. Savva, Y . Zhao, and D. Batra, “Habitat-matterport 3d dataset (HM3d): 1000 large-scale 3d environments for embodied AI,” inAdvances in Neural Information Processing Systems (NIPS), 2021. 6

work page 2021

[63] [63]

Matterport3d: Learning from rgb-d data in indoor environments,

A. Chang, A. Dai, T. Funkhouser, M. Halber, M. Niessner, M. Savva, S. Song, A. Zeng, and Y . Zhang, “Matterport3d: Learning from rgb-d data in indoor environments,” inProceedings of the International Conference on 3D Vision (3DV), 2017. 6

work page 2017

[64] [64]

Ma- trixcity: A large-scale city dataset for city-scale neural rendering and beyond,

Y . Li, L. Jiang, L. Xu, Y . Xiangli, Z. Wang, D. Lin, and B. Dai, “Ma- trixcity: A large-scale city dataset for city-scale neural rendering and beyond,” inProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023, pp. 3205–3215. 6

work page 2023

[65] [65]

Megadepth: Learning single-view depth prediction from internet photos,

Z. Li and N. Snavely, “Megadepth: Learning single-view depth prediction from internet photos,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 2041–2050. 6

work page 2018

[66] [66]

Map-free visual relocalization: Metric pose relative to a single image,

E. Arnold, J. Wynn, S. Vicente, G. Garcia-Hernando, ´A. Monszpart, V . A. Prisacariu, D. Turmukhambetov, and E. Brachmann, “Map-free visual relocalization: Metric pose relative to a single image,” inEuropean Conference on Computer Vision (ECCV), 2022. 6

work page 2022

[67] [67]

Pointodyssey: A large-scale synthetic dataset for long-term point track- ing,

Y . Zheng, A. W. Harley, B. Shen, G. Wetzstein, and L. J. Guibas, “Pointodyssey: A large-scale synthetic dataset for long-term point track- ing,” inProceedings of the IEEE/CVF International Conference on Com- puter Vision (ICCV), 2023, pp. 19 855–19 865. 6

work page 2023

[68] [68]

Scannet: Richly-annotated 3d reconstructions of indoor scenes,

A. Dai, A. X. Chang, M. Savva, M. Halber, T. Funkhouser, and M. Nießner, “Scannet: Richly-annotated 3d reconstructions of indoor scenes,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2017. 6

work page 2017

[69] [69]

Scannet++: A high- fidelity dataset of 3d indoor scenes,

C. Yeshwanth, Y .-C. Liu, M. Nießner, and A. Dai, “Scannet++: A high- fidelity dataset of 3d indoor scenes,” inProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023. 6

work page 2023

[70] [70]

Tartanair: A dataset to push the limits of visual slam,

W. Wang, D. Zhu, X. Wang, Y . Hu, Y . Qiu, C. Wang, Y . Hu, A. Kapoor, and S. Scherer, “Tartanair: A dataset to push the limits of visual slam,” in2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2020, pp. 4909–4916. 6

work page 2020

[71] [71]

Taskonomy: Disentangling task transfer learning,

A. R. Zamir, A. Sax, W. B. Shen, L. Guibas, J. Malik, and S. Savarese, “Taskonomy: Disentangling task transfer learning,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2018. 6

work page 2018

[72] [72]

Scalability in perception for autonomous driving: Waymo open dataset,

P. Sun, H. Kretzschmar, X. Dotiwalla, A. Chouard, V . Patnaik, P. Tsui, J. Guo, Y . Zhou, Y . Chai, B. Caineet al., “Scalability in perception for autonomous driving: Waymo open dataset,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 2446–2454. 6

work page 2020

[73] [73]

Rgbd objects in the wild: Scaling real-world 3d object learning from rgb-d videos,

H. Xia, Y . Fu, S. Liu, and X. Wang, “Rgbd objects in the wild: Scaling real-world 3d object learning from rgb-d videos,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 22 378–22 389. 6

work page 2024

[74] [74]

Sun rgb-d: A rgb-d scene un- derstanding benchmark suite,

S. Song, S. P. Lichtenberg, and J. Xiao, “Sun rgb-d: A rgb-d scene un- derstanding benchmark suite,”Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), vol. 07-12-June- 2015, pp. 567–576, 10 2015. 6

work page 2015

[75] [75]

Comparison of monocular depth estimation methods using geometrically relevant metrics on the IBims-1 dataset,

T. Koch, L. Liebel, M. K ¨orner, and F. Fraundorfer, “Comparison of monocular depth estimation methods using geometrically relevant metrics on the IBims-1 dataset,”Computer Vision and Image Understanding (CVIU), vol. 191, p. 102877, 2020. 6

work page 2020

[76] [76]

A benchmark for the evaluation of rgb-d slam systems,

J. Sturm, N. Engelhard, F. Endres, W. Burgard, and D. Cremers, “A benchmark for the evaluation of rgb-d slam systems,” inProc. of the International Conference on Intelligent Robot Systems (IROS), 2012. 6

work page 2012

[77] [77]

A multi-view stereo benchmark with high-resolution images and multi-camera videos,

T. Sch¨ops, J. L. Sch¨onberger, S. Galliani, T. Sattler, K. Schindler, M. Polle- feys, and A. Geiger, “A multi-view stereo benchmark with high-resolution images and multi-camera videos,” inProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition (CVPR), 2017. 6

work page 2017

[78] [78]

A naturalistic open source movie for optical flow evaluation,

D. J. Butler, J. Wulff, G. B. Stanley, and M. J. Black, “A naturalistic open source movie for optical flow evaluation,” inThe European Conference on Computer Vision (ECCV), ser. Part IV , LNCS 7577. Springer, 2012, pp. 611–625. 6

work page 2012

[79] [79]

3d packing for self-supervised monocular depth estimation,

V . Guizilini, R. Ambrus, S. Pillai, A. Raventos, and A. Gaidon, “3d packing for self-supervised monocular depth estimation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020. 6

work page 2020

[80] [80]

nuscenes: A multimodal dataset for autonomous driving,

H. Caesar, V . Bankiti, A. H. Lang, S. V ora, V . E. Liong, Q. Xu, A. Krish- nan, Y . Pan, G. Baldan, and O. Beijbom, “nuscenes: A multimodal dataset for autonomous driving,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020. 6 14

work page 2020